As part of my 5 Common Gotchas with SQL Server AlwaysOn presentation that I recently gave, I mentioned that I’d provide a links and additional resources as a follow up. (And for anyone who didn’t attend but would like to, that link should provide on-demand access to this presentation for 12 months—but might take a day or two [after the initial presentation date] to become active/available.)
Otherwise (in an order that doesn’t exactly match the order in which these were presented), here are some additional links and follow-up details.
I don't have any additional links or details here—other than a simple reminder: If your databases are important enough to merit redundancy, don't risk trying to 'sneak' Windows Server Failover Clustering (WSFC) Services 'underneath' your production SQL Servers. Doing so can end in spectacular disaster and robs you of the chance to spin up new hardware where you can safely configure everything, work out all kinks and issues before going live, and then use the new environment as a sandbox where you can safely test failover, crashes, patching, and the likes.
Here's a link to a PDF that covers the two 'rules' of licensing that I threw out ("All active workloads must be licensed" and "each license comes with a free license—as long as free means: 'used only for passive failover purposes only') in MUCH greater detail. (In other words, the rules I threw out ARE accurate. But this is licensing—so you know there are some caveats and additional details. Go read up on them. It'll take you 10-20 minutes.)
This one's a beast. One of the better and more succinct overviews of some of the gotchas I've seen is Brent Ozar's review of AlwaysOn Availability Groups - Real Life Lessons Learned. It's well worth your time if you don't have a solid handle on Quorum mechanics and best practices.
Server vs Availability Group Scope
This one's too complex to try and summarize in this post—if you're curious to know what this means, go watch the presentation. Otherwise, here's the documentation for that Backup Preferences part of the setup wizard. And note how the documentation says that these preferences are NOT enforced.
I also addressed some of the issues with Availability Groups, Active Secondaries, and Scale-Out scenarios—and specifically mentioned how that's a lot harder to implement correctly (especially in existing applications) than it is to talk about. For more a great work-around I can happily recommend ScaleArc's iDB—it's a game-changing solution that's complements AlwaysOn Availability Groups insanely well.
Here are two additional topics or details that I wanted to cover, but just didn’t have time to address.
Failover is Not Immediate. WSFC failover is pretty impressive—it usually happens in a matter of 2-3 seconds or less (where there are 'clean' failures). Still, once a SQL Server instance spins up somewhere and takes control of database resources, it's going to have to walk through the recovery process. With Availability Group Replicas, this recovery time is typically going to be brief (when we're talking about synchronous replicas)—where brief can be on the order of 20-40 seconds. With Failover Cluster Instances, however, the new active node is basically starting from scratch—a cold-start as it were—and typically takes a bit under a minute to be ready to start serving all databases as needed. For more info, check out my post on indirect checkpoint—a VERY cool (but powerful) feature of SQL Server 2012.
Scoping and Sizing. I was planning on covering this initially, but found I just didn't have time to do it justice. It's a common gotcha though. One that tends to 'bite' Sys-admins and reluctant DBAs much more than it happens to burn experienced DBAs. And that’s mostly because this one is a question of having a good feeling of the characteristics of your workloads.
There are two things you need to watch out for here:
And the first is best explained by means of an example.
Assume that you’ve got a fairly heavily used server that’s, let’s say, commonly churning through large amounts of data – to the point where it’s regularly pulling just under 300MB/sec of data off the drives.
Then, let’s assume you throw a new, less powerful, server in as your secondary—thinking that it’ll be LESS optimal as a failover box, but that you’ll only be using it on the odd occasion when you either schedule patches or updates—during non-peak hours, or when the primary temporarily fails and you have to shunt operations over to this secondary box. But let’s say this temporary box tops out at 200MB/sec in terms of disk throughput.
In such an overly-simplified scenario, if you team these two servers together as replicas, your primary server will be cobbled to the point where it’s only able to do about 200MB/sec of throughput—simply because each operation or transaction run on the primary server has to be mirrored – in real time – with the slower secondary or replica.
Or, that’s the behavior you get when you use synchronous replication—as that means that transactions on the primary, for example, can NOT be treated as completed until they’ve been successfully completed on ALL other replicas actively in your topology. Otherwise, you run the very real risk of data loss—which you can OPTIONALLY allow when you use ASYNCHRONOUS replication or synchronization.
And where this is commonly a gotcha is when Availability Groups are deployed WITHOUT a good, solid, understanding of the current workload AND/OR some of the bottlenecks you’ll be putting into place by ‘teaming’ multiple hosts together.
Similar problems can and will occur when transactional throughput rates are so frequent and/or large that they saturate existing network connections between servers, OR where the VERY SLIGHT amount of transactional overhead put into place to ensure synchronous replication manages to be JUST enough overhead to push heavily used systems ‘over the edge’ when under peak load.
And, one of the most common places where this gotcha manifests is when it comes to index maintenance—something that can generate HUGE amounts of IO. (And this is commonly done when indexes are being too frequently or over-zealously being rebuilt.)
The good news is that these problems can all be pretty easily avoided IF you have a solid understanding of your existing workloads and their hardware utilization characteristics. And, if you don’t, then you can simply profile your workloads over a period of a few days and watch for key indicators like max IO throughput rates and any potential bottlenecks—then plan accordingly.
Questions? Finally, as I mentioned in my presenation, if you have any questions or concerns that weren't addressed as part of the presentation or during the Q&A segment, feel free to reach out and ask me directly. My contact information can be found on my site.