AlwaysOn Availability Groups and SQL Server Jobs, Part 2: Putting AlwaysOn into Context

AlwaysOn is a marketing term used to describe two different high-availability (HA) and disaster recovery (DR) solutions from Microsoft: AlwaysOn Failover Cluster Instances (FCIs) or raditional clusters and AlwaysOn Availability Groups (AGs) which are a lot like Database Mirroring 3.0.

At a high level, both FCIs and AGs share the common theme or notion of using multiple SQL Server hosts to distribute a workload as a means of increasing fault tolerance (and in situations where one or more of the hosts involved in handling the load is remotely deployed, can also address concerns around disaster recovery via remote availability).

Despite similar intentions and high-level goals (and despite the fact that both AGs and FCIs are heavily reliant upon Windows Server Failover Clustering, or WSFC), the way in which FCIs and AGs tackle the exact specifics of how they enable HA and DR are quite different.

High-Level Differences between AGs and FCIs

Simply put, there are enough architectural and fundamental differences between AGs and FCIs to fill entire books. So I won’t be able to do the topic any justice in a single post. In brief, however, Failover Cluster Instances use clustering technology to create two or more nodes or teamed hosts that coordinate with the domain controller to specify which of those physical nodes will control a virtual IP address, Virtual Network Name, and a virtual SQL Server Instance that spans or straddles all of the hosts—but can be active on only one physical host at a time.

On the other hand, AlwaysOn Availability Groups use the notion of mirroring (or server-to-server communication via endpoints) to keep distinct copies of data on multiple hosts synchronized and then, also, coordinate with the Domain Controller to, optionally, keep Availability Group Listeners pointed at a Read/Write replica for normal database interactions, as well as enable the possibility of having read-only replicas made available for scale-out scenarios.

To help put key differences into a slightly bit better perspective, FCIs and AGs each come with some pros and cons that help identify where they’re best suited:

AlwaysOn Availability Groups

Pro: Each member of the solution has its own copy of data. (Unlike with FCIs—where there’s only a single copy of databases—which represents a single point of failure.)
Con: Since each replica (or copy) of the database is, indeed, distinct, each additional server added into an AG topology group adds +1N the amount of storage required. (i.e., if you have 200GBs of databases, 2 replicas will require 400GBs of total space, 3 replicas 600GB, and so on.)
Pro: Enables scale-out scenarios (or the option to spin up additional (fully-licensed) SQL Server hosts that can enable read-only operations to help improve overall system scalability).
Pro: Typically a bit easier to set up than FCIs.
Con: Usually more expensive to own and manage than FCIs (especially when it comes to logins, jobs, and other server-level settings/configuration).
Con: Only provides fault-tolerance at user-database level.

AlwaysOn Failover Cluster Instances

Con: Non-trivial to set up and configure – because of the need for shared storage (i.e. disk shared between all nodes in the Quorum).
Pro: Since there’s only one copy of the data, overall data storage requirements are less (i.e., just 1N).
Pro: Easier to manage and own than AGs.
Pro: Provides fault tolerance for an entire SQL Server Instance (i.e., server-level details like Logins, Linked Servers, endpoints, Jobs, etc., are all ‘synchronized’ or managed—unlike AGs which can only synchronize configured user-databases).

In addition to the pros and cons of each of those AlwaysOn options, FCIs and AGs both share the same benefits or key strengths that make them worthwhile in the first place:

Automatic Fault Tolerance. Both solutions effectively provide magical fault-tolerance (when properly configured) meaning that if the active node/replica fails at any given time, other members in the cluster will detect this failure—typically within a matter of just a few seconds—and initiate an automatic failover. Failover (or passing the baton from one server to another) typically happens within literally 1-2 seconds at the WSFC level in most cases—but SQL Server will then need to spin up on whichever host has now become the primary and will need to run through the Recovery Process. (In most cases, this should execute in < 60 seconds. See my post on Indirect Checkpoints for more info on this subject.)
Decreased Downtime During Maintenance. Logically, this is the same concern or benefit as automatic fault tolerance. Only, the idea here is that admins can use the ability to force failover from one host to another as a means of implementing rolling upgrades or rolling servicing. In other words, if you need to add more RAM/CPU/etc., to a host, or need to push a SQL Server Service Pack or deploy Windows Patches, etc., the general idea is that you can patch/upgrade the secondary server, bring it back into service and then force a failover to this server (so that it’s now the primary), and then patch/upgrade the previous Primary as well—all as a means of decreasing overall downtime. (Done during off-peak hours SQL Server’s Recovery time can typically be kept down around 10-15 seconds in many cases – meaning only a slight hiccup in up-time while failover is being forced.)
Remote Availability as an Options. Both AGs and FCIs can enable the notion of stretch availability or remote availability as a means of helping move fault-tolerance from being a host-level concern (within a particular data center) to making entire datacenters fault tolerant or redundant. These are very advanced options (that I won’t really address in this series at all)—but they are one of the big draws in implementing both FCIs and AGs.

A Question of Context

So, why go over the differences between FCIs and AGs in a series of posts on SQL Server AlwaysOn Availability Groups and SQL Server Agent Jobs? Simple: If you use Failover Cluster Instances, then all of the concerns we’ll cover in this series of posts about how to get SQL Server Agent Jobs to work when AlwayOn Availability Groups are thrown into the mix are simply a non-issue. Or, in other words, assume that, for whatever reason, we need a job to run against the production version of our Widget database every 10 minutes.

If we use Failover Clustering to achieve High Availability, then if we set up that job to run against the Widget Database, it won’t matter which physical host our SQL Server Instance is running against—every 10 minutes the job will just run (because FCIs provide instance-level protection and synchronization as there’s really only ever one SQL Server Agent active or working and only ever one copy of the msdb database that keeps schedule meta data and state around for scheduling and executing SQL Server Agent Jobs).

On the other hand, if we use a simple 2-node AlwaysOn Availability Group and need our job to fire every 10 minutes against the Widget database, then not only do we have 2 copies (or Replicas) of the Widget Database (one in read/write or primary status and the other in a Recovery status or serving as a warm failover database), but we also have two totally separate instances of SQL Server running—meaning two distinct SQL Server Agents and two totally different and distinct copies of the msdb database with their own state info and meta data describing jobs. The problem, though, is which server’s SQL Server Agent owns or runs this job every 10 minutes? Obviously, we want to have the job fired on the server that’s hosting the Primary Replica. The trick is that getting that to work is a bit more complex and complicated than it might initially appear. Hence, this series of posts.

Using Failover Cluster Instance to Achieve HA

The point of this post, however, is to call out that if you are just looking for high-availability or fault-tolerance within your datacenter (and have a smoke-and-rubble contingency plan, i.e., backups, etc.) for what happens should your datacenter encounter an outage), then you are likely much better off using SQL Server Failover Cluster Instances to achieve HA instead of using Availability Groups.

Availability Groups may be all the rage and they do allow scale-out scenarios (though I think there are much cheaper options for achieving this than using AGs), and they may be a bit easier to set up and configure (in some ways), but they’re hands-down more expensive and troublesome to manage and own. As such, only use them if they 100 percent make sense—not simply because they’re cool or because you don’t want the hassle of setting up a real cluster.

Part 3: Defining Batch Jobs

Comments

Plain text