High-availability options for Microsoft SQL Server have stabilized since AlwaysOn Availability Groups were released in Enterprise Edition SQL Server 2012. Yes, there have been modifications since then (a limited version of AlwaysOn Availability Groups was included in Standard Edition SQL Server 2016, for example), but the high-availability offerings that include log shipping, clustering and mirroring seem to have hit maturity. To most effectively leverage the capabilities, database administrators must have an understanding of the key issues involved with working with SQL Server Agent and AlwaysOn Availability Groups.
High-Level Explanation of AG Architecture
It’s important to at least have a cursory understanding of what AlwaysOn Availability Groups are before proceeding. Availability Groups, or AGs, consist of two or more separate servers hosting an instance of SQL Server. These servers are referred to as replicas. Each replica hosts a copy of one or more databases that participate in an Availability Group. For any given AG, only one replica will host the databases in a readable and writeable state; that replica is referred to as the primary replica. The other replicas’ copies of the databases participating in an AG are either readable or non-readable, depending how you structured your AG when building it.
When it comes to failovers, all databases participating in an AG will fail over together even if there is a trigger condition of some kind that arises in only one of the databases that participate in the AG. Failovers can be configured to occur manually or automatically. When an AG failover occurs--particularly in the wee hours of the morning when the DBA team is likely asleep--you want the fewest steps possible to occur to conform to the failover. Adjusting SQL Server Agent jobs’ enabled status should not be one of them.
The Hazards of Inconsistent SQL Job Configuration
SQL Server Agent is not part of the AG. This feature and the underlying databases that host the metadata used in its processes--namely msdb and master--exist independently of the AG on each replica. I have supported environments in which a SQL Server instance has jobs that are scheduled to execute through SQL Server Agent and should only be run against the writeable copy of the databases or against a database hosted outside the AG on the primary replica.
I’ve seen this accomplished by creating the job identically on each replica but enabling the job only on the current primary replica. The issue with this method is that the current primary is never guaranteed to always be the primary replica. If that was the case, then there would be no need for high-availability constructs like AGs. In this model, a failover occurs and then requires intervention by the DBA to manually disable the job on the old primary replica while enabling it on the new primary replica. This manual intervention is fraught with issues, including possibly hundreds or thousands of jobs all requiring this intervention or overlooking a job in the manual enablement process.
It is also likely that the AG may start to “flap” at some point in its life cycle: failing over and back. Keeping up with a flapping AG is untenable. This is why I’ve taken a different approach to working with SQL Agent jobs when it comes to Availability Groups: Create the jobs and enable them identically on every replica in the AG, but enable a smart workflow to test for whether the replica is the current primary before executing the job code.
A Simple Smart Workflow for SQL Agent Jobs and Availability Groups
To accomplish this mode of SQL Agent Job creation, I employ a function to test for whether a database participating in an AG is writeable. If it is, then proceed through the remainder of the SQL Agent Job steps. If it’s not, then toss an error that is not raised to the end user or application and exit the job gracefully.
There are only two requirements to create this smart workflow process:
- Creation of a user-defined function to test for the writeable state of a database
- An initial step in the SQL Server Agent Job that calls this function. The On Success action for the step is to proceed to the next step in the job. The On Failure action is to exit the job reporting success. This prevents any alerting you may have set up natively or via third-party tools from firing when you’re using this methodology. The job didn’t truly “fail,” after all; you simply detected it should not run and exited gracefully.
The following code is what I use for this litmus test for updatability of a given database:
CREATE FUNCTION [dbo].[fn_is_writeable_replica] (@dbname sysname)
WITH EXECUTE AS CALLER
DECLARE @is_writeable BIT;
IF EXISTS(SELECT 1 FROM master.sys.databases WHERE [name] = @dbname)
IF (DATABASEPROPERTYEX(@dbname, 'Updateability') <> 'READ_WRITE')
SELECT @is_writeable = 0
SELECT @is_writeable = 1
SELECT @is_writeable = 0
The crux of the code examines the value for the updateability property for the database you pass into it. If the database is updateable (that is, writeable), the function returns a value of 1. Otherwise, it returns the value of 0. When you call this function in the first job step (see screen shots below), if the call returns 1 the job will continue through to the next step(s). If a 0 value is returned the job step fails and exits the job in a graceful fashion.
In the code block above I chose to create the function in the master database. In the real world I tend to have a user database that exists on each server and is used as a repository of procedures and functions that I use without cluttering the master.
With the function created above in all replicas that participate in the AG, it’s time to employ it in the SQL Agent Job. I’ll illustrate that configuration through the following screen shots.
As shown in Figure 1 below, I inserted a new step before the original step one and named it “Litmus Test.” I also made sure to change the starting step of the job to the new step.
Figure 1. Job Steps
Inside of the new job step I coded a call to the fn_is_writeable_replica function and passed in the database name I wanted to use as my check. This name should be the name of the AG-participating database you intend to perform actions against later in the script. This is illustrated in Figure 2. The code is also provided here for copy/paste purposes:
DECLARE @is_prime bit;
SELECT @is_prime = dbo.fn_is_writeable_replica('sql_cruise_db');
IF @is_prime = 0
PRINT 'EXITING GRACEFULLY';
THROW 51000, 'This is not a writeable replica', 1;
Figure 2. Litmus Test Step Code
The last action to employ is creating the correct workflow for step outcomes as denoted in Figure 3. We want to have the job continue on discovery of a writeable database and exit if not.
Figure 3. Step Workflow
Inserting the above step in front of an existing job and changing the starting step to use the new “litmus test” step can be integrated into any SQL Agent Job you currently employ against a database hosted in any AG. There is very little additional work, and you don’t need to start from scratch. This has saved my team hours and has prevented job failures ever since we switched to this model.