Now that we’ve outlined the process to let servers in a SQL Server AlwaysOn Availability Group "talk to each other" by means of setting up linked servers, it’s possible to set up some additional or improved checks on Availability Group Health.
Why Simple Failover and Data Flow Alerts Aren’t Enough
In post #5 of my ongoing series, I outlined how to set up Alerts for when Failover Occurs--or for when data flow changes occur. Those kinds of alerts are highly recommended but don’t paint the entire picture of your Availability Group’s health. For example, assume you’ve got a "simple" two-node Availability Group. For starters, if you truly only have two nodes in your Availability Group, then you’ve “done it wrong": What I mean is that while you can have Availability Groups with only two active SQL Server Nodes, you’ll want or need a third server participating in your cluster to help establish quorum (and stop and squabbles or disputes should Server1 think it’s Spartacus when it detects "link" problems with Server2, which also detects similar communications issues and wants to determine that it’s Spartacus, as well).
In such a simple two-node setup, if Server1 goes down, you’ll get an alert about data flow and failover--as per the alerts defined in post #5. Likewise, if Server1 stays healthy and isn’t having any problems but Server2 crashes, then you should, hopefully, get information about a change in data flow/synchronization. However, if your third node (that is, a "witness node" that doesn’t need to run SQL Server but that helps determine quorum) goes down, then you’re running in a potentially compromised situation where a perfect storm could cause a squabble between Server1 and Server2 . And, without a neutral server to settle the squabble, you could end up in serious trouble.
Setting Up Additional Health Checks
Rather than run the risk of one of your cluster nodes silently falling down without you knowing about it, a better practice is to set up regularly scheduled health checks--to ask SQL Server how all members in the Availability Group’s underlying cluster are doing and to raise a report or alert if anything is amiss.
Technically speaking, you could run this on any/all of the SQL Server nodes in your Cluster/Availability Group, but a better approach is to just set "affinity" for this check to have it run on whatever SQL Server node is currently hosting your Availability Group (or one of your AGs, if you have multiples).
Pulling this all off is pretty simple given much of the logic I’ve outlined in previous posts (that is, about how to detect and arbitrarily run code on just a single server), as well as when we take advantage of a number of DMVs designed to showcase not only Availability Group health and data flows at the SQL Server level, but across all members of the cluster itself.
As an example of how you can check up on the state of all Cluster Node members (and not just the SQL Server nodes in your cluster), the following stored procedure represents a simple example of how you can work your way through a series of progressively more detailed health checks and report on any problems or issues--on anything ranging from no-primary member of the AG defined, to sick/non-healthy members of the quorum, and all the way up to issues with synchronization.
USE master; GO IF OBJECT_ID('dbo.dba_CheckOnAndReportAGStatus','P') IS NOT NULL DROP PROC dbo.dba_CheckOnAndReportAGStatus; GO CREATE PROC dbo.dba_CheckOnAndReportAGStatus @GroupName sysname, -- the name of the availability group to watch @ProfileName sysname, -- the name of the profile being used to send emails @OperatorName sysname -- the name of the operator to notify if there are issues AS SET NOCOUNT ON; DECLARE @errorMessage nvarchar(max); DECLARE @subject nvarchar(300); DECLARE @crlf char(2) = CHAR(13) + CHAR(10); -- 0) Make sure that there's an active PRIMARY -- (and if there is, and it's not this server, then stop further checks): DECLARE @primaryReplica sysname; SELECT @primaryReplica = agstates.primary_replica FROM sys.availability_groups AS ag LEFT OUTER JOIN sys.dm_hadr_availability_group_states as agstates ON ag.group_id = agstates.group_id WHERE ag.name = @GroupName; IF ISNULL(@primaryReplica,'') = '' BEGIN SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - No Active PRIMARY'; SET @errorMessage = N'There is currently no Replica defined as an ONLINE PRIMARY for the following Availability Group: '; SET @errorMessage += @crlf + @GroupName; EXEC msdb..sp_notify_operator @profile_name = @ProfileName, @name = @OperatorName, @subject = @subject, @body = @errorMessage; END; ELSE BEGIN -- otherwise, there IS a good/up primary. If this code is NOT running on the -- primary, then we're done (i.e., short-circuit/escape/return); -- grab a db and check on it: DECLARE @dbName sysname; SET @dbName = ( SELECT TOP 1 database_name FROM sys.availability_databases_cluster WHERE group_id = (SELECT group_id FROM sys.availability_groups WHERE name = @GroupName)); IF(NOT dbo.fn_hadr_database_is_primary(@dbName) = 1) BEGIN -- then bail - we don't need to run any additional checks or tests. RETURN 0; -- report a NORMAL/successful return value (i.e., expected outcome). END; END; -- 1) Check on Cluster/Quorum Status: DECLARE @status varchar(50); SELECT @status = (SELECT TOP 1 quorum_state_desc FROM sys.dm_hadr_cluster); IF @status IS NULL OR UPPER(@status) != 'NORMAL_QUORUM' BEGIN SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non Healthy Cluster Status'; SET @errorMessage = N'WSFC Cluster Status Problem.'; SET @errorMessage += @crlf + N'Quorum not detected or not normal. [SELECT * FROM sys.dm_hadr_cluster for more information.]'; EXEC msdb..sp_notify_operator @profile_name = @ProfileName, @name = @OperatorName, @subject = @subject, @body = @errorMessage; END; -- 2) Check on Cluster Members: DECLARE @nonUp int; SELECT @nonUP = COUNT(*) FROM sys.dm_hadr_cluster_members WHERE member_state != 1; -- UP IF @nonUp > 0 BEGIN SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non-Operational Members'; DECLARE @members nvarchar(200) = N''; SELECT @members += member_name + N',' FROM sys.dm_hadr_cluster_members WHERE member_state != 1; SET @errorMessage = N'The Following Availability Group Hosts/Servers are currently Down: '; SET @errorMessage += @crlf + LEFT(@members, LEN(@members) - 1); EXEC msdb..sp_notify_operator @profile_name = @ProfileName, @name = @OperatorName, @subject = @subject, @body = @errorMessage; END; -- 3) Check on Synchronization Status: DECLARE @nonHealthy int; SELECT @nonHealthy = COUNT(*) FROM sys.dm_hadr_availability_group_states WHERE synchronization_health != 2; IF @nonHealthy > 0 BEGIN SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non-Synchronized Status'; SET @members = N''; SELECT @members += ags.name + ' : ' + hags.synchronization_health_desc + ',' FROM sys.dm_hadr_availability_group_states hags INNER JOIN sys.availability_groups ags ON ags.group_id = hags.group_id WHERE hags.synchronization_health != 2; SET @errorMessage = N'The following Availability Groups have the following (non-healthy/non-synchronized) Synchronization States: '; SET @errorMessage += @crlf + LEFT(@members, LEN(@members) - 1); EXEC msdb..sp_notify_operator @profile_name = @ProfileName, @name = @OperatorName, @subject = @subject, @body = @errorMessage; END; RETURN 0; GO
This stored procedure relies upon the UDF dbo.fn_hadr_database_is_primary, which I provided in post #7: Detecting Primary Replica Ownership. So, if you end up using the script/sproc above, make sure you grab the UDF, as well.
To set up the sproc for regular checks on your Cluster and AG Health Statuses, you can create a simple SQL Server Agent Job (with a name like “Regular AG Health Checkup”) and drop the following in as the command to execute, say, every 1 to 5 minutes. (This script takes all of 0 seconds to run each time it’s executed, so it’s safe to run it every minute.)
EXEC master.dbo.dba_CheckOnAndReportAGStatus @GroupName = N'Name Of AG To Watch Here', @ProfileName = N'General', @OperatorName = N'Alerts';
Where @GroupName is the name of the Availability Group you want to watch – like ‘SSV’, ‘Production’, ‘Widgets’, ‘MyFirstAG’ or whatever.
And, of course, once you deploy it to one of the SQL Servers hosting your Availability Group(s), you’ll need to deploy it to all other servers hosting your AG, as well.
Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
Part 12: Issues with Enabling and Disabling Batch Jobs
Part 13: Creating and Populating a Batch Job State Table
Part 14: Establishing Synchronization Checks
Part 15: Setting Up Linked Servers
Part 16: Job Synchronization Checkup Logic
Part 17: Creating Jobs to Check on Synchronization