Practical SQL Server
AlwaysOn Availability Groups and SQL Server Jobs, Part 18: Health Checks for Availability Groups

AlwaysOn Availability Groups and SQL Server Jobs, Part 18: Health Checks for Availability Groups

Now that we’ve outlined the process to let servers in a SQL Server AlwaysOn Availability Group "talk to each other" by means of setting up linked servers, it’s possible to set up some additional or improved checks on Availability Group Health.

Why Simple Failover and Data Flow Alerts Aren’t Enough

In post #5 of my ongoing series, I outlined how to set up Alerts for when Failover Occurs--or for when data flow changes occur. Those kinds of alerts are highly recommended but don’t paint the entire picture of your Availability Group’s health. For example, assume you’ve got a "simple" two-node Availability Group. For starters, if you truly only have two nodes in your Availability Group, then you’ve “done it wrong": What I mean is that while you can have Availability Groups with only two active SQL Server Nodes, you’ll want or need a third server participating in your cluster to help establish quorum (and stop and squabbles or disputes should Server1 think it’s Spartacus when it detects "link" problems with Server2, which also detects similar communications issues and wants to determine that it’s Spartacus, as well).

In such a simple two-node setup, if Server1 goes down, you’ll get an alert about data flow and failover--as per the alerts defined in post #5. Likewise, if Server1 stays healthy and isn’t having any problems but Server2 crashes, then you should, hopefully, get information about a change in data flow/synchronization. However, if your third node (that is, a "witness node" that doesn’t need to run SQL Server but that helps determine quorum) goes down, then you’re running in a potentially compromised situation where a perfect storm could cause a squabble between Server1 and Server2 . And, without a neutral server to settle the squabble, you could end up in serious trouble.

Setting Up Additional Health Checks

Rather than run the risk of one of your cluster nodes silently falling down without you knowing about it, a better practice is to set up regularly scheduled health checks--to ask SQL Server how all members in the Availability Group’s underlying cluster are doing and to raise a report or alert if anything is amiss.

Technically speaking, you could run this on any/all of the SQL Server nodes in your Cluster/Availability Group, but a better approach is to just set "affinity" for this check to have it run on whatever SQL Server node is currently hosting your Availability Group (or one of your AGs, if you have multiples).

Pulling this all off is pretty simple given much of the logic I’ve outlined in previous posts (that is, about how to detect and arbitrarily run code on just a single server), as well as when we take advantage of a number of DMVs designed to showcase not only Availability Group health and data flows at the SQL Server level, but across all members of the cluster itself.

As an example of how you can check up on the state of all Cluster Node members (and not just the SQL Server nodes in your cluster), the following stored procedure represents a simple example of how you can work your way through a series of progressively more detailed health checks and report on any problems or issues--on anything ranging from no-primary member of the AG defined, to sick/non-healthy members of the quorum, and all the way up to issues with synchronization.

USE master;
GO

IF OBJECT_ID('dbo.dba_CheckOnAndReportAGStatus','P') IS NOT NULL
	DROP PROC dbo.dba_CheckOnAndReportAGStatus;
GO

CREATE PROC dbo.dba_CheckOnAndReportAGStatus
	@GroupName			sysname,	-- the name of the availability group to watch
	@ProfileName		sysname,	-- the name of the profile being used to send emails
	@OperatorName		sysname		-- the name of the operator to notify if there are issues
AS
	SET NOCOUNT ON;

	DECLARE @errorMessage nvarchar(max);
	DECLARE @subject nvarchar(300);
	DECLARE @crlf char(2) = CHAR(13) + CHAR(10);

	-- 0) Make sure that there's an active PRIMARY 
	--	(and if there is, and it's not this server, then stop further checks):
	DECLARE @primaryReplica sysname;
	SELECT @primaryReplica = agstates.primary_replica
	FROM sys.availability_groups AS ag
	LEFT OUTER JOIN 
		sys.dm_hadr_availability_group_states as agstates ON ag.group_id = agstates.group_id
	WHERE 
	ag.name = @GroupName;

	IF ISNULL(@primaryReplica,'') = '' BEGIN 
		
		SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - No Active PRIMARY';

		SET @errorMessage = N'There is currently no Replica defined as an ONLINE PRIMARY for the following Availability Group: ';
		SET @errorMessage += @crlf + @GroupName;

		EXEC msdb..sp_notify_operator 
			@profile_name = @ProfileName, 
		    @name = @OperatorName, 
		    @subject = @subject,
		    @body = @errorMessage;

	  END;
	ELSE BEGIN 
		-- otherwise, there IS a good/up primary. If this code is NOT running on the 
		-- primary, then we're done (i.e., short-circuit/escape/return);
		
		-- grab a db and check on it:
		DECLARE @dbName sysname;
		SET @dbName = (
			SELECT TOP 1 database_name 
			FROM sys.availability_databases_cluster 
			WHERE group_id = (SELECT group_id 
				FROM sys.availability_groups 
				WHERE name = @GroupName));

		IF(NOT dbo.fn_hadr_database_is_primary(@dbName) = 1) BEGIN   
			-- then bail - we don't need to run any additional checks or tests. 
			RETURN 0; -- report a NORMAL/successful return value (i.e., expected outcome).
		END;

	END;

	-- 1) Check on Cluster/Quorum Status:
	DECLARE @status varchar(50);
	SELECT @status = (SELECT TOP 1 quorum_state_desc FROM sys.dm_hadr_cluster);
	IF @status IS NULL OR UPPER(@status) != 'NORMAL_QUORUM' BEGIN 
		
		SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non Healthy Cluster Status';

		SET @errorMessage = N'WSFC Cluster Status Problem.';
		SET @errorMessage += @crlf + N'Quorum not detected or not normal. [SELECT * FROM sys.dm_hadr_cluster for more information.]';

		EXEC msdb..sp_notify_operator 
			@profile_name = @ProfileName, 
		    @name = @OperatorName, 
		    @subject = @subject,
		    @body = @errorMessage;
	END;

	-- 2) Check on Cluster Members:
	DECLARE @nonUp int;
	SELECT @nonUP = COUNT(*) FROM sys.dm_hadr_cluster_members WHERE member_state != 1; -- UP
	IF @nonUp > 0 BEGIN 
		SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non-Operational Members';

		DECLARE @members nvarchar(200) = N'';
		SELECT @members += member_name + N',' FROM sys.dm_hadr_cluster_members WHERE member_state != 1;

		SET @errorMessage = N'The Following Availability Group Hosts/Servers are currently Down: ';
		SET @errorMessage += @crlf + LEFT(@members, LEN(@members) - 1);

		EXEC msdb..sp_notify_operator 
			@profile_name = @ProfileName, 
		    @name = @OperatorName, 
		    @subject = @subject,
		    @body = @errorMessage;
	END;

	-- 3) Check on Synchronization Status:
	DECLARE @nonHealthy int;
	SELECT @nonHealthy = COUNT(*) FROM sys.dm_hadr_availability_group_states WHERE synchronization_health != 2;
	IF @nonHealthy > 0 BEGIN 
		SET @subject = N'[SQL SERVER HIGH AVAILABILITY PROBLEM] - Non-Synchronized Status';

		SET @members = N'';
		SELECT @members += ags.name + ' : ' + hags.synchronization_health_desc + ','
		FROM sys.dm_hadr_availability_group_states hags
		INNER JOIN sys.availability_groups ags ON ags.group_id = hags.group_id
		WHERE hags.synchronization_health != 2; 

		SET @errorMessage = N'The following Availability Groups have the following (non-healthy/non-synchronized) Synchronization States: ';
		SET @errorMessage += @crlf + LEFT(@members, LEN(@members) - 1);

		EXEC msdb..sp_notify_operator 
			@profile_name = @ProfileName, 
		    @name = @OperatorName, 
		    @subject = @subject,
		    @body = @errorMessage;

	END;

	RETURN 0;
GO

This stored procedure relies upon the UDF dbo.fn_hadr_database_is_primary, which I provided in post #7: Detecting Primary Replica Ownership. So, if you end up using the script/sproc above, make sure you grab the UDF, as well.

To set up the sproc for regular checks on your Cluster and AG Health Statuses, you can create a simple SQL Server Agent Job (with a name like “Regular AG Health Checkup”) and drop the following in as the command to execute, say, every 1 to 5 minutes. (This script takes all of 0 seconds to run each time it’s executed, so it’s safe to run it every minute.)

EXEC master.dbo.dba_CheckOnAndReportAGStatus
	@GroupName = N'Name Of AG To Watch Here', 
	@ProfileName = N'General',
	@OperatorName = N'Alerts';

Where @GroupName is the name of the Availability Group you want to watch – like ‘SSV’, ‘Production’, ‘Widgets’, ‘MyFirstAG’ or whatever.

And, of course, once you deploy it to one of the SQL Servers hosting your Availability Group(s), you’ll need to deploy it to all other servers hosting your AG, as well.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
Part 12: Issues with Enabling and Disabling Batch Jobs
Part 13: Creating and Populating a Batch Job State Table
Part 14: Establishing Synchronization Checks
Part 15: Setting Up Linked Servers
Part 16: Job Synchronization Checkup Logic
Part 17: Creating Jobs to Check on Synchronization

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish