Practical SQL Server
AlwaysOn Availability Groups and SQL Server Jobs, Part 16: Job Synchronization Checkup Logic

AlwaysOn Availability Groups and SQL Server Jobs, Part 16: Job Synchronization Checkup Logic

With Linked Servers configured to allow cross-host or cross-box communication between the servers you’re using to host your AlwaysOn Availability Groups, it’s time to start looking at some scripts that can be used to regularly check whether or not SQL Server Agent jobs are correctly synchronized across all servers in your topology. So, before we actually look at the code and implementation (which we’ll address in my next post), let’s take a high-level look at some of the needs and assumptions going into this process.

Defining a Synchronization Checkup Process

Ultimately, the synchronization checkup process is fairly straightforward in that what we’re doing is trying to prevent a nightmare from occurring by allowing SQL Server Agent Job details such as schedule, permissions, alerts, enabled/disabled, or the commands being executed themselves from getting out of sync from one server to another, as that would be really ugly if/when a failover from one host to another occurred. With that in mind, synchronization checks need to do a couple of things:

  • Check on details of our server or instance-level jobs (things like server maintenance routines and the likes).
  • Check on details for Backups (which can be similar-ish to server-level jobs in terms of scope – but which take on a slightly different aspect or character when we’re typically backing up on a single instance of a database spanning multiple servers).
  • Check on details about batch-jobs—or jobs that specifically target an Availability Group database.
  • Furthermore, depending upon how we’re handling batch-jobs, not only do we want to ensure that logic, schedules, permissions, alerts, and the likes are the same, but we may need to determine whether the job should be enabled/disabled or not on each given server.
  • Likewise, for whatever reasons, we might also want to be able to set up the ability to skip checks (i.e., ignore) certain jobs that we just don’t want to synchronize or bother with (no matter what their scope or no matter what our concerns are).

To address these concerns, in our next post we’ll end up creating two stored procedures—one that checks on server level details for SQL Server Agent jobs, and one that we use to check on the jobs belonging to a single SQL Server AlwaysOn Availability Group (such that if we have, say, 3 different AGs hosted on our servers, we’d end up with a job that executes four sproc calls; one for server level checks, and 3 different calls for each of our AGs.

A High-Level Example of Execution

As I discussed in Post #11: A Fork in the Road, there are two primary ways to tackle batch-jobs: either by Enabling/Disabling entire jobs on a given server (based upon whether it’s currently hosting the Primary Replica or not), or to simply use if/then logic that defers execution into a stored procedure (which is going to be the better option in many cases). In this post, I’ll assume, however, that we’re going with the first option (Enabling/Disabling) as that’s a tougher approach to jobs management, and you may need it if you’re running things like SSIS packages as jobs and so on. Under such an approach, we’ll want to assume that not only do we want our regular synchronization checkups to check on AG-level job synchronization, but that we also want it to toggle enabled/disabled states should we find that a failover has occurred. Of course, as I outlined previously, we’ll also need to account for batch jobs that might’ve temporarily been disabled—and which should STAY disabled even through/after a failover.

Under such parameters, let’s assume we’ve got a single Availability Group, called SSV, and that we’ve got a Batch Job set up and correlated with that AG by means of a SQL Server Agent Job Category called SSV as well and that we’ve already set up and populated a corresponding Jobs State table. Then assume that we’ll have our synchronization checks running every, say, 3 minutes and that (since we’re using the approach of enabling/disabling jobs based upon their affinity with the Primary Replica) we’ve also got logic in place to enable/disable jobs should we find a failover has occurred. If this were the case and everything were running on SERVER1, we’d have executions at, say, 10:03, 10:06, 10:09AM, and so on. Then assume that SERVER1 crashes (for whatever reason) at 10:10AM. Assumer further that we’ve failed over and recovered operations by 10:12AM. At this point, we’ll expect code/logic executing as part of our synchronization checkup job to do the following on SERVER2:

  • Determine that a failover has occurred: i.e. the AG called SSV used to have the Primary Replica hosted on SERVER1, but SERVER2 is now the Primary.
  • Make sure that any AG-Level jobs that
    1. target the SSV database and
    2. are defined by the Jobs State Table as being ‘Enabled’ are now Enabled on SERVER2. Similarly, we should try to make sure that any AG-level jobs targeting the SSV database/AG are disabled on our OTHER servers (SERVER1) in the Availability Group. For our purposes, let’s assume that SERVER1 doesn’t come back online until 11:15AM – because it crashed spectacularly.
  • Try to do Server-Level, backup-level, and AG-level synchronization checks between SERVER2 and SERVER1 to make sure that everything is still synchronized. (And, obviously, if there are more servers involved in your AG, you’ll need to include them as well.)

That’s a lot to pull off, and some of the things we expect to have happen simply can’t or won’t happen—like setting AG-Level batch-jobs to Disabled on SERVER1 or comparing jobs details between SERVER2 and SERVER1—all simply because SERVER1 is dead and non-reachable for about an hour. On the other hand, our synchronization logic should be able to activate the sole SSV-related job on SERVER2, and that’ll mean that it should fire as expected in its new home going forward. Of course, if that SSV batch job was running at a high rate of execution say, once every minute, then there would have been a few minutes (10:10, 10:11, and EVEN 10:12AM) where it didn’t fire because the failover occurred and our toggling logic hadn’t yet fired to activate this job on SERVER2. In most cases, that’s probably going to be okay. Technically speaking, if you’re firing off a batch job every minute or so, chances are that you’re doing it wrong. If it’s not, then you’ll potentially want to address this as needed by increasing the frequency at which you’re firing off these synchronization checks.

In our next post we’ll look at some of the logic involved in these checks and help make all of these abstract concepts a bit more concrete.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
Part 12: Issues with Enabling and Disabling Batch Jobs
Part 13: Creating and Populating a Batch Job State Table
Part 14: Establishing Synchronization Checks
Part 15: Setting Up Linked Servers

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish