AlwaysOn Availability Groups and SQL Server Jobs, Part 6: High-Level Options and Rationale for Handling Batch Jobs

As a recap, up to this point we’ve defined three different kinds of SQL Server Agent Jobs (or activities) that need to be addressed when SQL Server AlwaysOn Availability Groups are deployed:

Server/Instance-Level Jobs. Or jobs that are run against the server itself—for things like SQL Server Agent job history truncation, index maintenance/corruption checks (typically fired against all active databases on the server), etc.
Availability Group (AG) Level Jobs. Or jobs that specifically target databases that are part of an Availability Group—or jobs which I’m calling ‘batch jobs’ for the sake of these posts.
Backups. Which are fairly similar in some ways to server/instance-level jobs—but not quite the same (as they share some similarities to jobs that target AG-level databases).

In this post, we’ll focus on AG-level jobs or batch jobs and options that we have for executing them.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts

Executing Batch Jobs

At a very high level, we have a couple of options for how we could handle batch jobs when AGs are thrown into the mix.

One option we could take is simply to do nothing—or not think ahead, plan ahead, or address what could and would happen should a failover occur. And while it’s almost ridiculous for me to even throw this out as an option, the reality is that I’m guessing that someone—somewhere—is probably doing this already. Possibly not—because AGs are a tiny bit more involved to set up than database mirroring topologies. But, I happen to know for a fact that gobs of organizations commonly set up database mirroring as an high availability (HA) solution, and don’t ever really do ANYTHING to address what happens to their jobs if/when they failover to their secondary server (at least until it’s too late or until after they’ve failed over a few times).

Apart from that non-option, there are really, then, two approaches we could take to tackling SQL Server Agent Jobs that need to execute against databases that are part of an AG:

TARGETING: With this line of thought—or approach to tackling batch jobs—the idea is that if we can somehow get each job to know if it’s running against the active replica, then our job is mostly done. In other words, if we can basically inject some logic into the start or beginning of each SQL Server Agent Job (or Job Step) that says "If this code is running on a host that owns the primary replica for the target database/Availability Group, then go ahead and run—and if NOT, then go ahead and just short-circuit or exit execution." If that were possible, then Jobs management across servers would be pretty easy—you’d just need to modify existing jobs to do an if check at the start of execution, and then just ensure that this code/logic was synchronized or replicated across all servers (and stayed that way through subsequent changes and modifications to jobs/schedules/etc.).
DYNAMICALLY ENABLING/DISABLING: If it’s not possible (or feasible) to inject targeting (or if-checks) into existing jobs, then—unfortunately – we need to take a bit of a different approach: one where we dynamically enable or disable jobs based upon whether they hosted on the same server or not as the Active Replica of the targeted database. In this sense, a bit of logic or code could be used to make sure that jobs hosted/found on servers that are NOT hosting the primary replica are disabled—that logic’s easy enough. By the same token, however, we can’t assume that jobs targeting an AG database found on server hosting the primary replica should all, by the same logic, be enabled – because such an approach wouldn’t account for situations where a job had been explicitly disabled and should then, by definition, be disabled on a new host if/when a failover occurs. Consequently, the only real way to tackle the ability to dynamically enable/disable jobs means that a state table or some other mechanism of capturing, defining, and enforcing job enabled/disabled states needs to be thrown into the mix (thus complicating things and adding an additional moving part into your HA solution—which you typically want to keep as simple as possible).

Given the concerns with the overhead involved with dynamically managing Enabled/Disabled states of jobs, it’s pretty obvious that an approach that favors targeting would be preferred. Sadly, however, this approach is—in my experience and perspective—actually much harder to pull off given some limitations of how SQL Server works. As such, the solution that I’ve ended up implementing is actually based on dynamically enabling/disabling jobs. But in subsequent posts I’ll outline (at length) some of the pitfalls and reasons why targeting doesn’t work.

Synchronization is a Must

Otherwise, the key thing to note here is that SYNCHRONIZATION of job details is non-negotiable. Without it, changes to job details (such as scheduling, enabled/disabled states, owners, job-steps, etc.) would only be made on one server (typically/logically the Primary server) and wouldn’t be replicated into place on other servers and jobs therefore wouldn’t run as desired (or at all) once a failover occurred—meaning that we’d be right back at more or less square one or in the same state as if we hadn’t done anything to address job execution once AGs were thrown into the mix. As such, once I cover reasons why targeting doesn’t work, subsequent blog posts will then cover mechanisms and details for synchronization.

Next: AlwaysOn Availability Groups and SQL Server Jobs, Part 7: Detecting Primary Replica Ownership

Comments

Plain text