Note: These three gotcha blog posts are somewhat optional in the sense that if you're reading along and are just interested in a solution, these posts won't help too much. Instead, they're designed to show WHY trying to dynamically detect whether to execute a job or not (based upon whether the server it's running on hosts the primary replica or not) is problematic.
In my last post, I outlined why trying to dynamically inject a USE myTargetDatabase; statement simply won't work in SQL Server without doing something ugly like using redirection from within a sproc or using dynamic SQL. I also alluded to the notion that you could interject a Job Step into your Jobs and have this new/injected Job Step do some checking for you. In this blog post, we’ll look at everything that's involved in such an approach, and along the way, it'll become abundantly clear why this approach is just not even close to an optimal solution for something you'd want to deploy alongside your AlwaysOn Availability Groups.
Dynamic Checking Within SQL Server Agent Jobs
At its most basic, the technique I'm outlining here is one where you simply interject a new SQL Server Agent Job step into/before your existing jobs and have that new Job Step run some code to determine if the code in subsequent steps should be run or not. Or, in other words, since running an IF check inside your Job Step won't work because we can't access or target (non Primary) databases that are in the restoring or synchronizing state—and because we can't dynamically execute a USE statement—then the technique outline here effectively just moves that check to another job step entirely.
Assume that your job initially looks like the following—with just a single step (which is the step I outlined in previous posts):
To inject a check before this job step, open up your SQL Server Agent Job and switch to the Steps node. Then, click on the Insert button down near the bottom of the Steps pane, and interject or Insert a new Job step into the mix.
Make sure the Job Step is set to run in the master database, and then give it some logic similar to what follows—where we'll raise or throw an error if we're not on the Server hosting the Primary:
Then, since this IS a hack, you'll have to switch to the Advanced tab for the job step in question, and then specify that you’d like this Job step to continue to the NEXT job step upon success, and QUIT the job (reporting success) when the job step fails:
Again, this IS quite a hack.
In fact, it's enough of a hack that you'll also have to go in and make sure that the Job actually starts with Job Step #1 (even though you pushed the Insert button to inject this step into the mix):
At this point, though, your Job Steps should look similar to the following:
With those changes made, you can then actually go into your real Job Step (or the one that existed previously), and tweak it so that it no longer needs any of the IF/ELSE checks that would be present if you were following along from previous posts. Or, in other words, since Job Step #1 in this job CHECKS to see if we should keep running, then IF we get to Job Step #2, we don't need to do any checks, and we can target the MyAGDatabase and have our code/logic run as expected and without the need to wrap it in any checks, as follows:
Finally, you'll need to push this updated job out to both servers (i.e., Primary and Secondary servers in your AG).
The Real Problem(s) with this Approach
At this point, we’ve jumped through a few hoops to interject a pre-execution check: either we're on the server hosting the Primary database in our AG and we'll keep processing—or we're not and we'll get an error in Job Step #1 and then the job will terminate gracefully. Overall, that’s a few more hoops than I’d prefer to jump through, especially since setting up something like this carries the negatives of being a bit counter-intuitive (i.e., someone on-call or troubleshooting failures on/with this Job might not be able to immediately grok what’s been done here—or why).
The bigger issue, though, is what you end up with if you take this approach.
For comparison, here's what you get when looking at the Job’s execution History on the server hosting the Primary replica for the target database:
To me that looks fine—or, frankly, perfect. The job is running every 10 minutes as scheduled and both job steps ae clearly running without any issues.
Compare that to what you get when you look at the Job’s history on a non-primary server:
This is what I would describe as HIDEOUS. Yes, the job IS executing correctly every 10 minutes, exactly as expected. AND, if you’ve got alerts or notifications set up to tell you if/when this job fails, well . . . the job hasn't failed so you won't be getting any alerts. So, that's all working fine and as expected.
The problem, however, is that if you're checking up on this server and trying to get a feel for how a job is behaving, the last thing you want to see are a bunch of WARNINGS in the Jobs history or list of outcomes. To me that's way too much NOISE vs. signal—as my personal preference with alerting and notifications is to only raise alerts and/or to only flag warnings or problems when there are, well, problems or issues that need human intervention and attention. As such, a database administrator (DBA) taking a quick view of job histories/outcomes on a server with this Job deployed would almost always initially think, "Woah? Wait? What's going on?" when seeing all those yellow warning icons—which is something I wouldn't want to push into production.
So, given that this approach works but: a) is fairly non-intuitive; b) requires us to jump through too many hoops (meaning that if we miss a step along the way things might not run as expected); and c) causes the potential for you or other members of your team to think that something is BUSTED (when it's not), I’m not a fan of this approach. Instead, the best approach that I can see is to dynamically enable or disable jobs based upon whether or not the servers they're living on are currently hosting the target/Primary replica or not. So, that's what we’ll look at in subsequent posts.
Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership