AlwaysOn Availability Groups and SQL Server Jobs, Part 8: Dynamic Detection Gotcha #1

Note: The next three blog posts are somewhat optional in the sense that if you’re reading along and are just interested in a solution, these posts won’t help too much. Instead, they’re designed to show WHY trying to dynamically detect whether to execute a job or not (based upon whether the server it’s running on hosts the primary replica or not) is problematic.

In previous posts, I’ve outlined that there are effectively two different ways to address SQL Server Agent Jobs that ‘target’ databases that are part of AlwaysOn Availability Groups. First, you can either determine, at run time, whether or not the job should run based on whether or not the code being executed is being run on the server hosting the Primary Replica for the target database or not. Or, second, you can figure out some way to either enable/disable jobs entirely, based upon whether they’re on the server hosting your primary replica or not.

Detection is Easy But Only the Tip of the Iceberg

In this post, we’ll start looking at why the option of dynamically detecting whether or not to run at execution time is quite a bit more problematic than you might think.

Given that it’s pretty much trivial to determine (at run time) whether a given SQL Server instance is the host of a Primary Replica or not (using the two UDFs I defined in my previous point), it would seem pretty logical that if you’ve got a very simple job, that you could just inject if/else checks into your job steps and then your jobs/job steps would only run against your Primary Replicas.

Assume, therefore, a very TRIVIAL batch job—something that is designed to go out every 4 hours and DELETE any values in a MyAgDatabase.dbo.UserActivityLogs table. (Never mind that this might not be the best way to tackle such an issue —I’m just trying to keep the job logic trivial.)

If such a scenario were the case, you’d expect to have a SQL Server Agent Job with a name like, Delete Older User Activity Logs, with a corresponding schedule, and with a single Job Step with details that would look like the following:

And, where you’d note that the job is set to run in the MyAGDatabase which is both a member of an Availability Group (creatively entitled TestAG), and that is also the database where the UserActivityLogs table is found.

With that background in place, imagine that you make the following, simple, change to your Job Step:

Where you’ve simply added in some if/else logic that will dynamically detect—at run time—whether the server that this job is being run on is hosting the Primary replica, or not. Then, assume you’ve only got two servers or hosts in your Availability Group and distribute this single job to both servers (assuming that the logic you’ve just put in place will solve your problems and make sure that the Delete Older Logs job will now intelligently only run on the only server/database it could) the Primary (where read/write operations are permitted).

If you do that, and then let the job run, it’ll do just FINE on the server hosting your Primary Replica. On the secondary server, however, you’ll end up getting errors that look like the following in the Job’s History:

And that’s because when this Job actually runs on the secondary server, the MyAGDatabase database is NOT accessible—it’s in a non-accessible state (because it’s a non-read-only replica, and note: setting it to a read-only replica would require an ADDITIONAL license) and simply can’t be accessed. Accordingly, the SQL Server Agent that is instructed by this Job to use the MyAGDatabase database, simply can’t access the MyAGDatabase database at all. Meaning that it’s not even able to execute the check and, instead, we’re met with a simple error.

Welcome to Gotcha #1. It’s trivial to see what’s going on and why, but the underlying issue betrays the fact that dynamically trying to detect and then EXECUTE code within a database that may or may not be accessible is MUCH harder to pull off than you might think. In the next two posts, we’ll look at some options, or work-arounds, we might think to use in such a scenario and we’ll see why (from my perspective) these work-arounds fall miserably short of what you’d want to use and maintain as part of a production-level solution for managing Batch Jobs on multiple servers when they’re part of an AlwaysOn Availability Group.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership

Next: AlwaysOn Availability Groups and SQL Server Jobs, Part 9: Dynamic Detection Gotcha #2

Comments

Plain text