AlwaysOn Availability Groups and SQL Server Jobs, Part 29: Practical Implementation Tips

My initial goal in writing this series of posts was to outline some of the concerns surrounding Availability Groups (AGs) and SQL Server Agent Jobs – and call out how there is virtually no guidance from Microsoft on this front and then detail some of the pitfalls and options available for tackling this problem domain. I initially expected this series of posts to have between 25 and 30 posts – according to some of the early outlines I created ‘way back when’. As such, anyone who’s followed along with all of the posts in this series will have a better idea of some of the core concerns, pitfalls, and options available to them in tackling AGs and Jobs. But for anyone coming ‘late to the party’ and who is just looking for a set of instructions on how to best tackle AGs, this series of posts doesn’t end up with very much concrete, or practical, guidance. As such, that’s something I wanted to try to tackle a bit better in this final post.

Start at the Top

To correctly tackle how to manage SQL Server Agent Jobs in high-availability scenarios, you need to start by ensuring that Availability Groups are your best choice – as outlined or discussed in post #2 – Putting AlwaysOn Into Context – because a Failover Cluster Instance is going to make jobs management effectively trivial in comparison to what you’ll find is the case when Availability Groups are involved. Otherwise, once you’re sure that AGs are what you need, you’ll then need to do a ‘survey’ of what kinds of jobs and specific concerns you’ll need to manage:

Backups. Unless your backups are being 100% executed from a 3^rd party solution with its OWN scheduling functionality, you’ll be executing backups via the SQL Server Agent – so ‘backup jobs’ are effectively a given that you’ll need to account for. Happily, they’re pretty easy to address – as outlined in posts 19 – 25.
Batch Jobs. In this series of posts I define Batch Jobs as SQL Server Agent jobs that are run against your target databases to execute some sort of business need or evaluate business rules, or perform ‘logical maintenance’ on your database – something DIFFERENT than what you’d expect from normal DBA-level ‘maintenance tasks’. (See Post #3 – Defining Batch Jobs for more information) If you’ll be supporting these kinds of jobs, things will get a bit more complicated – as I’ll outline below.
Maintenance Jobs. Index Maintenance, DBCC CHECKS, and other ‘system level’ jobs that target user databases will be something that you’ll have to address too. Likewise, you should have jobs in place to truncate backup history, jobs history, and do other bits of key maintenance at both the server level and against the msdb database. These kinds of jobs you’ll be running on all of your Availability Group hosts – unlike Batch Jobs which you’ll only want to fire off (usually) against the Primary Replica (unless you’re executing read-only reports or something).
SSIS / ETL Jobs. Further, if you’ve got ETL jobs (or any jobs powered by SSIS), then you’ll need to tackle some additional details when it comes to jobs management. Obviously, any of these jobs should be modified such that they now target the AG Listener (instead of targeting servers/databases directly), but you’ll still end up having to tackle the notion of WHERE these jobs end up being executed. Technically, you could execute them ‘off box’ – or out on, say, a ‘Master Server’ somewhere that’s not a part of the Availability Group (see Post # 26 for more insights), or you’ll need these jobs to run on ONE of your AG hosts. And, if that’s the case, you’re either going to need to modify the SQL Server Agent jobs ‘hosting’ these ETL/SSIS jobs – as per Post #10 – to include ‘if/else’ logic in the SQL Server Agent job itself, OR you’re going to have to modify your SSIS packages – so that they’ll only run/execute when they’re run on the desired host – as per Post #23.

Finally, you’ll also have to determine what you’ll need to do to address the notion of Disabling/Enabling Jobs and how you’ll keep jobs synchronized between AG hosts. (Because if you fail to keep job details synchronized between servers, failover from one host to another could end up being anything from a serious pain all the way up to a full-blown nightmare/disaster.) Accordingly, you’ll want to check out Post#28 – where I outline some options for how you’ll manage both of those concerns.

Putting it All Together

Ultimately, the path you take really depends upon your needs. For example, if the ONLY jobs you need to manage are backups, then you can simply setup backup preferences (as per posts #21 and #22), and then either set up a policy that ENSURES that any change made to backups on one server MUST be replicated to all other servers in the Availability Group or, if you’re a bit skeptical that such a policy would ALWAYS be 100% followed/enforced, you’ll probably want to look into setting up synchronization checks – or regularly scheduled jobs that’ll let you KNOW if the details about specific jobs fall out of ‘synchronization’ with each other across servers. (For more details on this need, see posts #14 – 17 if you’re just dealing with a ‘simple’ 2-node AG, and if you’re dealing with > 2 nodes, you’ll want to check out post #27 for some ideas on how to ‘tweak’ synchronization checks to account for multiple servers.)

Otherwise, if you’ve got batch jobs, and feel that you’ll want or need the ability to DISABLE some of these jobs (rather than fully removing them from all servers/nodes and re-adding them if/when they should be turned back on), then you’ll want to look into the notion of building a Jobs ‘state table’ – or something that’ll track whether or not a SQL Server Agent job ‘assigned’ to an Availability Group should be enabled or not. (Otherwise, without a state table, assume that you Disable the ‘Delete Empty Shopping Carts’ job on one your primary server at one point because (let’s say) the analytics group wants to grab data over a longer period of time. Then, assume that you failover from NodeA to NodeB – where there was an identical copy of this job that was already disabled because it was on the non-active node. At this point, UNLESS you’ve got a state table (or something similar), your ‘Failover Response Logic’ will likely just enable all disabled jobs on NodeB (assigned to the AG in question) and now this job will be ‘accidentally’ running when it logically should not be.) For information on these needs, make sure to check out Posts #12 and #13.

Finally, if you’d like a list of “If this, then that” definitions for what you could or should do, I thought about creating a post that would try to outline those details to some degree or another. But, there were two problems with that idea. First, there are a LOT of different needs and ways to meet those needs – which would mean a post outlining those needs and their targeted solutions would have been pretty massive and complex. Second, I didn’t set out to write the DEFINITIVE guide to EXACTLY how to handle Jobs and Availability Groups. Instead, I set out to outline that there ARE a number of non-trivial problems in this area – that Microsoft has completely and utterly ignored – and sought to provide some ideas, rationale, and explanations for ways I could think of to tackle some of these needs. Along the way, I’ve gotten a number of great comments and emails about OTHER options and approaches – and I think that’s all simply indicative of the fact that there are MULTIPLE ways to tackle this problem – and that the best solution REALLY does depend upon your organization’s specific needs, culture, and AG configurations. As such, trying to ‘dumb down’ this content to make it a one-size-fits-all (or three-sizes-fit-just-about-everything) approach just doesn’t feel right.

So, hopefully this series of posts has helped serve as a sounding-board for anyone tackling these issues. And, if you’re tackling these concerns, then I wish you the best of luck – and feel free to ping me if you have any questions or brilliant ideas.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
Part 12: Issues with Enabling and Disabling Batch Jobs
Part 13: Creating and Populating a Batch Job State Table
Part 14: Establishing Synchronization Checks
Part 15: Setting Up Linked Servers
Part 16: Job Synchronization Checkup Logic
Part 17: Creating Jobs to Check on Synchronization
Part 18: Health Checks for Availability Groups
Part 19: Availability Group Database Backups
Part 20: Avoiding Backup Fragmentation
Part 21: Assigning Backup Preferences
Part 22: Executing Backups
Part 23: Maintenance Plan Backups and SSIS Packages
Part 24: Synchronizing SSIS Packages
Part 25: Final Remarks on Backups
Part 26: Using Master Servers for Administration
Part 27: Options and Concerns for More Advanced Deployments
Part 28: Additional Options for Tackling Jobs Failover

Comments

Plain text