SQL Server Recovery Time Objectives and Recovery Point Objectives

In this first post for the Practical SQL Server blog, I wanted to lay the groundwork for topics and concepts that I’m sure will show up over and over again when addressing future topics such as High Availability, Disaster Recovery, Performance Tuning, Budgeting, and so on.

Why RTOs and RPOs?

Based upon my SQL Server consulting experience, many organizations haven’t actually sat down to quantify their Recovery Time Objectives (RTOs) or their Recovery Point Objectives (RPOs)—especially when it comes to their SQL Server databases.

Without properly identifying RTOs and RPOs, IT professionals risk a huge potential mismatch between what they’re able to recover during a disaster, and what a business needs to effectively survive a disaster.

Stated differently, if you haven’t actually quantified and defined these objectives, then it’s safe to say that you’re at risk.

RTOs and RPOs Defined

Within the IT community, Recovery Time Objectives are commonly defined as the amount of time it takes to recover from a disaster – or to get a system back online after it goes offline or crashes. Similarly, Recovery Point Objectives are commonly described as the amount of data that was lost during the outage and recovery period.

Personally, I’m not a big fan of those definitions – because they fail to address the fact that RTOs and RPOs are both centered on the notion of ‘Objectives’ instead of actual down time and data loss.

In other words, when used correctly, RTOs and RPOs cease being mere buzz words (or an annoyance put in place by management) and can become very effective tools for addressing the very real potential for disaster and proactively ensuring data protection and business continuity. More specifically, when leveraged correctly, RTOs and RPOs represent a great way for businesses (meaning IT and management) to work together to both establish acceptable windows for downtime and data loss and then begin working towards solutions that meet (or exceed) those windows or objectives.

Even better, once RTOs and RPOs are defined, IT departments can take these benefits to the next level by codifying them into full-blown Service Level Agreements (SLAs), which shift the focus from a discussion of acceptable amounts of loss to a pro-active focus on overall uptime and availability. This, in turn, helps IT professionals clearly establish their commitment to business continuity and growth—and can be a pivotal component in transitioning from a tactical (or reactive) approach to systems management into a more strategic (or proactive) approach.

Practical RTOs and RPOs

Practically speaking then, the question is how do you go about figuring out what your objectives should be when it comes to restoring availability and dealing with data loss?

To address that issue, I typically like to think of RTOs and RPOs in terms of downtime—which really comes in two forms:

First, and most obviously, there’s the downtime that occurs when applications and employees aren’t able to work when a system is down, offline, or crashed.
Second, there is also ‘downtime’ associated with how much time was lost in entering data that was lost—along with the downtime associated with how much additional time it takes to put that data back into the system.

In other words, when IT professionals typically address RTOs and RPOs, they commonly only focus on how much time and data loss is acceptable—instead of focusing on the total amount of disruption that a disaster can incur.

To put this into better perspective, consider an example centered on a medical billing application with, say, 20 or 30 semi-active users who regularly log data into the application during business hours. Then assume that the system goes down. Something similar to the following will characterize how things play out in most organizations:

At first, end-users might not actually care too much about the outage. They may have other things to do or may be happy to go stand around the water cooler and take a quick break.
Management, on the other hand, may feel differently.
Once the system has been down for, say, 10 or 15 minutes, end users may start to joke or complain as they start worrying or wondering if they’ll have to stay late at work to get their jobs done.

Then, let’s assume a happy ending: After 35 minutes of downtime, the database is brought back online with only 15 minutes of data being lost from before the crash. In such an event, for example, the following considerations help contribute to addressing the true, or total, cost of this outage:

First, end users lost 35 minutes of productivity while the system was down. (Addressing this window of time loss corresponds to Recovery Time Objective—or the goals you put in place to minimize this kind of downtime.)
Second, these same end users lost 15 minutes of work because of the 15 minutes of data that was lost during the crash or outage.
Third, these same end users are also going to lose whatever time it takes them to re-enter that same lost, or missing, data back into the system. With some applications, this lost time can be minimized if everything is automated or queued in the sense of being highly fault-tolerant. But with applications that rely heavily upon end-user interactions, it’s commonly safe to assume that this amount of lost time might actually be larger than the amount of time represented by lost data alone—as end users now have to go back through previous work, validate that it was correctly added or changed in the system, and re-enter anything that they find missing or incorrect. (And addressing the cost of this lost time is the role of Recovery Point Objectives – in the sense that these objectives are what you use to minimize the costs associated with lost or missing data.)
Roughly summarized, the true amount of lost time for end users is more accurately going to be 75 minutes instead of just the 35 minutes that IT was freaked out trying to recover the database. (And the math for this goes something like: 15 minutes of lost data + 35 minutes of downtime + 5 minutes to tell everyone that the system is back up + roughly 20 additional minutes for them to figure out what was lost and put missing data or changes back into the system.)

Consequently, when it comes to figuring out what your own RTOs and RPOs should be, you need to consider the potential costs associated with lost time and lost data for your own business. There is no ‘one size fits all’ approach to RTO or RPOs—nor is it adequate to just assume that you can achieve zero loss of time or data because you may not have the budget, solutions, or resources needed to meet such lofty goals.

As such, one way that I commonly recommend that organizations address the potential costs of lost time and data is simply to take average monthly business revenues and then divide those amounts into days, hours, and minutes as necessary. While this is a vastly over-simplified approach to calculating the potential costs of outages, it typically does a great job of underscoring just how at-risk many organizations are when they don’t have any plans or solutions in place to address this potential for loss. Likewise, another big benefit of this overly-simplistic approach to calculating the cost of downtime is that it can help IT professionals establish budget and resource allocations to address the cost of downtime by making a very clear case to management of the risks involved.

Putting RTOs and RPOs to Work

Once you’ve figured out how much data loss and downtime will cost or impact your business, you’re then able to formulate objectives for how to mitigate those costs—which is the exact role and nature of RTOs and RPOs—in that they specify the objectives (or goals) you’d like to meet in minimizing those costs. From this point, you then able to contrast the kinds of budgets and solutions available to meet these objectives and begin putting technical solutions into place that meet your RTOs and RPOs, which help you meet your service level agreements.

However, without testing, RPOs and RTOs are just an expression of how potentially expensive outages can be, because

There’s no verification that the solutions you have in place to mitigate potential downtime and data loss can meet your stated objectives.
You run the risk that your technical solution might not even work as anticipated—meaning that RTO and RPO take a back seat to seeing if the business will actually survive an outage or not.

Furthermore, without regular testing you also:

Run the risk that changes within your environment or infrastructure (think of network or system changes clear on down to things like patches or service packs) can render your disaster recovery and high availability plans moot.
Run the risk that growth or increased workloads can negatively impact your ability to actually achieve your Recovery Time Objectives (and, in some cases: your Recovery Point Objectives as well) due to increasing load or demand on your failover or recovery systems.

As such, in future posts we’ll look at some practical approaches to validation, testing, and documentation that you can leverage to make sure that you can meet (or exceed) your RTOs and RPOs.

Comments

Plain text