As an executive, you've likely heard the terms “RPO” and “RTO” in conversations with your IT department. As an IT manager, you’re committed to meeting your RPOs and RTOs. As a DBA or other data professional, you’re working hard to configure your environments to meet RPOs and RTOs. Or, you may be hearing these terms used, but aren't quite sure what they mean and at this point are afraid to ask.
Here's the lowdown.
What is RPO?
RPO stands for recovery point objective. This is a fancy way of saying: Just how much work--or data, for the sake of this discussion--am I willing to lose if a crisis arises affecting access to critical systems by end users and customers? Unlike so many other uncertainties in the technology sector, this is a goal you have almost complete control over meeting. The RPO is a factor of multiple things, all of which are in the hands of your data platform team. The following factors weigh into setting an achievable backup strategy.
- Impact of an outage
- Timing of the outage
Depending on how you structure your backup strategy and how much you’re willing to spend on it, you may be able to achieve near-zero RPO. This is within your control because different causes of outages have different impacts. Frequent transaction log backups on top of a solid set of full or differential backups--all hosted on server as well as in a local data center and in the cloud--offers a recovery path of near-zero (at a price tag commensurate to the level of redundancy involved). However, an outage that is the result of a hardware failure at the server level requires a different recovery path--involving fewer teams and processes toward recovery--than an outage that is due to a corruption of data within the database or an infrastructure issue that affects an entire data center.
Even though a DBA has control over the creation and execution of the backup strategy, circumstances around the cause of the outage impact the ability to meet the RPO. Consider the following backup strategy:
- Weekly full database backups on Sunday at 1:00 a.m.
- Daily incremental backups (also known an “differential” backups) taken Monday through Saturday at 1:00 a.m.
- Transaction log backups every 5 minutes on the “3s” and “8s”
Taking into consideration the behavior of SQL Server transaction logging and the cause of the failure, you could conceivably lose 0 to 5 minutes of data changes, depending on whether the transaction log or the disk it was hosted on was damaged as part of the incident behind the outage under the strategy above. Budget in this scenario also affects RPO, depending on how much is spent for storage, secondary storage and perhaps tertiary cloud storage. If you’re only backing up to local disk because you are not offloading the backups to secondary storage, you may have lost hours, or days, of data. Timing and circumstances also affect the RPO.
What is RTO?
RTO, stands for Recovery Time Objective. This is the amount of time that is acceptable for returning the failed system to normal operations. You may have the ability to recover 100% of your data and achieve an RPO of 0, but if you need to make trips to the cloud to pull a backup out of cold storage, rebuild networking infrastructure or one of a million other variables, you’re going to find yourself facing vastly different actualized RTO.
RTO is often expressed as acceptable downtime. Think of phrases such as “five-nines,” which corresponds to being “up” 99.999% of whatever time is the basis of that measurement--whether that is in business hours, 24 hours per day or some other measure. Budget and planning are the two largest factors in attainable RTO. The higher the bar and more complex it is to make the system redundant, the higher the cost. In general, minor incidents are likely to be less-impactful and better planned for than crises that affect multiple services, platforms and infrastructure components.
How Do You Protect Yourself?
Books and thousands of articles have been written on the subject of RPO and RTO. Suffice to say that planning, execution, spending and testing your disaster recovery plans are key to successfully recovering from incidents affecting business continuity. You’ll likely never be able to protect yourself from critical system failures, but if you have a solid plan that is tested, hardened and extensively redundant for multiple levels of failure you’ll be prepared to recover from most incidents you may face.