Ask most IT leaders how they mitigate downtime, and you'll probably receive answers centered around minimizing unplanned downtime, meaning outages that no one anticipates. Traditionally, unplanned downtime has been the boogeyman that businesses have worked hard to mitigate so that they can uphold availability guarantees that are often in the range of 99.5% or better.
In the cloud-native world, many of these issues simply do not exist — I remember talking to a very large online streaming company back in 2013. They were going on about how resilient their infrastructure was, and I recall wishing that way of thinking could be made applicable to the world of legacy applications. The truth is now it can be (well nearly!).
There's another form of downtime, though, that tends not to receive as much attention, but should: planned downtime, which means events where IT teams deliberately turn systems off in order to perform maintenance work, such as installing updates. If you really want to minimize total downtime at your organization, you should invest in ways to avoid planned outages just as much as unplanned ones. Both types of downtime cut into overall availability rates, both harm the business, and both can be mitigated — although avoiding planned downtime requires different tools and strategies than those that reduce unplanned downtime.
Keep reading for a look at why planned downtime causes so much harm to the business, along with tips on how to minimize planned outages across your IT estate.
The High Cost of Planned Downtime
Historically, businesses haven't tended to spend much time or effort mitigating planned downtime. There are two main reasons why:
One is that planned downtime often feels less costly than unplanned outages. If you can plan ahead for an outage, you can theoretically minimize its impact on the organization by, for example, scheduling downtime during periods of low business activity.
That might be true for smaller organizations that operate on a predictable schedule. However, not all businesses can simply work around planned downtime. If you need to be operational 24/7, planned downtime will impact your business because it will disrupt at least some of the services that employees and customers depend on.
That's especially true if your IT systems are centralized in a single data center — as many are today due to infrastructure consolidation and migrations to the cloud — rather than being distributed across a series of regional locations. If you only have one instance of a critical business system, any downtime to that system at all will bring operations that depend on the system to a total halt.
Indeed, if you measure downtime costs in terms of money lost per minute or hour, you'll find that the costs are about the same regardless of whether the downtime is planned or unplanned. For example, one of the things a Fortune 100 Oil and Gas company recently tasked us with was eliminating the planned downtime for their ERP systems during a turnaround event, or TAR. A TAR is a planned (and very expensive) period of regeneration in a plant or refinery that typically lasts between two weeks and two months. For the duration, an entire part of the operation is shut down for inspection and revamping. Under these circumstances, prolonged downtime can cost millions of dollars per day due to lost production and increased labor and equipment spending. To mitigate these costs, the plant's ERP systems must experience the least possible amount of planned downtime during the TAR, as they are critical to effectively managing the process and getting the plant back up and operational quickly.
The second reason why planned downtime has not historically received much attention from IT leaders is that the typical IT organization has become accustomed to thinking of planned downtime as an unavoidable fact of life. Historically, most software was designed such that it required periodic downtime in order for IT engineers to perform maintenance work. Downtime was a necessary evil that businesses just accepted.
Saying Goodbye to Planned Downtime — Even for Legacy Apps
Thanks to technological changes, however, planned downtime has ceased to be a necessary evil in many cases today.
Modern applications — meaning those designed to operate on clusters of servers using a microservices architecture — can typically be updated or redeployed without requiring any downtime. This is possible because these applications run multiple instances of the same services. Thus, when you need to make changes to a service in order to, say, apply a patch or deploy a new feature, you simply modify one instance while leaving the others operational, then switch over to the updated instance in real time once updates are complete. This is part of the reason why distributed, cloud-native application architectures and orchestration platforms, like Kubernetes, have become so popular.
But it's not just cloud-native applications that can make planned downtime a thing of the past. Even in cases where you can't overhaul your applications to run as microservices on platforms like Kubernetes, you can still avoid planned downtime by applying the same update strategy to legacy apps that is possible for cloud-native apps.
In the world of SAP, this means deploying automation and orchestration tools that can not only manage state and configuration of cloud infrastructure but also capture the very specific SAP application configurations. SAP creates a conundrum for typical SRE teams as the critical configuration information reside buried inside the SAP databases, which creates complexity when seeking the benefits of immutable operations. The coupling of cloud automation and SAP configuration management enables the ability to build parallel landscapes, apply patches/upgrades and then fail forward to them with minimal disruption.
Your legacy application platforms themselves don't necessarily include all of the tooling necessary to perform this feat, but if you pair them with a solid understanding of critical SAP configuration information — as we've done at Lemongrass to achieve near zero downtime patching for SAP apps — you can perform maintenance work whenever you need, without taking systems down.
The result is the ability to push out changes more rapidly without disrupting the business — and to move beyond the mindset where the business has to accept downtime as a cost for adding new features or services to its software.
Of course, we have not covered the other benefits of this type of approach related to agility and cost savings (that is for another paper!) or the fact that an exercise that would normally take weeks in the making and involve multiple client teams can now be done by a single administer with better quality in minutes not days!
I won't tell you that it's possible to guarantee zero unplanned outages, because it's not. But I will tell you we are approaching a world where planned downtime can be virtually eliminated. Whether you run cloud-native applications, legacy apps, or a combination thereof, you can take advantage of modern automation and orchestration technology to perform routine maintenance without having to take applications offline.
You can, in other words, say goodbye to planned downtime and stop accepting outages as a necessary cost of innovation.
Tim Wintrip is the Chief Sales and Customer Officer at Lemongrass.