A typical data center can sustain itself and meet its service-level objectives without regular maintenance for one month, perhaps two, according to experts. Although the first warning signs about the impending coronavirus pandemic came in early January, for North America and Europe, coronavirus has been an event for no longer than two months. Thus far, data center operators around the world appear to have responded to the challenge appropriately: adjusting shifts, minimizing handoffs and interpersonal contact, taking precautions to clean and sterilize facilities, and switching to essential-only maintenance schedules.
Data centers, like any other complex infrastructure, are dependent upon strong supply chains. Equipment failures – inevitable even in normal times – are tolerated through planned redundancy and person-to-person supply chain management.
The novel coronavirus has already negatively impacted supply chains, with the suspension of manufacturing in Asia and of assembly and distribution worldwide. For example, Clint, one of the world’s principal manufacturers of liquid chiller units and HVAC components, produces most of its components in plants throughout Italy.
Data center operators have not yet felt the effects of the broader pandemic supply shock. (Meanwhile, some of their biggest users have reported greatly expanded lead times for computing equipment they fill data centers with.) If the world were to emerge from its state of economic lockdown soon, conceivably, suppliers would be able to ramp up production in time to address the demand shocks that are expected to follow a supply shock of this magnitude.
While we could see a handful of smaller US states re-open to some degree in May, experts currently project that others wouldn’t be able to do the same until late June or July. But even that goal post may shift. It’s still unclear when testing for the virus and for antibodies can be scaled to adequate levels (one of the key pre-requisites for re-opening). A vaccine appears to be more than a year out. Meanwhile, the number of confirmed COVID-19 cases in the US keeps rising.
Can data centers hold out until coronavirus testing becomes commonplace?
Even the most risk-averse business verticals (at least when it gets to their infrastructure), such as financial services, have outsourced data centers, going from highly fault-tolerant, robust infrastructure to “a lower level of redundancy in terms of a platform,” Ed Ansett, chairman of data center engineering consultancy i3 Solutions, told us.
“Which is fine, as long as, one, you’ve got the people there with the skills to run the system – and there’s a big question there in terms of the impact the virus will have – and, two, the availability of spare parts to fix stuff when it goes wrong.”
It all comes back to how long the pandemic will last. A six-month timeframe could be sustainable and recoverable, he said. The industry could find the resilience it would need to bounce back. But if it were to continue for much longer, “then we’re going to see nightmares,” Ansett said. “Because of the highly interdependent nature of the network, the unintended consequences, we just don’t know what they’re going to be.”
Data center operators, especially commercial service providers, keep inventories of spare parts on hand. The big question is whether they will be able to replenish those inventories as they get low.
“Of course, we have critical spares on site, always,” Bob Woolley, senior VP of operations for NTT Global Data Centers, said. “We’ve tried to stock up on a few extra items, but there’s only so much you can do in a short period of time.”
NTT, like other data center operators, typically has supplies on hand in each facility for workers to shelter in place, but that’s for short-term emergencies, such as hurricanes and earthquakes, Woolley told DCK. For the long term, data center managers are focusing on a six-month time frame (which, applied from February, would result in a threshold date sometime in July or early August).
They’re accustomed to suspending maintenance for one month, maybe two, including in response to explicit customer requests for deferral. But at six months, risk levels start to be measurable by lenders, insurers, and other institutions that underwrite them. Before too long, the risk level could become unacceptable.
“We do have the luxury of having reliability engineers that work at NTT, and we’ve got them doing risk assessments around maintenance deferment, so that we are continually evaluating what the level of risk is,” Woolley said. “At the same time, we are continually re-juggling our maintenance schedule with anticipation that we’ll resume activity in a month or two months.”
But this balancing act wouldn’t be sustainable for longer than six months, at which point resuming normal operations would become difficult, he warned. If six months of maintenance stasis would be lifted, engineering teams playing catch-up to bring their service levels back to nominal would greatly strain the supply chain. Teams would be backlogged with up to five times their normal workloads. NTT would want replacement parts — at any price — and so would everyone else. The spike in demand would set maintenance schedules back even further.
Not all operators have switched to a reduced maintenance schedule. CoreSite, for example, one of the largest data center providers in the US, has stuck to its regular schedule, Anthony Hatzenbuehler, senior VP of data center operations at CoreSite, told DCK.
“We have reduced the number of staffing that actually go into the data center, as many of our colleagues have in the industry, but we’re still maintaining our maintenance program,” he said. “We’ve asked vendors to, obviously, work with us on the protocol that needs to be occurring, social distancing and so forth.”
CoreSite keeps enough generator fuel on site at its facilities for pre-defined periods of time, but it doesn’t manage spare-parts inventory based on timeframes, Hatzenbuehler explained. Spares kept on site include UPS and generator parts, but the biggest stockpiles are usually of mechanical components, he said.
Hatzenbuehler said the vendors CoreSite has been speaking with haven’t given the company reason for concern about future availability of spare parts. Some vendors, he said, were grateful to CoreSite for keeping its regular maintenance schedule. That’s one less customer they will have to add to their backlog when the expected rush of data center operators with pent-up maintenance demand occurs.
Schneider Electric and Vertiv, two of the world’s largest suppliers of data center power and cooling infrastructure equipment, declined to comment for this story.
From Pandemic to Endemic
Chris Brown, CTO of the Uptime Institute, which has been surveying its network of data center operators, said another potential outcome of running data centers the way they are run now for an extended period of time could be new institutional habits.
“If we’re still dealing with this — God forbid — in February 2021, while we still wait for a vaccine to get developed, then it may become the new norm, because folks don’t like change,” Brown told DCK.
Brown believes people in any line of business will resist change for more than a modicum of time. Once temporary protocols are lifted, standards and practices usually snap back to their old familiar patterns. But at some point, in the absence of an “all clear” signal, those temporary patterns become “the new normal.” Indeed, conditions historically considered intolerable would at least become familiar if not entirely comfortable. Behaviors first reserved for a pandemic would ironically become endemic.
Case in point: Some of Uptime’s clients have switched to staffing their facilities with fewer, more junior personnel, wearing helmet cams that link them to senior staffers off site. The seniors can then guide the juniors through repair and maintenance procedures in which they may not otherwise be skilled.
This practice could continue once the crisis is over “if we stay in this mode that long, because you’d have some track history,” Brown said. The sooner operators get out of the current mode, the less likely the pandemic is to drive “substantive, long-term changes in behavior. People will tell themselves, ‘We got lucky, and we got by for five months.’ If you make it 12, 14 months, you have more track history to be comfortable with it.”
Danny Lane, Digital Realty’s senior VP for global operations, said his team was actively researching for ways to implement automation and even AI to do work humans do today. “We have active R&D programs in place evaluating AI and ML capabilities to streamline engineering workstreams and equip our technical experts with improved data analytics with a focus on working smarter, not leaner,” he told DCK.
Such adaptations, assuming they’re discovered, would very likely become permanent. However, substituting on-staff mechanical engineers with junior personnel equipped with helmet cams will not be an option, Lane told us.
The Limits of Resilience
In any economy, fundamental behavioral changes among consumers alter, if not remake, supply chains. If data center operators eventually accept the emergency mindset as the new norm – and Uptime’s Brown said some of them could do that as soon as July – it would pave the way for a much more conservative consumption pattern, with component and appliance lifecycles extended, and facilities learning to do more with less. Longer lifecycles could lead to deceleration of innovation. Capital investment in new and replacement facilities that are less than critical would be shelved indefinitely.
If there is such a thing as consensus in the context of pandemic analysis, here is where things stand today: Data centers have already made adaptations to their supply chain management policies that should enable them to remain reasonably functional through May. By July, or perhaps as soon as June, facilities may not be able to retain the redundancies they would need to sustain outages or equipment failures. By September, such failures could start to emerge, and service levels could start to plummet.
“One of the things that will come about, given enough time is a higher rate of failures,” i3’s Ansett remarked. “The cause of many of those failures will not be just this parts shortage — in fact, that may turn out to be the least of our problems. I wouldn’t be surprised to find out a lot of systems have failed because they weren’t tested and commissioned properly, or that they weren’t designed properly in the first place. I’m sounding like the prophet of doom, but I find myself down this road. And those are the consequences.”
-- Yevgeniy Sverdlik contributed to this report.