The pandemic has been forcing data center operators to invest in costly measures to make their systems more robust. The hardships imposed on them by the coronavirus became so great that some saw their services fail.
The risk of an unreliable data infrastructure rattling the economy and society has meanwhile become so great that the British government is considering laws to force it to become more resilient, a government official told an industry meeting last week. In the wake of the COVID-19 pandemic, the government is for the first time treating physical data infrastructure as a distinct industry, where before it was considered a supporting service to other crucial sectors. Officials have begun considering how this might affect the way government handles the industry.
Yet data center operators blame software suppliers for inadequate resilience of data services, saying it has dragged the data industry down even as their own physical infrastructure is becoming more reliable.
The data center industry is already planning further measures to increase its resilience against faults and disasters, potentially raising its operating costs, but certainly needing greater investment of capital, executives suggested at the meeting and in conversations with Data Center Knowledge.
An industry reputably obsessed with resilience, and prideful of its achievements, data centers already strive to have so many safeguards and failover systems that they guarantee to keep data services running, on average, 99.982 percent of the time, no matter what disasters might hit them, according to the Uptime Institute, a private enterprise that acts as an industry body that sets and accredits standards for such precautions. The reliability measure, set by Uptime as a result of designing a data center to its ‘Tier III’ standard, is what most operators aspire to even if they don't actually reach it, Uptime executive director Andy Lawrence told DCK. There have been calls for data services to become as reliable as electricity.
Three Percent of Operators Report Data Center Outages
Three percent of data center operators admitted that their services simply crumpled and failed after adversities wrought by the virus pitted their preparations against a real disaster, according to a survey Uptime plans to publish later this month. The Institute does not yet know how bad the outages were, or what effect they have on the industry's overall reliability. But the failures are roughly representative of the industry's performance under Covid-19 stringencies, Amber Williamson, a data center engineer and consultant, told us in a telephone call afterwards. Williamson presented Uptime's survey findings at the meeting.
The reported data center outages were likely exceptional, Williamson said. "A Tier III data center should be able to do any maintenance without impacting IT and services. They shouldn't have any outages at all," she said.
The outages might have been caused by severe staff shortages, as people stayed home from work to stop the virus spreading, said Williamson. People might have been absent when they were needed to manage problems, she said. Spare parts might not have been available.
New Investment in Infrastructure Resilience
Two-thirds of data centers operators plan to make their facilities more resilient as a response to the pandemic, according to Uptime's unpublished survey. This means building more redundant systems that can take over in an emergency.
"We will see an increase in resiliency, which means we are going to have an increase in capital expenditure," said Williamson. One cloud computing firm had already mandated its data center providers must henceforward have two redundant systems in the wings for each of the foundational data center components, such as cooling and power.
Uptime’s Lawrence, who co-authored the research, said people shortages implied a need for greater resilience. "The fewer people you have on site, if you want operations to continue, then you've obviously got to plan for it to continue to operate even if components fail. That [degree of] fault tolerance is Tier IV," he said.
The industry was already suffering a "critical" skills shortage before the pandemic, according to an annual survey of 1,100 operators Uptime did last year. Industry reports have cited a growing belief in a need for 100 percent reliability in data services to make them as assured as electricity, as innovations such as driverless cars come to depend on them vitally. Lawrence said industry had already shown greater interest in Tier IV reliability, which Uptime Institute considers to be capable of delivering 99.995 percent uptime. Most data centers don't seek formal accreditation of their reliability at all though, Lawrence said. (Uptime's business is built on its proprietary Tier certifications.)
Data Center Reliability Under UK Government’s Microscope
The UK government’s Department of Culture, Media, and Sport (DCMS), which has been charged with keeping the national data infrastructure running during the crisis, has been trying to determine whether it is indeed resilient enough not to fail when vital data services that run hospitals and the economy need it, according to a presentation made by a government official during the meeting, a webinar hosted on Thursday by the industry body techUK.
"We are very keen to understand the nature of the sector, and how we can ensure future-proofing of policy making," Sam Roberts, head of open government and open data at DCMS, told the meeting. "And that could be things like skills, access to materials, and inbuilt structural resilience.
"Coronavirus has shone a spotlight on the criticality of physical data infrastructure. It is very important we look at this as an essential underpinning infrastructure for wider economic and societal outcomes.
"We felt it would be prudent to assess the resilience of the sector in its entirety. This is the first time Whitehall is looking at the whole sector in its entirety, as opposed to the supply chains of others. This is a step change in the way the government has been looking at the sector," said Roberts, speaking as a representative of the Data Infrastructure Resilience Team, which DCMS set up in March to help make sure vital data services kept running during the crisis.
"We are assessing whether there is a greater role for government to play," he said. But DCMS recognized that resilience and security were already competitive drivers in the data infrastructure sector. His resilience team would continue working at least for another year to "address issues of resilience and security for the sector," he said. The team's emergency work has meanwhile eased.
He said government was considering whether to designate physical data infrastructure as a formal part of the critical national infrastructure (CNI). Until now, data centers have been considered CNI when they provide data to another CNI sector, such as health. The question has become important now that the government is treating data infrastructure as a distinct sector. Operators have expressed reluctance at the prospect of greater regulation a CNI designation would bring.
Software Resilience “Isn’t Quite There”
Neil Cresswell, CEO of data centers operator Virtus, told the meeting that physical data infrastructure resilience is increasing.
"Maybe the resilience at the software level isn't quite there," he said. The crisis forced Virtus to manage with 70 to 80 percent of its usual staff onsite. Management’s choice was one of life and death. "Nobody wants to be in a situation where you force somebody to work and they get ill or heaven forbid die for the sake of building a data center early. It's just not worth it," he said.
Virtus has been doing all it can to increase resilience and accelerated its use of remote management systems, so staff don’t need to be on site, and of automation, to carry out operations and repairs.
But the stringencies, combined with bottlenecked supply chains, had delayed Virtus's construction of new data centers by three months, Cresswell said. Williamson cited reports that demand for data services has nearly doubled during the crisis.
Andrew Jay, executive director of CBRE Data Center Solutions, one of the world's largest operators, told the meeting that data infrastructure resilience had actually been declining, but because of problems at the software level.
"If people believe they need to go above and beyond Tier III, there's cost implications and efficiency implications. I would love to understand more about this increased resilience," he said.
Uptime said in its 2019 annual survey that two thirds of all outages in data centers were caused by software and network problems, suggesting they might be beyond a data center operator's control. Another third were caused by power failures at the data centers themselves.
"Outages continue to be costly and common," it said. One third of operators had outages in the last year. One fifth had been so severe that they caused financial losses, reputational damage, regulatory breaches and safety issues. One in ten cost $1 million in damage.
Almost all 250 data center operators who responded to the Uptime survey said they intended to make greater use of remote management as a consequence of their experience operating under Covid-19 stringencies. Three quarters said they intended to use automation.