The novel coronavirus has already devastated the global economy. Our worst manifestations of what the “D” could stand for in “BC/DR” have been reconsidered, and the value of “C” has transformed from a constant to a variable. Historically, most business continuity plans for data centers were based on local scenarios, where “acts of God” wreaked havoc on one place. Rarely had anyone considered that one place being all of Earth.
“I think where we lied to ourselves was when we thought that, well, when the stuff hits the fan, we’re going to have enough time to respond,” Chris Brown, CTO of Uptime Institute, remarked. “I think what this pandemic has taught us (through the marvel of modern aviation) is that a pandemic is going to spread around the world very rapidly — much faster than you can respond to it.”
“As a society, we’ve spent the last 50 years planning for a nuclear attack, haven’t we,” Ed Ansett, chairman of the mission-critical IT engineering firm i3 Solutions Group, said. “That was the perceived major threat. And there has been lots of modeling around pandemics. But the fact is, we — particularly in the West — have not got our heads around the pandemic thing.”
A Change in Mindset
It is not — at least not yet — the equivalent of a worldwide hurricane. Today, the world’s data centers are for the most part functional. Indeed, some facilities are now performing at the peak of their capabilities, but overall, digital infrastructure is showing considerable resilience.
Modern enterprise data centers had already been designed to operate with as few as three full-time staff members onsite. Scaling back personnel, experts told DCK, has meant paring back onsite staff to two at any one time, but also adjusting shifts.
“Rotational shift schedules have been put in place to minimize the number of people on-site, while ensuring 24x7 coverage of engineering and security at most locations,” Danny Lane, Digital Realty’s senior VP of global operations, said. “Social distancing is being practiced by all teams at Digital Realty, and face-to-face meetings have been replaced with phone conversations and video conferencing. We continue to allow access to authorized people to our sites, but we have encouraged all customers to keep such traffic to essential personnel only if possible. If necessary, at higher-traffic locations we monitor traffic in the lobbies and use traffic-control methods where needed (floor stanchions, barriers, outlined walkways, etc.) to support social distancing efforts.”
Bob Woolley, senior VP of operations for NTT Global Data Centers (formerly RagingWire and other operators NTT has acquired in recent years), told us his organization has moved to one- or two-person shifts, with some shift rotations extended from 8 to 12 hours — including in California, where overtime laws typically make shifts longer than eight hours expensive to implement. Shift rotations will conduct in-person monitoring and ensure operational continuity.
In addition, maintenance and technical teams will be assigned to periodic shifts on weekdays. On occasion supervisors may stand in for on-site technicians, enabling those technicians to work remotely. Since non-essential personnel are now prohibited from entering facilities, some maintenance personnel have been suspended, Woolley said.
“We really don’t need as many of the maintenance crew there during the day, so we peeled those people back and are holding them in reserve,” he said. For some campuses, extended monitoring shifts may be reduced to just the central building. There’s no formula for this yet, he acknowledged; there’s never been one for an event of this magnitude.
“The whole philosophy right now is to minimize exposure to the core technical staff that actually know how to fix things that could go wrong in the data center,” Woolley told us. “These are the people we’re trying to protect in a special way. This special level of protection is designed to keep these people who are unique in their ability to keep the data center running and recover from a failure in a place where they’re available.”
Operations and technical staff members are assigned a single building and may not walk between buildings. They enter from the buildings’ shipping and receiving entrances or from any entrances not being used by customers. In customer lobbies guests are received by receptionists through glass shields and if appropriate escorted to their assets at a distance. In some situations customers are being received in separate vestibules at the edges of campuses.
“Our customers are some of the most important critical services providers that are struggling to keep infrastructure up and running so we can have conference calls,” Woolley said. “Those customers are still doing work to expand their footprints because they’re struggling to meet demand.”
In a note to DCK, Jon Lin, president of the Americas at Equinix, said the world’s largest data center provider was continuing “to comply with all governmental regulations and public health guidance.”
Equinix is also enforcing a policy of minimal staffing, he said, although its current strategy is to reduce the amount of time each individual staffer spends in their designated complex. To minimize customer visits to sites Equinix is stepping up use of its smart-hands services. Data center providers and their customers overall have been leaning on smart hands services and remote management tools a lot more than they have in the past.
“In areas with larger numbers of confirmed cases of COVID-19 all visitors to the IBX [International Business Exchange] are required to have their temperature checked by the security staff using no-contact infrared thermometers,” Lin said. “Those with a body temperature above 37.3 degrees C (99.1 degrees F) will not be allowed to enter.”
For some data centers, newly offsite personnel include facilities managers, who are being instructed to stay at home unless their presence is unavoidable, Uptime’s Brown told us. Operators adopting the two-shift, 12-hour strategy are sequestering the third shift, holding the staff in reserve in case anyone in the primary crew exhibits symptoms.
Handoffs between shifts are now contactless. “One shift will wipe down the control room and leave,” he explained, “the other shift will come in, and they’ll have the turnovers via cellphone.”
Typical rounds itineraries for a shift — which normally include walkthroughs of critical facility areas such as the data floor, equipment rooms, and operating plants — are being trimmed. In a normal world, one benefit of such walkthroughs is enabling personnel to sense trouble before it happens. Now, where feasible, itinerary checklist items are being replaced with remote monitoring.
Investing Now for the Long Term
Some data center operators are making capital investments in remote monitoring tools and services for the long term, Brown told us — the first clear indication that the pandemic is having a permanent impact on normal management patterns. In a worst-case scenario, such tools could make a facility completely operable with no personnel onsite, although in such a scenario critical repairs and replacements could be deferred.
“All the data centers I know about are using various capabilities of remote monitoring and remote control to keep a better eye on their data centers with smaller shifts, smaller number of crew, or without having roving people around the data center — protecting them,” Brown said.
Will Headcount Reductions Become Permanent?
Collectively, these policy shifts are partly responsible for the surprisingly good quality of service end users have been experiencing since the start of the pandemic. No major internet or cloud outages have been reported so far. Neither has there been a report of a big public-facing enterprise (a major bank for instance) experiencing business disruption due to failure of its technical infrastructure.
But it’s only April. The challenge of maintaining these tolerable conditions over an extended period of time looms large for data center operators.
“The problem we’ve got here is we believe that we have things under control through automation,” i3’s Ansett warned. “We believe we’ve got the tools to do things, remote hands and so on. And I think that’s largely true, but it’s the exceptions that we’re interested in, not the generality, when one thing’s gone wrong, when people can’t get money out of the bank. It could be just a small network patch. And it can only be exacerbated in our current situation.”
But Ansett believes automation could play a positive role in data center maintenance post-pandemic. Removing human intervention from maintenance processes could reduce the chances of human error. Making the remaining human processes more regular might have the added benefit of reducing such chances even further while also making maintenance processes more trainable — thus addressing the pre-existing skills shortage.
“Every time somebody walks through the door of the data center, the probabilities of it failing go up exponentially,” he said. “That’s just the human factor — the statistical measure of how reliability works.”
But Uptime’s Brown said that past experience showed that people, including those in data center management, tend to go back to familiar patterns once a crisis is over.
“Everybody I talk to says this is going to change the way we live from here on out,” he said. “But I’m also a student of human nature and history. Throughout history, whenever we’ve faced major challenges, everybody always comments, ‘That was a life changing moment!’ But not much changed in life. They made some minor changes, but often, once the pain was forgotten about, they went right back to the way they were operating.
There’s a possibility the same will happen after the current crisis, Brown continued. “But I don’t know if as humans we will rise above basic human nature: not liking change and wanting to go back to the good ol’ days [but] actually making some changes.”
In almost a quarter-century of experience automating data centers, Brown told us, “the one thing I’ve never seen result from that is a reduction of head count... The thing I could see from this might be that people want to reduce the contacts between shifts, and rely more on automation and monitoring. But based on history, I don’t know that some of these contingencies will become the new norm [and] result in permanent changes of behavior.”