OVH, the quickly growing French cloud provider that’s been aggressively going after US market share of giants like Amazon and Microsoft, is planning to shut down and disassemble two of the three data centers on its campus in Strasbourg, France, following a power outage that brought down the entire campus Friday, causing prolonged disruption to customer applications that lasted throughout the day and well into the evening.
About 40 minutes after Strasbourg went dark, the company’s campus in Roubaix -- its largest one, located about 500 kilometers away -- lost all connectivity to six crucial points of presence on its network, which the company’s founder and CEO Octave Klaba said was an incident unrelated to the Strasbourg data center outage and caused by an optical networking equipment software bug.
The embarrassing incident is a major setback for the company, valued above $1 billion. Roubaix-based OVH has been enjoying lots of momentum recently, securing new financial backing and expanding into new markets across Europe and North America. Earlier this year it acquired VMware's public cloud business, announced construction of a data center in Oregon and a new office in Reston, Virginia. It's also building a data center in Vint Hill, Virginia, not far from Reston. OVH already has live data center footprint in the Montreal market.
“This is probably the worst-case scenario that could have happened to us,” Klaba wrote in a detailed blog post Friday, in which he also described the decision to get rid of two Strasbourg data centers, which were built using shipping containers in order to shorten construction time.
Even if this morning's incident was caused by third-party automaton, we cannot deny our own liability for the breakdown. We have some catching up to do on SBG to reach the same level of standards as other OVH sites.
The Strasbourg site was without power for 3.5 hours on Friday, but it took OVH staff many hours to restart servers and restore applications. Many servers, which OVH builds by itself, apparently experienced hardware failure as a result of the outage; on Friday morning, a team in Roubaix loaded a truck with spare parts and sent it to Strasbourg, with technicians there working well into the night to replace parts and boot up computers.
Bringing connectivity in Roubaix back to normal was easier and took a lot less time, but its impact was wide-ranging nevertheless. The optical network that went down connects the campus to network PoPs in Paris, Frankfurt, Amsterdam, London, and Brussels, all of which save for Brussels are the most important network interconnection hubs in Europe.
While attributing the bug to the optical networking equipment vendor, whom he did not name, Klaba said OVH was ultimately responsible for the outage by not being “paranoid” enough:
We will work with the OEM to find the source of the problem and help fix the bug. We do not doubt the equipment manufacturer, even if this type of bug is particularly critical. Uptime is a matter of design that must consider every eventuality, including when nothing else works. OVH must make sure to be even more paranoid than it already is in every system that it designs.
The third-party automation failure Klaba referred to in the quote above was failure of a motorized failover system in Strasbourg to switch to generator power when the campus lost utility power. The company tests the failover system regularly, Klaba said, and the last test was conducted – without problems – this May.
He admitted, however, that the company could have done more in terms of infrastructure design to avoid Friday’s meltdown. The entire site is being fed by one 20KV utility feed, as opposed to the standard practice of having two redundant feeds, often from two separate electrical grids.
OVH uses redundant utility feeds and separate power grids for individual data centers on its other campuses, according to the chief executive, but not in Strasburg, where two of the buildings (SBG1 and SBG2) are on the same grid.
The company developed the container-based design, where it essentially stacks shipping containers on top of each other instead of building walls and a roof, to speed up deployment by removing the time constraints associated with receiving building permits, Klaba explained. It was also a way to “test the appetite for each market, with new cities and new countries,” before making a big investment into a new location.
SBG1, the first Strasbourg data center, consisting of eight containers, came online in 2012, after less than 2 months of work, he wrote. There turned out to be high-enough demand in the market, so the company built a non-container data center, SBG2, there in 2016, using its “Tower” design and started construction of a third one, SBG3.
Before SBG2, however, as it was struggling to meet demand for capacity in Strasbourg, OVH built a second container data center there, SBG4, in 2013.
Now that OVH’s decision to save time and money in order to deploy capacity quickly has backfired in such a spectacular way, it’s decided to invest the 2 to 3 million euros to install a second utility feed for the campus and millions more to put buildings on separate power grids, move customers out of shipping containers and into SBG3, and uninstall the containers.
Here's a play-by-play timeline of the worst day in OVH's history (times in local GMT+1):