An accidental release of fire-suppression agent into the data center environment in Europe triggered a chain of events that led to an Azure cloud outage for a group of customers on September 29.
Microsoft Azure engineers attributed what they referred to as a “Storage Related Incident,” which led to systems going down for customers hosting virtual infrastructure in the company’s data center in Northern Europe, to an accident during routine maintenance of the facility’s fire-suppression system.
From the incident report posted on the Azure status page:
During a routine periodic fire suppression system maintenance, an unexpected release of inert fire suppression agent occurred. When suppression was triggered, it initiated the automatic shutdown of Air Handler Units (AHU) as designed for containment and safety. While conditions in the data center were being reaffirmed and AHUs were being restarted, the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters. Some systems in the impacted zone performed auto shutdowns or reboots triggered by internal thermal health monitoring to prevent overheating of those systems.
The staff brought the air handlers back online 35 minutes after the unexpected gas release, returning facility temperature to a normal operational level and restoring all systems. But because of temperature variance in some areas, some servers and storage units hadn’t gone through a controlled shutdown, which meant more time would be needed to troubleshoot and recover those systems.
All in all, seven hours had elapsed between the time the fire-suppression system went off and the time a storage scale unit and services that depended on it were back to normal. Services that depended on the affected storage resource included Virtual Machines, Cloud Services, Azure Backup, and 10 others. The effects ranged from latency and errors to services not being available.
In what has become common practice for cloud providers, Microsoft said customers with redundant virtual machines deployed across multiple isolated hardware clusters would not have been affected by the outage. The Azure-specific name for this high-availability feature is “Availability Sets.”
While it’s always wise to build a degree of redundancy into IT infrastructure, physical or virtual, it is more costly, and some companies don’t do it, often to avoid extra costs associated with running applications they don’t consider critical.
In addition to Availability Sets, Microsoft Azure recently started rolling out availability zones, which are multiple, isolated data centers within a single cloud availability region. For now, only two regions have this multi-zone setup as a preview – one in Northern Virginia and the other in Western Europe.
Azure North Europe region is hosted in Microsoft data centers in Dublin.