You can spend billions designing and building redundant infrastructure to make sure your data centers don’t go down, but things can and do go wrong anyway – sometimes simply because of bad weather.
A cooling problem in one of its San Antonio, Texas, data centers effected Office 365 and Azure cloud outages for some customers Tuesday, Microsoft said. The problem was caused by a voltage spike that resulted from a “severe weather event, including lighting strikes” near the facility, the company said.
The problems started shortly after 2 am Pacific, according to Microsoft’s cloud status notifications. The outage was limited to resources hosted in the South Central US availability region, the company said (the region is hosted in San Antonio) but acknowledged that customers in other regions may have had problems too.
While the “preliminary root-cause” was a severe weather-related cooling issue, applications went down when the system automatically started to shut off hardware to prevent damage from high temperature:
A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process.
Engineers had restored power to the facility and recovered most of the impacted network devices, observing signs of recovery for some services, company representatives wrote in a status update posted more than nine hours after the outages started.
Close to 40 Azure services – not all services hosted in Microsoft’s San Antonio data centers – were affected, according to the Azure status dashboard. Office 365 services like Exchange, SharePoint, and Teams were also affected, according to another service health status notification. Also affected were non-region-specific Azure services, such as Active Directory, Bot Service, and Resource Manager.
The Azure health status dashboard itself was unavailable for some time as well, with the support team relying on Twitter for communicating with affected customers, according to news reports.