One of Google’s cloud services went down for nearly two hours Monday across all regions, causing another service to also go down completely and another to experience errors and higher latency as a result. Ironically, the cause was a problem in the system designed to keep the service from falling over.
The service that went down in the afternoon US Pacific time was Memcache – part of App Engine, Google’s Platform-as-a-Service, which provides a platform and services for developers to build applications without worrying about managing the underlying infrastructure.
Memcache speeds up responses to data store queries by caching them in memory. Answers to the most popular searches on a website, for example, can be served straight from memory instead of getting retrieved from a database or some other type of data store, which typically takes longer.
The automatic failover system Google has set up for Memcache requires that it has a consistent view of the data center that’s serving every application. That way, when there’s a problem with one data center, the system can smoothly switch it to another one. The incident happened when a database that stores this data center configuration data became unavailable after a configuration update.
Here’s how Google engineers described the root cause in an incident report:
The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database.
The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable.
During the incident, because Memcache was unavailable, requests to Memcache went to the Datastore service, creating a “surge of Datastore activity,” which led to errors and latency. Another surge of traffic to Datastore came when the Memcache outage ended, causing elevated latency for some applications hosted in the US for an additional 40 minutes after Memcache itself was restored.
Another service that was affected was Managed Virtual Machines, which was launched in 2014 as something between raw Infrastructure-as-a-Service and PaaS but later replaced with a service called Flexible Environment. Google still has customers running applications that use Managed VMs, all of whom saw failures for all HTTP requests and App Engine API calls during the outage. The incident did not affect Flexible Environment users.
Google provides two tiers of service for Memcache: free shared Memcache, a service that does the best it can for a customer at any given moment, depending on resource availability; and dedicated memcache, which secures fixed cache capacity at $0.06 per hour.
Google is one of the leaders in the PaaS space, competing with Salesforce’s Heroku, which has dominated for years. Other heavyweights include Microsoft, Mendix, OutSystems, IBM, Red Hat, SAP, and Oracle.
PaaS is a much smaller market than subscription-based software applications or raw infrastructure as a service (SaaS and IaaS), but analysts project it will grow at a healthy rate over the next four years. The global PaaS market reached $7.17 billion in size last year, according to Gartner, which estimates it will grow to $8.85 billion this year and to $14.8 billion in 2020.
Much of that growth will be driven by widespread adoption of Internet of Things applications. According to the research firm, more than half of new applications built on PaaS will be “IoT-centric.”