How to Take Down the Cloud with a Single Finger

It sounds like something from either a Vegas magic act or the result of years of training in martial arts, but did you hear that a single, "fat finger" took down the Joyent Cloud?

Joyent touts itself as the high-performance cloud infrastructure and big data analytics company that serves an eclectic mix of companies I'm sure you've never heard about. In addition to promoting a high-performance service that is self-reported to be up to 200% faster than Amazon Web Services in some areas, the company also promises high availability with a 99.9999% uptime record in every service region as well as a 100% SLA.

But, that's where things get murky after a recent incident. On May 27, 2014, an operator at Joyent was in the midst of performing upgrades to capacity, but instead of rebooting only those servers that were part of the upgrades, the operator mis-typed and instead specified all servers in the datacenter for the reboot. And, of course, there was not enough validation in the reboot command tools to ensure the operator was "really sure" that he/she wanted the reboot to be performed against all systems.

In a post to Hacker News, Bryan Cantrill, CTO at Joyent said this:

It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).

Later, on May 28, 2014, the Joyent Team put up a blog post to better describe the entire situation, along with a gushing apology.

We've all been there before and know what it's like to get that wrenched gut feeling while the entire datacenter around us goes dark. Things happen and there's really no real way to mitigate human error. But, in the Cloud world there's a difference.

When this happens in a company, the company goes dark for as long as it takes for the servers to come back online. When this happens at Cloud provider, ALL companies tied to the provider's datacenter go dark and they no longer have a single anyone to point a finger toward. The problem is more massive and really highlights an area that will be more prominent in the near future.

Mark Russinovich stated recently at TechEd 2014 during the Mark Minasi interview session that he believes the Cloud will come down to 3 maybe 4 providers in the future because the smaller providers will not be able to match compute, scalability, and performance, but particularly pricing. We've all watched the Cloud pricing wars of late, and the cost of Cloud services is being driven through the floor.

But, there's another aspect I think Mark should have highlighted and that's how the big 3 or 4 providers need to build Clouds that are built to fail. Built to fail? What does that mean exactly?

I've sort of talked about this before. In Build the Cloud Like a Bad Guy, it was my suggestion that Cloud providers are building the Cloud all wrong. It's great to see new features added at such a frantic pace, but what happens when those features are unavailable? Microsoft and others have the opportunity to build out datacenters that can fail-over automatically during a disaster, but these smaller Cloud providers don’t have the funds and resources to do the same. Additionally, if you've ever had the chance to visit an Azure datacenter, you know how scary it is just how much of the system is automated (orchestrated). Most upgrades are automated with failover built in to the process, eliminating the "fat finger" scenario that could be blamed on a single operator. Again, it's tough for smaller Cloud providers to do this with limited resources. Even Rackspace, considered a larger Cloud provider, is rumored to be looking to exit the Cloud provider service because Microsoft and others are advancing so fast and pushing them out of the market due to cutting prices so low.

So, it seems we're getting there, and the Joynet operator with fat fingers really highlights how important the selection criteria is for choosing the right Cloud provider, particularly if vendors are tied to a high availability.

BTW: If you missed the Mark and Mark and the Funky Bunch at TechEd 2014, here it is:

Comments

Plain text