You’ve probably heard of George Santayana, but you might not have heard of David Sifry. Both men have something valuable to teach us.
Santayana (1863-1952) was a uniquely American philosopher and literary critic. Of his many works, he’s best known for a single quote: “Those who do not learn from history are doomed to repeat it.” Sifry is the inventor and developer who launched the popular Technorati Web site (http://www.technorati.com). Technorati is, at heart, a search engine dedicated to searching Web logs and exposing connections between them. The site has become exceptionally popular, in part because of a cool feature that shows which sites are linked to other sites. So, what do these two men have in common?
Recently, Technorati was down for an entire weekend--from Friday night until Monday morning--because of a chain of problems. Sifry detailed the outage, and his company’s response to it, in his blog. If you adhere to Santayana's philosophy, you can learn several lessons from Sifry's experience--and hopefully prevent a similar trauma yourself.
Lesson 1 comes simply from the fact that Sifry took the unusual step of publicizing his experience. Most companies would rather die than admit that they’ve had an outage. The first lesson, therefore, is that being open about your availability problems can benefit others (although political constraints might keep you from spilling all the beans).
So, what happened to Technorati’s service? First, an electrical fire at the collocation facility that houses Technorati’s servers cut power to the servers. The facility's UPS units kicked in but the backup generator didn’t (either the generator failed or was damaged by the fire). The UPS units eventually ran out of power, and the servers shut down--ungracefully, leading to widespread data corruption. Hence, lesson 2: Always have an emergency mechanism in place to gracefully shut down your servers.
The good news is that Sifry and his staff diligently kept daily backups, which they were able to restore. The bad news is that they had to restore those backups on more than 100 servers. That certainly isn't how most of us would choose to spend a weekend and vividly illustrates lesson 3: Tier 1 hosting facilities are expensive, but the extra money buys you a much higher degree of insurance against the kinds of problems Sifry experienced.
The incident, says Sifry, sparked a comprehensive review of his company's availability requirements and processes. His team had planned to implement a better shutdown system for their servers, but decided to defer the implementation until after they'd moved the servers to a new collocation facility. Lesson 4: Putting off disaster planning doesn’t pay.
What if your servers were all down for an entire weekend? Would it put you out of business or be just a minor inconvenience? For most of us, the answer is somewhere in between the two extremes, but it’s a darned good idea to know the answer ahead of time, and find any holes in your disaster recovery plan before disaster strikes.
And while we're on the subject of repeating history: Florida has had a really tough hurricane season this year, and I'd like to repeat my suggestion that you consider donating to the Florida Hurricane Relief Fund (http://www.flahurricanefund.org) or the American Red Cross (http://www.redcross.org). There are a lot of people who can use our help right now.