The High Availability Puzzle

Of all a DBA's missions, none is more important than ensuring that vital business services are available to end users. All of your high-end scalability hardware and modern .NET coding techniques will make little difference if users can't access data. Unplanned downtime for an application or the database server can cost an organization dearly in money and reputation. Outages for large online retailers or financial institutions can cost millions of dollars per hour, and when users can't access a site or its vital applications, the organization loses face and customer goodwill.

Microsoft and other enterprise database vendors have devised several high-availability technologies. For example, Microsoft Clustering Services lets one or more cluster nodes assume the work of any failed nodes. Log shipping and replication help organizations protect against both server and site failure by duplicating a database on a remote server. And traditional backup-and-restore technology protects against server and site failure as well as application-data corruption by periodically saving a database's data and log files so you can rebuild the database to a specified date and time. Although these technologies can help you create a highly available environment, by themselves they can go only so far. Technology alone can't address two critical pieces of the complex high-availability puzzle: the people and processes that touch your system.

Server and site failure can produce downtime, but they're relatively rare compared to human error. The mean time between failures (MTBF) for servers is high, and today's hardware, although not perfect, is usually reliable, making server failures uncommon. In contrast, users, operators, programmers, and administrators interact with your systems virtually all the time, and the high volume gives more chances for problems to arise. Thus, the ability to quickly and efficiently recover from human errors is essential for a highly available system. An operator error can take down a database or server in a few seconds, but recovery could take hours. However, with proper planning, you can reduce downtime due to human error by creating adequate application documentation and by ensuring that personnel receive proper training.

Processes are also critical for a highly available environment. Standardized operating procedures can help reduce unnecessary downtime and enable quicker recovery from planned and unplanned downtime. You need written procedures for performing routine operational tasks as well as documentation that covers the steps necessary to recover from various types of disasters. In addition, the DBA and operations staff should practice these recovery plans to verify their accuracy and effectiveness. Another process-related factor that can contribute to high availability is standardizing hardware and software configurations. Standardized hardware components simplify implementing system repairs and acquiring replacement components after a hardware failure. Standardized software configurations make routine operations simpler, reducing the possibility of operator error.

Creating a highly available environment requires more than just technology. Technology provides the foundation for a highly available environment. But true high availability combines platform capabilities, effective operating procedures, and appropriate training of everyone involved with the system.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.