Top Causes of SQL Server Downtime

Several dozen readers responded to my call for feedback about the primary causes of planned and unplanned SQL Server outages in their environments. I tallied the results, and the following list presents the reasons in order of how many readers reported them as their primary cause of downtime—not in order of most serious problems. After all, your server is either available or it isn't. Users and employers don't care why they can't get to the database—they simply want the database up and running when they need it. Here's what's causing your SQL Server downtime:

1. Applying Service Packs and Other Patches (28 votes)

Not surprisingly, applying service packs and other patches was the leading cause of downtime (I include OS- and database-level patches and service packs in this group). In light of the many security-related critical updates that Microsoft has recently released, the company needs to improve patch-management functionality so that customers can apply service packs and patches without having to reboot their servers.

2. Problems with SQL Mail (11 votes)

I was surprised that problems associated with SQL Mail were so prominent. Technically, most problems with SQL Mail are related to Messaging API (MAPI)-level issues rather than with SQL Mail code, but SQL Mail remains responsible for many downtime incidents among readers. Again, this problem falls squarely in Microsoft's lap.

3. Random Bugs and Unknown Problems (8 votes)

What are random bugs? Most readers described them as memory-leak problems. Everyone wants to blame Microsoft for these kinds of problems, but I've seen many cases where client-developed code or third-party software was responsible for the outages.

4. Errors in Administration and Maintenance Procedures (5 votes)

Few readers said, "It was my fault." But one reader summed up this common IT problem by saying, "The reasons we experience downtime are usually caused by human error or network problems. Our main problems come from technicians who either reconfigure something without notifying anyone or test something without notification or from equipment that breaks down. My main frustration comes from technicians who forget to communicate." This reason falls into the "oops, I forgot" class of problems and is a reminder that high availability is equal parts technology and human policies and procedures.

5. Lack of Knowledge and Training (3 votes)

This reason is directly related to the errors-in-administration problem. The difference is that, in the previous problem, DBAs know what to do and simply either don't do it or do it incorrectly. However, sometimes DBAs make mistakes because they don't know better. You'll never have a highly available database environment without investing in the human component of high availability through policies, procedures, and training.

The following responses garnered one vote each:

Adding indexes to very large tables, which causes blocking
Virus attacks
Complex environment interdependencies

This reader comment describes how complex environment interdependencies can lead to downtime: "What really makes my SQL Servers reboot too often is their interdependency with the heterogeneous and complex environment they live in. There are bunches of DNS servers, domain controllers, backup servers, firewalls, proxies, routers, switches, repeaters, thousands of network cables, power supply, redundant power supply, disks shared on a SAN, kilometers of fiber channel linking you to your backup site, and last but not least, other database servers. All these elements require maintenance, upgrades, reboots, and the like. The problem is that when one of these elements starts to malfunction, it may have consequences on other elements, and eventually, Windows or your SQL Server will hang and require a reboot."

Are systems getting so complex that one person, or even a group of people, can't possibly keep them running reliably? That's a depressing thought, but I suspect this phenomenon is at work more often than we realize. Microsoft has made great strides in helping customers deploy highly available SQL Servers over the past few years, but clearly both Microsoft and the SQL Server community still have work to do. I hope this commentary serves as a reminder that technology alone isn't enough to keep your systems running. Planning, careful adherence to well-thought-out procedures, and an investment in training your team are core components of high availability.

Comments

Plain text