In the media, you'll find a wealth of information about hardware and software technologies that can provide the ultimate in server performance—five nines of uptime. Making hardware and applications available 99.999 percent of the time is no trivial matter. Even ascending from 98 percent to 99 percent uptime significantly increases hardware and software costs. Going for that fifth nine can result in an exponential cost increase.
Fortunately, few applications require five nines of uptime. Most network administrators would be happy with two nines (i.e., 99 percent server uptime), which is far easier to achieve than five nines. Regardless of whether your server platform is Windows 2000 or Windows NT 4.0, you can make a few specific decisions about server software, as well as the purchase and deployment of server hardware, to provide a strong base for reliable servers.
Basic System Design
Before I start introducing panic into your life, let me say this: The basic server-class hardware that major system vendors sell is fairly reliable. Perusing Microsoft's Hardware Compatibility List (HCL) before you start shopping is a good idea—and a necessity if you're shopping for Win2K Server hardware.
So what should you look for first in a reliable server? I'm going to step out on a limb and recommend that a reliable server primarily needs good systems management software (along with a properly instrumented setup). All the first-tier server system vendors include systems management software as a standard part of their server package.
This first level of software typically includes basic systems management services, and system-health management and reporting capabilities. Generally, you'll find full support for systems management standards such as Desktop Management Interface (DMI) and Common Information Model (CIM). Such support lets these systems provide data to upstream systems management tools (e.g., Hewlett-Packard's—HP's—OpenView, Computer Associates'—CA's— Unicenter, Tivoli) that enterprise environments often use. At the same time, support for these standards lets systems provide significant levels of system information, accessible on a system-by-system basis, without requiring an investment in those large-scale systems management tools.
First-level systems management tools provide the basic information you need to keep an eye on your servers' health and—not incidentally—performance. The fundamentals of hardware management include monitoring system voltage, fan speed, and thermal conditions, and examining the system hardware for specific types of faults. When the systems management software or firmware detects such faults, it can take a couple of different actions. In some cases—for example, when the software detects transient voltage drops or spikes—the software can only generate alerts. In other cases—for example, when the software detects a bad fan—the software can not only alert the network administrator but also transfer the cooling load to other fans in the box.
High-end systems might also be able to analyze system information and alert the systems administrator about potential system failures. This capability is often called a prefail warranty or predictive failure analysis. Coupled with warranties and service contracts that offer prefail replacement parts, predictive failure analysis can give the systems administrator plenty of warning if key components (e.g., hard disks, memory, CPUs) are on the quick path to failure and require replacement.
The next step up the reliability path is the add-on, hardware-based systems management card. IBM integrates this hardware—the Advanced System Management Processor—into its high-end servers. Compaq offers a couple of versions of its Remote Insight Board. Dell's implementation is the Remote Assistant Card. Any vendor selling enterprise-level server hardware offers similar technology. These boards and add-on processors build on the basic capabilities of the systems' standard management tools. Look for features such as in-band and out-of-band management, and modems that support direct dial-out to alphanumeric pagers or even to the vendor. IBM trumpets its high-end Netfinity servers' ability to call technical support—without human intervention.
These server-management cards often include the ability to dial in to a down server and run diagnostics or reboot the system—an important feature in a small shop with limited resources. The cards also typically offer the ability to redirect the server console to another system through an in-band or out-of-band connection (or both). Therefore, you can run your servers headless (i.e., sans console), eliminating the keyboard, mouse, and monitor as points of failure. Some cards even provide all the server control and diagnostic features through a Web-browser interface, letting you monitor and (in some cases) control systems from any computer that can see the server. Talk to your system vendor about the features that are most useful to you. Not every vendor offers the same combinations of hardware and management features, and some vendors (e.g., Compaq) offer multiple versions of systems management cards with varying features and options.
After you decide on a systems management approach, the remainder of the server-reliability equation becomes simpler. Now you need to think about straightforward hardware and implementation concerns.
Let's start with power. Redundant, hot-swappable power supplies are standard on any server that claims to offer increased levels of reliability or availability. However, I've lost count of how many shops I've visited in which redundant power supplies were plugged into the same UPS. Spend some time thinking about how power runs into your server room. If possible, plug each side of those redundant power systems into its own circuit breaker. We've all heard stories about custodial personnel unplugging crucial equipment to make room for the vacuum-cleaner cord. Why run the risk of such silliness bringing down a server?
Memory failures can quickly crash a server. Most server vendors offer some form of memory protection that can prevent soft errors from causing a server failure, and additional Error-Correcting Code (ECC) memory protection can prevent hard errors from faulting an entire memory DIMM. (A soft error, typically caused by electrical surges, goes away when memory is refreshed, whereas a hard error is bad memory that you must replace.) Vendors such as Compaq, with its Advanced ECC memory, offer additional memory protection schemes that provide enhanced protection from hard memory errors. IBM's Chipkill technology, when incorporated into the system motherboard, enables the use of nonproprietary memory and gives you additional protection from hard memory errors.
Hard disk failures are probably the most common severe hardware failures that systems administrators encounter. You've probably heard most of the following suggestions, but they bear repeating.
Use some form of hardware fault tolerance. Both Win2K and NT let you implement RAID through software. However, hardware RAID is generally more reliable and provides better performance. Hardware RAID offers a better selection of configuration options and almost always lets you take more detailed control and management of the disks.
Be careful how you configure. Take a look at the layout of your system and the data partitions on your RAID array. Whereas systems administrators in large enterprises are often familiar with how to optimize RAID arrays, administrators in smaller shops tend to avoid changing factory configurations of the arrays.
From recent discussions with small-network IT people, I came away with some surprising information. In many cases, these small shops use RAID 5 with only three drives on one SCSI channel—in essence, they've put all their eggs in one basket. Most surprisingly, the shops are using these drives for both system and data partitions. Many of these administrators don't realize that RAID 5 actually decreases performance (i.e., relative to other striping technologies). Also, because their swap files and system partitions share the same stripe set as their data files, they're experiencing an unnecessary performance hit.
If possible, use more than three drives for a RAID 5 array, don't put your system partition on the same drives as your data, and make use of the multichannel SCSI controllers that every server system vendor offers. With current SCSI technology, you can mix different-speed SCSI devices in the same chain and still get close to maximum performance from each device. Therefore, keep those hot-swap trays full and make sure that you have spare drives available for an instant swap. Administrators who have never had to service user requests while working on a server that is rebuilding a RAID 5 stripe set are often blinded by RAID 5's "It keeps working" promise. Yes, the system will stay up, but you don't want to put much of a load on it while it recovers from a dead drive.
Consider an external enclosure. Even a small business can benefit from an external drive array. After you've properly configured an external array, it's not likely to be your single point of failure. The ability to simply plug in an entire server can drastically minimize downtime if all the data resides externally to the system box.
Another feature you should consider is the ability to hot-swap a PCI card—a feature typically relegated to only top-end servers. Marketed under different trade names, hot-swappable PCI lets you keep servers running when SCSI controllers and NICs fail. A limited number of hardware devices support this technology, primarily because of the necessity to write specialized drivers that let the system stop and restart PCI dynamically. However, the feature is making its way down the product chain and is worth your consideration.
The High-Availability Bandwagon
The average systems administrator in a small or midsized business can take quite a few steps to keep servers up and running, particularly through careful selection and maintenance of the server hardware platform. The standard tools that accompany top-tier servers provide most of what you need to ensure the availability of your crucial business data, without requiring you to invest in more nines than you need or can afford. Although enterprise-class users will leap on the high-availability bandwagon and invest in systems that guarantee 99.9 percent uptime (and even more nines, now that Win2K Datacenter Server is shipping), systems administrators on a budget can still provide a high level of availability to their smaller—and no less crucial—business.