Hardware doesn’t live forever. But you can extend the life of your server hardware by following these practical maintenance steps.
1. Get to Know Your Server
Get reacquainted with your Exchange Server. The component more likely to fail than any other is the server's hard disks because they contain moving parts and are almost constantly in use. Over time, hard disks simply wear out.
Most servers contain mechanisms for predicting drive failures. Although such mechanisms are not entirely reliable, you need to pay attention to their warnings. The problem is that no industry standard exists for drive-failure warnings. Self-Monitoring, Analysis, and Reporting Technology (SMART) is one of the most widely used drive-monitoring mechanisms, but even SMART technology is implemented differently from one manufacturer to another, so guessing how a server is going to warn you of an impending drive failure can be difficult. Some servers might generate warnings through the OS, while others might notify you only through the BIOS or a warning light. Knowing which notification method your server uses is important so that you can recognize the signs of an impending drive failure and periodically check the health of your server's drives.
2. Make Sure Drives Are Still Available
Microsoft recommends placing Exchange Server databases on RAID arrays, which offer both performance improvements and fault tolerance. In some instances, however, a fault-tolerant array provides a false sense of security.
A friend of mine had an aging server that was connected to a fault-tolerant array. Although the array was several years old, none of the drives had ever failed. However, when a drive did eventually fail, the manufacturer had discontinued production of the required drives. My friend tried to track down replacement drives on the Internet, but a second drive failed before he could find one. The array had been designed to provide fault tolerance against one drive failure, but provided no mechanism for protecting data against multiple drive failures. In this case, the array proved no more reliable than a single drive—all because the replacement parts were unavailable.
If your Exchange Server is running on older hardware, periodically check that drives are still available for your arrays. Even if it looks like drives will be available forever, keep a couple of spare drives on hand so that you can immediately recover from any failures that might occur. And although fault-tolerant arrays are designed to keep running even after a failure, replacing a failed drive as quickly as possible is a good idea. And depending on the type of array you’re using, the server's performance might suffer dramatically until you’ve replaced the failed drive.
3. Remove Dust Bunnies
Although dust seems trivial when compared with other maintenance issues, excessive dust and grime can lead to overheating, which can cause data corruption, equipment damage, or downtime related to random errors.
Dust contamination causes three primary types of server problems. First, excessive dust tends to cause problems with CD and DVD drives. Dust can coat the laser lens or prevent the lens from moving correctly, causing the drive to perform slowly or unreliably—or even to stop working completely.
The second, and probably most common, dust-related issue I’ve encountered is dust clogging and inhibiting fans. Over time, if the fan isn’t working properly, the power supply could overheat and burn out. Unless your server is equipped with redundant power supplies, you’re facing immediate downtime.
And third, dust can cause the CPU to overheat. As dust accumulates, the case fans and CPU fans can become less efficient and might stop working altogether. Heat sinks can also become clogged, reducing the volume of air that can flow across them. When CPUs overheat, weird things can happen. Excessive amounts of heat can cause CPUs to make computational errors. Otherwise reliable machines start producing blue-screen errors for no apparent reason. And I’ve even witnessed a few instances in which database corruption has occurred as a direct result of errors produced when a computer overheated. Other symptoms of overheating include video and other component failure.
Many servers are equipped with internal temperature probes designed to shut down the server before serious damage can occur. However, the shutdown still results in server downtime. To combat dust and grime, I clean my servers every six months, removing the covers and vacuuming out any dust. While I'm at it, I verify that all the fans are working.
4. Test UPS Batteries
So far, I’ve primarily covered problems related to aging servers. However, the aging process tends to be a lot harder on UPSs than on servers. Although UPSs have improved dramatically over the years, the batteries tend to lose capacity over time. Eventually, an aging UPS might not be able to sustain a server for more than a minute or two during a power failure.
Although you can’t do much to prevent this capacity leak, I like to know exactly what to expect of my UPSs when a power failure occurs. A couple of times a year, I test my UPSs and document how long they were able to sustain a server. I not only find out how long my servers would be able to stay online during a power failure, but because I keep records of each UPS’s performance, I can determine whether a UPS’s runtime is decreasing and decide when to replace the UPS.