Microsoft Exchange Server administrator's job is all about getting the mail through, keeping the system running, and giving users the impression that Exchange Server is 100 percent dependable. To achieve these goals, you need to concentrate on several performance fundamentals: hardware, design, and operation. These three basics come into play for all Exchange Server organizations, whether you're running Exchange 2000 Server or Exchange Server 5.5. Reliable and capable hardware is the foundation of Exchange Server performance, but the design into which you place that hardware and the way that you operate the hardware are just as important when you want to achieve maximum performance over a sustained period.
Fundamental No. 1: Hardware
An adequate and balanced hardware configuration provides the platform for good Exchange Server performance. To meet your performance goals, your servers must strike a balance among these three essential components: CPU, memory, and disk.
Server configuration essentially comes down to how many CPUs the server has and how fast they are, how much memory the server has, and what type of disk subsystem the server uses. Given the speed of today's CPUs and the relatively low cost of memory and disks, you'll be hard pressed to underconfigure a server. Even the smallest off-the-shelf server—with, for example, a 700MHz CPU, 256MB of RAM, and three 18GB disks with a RAID 5 controller—can support several hundred Exchange Server mailboxes. (For an explanation of the benchmarks that vendors use in server-sizing guides, see the sidebar "Making Sense of Benchmarks.") High-end servers (i.e., servers that support more than 1000 mailboxes or servers that are designed for high availability) present more of a challenge, mostly because of the different combinations of hardware that you can apply to provide the desired performance level.
CPU statistics. Most Exchange Server machines are equipped with only one CPU, but hardware vendor tests demonstrate that Exchange 2000 scales well using SMP. In Compaq tests, an increase from two processors to four processors led to a 50 percent improvement in capacity; an increase from four processors to eight processors also led to a 50 percent increase. Considering SMP's overhead, this performance is excellent and testifies to the Exchange Server code base's multiprocessing capabilities. (Microsoft says Exchange 2000 Service Pack 1—SP1—will support more than eight processors, but factors other than the CPU will probably limit most configurations, and 8-way machines can currently meet the needs of even the largest corporate Exchange 2000 organization.)
CPUs get faster all the time. (Intel recently announced the availability of a 1.5GHz Pentium IV processor.) Level 2 cache is particularly important for good Exchange Server performance, so use systems with as much Level 2 cache as possible. This special area of the processor caches instructions, and its size depends on the processor's model. Intel's Pentium III processors' typical Level 2 cache size is 256KB, whereas the company's Xeon processors can have Level 2 caches as big as 2MB. Currently, Xeon processors are the best platform for Exchange Server because of their cache size and the nature of the chipset, which Intel has optimized for server applications.
Keep in mind that your goal should be to saturate your machine's processing capability; to do so, you need to remove any bottlenecks. Typically, you should first address any memory deficiencies, then add storage subsystem capacity, then increase network speed (100Mbps should be sufficient for all but the largest Exchange Server systems). If you still need to increase CPU saturation after you've tuned these components, you can add more processors or increase the processors' clock speed.
Memory and cache. To optimize the memory demands that Exchange Server makes on the OS, Exchange 2000 and Exchange Server 5.5 both implement a mechanism called Dynamic Buffer Allocation. DBA monitors the level of activity on a server and adjusts the amount of virtual memory that Exchange Server's database engine (i.e., the Extensible Storage Engine—ESE) uses. Exchange Server implements DBA as part of the Store (Microsoft's generic term for the Information Store—IS), so you'll sometimes see the Store process grow and contract quite dramatically on a server that experiences intermittent periods of heavy demand. You'll also see the Store fluctuate on Exchange Server systems that run other applications—especially database applications such as Microsoft SQL Server—when multiple applications request memory resources.
On servers that run only Exchange Server and therefore experience a consistent level of demand, the Store process tends to grow to a certain level, then remain constant. (Exchange 2000 machines aren't in this group because all Exchange 2000 servers also run Microsoft IIS to support Internet protocol access.) Don't let a large Store process worry you—it simply means that DBA has observed that no other active application wants to utilize memory and so has requested additional memory to cache database pages. The net result is a reduction in the amount of system paging; this reduction aids performance.
To see how much memory the ESE is utilizing, you can use the Database Cache Size (Information Store) counter on the Performance Monitor's Database object. Figure 1 shows a typical value from a small Exchange 2000 server under moderate load. This server has 256MB of RAM but has allocated approximately 114MB of virtual memory to the ESE. Note that the amount of virtual memory that the ESE uses will increase as you load more storage groups (SGs) and databases—one reason why experienced systems administrators stop to think before they partition the Store.
The ESE uses RAM to cache database pages in memory. When a server doesn't have enough physical memory, the ESE caches fewer pages and increases disk access to fetch information from the database. The result is an increased strain on the I/O subsystem, which must handle more operations than usual. You might conclude that you need to upgrade the I/O subsystem, probably installing additional disks and redistributing I/O. However, increasing the amount of RAM available to the server is usually more cost-effective, so always consider this step before any other action.
Maximum disk performance. Exchange Server is essentially a database application, and like all other database applications, it generates a considerable I/O load and stresses both disk and controller components. Any performance-management exercise must therefore include three steps: Properly distribute the source of I/O (i.e., the files involved in messaging activity) across disks, ensure the correct level of protection for essential files, and install the most appropriate hardware to handle disk activity.
The first step is to separate transaction-log sets and database files, even on the smallest system. This procedure ensures a maximum chance of maintaining information after a disk crashes. If you place logs on one spindle and the database on another, the loss of one won't affect the other, and you can use backups to recover the database to a known state. If you place logs and database on the same spindle, however, a disk failure inevitably results in data loss.
The second step is to properly protect the transaction-log sets. The data in the transaction logs represents changes in information between the database's current state (i.e., pages in RAM) and its state at the last backup (i.e., pages on disk). Never place transaction logs on an unprotected disk; use RAID 1 volumes with controller-based write-back cache, which provides adequate protection without reducing performance. Separate transaction-log sets on servers running Exchange 2000 Enterprise Server with multiple SGs. The ideal situation is to assign a separate volume for each log set.
The third step is to give as much protection as possible to the Store databases. Use RAID 5 or RAID 0+1 to protect the disks that hold the mailbox and public stores. RAID 5 is the most popular approach (Microsoft has recommended a RAID 5 approach since it launched Exchange Server 4.0), but RAID 0+1 is becoming more common because it delivers better I/O performance and avoids the necessity of enabling a write-back cache. Compaq has performed tests that demonstrate that one spindle in a RAID 0+1 set can support the I/O that 250 active Exchange Server users generate. Thus, a large server that supports 2500 active users would need a 10-disk RAID 0+1 set. For maximum performance, concentrate on each disk's I/O capacity rather than on how many gigabytes it can store.
After you've equipped your server with sufficient storage, you need to monitor the situation to ensure that the server delivers the desired performance. To get a complete picture, monitor each device that hosts a potential hot file (i.e., a file that generates most of the I/O traffic). For Exchange Server 5.5 machines, these devices include those that hold the Store, the Message Transfer Agent (MTA), and the Internet Mail Service (IMS) work files. For Exchange 2000, take the same approach but also monitor the device that hosts the SMTP mail drop directory, and consider that the Store might be partitioned into multiple databases, each of which you need to monitor.
To quickly assess storage performance for an Exchange Server machine, you can use several Performance Monitor counters that examine disk response times and queue lengths. (For information about Performance Monitor and similar tools, see Cris Banson, "NT Performance Tuning," page 40, or read Curt Aubley, "Windows 2000 Performance Tools," http://www.win2000mag.com, InstantDoc ID 8198.) To monitor response time, you can use any of the following Physical Disk object counters:
- Avg. disk sec/Read
- Avg. disk sec/Write
- Avg. disk sec/Transfer
For acceptable performance, each device should perform random I/O operations in 20 milliseconds (ms) or less. Sequential operations should take place in less than 10ms. You need to take action when devices exceed these thresholds. The easiest course is to move files so that less heavily used devices take more of the load. The alternatives are more drastic: Relocate mailboxes, public folders, or connectors to another server, or install additional disks and separate the hot files.
You should also monitor the Pages Read and Pages Writes performance counters for the Memory object because they indicate the hard page faults that result in a disk access. The sum of these two values shouldn't exceed 80ms (i.e., roughly the I/O limit for one disk drive). If the sum exceeds 80ms, you should add more physical memory to your server and possibly locate the page file on a fast drive (although the latter solution is less efficient than adding memory).
The Current Disk Queue Length counter for the Physical Disk object reports the number of outstanding operations to a particular volume. Although Win2K reports separate queue lengths for read and write operations, the current aggregate value is what matters. A good rule of thumb is that the queue length should always be less than half of the number of disks in a volume. For example, if you have a 10-disk RAID volume, the queue length should be less than 5. Your aim is to ensure that the volumes have sufficient headroom to handle peak demand. Consistently high queue-length values are a signal that the volume can't keep up with the rate of I/O requests from an application and that you need to take action.
Figure 2 shows monitoring results for the F disk (on which the Store databases reside) and the L disk (on which the log files reside) on an Exchange 2000 server. The Current Disk Queue Length counter for the Physical Disk object shows that the Store disk is under considerable strain: an average queue length of 23.59 I/O operations and a peak of 73 I/O operations. This I/O load occurred when a user created some large public-folder replicas (e.g., one folder contained 32,000 items) in a public store. Exchange Server generated 230 log files—1.3GB of data—during the replication process. Users will notice such a load because Exchange Server will queue any request they make to the Store, and Exchange Server response won't be as snappy as usual.
Fundamental No. 2: Design
Exchange Server machines fit within a design that determines each server's workload: the number of mailboxes the server supports, whether it hosts a messaging connector or public folders, or whether it performs a specific role (e.g., key management). This design also needs to accommodate how much data the server must manage and how available you want the server to be.
You expect data requirements to increase as servers support more mailboxes, but you must also deal with the "pack rat syndrome." Users love to keep messages, so they appeal to administrators for larger mailbox quotas. Disks are cheap, so increasing the quota is the easiest response. Default quotas for Exchange Server organizations have gradually increased from 25MB in 1996 to about 100MB today. Some administrators manage to keep quotas smaller than 100MB and still keep their users happy (which is a feat in itself). Other administrators permit quotas larger than 100MB and put up with the need for more disk space and longer backup times. (For tips about dealing with these types of problems, see "Mailbox Management," October 2000.)
Slapping a few extra drives into a cabinet and bringing them online might increase an Exchange Server machine's available storage but isn't a good way to ensure performance. Every ad hoc upgrade hurts server availability, and the chance always exists that something will go wrong during an upgrade procedure. A better approach is to plan out the maximum storage that you expect a server to manage during its lifetime, then design your storage infrastructure accordingly. If you're using Exchange 2000, your design also needs to take into consideration the interaction between Exchange Server and Active Directory (AD—for details about this relationship, see the sidebar "Exchange 2000 and AD").
New and upcoming Exchange 2000 and hardware features offer capabilities that could increase the power of your organization's availability. Exchange 2000's improved clustering support makes clustering a more attractive option. (For information about Exchange 2000's clustering capabilities, see Jerry Cochran, "Clustering Exchange 2000, Part 1," December 2000, and "Clustering Exchange 2000, Part 2," January 2001.) New hardware capabilities such as Storage Area Networks (SANs) make systems more resilient with disk failures, which are Exchange Server's Achilles' heel. (For information about how Exchange Server can work with SANs, see Jerry Cochran, "Storage Area Networks in an Exchange Server Environment," http://www.win2000mag.com, InstantDoc ID 7513.) True online snapshot backups, a feature that Microsoft originally planned for Exchange 2000 but has now scheduled for a future product, will increase server availability by making recovery from database corruption easier and faster.
Fundamental No. 3: Operations
Flawed operational procedures can render useless the best possible hardware and most comprehensive design. Your organization is only as good as its weakest link, and all too often you don't discover that link until an operational problem occurs.
Carefully observing your production systems is the key to good operations. The OS and Exchange Server write information to the event logs. You need to either scan that information manually or use a product such as NetIQ's AppManager to watch for events that point to potential problems. For example, if the Store isn't absolutely satisfied that the database engine has fully committed a transaction to a database, Exchange Server generates a -1018 error in the Application log. In versions earlier than Exchange Server 5.5, a -1018 error might be the result of a timing glitch between a disk controller and the OS, but Exchange 2000 and Exchange Server 5.5 include code to retry transactions and so overcome any intermittent problems. A -1018 error in Exchange 2000 or Exchange Server 5.5 could mean that a hardware failure has occurred and the database is corrupt. If you don't check the hardware and restore the database from backup, Exchange Server might generate more -1018 errors as the database becomes more and more corrupt and eventually fails. Of course, any backups you made during this time contain corrupted data. The Eseutil utility might be able to fix minor corruptions, but it can't fix the fundamental data loss that a hardware failure causes, so a -1018 error is a serious event. (For information about using Eseutil, see Paul Robichaux, Getting Started with Exchange, "The Sorcerer's Apprentices," May 2000.)
Many other daily events provide insight into the proper workings of an Exchange Server system. For example, you can find information in the Application log about background defragmentation, an operation that the Store usually performs automatically in the middle of the night. Exchange Server logs events that report the start of a defragmentation pass (event ID 700), the end of the pass (event ID 701), and how much free space exists in the database after the pass (event ID 1221).
The event that Figure 3 shows reports 486MB of free space (i.e., roughly 7.2 percent of the database) after defragmentation of a 6.67GB Exchange 2000 mailbox store. Exchange 2000 will use this space to store new messages and attachments as they arrive.
Although you can use Eseutil to perform an offline rebuild and shrink the database, you should do so only when you can recover a significant amount of free space (i.e., more than 30 percent of the database) and you either need the disk space or want to reduce backup time. Because an offline rebuild prevents users from accessing email and takes a long time—at least 1 hour per 4GB of data, plus time for backups before and after the rebuild—you're better off buying more disks or considering faster backup devices than running Eseutil.
The Application log is also the place to look for signs of MTA errors, details of incoming replication messages, and situations in which someone has logged on to another user's mailbox using a Win2K or Windows NT account that isn't associated with that mailbox. (Some antivirus products provoke the latter type of event when they log on to mailboxes to monitor incoming messages for any attached viruses.)
Exchange Server also logs backup-related events. Good systems administrators are paranoid about backups and always ensure that they successfully begin, process all expected data, and finish. According to Murphy's Law, backup tapes will become unreadable at the worst possible time and any readable backup tapes you fetch when you're under pressure will contain corrupt data.
Backups are the single most important and fundamental task for an Exchange Server administrator. Anyone can lose system availability because of a hardware problem, but your boss won't easily forgive extended system downtime or data loss when incorrect or lazy procedures result in a backup failure. (See Jerry Cochran, "Exchange Server Backup Woes," http://www.win2000mag.com, InstantDoc ID 16542 for a refresher course about backup best practices.)
Losing system availability is the ultimate performance failure. Your goal should be to minimize the negative effects of any problem that requires you to restore data. The only way to meet this goal is to be scrupulous about observing the following requirements:
- Make daily backups, and confirm their success. Figure 4 shows event ID 213, which Exchange Server writes to the Application log at the end of a successful backup.
- Know how to restore a failed Exchange Server database. Take note: Exchange 2000 makes this task both more complex and easier than it is for Exchange Server 5.5. On the one hand, Exchange 2000 Enterprise supports multiple databases, so you might need to restore more than one database. On the other hand, the Exchange 2000 Store can keep running while you restore the databases, so service is only unavailable to users whose mailboxes are on the failed databases. (For more information about restoring databases, see Jerry Cochran, "Repairing Your Exchange Server Databases," http://www.win2000mag.com, InstantDoc ID 8864, and "Detecting and Repairing Logical Corruption of Your Exchange Server Databases," http://www.win2000mag.com, InstantDoc ID 8906.)
- Know the signs of imminent failure, and monitor system health to catch problems early.
- Practice a disaster-recovery plan. Make sure that everyone who might need to restore data knows where to find the backup media, how to restore both Exchange Server and the OS (in case of a catastrophic hardware failure), and when to call for help. Calling Microsoft Product Support Services (PSS) for assistance won't help you if you've already botched up a restore. If you don't know what to do, call for help first. (For information about disaster recovery, see Paul Robichaux, Getting Started with Exchange, "Mitigating Disaster," July 2000.)
Backups are boring, and performing them correctly day in and day out can be tedious. But a good backup is invaluable when a disk or controller fails, and you'll be glad (and able to keep your job) when a successful restore gets users back online quickly.
Stay in the Know
Knowledge is the key to achieving and maintaining great performance within an Exchange Server infrastructure. If you don't understand the technology you deal with, you can't create a good design or properly operate your servers, and the first hint of a hardware problem could lead to data loss and extended downtime. A huge body of knowledge is available for Exchange Server 5.5 and is developing rapidly for Exchange 2000. Although newsgroups and mailing lists have a high noise-to-data ratio, you'll be surprised at how many gems you can mine from the discussions. Conferences such as Microsoft TechEd and the Microsoft Exchange Conference (MEC) offer a chance to listen to other people's experiences and learn what the future might hold.
The only thing we can be sure of is that technology will keep changing. Make sure you maintain your personal knowledge base so that you can take advantage of new hardware and software technologies. By doing so, you'll maintain—and improve—your Exchange Server organization's performance.