As an Exchange 2000 Server administrator, you must make decisions each day about how you will accomplish your tasks. One of your most important decisions is to be proactive about the health of your servers. To make sure that your Exchange 2000 system continues to work properly, you need to perform seven checks every day:
- Check the backup events.
- Check the event logs.
- Check the partitions.
- Check the virus definition file updates.
- Check the queues.
- Check the nondelivery reports (NDRs).
- Physically check the hardware.
By being proactive and performing these checks, your days won't degenerate into an endless cycle of reacting to problems. If you have a large, complex system, you might need to use an automated monitoring tool to gather the information for some of these checks. Even with such a tool, occasionally performing manual checks is a good idea because an unusual situation might develop that the monitoring tool doesn't account for. No matter whether you gather the information for the checks manually or with automated monitoring tools, you still need to interpret that information.
1. Check the Backup Events
Checking the backup events in Exchange 2000's Application log can give you two important pieces of information. First, you can confirm whether the backup will serve its primary purpose—recovering the Exchange 2000 system—should the system fail. Second, you can get a general idea of the health of the databases that make up the Store.
To determine the usability of the backup for recovery, you need to look at the backup from both Exchange 2000's and the backup utility's perspectives. Typically, file-based backups occur when the files to be backed up aren't in use. In the case of Exchange systems, the only time the databases aren't in use is when the system is down. Luckily, Exchange 2000 provides an API that enables backup utilities to work in tandem with the Extensible Storage Engine (ESE) so that you can perform backups when the databases are in use.
During the backup, ESE reads the database and passes the information to the backup utility, which saves the information to a backup medium. (For more information about the ESE backup process, see Jerry Cochran's Exchange & Outlook UPDATE article, "The ESE Backup Process: An Inside Look," http://www.exchangeadmin.com, InstantDoc ID 25350.) Because two components (i.e., ESE and the backup utility) handle the data during the backup, you need to ensure that both components consider the saved data valid. To do so, you can use Table 1, page 2, to interpret the Application log. Table 1 includes the backup-related event IDs for ESE and NTBackup, Windows 2000's backup utility. (The backup utility's event IDs depend on the backup utility you use, but ESE's event IDs will be the same as those in Table 1.)
Interpreting the event log is simple. For example, look at the sample Application log that Figure 1 shows. Event IDs 8000 and 8001 signal the start and end of the backup, respectively, from NTBackup's perspective. Event IDs 210 and 213 signal the start and end, respectively, of a normal backup from ESE's perspective. (Other types of backups, such as incremental and differential backups, will have different event IDs.)
NTBackup logs event IDs 8008 and 8009 to signal the start and end of the backup verification process, respectively. During the verification process, NTBackup reads the data and associated checksums from the backup medium to help confirm that the backup is usable. If a hardware problem occurred or if the backup medium is damaged, NTBackup reports an error. Because the verification process finishes after ESE signals the successful completion of the backup, Exchange 2000 considers the backup a success but the backup utility considers it a failure. Thus, only when both ESE and the backup utility consider the backup a success should you consider the backup a success.
Checking the Application log's events is a good way to determine a backup's usability. However, if you want to be absolutely certain that you can use your backup for recovery, you need to test the backup by restoring it to a recovery server. For more information about this type of backup testing, see "Build an Offline Exchange 2000 Server in 9 Steps," September 2001, InstantDoc ID 21801.
By checking backup events, you can get an idea of the overall health of your databases. When you perform a normal backup, ESE reads the databases in chunks called pages. Each page contains a checksum to help ensure that the data on the page hasn't been corrupted. The backup API calculates a new checksum and compares it with the stored version to detect corruption. If ESE detects any corruption, it logs an error and the backup terminates. If you're unfamiliar with these errors, read the Microsoft article "XADM: Understanding and Analyzing -1018, -1019, and -1022 Exchange Database Errors" (http://support.microsoft.com/default.aspx?scid=kb;en-us;q314917). Early detection of these types of problems is essential. If you don't detect these problems and you continue to run Exchange, you run the risk of not being able to restore the database and replay all the transactions because you won't have all the necessary logs or backup sets.
2. Check the Event Logs
As the Exchange 2000 system runs and encounters unexpected situations, it writes error, warning, and informational events to the event logs. At least once a day, you need to review the event logs on each Exchange 2000 server and research events that deviate from what you expect to see. To make this task useful, you must perform it frequently and regularly so that you establish a baseline of what to expect in the log. Like reviewing performance statistics, reviewing event logs requires that you know the difference between normal and abnormal system behavior.
By default, Exchange 2000 logs a large amount of data in the event logs. Establishing a baseline saves time because you learn which events need attention. Although some informational events, such as the backup events I described previously, are crucial for you to review, others are not. For example, event ID 1221 tells you that a message store has 10MB of free space. As users delete items from mailboxes, Exchange 2000 doesn't reduce the message store's size but rather flags the data for reuse. So, this type of information can be helpful if you're running low on disk space, but most days you can consider it superfluous. Table 2 lists this event and other events that I typically exclude from my daily review.
When you begin developing a baseline, I recommend that you focus your attention on error and warning events. You'll need to research these events to determine what caused them and the consequences, if any. For example, suppose you find event ID 2090 in the event log, as Figure 2 shows. If you research this warning event, you'll discover that it occurs when you use an Exchange 2000 server's Directory Access property tab (a feature added with Exchange 2000 Service Pack 2—SP2) to specify a domain controller (DC) or Global Catalog (GC) that you need to access, but the server is unreachable. Because you most likely specified the DC or GC for a reason, the failure to locate this server likely has performance repercussions, such as causing Exchange 2000 to locate and use a GC on the far side of a slow WAN link.
When you have many Exchange 2000 servers, reviewing event logs can be a huge task; I recommend that you use an automated monitoring tool, such as Microsoft Operations Manager (MOM), Aelita Software's EventAdmin, NetIQ's AppManager Suite (AppManager), or Hewlett-Packard's HP OpenView. These packages provide mechanisms that let you filter out the events that you consider superfluous. These packages also provide the added benefit of notifying you as soon as a significant event occurs.
3. Check the Partitions
Exchange 2000 servers that stop because of low disk space cause grief for many administrators. Typically, Exchange 2000 servers have separate disk partitions for the OS, the transaction logs, and the Store. Some servers might also have separate disk partitions for other components, such as message-tracking logs, SMTP connector queues, and quarantines for files that antivirus software capture. You need to check the amount of free disk space on each partition every day to ensure that you're not running out of free disk space. The amount of space you need on a partition depends on factors such as the volume of mail the system handles in a day and your standard operating procedures for retaining Exchange logs, quarantined files, and other files. In addition to checking the amount of free space on each partition, you should do the following:
- Confirm that Exchange 2000 is purging the transaction logs. Assuming you perform a normal backup on all the databases within a storage group (SG), you should see transaction log files only from around the time of the last backup. Exchange 2000 purges the transaction logs only when the backup utility has successfully backed up all databases in the SG. If you see older logs, Exchange 2000 isn't purging the logs, in which case you might have a database that the backup utility isn't backing up.
- Check the size of the antivirus software's quarantine and reports. Some antivirus software manufacturers caution that the performance of their software might decline as these files accumulate or grow. The manufacturers likely give this warning because the software writes these files sequentially to the disk and the files can grow quite large over time. However, I've never seen a noticeable impact in a production environment. In most environments, keeping reports and quarantined items for 15 to 30 days is probably sufficient. This time period lets you recover false-positive quarantined items.
- Check the size of the SMTP log directory and purge logs if necessary. Although Exchange 2000 gives you much more control over log rollover so that you don't have one continuously growing log, it doesn't automatically purge old logs. Allowing the logs to accumulate unchecked is a recipe for disaster. If your logging directory is on the default system partition, Windows will crash when all the space on the partition is consumed. If your logging directory is on the same partition as your SMTP virtual server's working directories, the SMTP server will stop processing mail when no free disk space exists. If you have the space, I recommend that you retain logs for 21 to 30 days. On more than one occasion, I've been asked to research a problem that occurred days and sometimes weeks in the past. This retention period lets you perform trend analyses and research problems that users say have been happening for a while.
- Delete archived messages that are outdated. If you use the SMTP archive sink, make sure that you delete or move the archived messages to secondary storage after a reasonable retention period related to your reason for archiving. Companies archive messages for a variety of reasons, ranging from troubleshooting to content monitoring. However, when no one actively reviews the messages, they tend to be ignored and the number grows until all disk space on the partition is consumed.
4. Check the Virus Definition File Updates
You never know when the next virus will make its way to your system. The best defense against viruses is to ensure that your antivirus software's virus definition files are up-to-date. You need to check for virus definition file updates at least once a day. Typically, the antivirus software records the process of checking for updates and installing them in the Windows event log. If your antivirus software doesn't log this information as an event, the software probably writes the information to its own log. For example, Sybari Software's Antigen writes the update events to its programlog.txt file, and Trend Micro's ScanMail for Microsoft Exchange 2000 writes events to its update.log file. Whether you check the Windows event log or the antivirus software's log, make sure that the virus definition file updates are being checked for, retrieved, and installed correctly.
You might think that this check is a no-brainer, but on several occasions I've seen problems arise. For example, one company stopped its Internet connectivity for a while because of the Nimda virus. Because the antivirus software couldn't access the vendor's Web site to download the update, the Exchange administrator used a dial-up connection to download the virus definition file updates to a CD-ROM. When the administrator copied the updates from the CD-ROM to the server, they retained the read-only attribute. Later, when the company restored Internet connectivity, the auto-update process failed for more than a week because the new updates couldn't overwrite the old files.
5. Check the Queues
Unless an Exchange 2000 server handles an extremely high volume of mail, the server won't usually experience queued messages for any extended duration. Having extended periods of queuing typically indicates an abnormal system event that warrants your attention. Spikes in queued messages can occur when someone sends a message to a large distribution list (DL), when someone sends an extremely large message to many people, or when a message's destination is across a slow network link. These situations aren't cause for alarm. What is cause for alarm is finding hundreds of messages queued to the same account or many messages queued to a particular server or domain. Having hundreds of messages queued to the same account can be a symptom of a mail loop or Denial of Service (DoS) attack. Having many messages queued to a particular server or domain can indicate that a server is down, a service is stopped, or a network disruption is preventing the system from establishing a connection.
As for all other metrics, you need to develop a queue baseline so that you know what is normal and abnormal behavior. To keep tabs on your queues, you can use the Microsoft Exchange 2000 Server Resource Kit's MailQ tool. (For information about MailQ, see Donald Livengood, "How to Use WinRoute and MailQ," May 2002, InstantDoc ID 24434.) If you find that you have queue problems, you can use Exchange System Manager (ESM) to investigate the cause of those problems.
6. Check the NDRs
NDRs are common. The two biggest reasons for NDRs are misspelled usernames and list servers sending messages to users who've left an organization but failed to unsubscribe to those list servers before they left. Servers send NDRs to the message originator, but you can configure your system to send a copy of these reports to your mailbox so that you have access to them. To do so, you need to use ESM to access the Properties dialog box for each SMTP virtual server. On the Messages tab, enter your account's SMTP address in the Send copy of non-delivery reports to field. After you've made the entry, stop and restart the virtual server.
Trying to determine and correct the cause of each NDR isn't practical as a daily task. Instead, you want to compare the number of NDRs against your baseline. To determine the baseline, you need to get a feel for the typical number of NDRs generated (or received) each day of the week. The number might vary greatly. For example, you might find that on Monday, you typically get 25 NDRs every 10 minutes but on Friday you get only 25 NDRs an hour.
A large jump in the number of NDRs usually indicates a problem, such as a DoS attack or a message loop. A message loop can form when users configure a rule to forward their mail to personal ISP accounts. Users sometimes make this configuration when they're going on an extended leave or they don't have remote access to corporate email but are waiting for a particular message. In theory, this configuration seems benign. However, if the personal ISP account address is misspelled, the ISP mailbox reaches its quota, or another problem occurs, the ISP server sends an NDR—and the rule happily forwards the NDR right back to the same account that was just undeliverable, resulting in a loop of NDRs.
7. Physically Check the Hardware
One task that's often overlooked is physically checking the hardware. At least once a day, you should go to the computer room and check each server. For example, you should check the disk indicators to make sure no disks have failed and check the console to make sure that no applications have crashed.
Even if you use automated tools to monitor your servers, you should check them physically because unforeseen problems can arise. For example, sometimes when a process crashes, a notification dialog box appears on the system console. In these situations, the system doesn't completely close the process, release the file locks, and log an event until someone clicks OK in that dialog box. I've seen this situation occur a few times with antivirus signature updates. The update process crashed, and the notification dialog box appeared. Because no one physically checked the server and saw this dialog box, subsequent updates failed because the crashed process's file locks weren't released and the update file couldn't be overwritten.
Know Your Norms
What types of events does your Exchange 2000 server typically record in the event log? How many messages are usually in the queue at any one time? How many NDRs does your system typically generate and receive each day? Knowing the answers to such questions is crucial. If you don't know how your system typically behaves, you can't determine when your server is experiencing an abnormal event. To help answer these questions, you can perform the seven daily checks to develop baselines and proactively monitor the health of your Exchange 2000 system.