What's worse than discovering that one of your connectors has been stacking up mail messages for the past 12 hours?
Discovering the fact from your enraged users!
Messaging has become so integral in business that some companies lose money if the messaging system fails. Yet I have visited many companies that haven't set up systems to proactively monitor their servers or connections. Consequently, when the cc:Mail Connector or the Message Transfer Agent (MTA) shuts down, the first sign of trouble is users' complaints. The situation is exacerbated when the systems administrator discovers that the failure occurred merely because a server partition ran out of disk space.
Server outages occur for three reasons: inadequate resources (e.g., disk, memory, network); external problems (e.g., modem failure, Internet Service Provider—ISP—error, unsolicited commercial email—UCE, user error); or hardware or software failure. In my experience, most messaging service failures occur because of inadequate resources or external problems, areas that Windows NT Performance Monitor excels at spotting.
You can do little about hardware and software failures, except use hardware monitoring utilities, such as Compaq's Insight Manager (http://www.compaq.com/products/servers/management/insight42-description.html) and IBM's Tivoli Enterprise (http://www.tivoli.com/o_products/html/body_products.html). However, you can use Performance Monitor to quickly detect potential problems and attend to them before they become critical.
In this article, I'll briefly explain how to use Performance Monitor. Then, I'll discuss which counters to use to keep track of Exchange Server's performance, so that you can receive useful realtime feedback and alerts to head off disastrous events.
Performance Monitor Overview
You can use Performance Monitor to find out more than how fast your server is performing or how much of the processor you're utilizing. For an introduction to Performance Monitor, see Michael D. Reilly's Windows NT Magazine articles "The Windows NT Performance Monitor" (March 1997) and "More Windows NT Performance Monitor" (April 1997). "The Exchange Server Troubleshooter" in Exchange Administrator (August 1998) also discusses using Performance Monitor for Exchange.
Performance Monitor groups related counters into logical objects, such as Processor, Memory, MSExchangeIS, and MSExchangeMTA. Within the MSExchangeIS object, for example, are counters such as Active User Count (number of users who have performed some Exchange-related activity within the last 10 minutes) and Maximum Users (maximum concurrent user count since the Exchange services last started). Some counters have multiple instances. For example, if your system has four processors, you have four instances of the counter %Processor Time for the Processor object, so you can monitor the utilization of each processor.
Because systems administrators want to display this bewildering amount of information in varying ways, Performance Monitor offers four views: Chart, Alert, Log, and Report. The Chart view provides realtime information in the form of either a histogram graph or a line graph.
The Alert view displays only the counters that have exceeded defined thresholds. A queue size of 30 on your Internet Mail Connector (IMC) Outbound Queue might be a valid threshold, for example. When a counter reaches a threshold, the Performance Monitor application can generate a Performance Monitor alert and an NT alert. A Performance Monitor alert is a message within the Performance Monitor application that pinpoints which counter has reached its threshold value. An NT alert uses the messenger service to deliver an onscreen notification to a specified workstation.
The Report view provides a more scientific view explicitly listing the value for each counter as Performance Monitor updates it. The Log view presents no realtime information on screen, but it lets you write information to a Log file that you can import into other applications, such as Microsoft Excel, for further investigation.
Getting a Baseline Reading
Later in this article, I discuss particular objects and counters to monitor and methods to determine the appropriate threshold value. However, before you can set any threshold values, you must obtain a baseline reading (i.e., typical values that a normally functioning system generates). My company usually recommends that its customers monitor counters daily for at least a week, in Log mode with a different log for each day.
The amount of information you can export to a comma-separated values (CSV) format is one full screen, as viewed in Chart View. One screen in Chart View (graph mode) records 100 readings, so you calculate the graph time (i.e., the amount of time the red line takes to travel to the other side of the screen) as Interval (seconds) multiplied by 100. Therefore, if you want your chart to record 24 hours worth of information in one screen, the interval time is the number of seconds in 24 hours divided by 100 readings (i.e., 86,400/100, or one reading every 864 seconds). For the base readings, you could increase the interval to 6048 seconds (the number of seconds in 1 week divided by 100 readings), but this measurement leaves a large gap between readings, and you might miss crucial but normal spikes of activity. Therefore, I suggest creating one chart per day for 7 days.
You can use Excel to chart the servers' activity each day to obtain an understanding of the day-to-day values a healthy system generates. The following section describes how to use this baseline information to set thresholds for each counter.
Exchange Performance Monitor Workspaces
Exchange Server automatically installs into \exchsrvr\bin eight Performance Monitor workspaces (.pmw files): Server Health, Server History, IMS Queues, IMS Statistics, IMS Traffic, Load, Queues, and Users. These workspaces are views of groups of Exchange Server performance counters. The workspaces provide invaluable information for each Exchange server. Their usefulness is limited, however, for several reasons. Monitors are relevant only to the server on which they reside, so you can't take the .pmw files and run them on a remote dedicated monitoring workstation without modification. Furthermore, although workstations provide useful realtime information, they don't include thresholds, and therefore they can't generate NT or Performance Monitor alerts. Finally, the groups of counters that make up the workspaces might not be suitable for your environment. I find that customers prefer the information grouped logically. For example, they like to display the MTA Work Queues for each server in Chart mode in one Performance Monitor and Free Megabytes for each disk partition on each server in another. The monitors in the Performance Monitor workspaces are useful starting points for monitoring your Exchange Server infrastructure, but I'll show you how to improve and tailor the monitors to your specific requirements. The sidebar "How to Configure Counters and Alerts," page 2, describes how to use Performance Monitor.
Counters to Monitor
Although Exchange Server 5.5 Enterprise Edition provides approximately 500 Exchange-specific counters, you need only about 20 counters to get a good picture of Exchange Server's performance. Table 1 lists these counters by object, and the following sections discuss each object's counters in depth.
LogicalDisk. Because several key Exchange Server components shut down if the amount of free disk space falls below 10MB, an obvious counter to monitor is Free Megabytes for each partition on each server. The value to set as a threshold depends on your circumstances. For example, if you have many active Exchange users on a server running a highly utilized Public Information Store, a fax connector, and several Event Scripts, you need to configure a very high threshold, perhaps 1GB. Alternatively, a few users who use the server only lightly require a lower threshold, perhaps 250MB.
You determine the threshold value by the amount of time you have after you find the problem until the server fails. For example, if your dedicated fax connector handles 3000 faxes per working day, and the average fax is three 50KB pages, the fax connector processes about 450MB of faxes. If the connector or the fax server fails, faxes begin queuing up on your Exchange server. A 1GB threshold value gives you about 2 days before the server fails because of insufficient disk space. If you need a 4-day window, you need a 2GB threshold.
Memory. The %Pages/Sec counter records the number of times per second the system must page conventional memory to disk-based virtual memory, a process that can slow the performance of your server dramatically. A number that is significantly and consistently higher than your baseline readings signals a low-memory problem. The only resolution is to add more physical memory to the server.
MSExchangeMTA. Always monitor the Work Queue Length counter on the MSExchangeMTA object. This counter tracks the number of messages the MTA is working on and provides an overall perspective on the general health of the MTA. This counter represents the sum of all the Queue Length counters in the MSExchangeMTA Connections object. If you are monitoring remote Exchange servers and want to conserve WAN use, you can monitor just the Work Queue Length rather than all the individual MTA Queues described in the next section.
MSExchangeMTA Connections. Every logical connection the MTA has to any other messaging component, either on the same server or to a remote MTA, is an association. Within the MSExchangeMTA Connections object, each component with which the local MTA can form an association is a separate instance. For example, the instances listed on a client's Exchange server are FAXSRGATEWAY, cc:Mail Connector, the IMC (Performance Monitor and the NT services still call this connector the IMC, even though Microsoft changed its name to the Internet Mail Service—IMS—in Exchange Server 5.5), Microsoft Private MDB, Microsoft Public MDB, and the seven other servers in the site. Exchange Server 5.5 lists 28 counters under MSExchangeMTA Connections, so you could monitor 308 monitors for only one server. Fortunately, most of the counters are for troubleshooting rather than for performance monitoring.
The most valuable counter is Queue Length. The most important instances to monitor are the queue lengths to the other servers in the site, especially if the servers are in other countries. Set the threshold value according to the length of time it takes for the queue to reach that value. For example, a message taking more than 30 minutes for delivery, anywhere in the world, might be unacceptable to some companies. Exchange calculates the threshold value per link; the value is the number of messages that would build up in 30 minutes if the link fails. The Send Message/sec and Receive Message/sec counters help you determine the overall usage of the system.
MSExchangeIMC. In addition to their MTA-related counters, all standard connectors have other counters; the IMS in Exchange Server 5.5 has at least 35. When problems occur with your Internet connection, the first sign is usually an increased queue length. You need to monitor four queue-related counters: Queued Inbound, Queued MTS-IN, Queued MTS-OUT, and Queued Outbound. Queued Inbound is the first point of call for inbound IMS traffic. Messages wait in this queue for the IMS to convert them into Exchange Server's internal format before passing them to the Information Store (IS)-based Queued MTS-IN, either for local delivery or for routing through the Exchange Server infrastructure. IS-based Queued MTS-OUT shows the number of messages waiting for the IMS to convert them into Simple Mail Transfer Protocol (SMTP) format. Queued Outbound is the messages' final resting place before delivery to the Internet.
From your baseline readings for the IMC, you'll know the range in queue size you can expect during the course of a week. From this baseline, you can work out an acceptable threshold, taking into account how quickly you want to know about the problem, how reliable your ISP is, and how infrequently you want false alarms. For example, to learn quickly about a potential problem, you need a relatively low threshold value, but to minimize false alarms, you need to set a higher value. If you schedule your Internet connection or use slow links, choose suitably higher figures. Only you can decide the trade-off and work out the appropriate threshold value. For example, one of our larger customers has a 2Mbps connection to a large ISP with a very active user base, and the company's threshold value is 40. Keep in mind that monitoring your IMC queues gives you information, but a more efficient way to trap problems is to use an Exchange Link Monitor from the Microsoft Exchange Administrator program.
Inbound Messages/Hr, Outbound Messages/Hr, NDRs Total Inbound, and NDRs Total Outbound are also valuable counters, because unusually high numbers can signal a UCE, or spam, attack in progress. I've heard of UCE attacks in which companies have used the telephone directory as the basis for addressing email, using popular conventions to guess email addresses, for example
- [email protected] ([email protected])
- [email protected] ([email protected])
- First%[email protected] ([email protected])
If you see a higher number of Internet nondelivery reports (NDRs) than usual, you might be receiving UCE. NDR notifications go to the Administrator's Mailbox, as you define it on the General tab on the IMS. If you suspect UCE because of high Internet counter values, examine the NDRs for common errors or frequently used From addresses.
A good tip is to make a public folder visible in the address book and use this folder as the Administrator's Mailbox for the IMS and configure notifications for all options for delivery to the administrator. This way, any member of your messaging team can view the NDRs and spot problems. Screen 1 shows that Advisacom calls its IMS Administrator's Mailbox Internet NDRs; it's really a public folder that the administrator has made visible in the Global Address List (GAL).
Process. Performance Monitor can't generate an alert if traffic exceeds a threshold for a specific amount of time. Therefore, when you're monitoring processor time, configure a Performance Monitor Chart View but don't set any alerts in the Alert View, because spikes of 100 percent are common for Exchange servers. For example, you can get a spike when the IS is flushing transactions to disk or when the Directory Service (DS) starts a Knowledge Consistency Check (KCC).
Furthermore, getting consistently high processor utilization at times when you don't expect it can signal a problem with Exchange Server. Several Microsoft articles describe scenarios in which store.exe (the Exchange Information Store service) can reach and maintain 100 percent utilization until you restart the service. These articles include "XADM: Store Uses 100 Percent CPU When Sending Postscript Attachments" .asp) and "XADM: Store Uses 100% of CPU on Incoming MIME Binhex Message" (http://support.microsoft.com/support/kb/articles/q170/0/60.asp). Microsoft discovered that various bugs can cause this high processor utilization. Although Microsoft has corrected the bugs through service packs, you still need to monitor processor use. The counter to monitor is % Processor Time for the Process object for the instances DSAMAIN, EMSMTA, STORE, MAD, and EVENTS.
System. % Total Processor Time represents the sum of the individual %Processor Times for each process running on the NT server. If this figure is consistently close to 90 or 100 percent but the Exchange services aren't reporting unusually high processor utilization figures, you have another application that is seriously affecting the performance of your Exchange server. Perhaps you need to upgrade your server or split the applications over multiple servers.
Message Flow Counters
Performance Monitor also provides counters that can show you how much total mail traffic passes through the system. Table 2 lists some of these counters. From this information, you can observe the increase in messaging traffic as its popularity rises and, therefore, plan for server upgrades. This data can give you ammunition to justify more powerful hardware in the future.
Only the cc:Mail Connector and the IMS have per-hour counters. Listen up, Microsoft: Per-hour counters for the MTA, the IS, and the DS would be very useful for monitoring the total increase in system usage and for planning for future requirements. And let's have a per-day and a per-week counter, too.
Commercial products are rapidly appearing that do a lot of the monitoring I've described here, and more. NetIQ AppManager (http://www.netiq.com) and IBM's Tivoli are two products that have advanced features such as enhanced reporting capabilities, enhanced logging, knowledge scripts, and automatic monitoring of new servers. In addition, Sentinel, a Microsoft BackOffice Resource Kit (BORK—2nd Edition) utility, overcomes some of the limitations of Performance Monitor and Event Viewer; Sentinel even has the distinction of being one of the few BORK utilities that comes with instructions.
However, Performance Monitor is certainly the most inexpensive option. It can achieve much of what the commercial applications offer, but you have to do much of the configuration. The commercial applications have the benefit of tying up all the functionality of monitoring in one package with enhanced functionality but at the price of increased cost and complexity. Whether these products are better than Performance Monitor depends on your circumstances, but regardless, Performance Monitor is a good starting point on the road to improved service times.