Tuning the X.400 Connector and the Message Transfer Agent

If you're operating Exchange Server in an environment with slow or unreliable WAN links, you've most likely considered tuning the Message Transfer Agent (MTA) to obtain the most efficient message delivery. The MTA is the core service that routes messages between two or more Exchange servers. The X.400 connector is the best connector to use with slow and unreliable WAN links—including T1 links that can become saturated with non-Exchange traffic—because the X.400 connector doesn't require permanent, high-bandwidth connectivity and it doesn't need remote procedure calls (RPCs) to connect. Let's see how you can get the most out of the MTA over X.400 connections.

The most obvious way to tune the MTA is to alter values on the MTA Site Configuration object's Messaging Defaults tab. However, if you're using the X.400 connector instead of the Site connector, you must change the values on the Override tab of each end of the X.400 connection to configure the MTA's behavior. The values on the connector's Override tab take precedence over the values on the MTA's Messaging Defaults tab. When you use the X.400 connector, messages flow from the sending MTA to the sending X.400 connector to the receiving X.400 connector to the receiving MTA.

When you optimize the MTA, you try to strike the optimal balance between maximum use of available bandwidth and a minimum number of message retransfers. You also want to control the MTA's behavior when those links become saturated or broken.

When the MTA, via the X.400 connector, can't successfully deliver a message to a downstream MTA, the originating MTA returns a nondelivery report (NDR) to the sender. Contrary to what many people think, an NDR means that the originating MTA couldn't deliver the message to the downstream MTA, not to the intended recipient. Fortunately, as an Exchange administrator, you can adjust MTA values to reduce the number of NDRs the downstream MTA returns.

To achieve optimal message transfer performance, Exchange lets you configure four sets of values that can help control the flow of messages:

Reliable Transfer System (RTS)—the frequency with which the MTA verifies message transfer and receipt
Connection retry values—the number of times you want the MTA to try to open a connection or resend a message
Association parameters—the number of communication channels between Exchange and other systems
Transfer timeout values—the interval the MTA waits before sending an NDR, based on the importance of the message.

To configure these values, open the Microsoft Exchange Administrator program, and go to the Override tab on the X.400 connector , which Screen 1 shows. For this article, I assume that these links have no Exchange messaging RPC traffic. I also assume that you connect the two Exchange servers with an X.400 connector on top of the TCP/IP transport stack.

RTS Values
The RTS values you configure are Checkpoint size (K), Recovery timeout (sec), and Window size. Checkpoint size is the amount of data one MTA transfers to another before the receiving MTA acknowledges receipt of the data. In addition, if the MTA needs to retransmit portions of the data stream for any reason, the sending MTA can retransmit from the last acknowledged checkpoint instead of from the beginning of the data stream. This feature minimizes bandwidth usage for retransmission.

Be sure to set the Checkpoint size value on the X.400 below that of your available bandwidth for Exchange data. For instance, if you're connecting two sites over a 56Kbps frame relay and after accounting for other traffic you have 24Kbps left for Exchange data transfers, set your Checkpoint size below 24Kbps—perhaps between 18Kbps and 22Kbps. If you set the value higher than the available bandwidth, your messages can saturate your link before the sending MTA can insert a checkpoint into the data stream. Hence, the receiving MTA can't acknowledge receipt of the data and generate error messages. Therefore, any necessary retransmissions would start from the beginning of the data stream and slow the data transfer rate.

Recovery timeout is how long the MTA caches unacknowledged checkpoint data. The sending MTA continues to transfer messages with unacknowledged checkpoints for a set time before it assumes that the receiving MTA hasn't received the data, because the receiving MTA hasn't acknowledged the data. When the Recovery timeout period expires, the sending MTA assumes that none of the data it has sent arrived at the receiving MTA and retransmits the data from the beginning. Therefore, if your link is unreliable, increase the Recovery timeout setting.

For example, if your T1 line often becomes saturated with non-Exchange traffic, increasing the Recovery timeout value on the X.400 connector from the 60-second default to 120 seconds lets your sending MTA cache more unacknowledged checkpoint data. However, be sure to keep the setting below 15 minutes (900 seconds) if you've configured your X.400 connector to transfer messages Always rather than at scheduled times. The Always setting on any Exchange schedule tab means every 15 minutes.

Window size refers to the number of checkpoints that can go unacknowledged before the sending MTA suspends message transfer. As in the Recovery timeout value, if the Window size value is reached, the sending MTA assumes that the downstream MTA hasn't received its data and begins retransmitting the data stream from the beginning.

The Window size default value is 5. However, Exchange Server 5.5 automatically negotiates this value down to 3. To force Exchange to use your configured values instead of 3, install Service Pack 2 (SP2). If your link is slow or unreliable, increase this value to 8 or 10 checkpoints on your X.400 connector. This increase lets message transfer continue in case acknowledgments aren't making their way back to the sending MTA fast enough.

However, if your link is fast but has frequent, brief saturation periods, try leaving the Window size value alone and increasing the Recovery timeout value. If this action doesn't reduce the number of NDRs, increase the Window size value to 8 or 10.

Multiplying the Window size and the Checkpoint size results in Memory Size Allocation, which denotes the maximum amount of data that the MTA can transfer without an acknowledgment. Hence, a 20KB Window size multiplied by a Checkpoint number of 10 results in a maximum of 200KB being held in memory without an acknowledgment.

In summary, if you're passing messages between two Exchange servers over an unreliable link, decrease the MTA's Checkpoint size and increase the Recovery timeout interval. This setting reduces the number of message retransfers. However, if you have a reliable but slow link, make the Checkpoint size smaller than the available bandwidth and increase both the Window size value and the Recovery timeout interval. Finally, if you have a fast but often saturated link, increase both the Recovery timeout interval and the Windows size value.

Connection Retry Values
The sending MTA invokes Connection retry values when it can't initiate or maintain a connection with the receiving MTA. You can configure Max open retries, Max transfer retires, Open interval (sec), and Transfer interval (sec) values.

The Max open retries value designates the maximum number of times a connector attempts to open a connection to another connector before sending an NDR. The default is 144.

The Max open retries value works in concert with the Open interval value, which specifies the number of seconds the MTA waits before attempting to reopen a connection that has previously failed. The default is 600 seconds (10 minutes). So, the sending connector attempts to make 144 connections to the remote connector and waits 10 minutes between each connection attempt before having the MTA return an NDR to the user. This sequence takes 24 hours—(144 * 10) / 60 = 24.

For unreliable links, set the Open interval value to be greater than the time it takes to establish a connection. And if you have a very slow link, consider setting this value as high as 1200 seconds (20 minutes).

If your connection is reliable but your link is slow, you can change the Max transfer retries and Transfer interval values. The Max transfer retries value is the number of times the MTA attempts to send a message across an open association before returning an NDR. The default is 2. If your link is slow, you can raise this value on the X.400 connector to between 5 and 10, thereby allowing your MTA more attempts to deliver messages before concluding that the receiving MTA is unavailable. The Transfer interval is the amount of time the MTA waits before retransmitting a message across an open connection after an error. The default is 120 seconds. Lowering the Transfer interval value causes the message retries to occur more quickly after the failure of a message transfer. The effect of this change is to force the MTA to attempt retransfer more often and allow more failures before invoking and incrementing the Open interval value.

One problem you need to plan for is when your WAN link disconnects for discrete periods of time (e.g., from 1 minute to more than 1 hour for a power outage or cut cable). During this time, outbound messages accumulate in the MTA's message queue. On the local server, you can see these messages as *.dat files in the exchsrvr\ mtadata\mtacheck.out folder. When you reestablish the connection, your link must accommodate not only current messaging traffic but also the backlog of messages. Therefore, if your link is slow or unreliable, the link might experience message saturation soon after you reestablish the connection.

Screen 2 illustrates what happens when your link disconnects and messages begin to build up in the MTA message queue. In this example, I was sending two 2KB messages from the Minneapolis site to the Indianapolis site every 3 seconds. Because I had only one association and virtually no messages were in the MTA queue, the graph of both the Associations and the Queue Length counters is flat. But the TCP/IP Transmit Bytes/sec counter shows that the MTA is sending messages regularly from the Minneapolis site to the Indianapolis X.400 connector.

In Screen 3, page 4, you can see what happens when I unplug the patch cable from the Minneapolis server. Because I've severed the link to my Indianapolis site, the MTA can't transfer messages; therefore, the TCP/IP Transmit Bytes/sec counter (i.e., the line that drops to 0) immediately flattens. However, the Queue Length counter (i.e., the line that is steadily increasing) records an increasing number of messages that the MTA has placed in the queue for Indianapolis and that the MTA can't send. In addition, the number of associations remains flat. In fact, the number drops from 1 to 0 because loss of the physical connection also means loss of the association.

After 20 minutes, I plugged my patch cable back into the Minneapolis server. Contrary to what you might think, an immediate rush of messages passing between the MTAs didn't occur. Because the Open interval was set to 600 seconds (i.e., 10 minutes), reestablishing an association from Minneapolis to Indianapolis took several minutes. However, after the MTA established a new association, the messages in the queue began passing to Indianapolis. You can see this process in Screen 4 in the first vertical line (the TCP/IP Transmit Bytes/sec counter), which soared to the top of the chart as the messages began to flow.

After I created the first association, the Minneapolis MTA discerned that more than 50 messages were waiting in the queue for transfer and opened seven additional associations to Indianapolis, as Screen 5 shows in the line that moves from 0 to 8. As the number of bytes transmitted over the X.400 connector to Indianapolis substantially increased, the queue length counter dropped dramatically, as Screen 6 shows. Because I'd set the Lifetime (sec) parameter to 300 seconds (5 minutes), the number of associations stayed at eight for 5 minutes, after the MTA had flushed all the messages from the Minneapolis queue.

Association Parameters
An association is the virtual pipe through which messages travel between MTAs. When you use Windows sockets, the transport layer must create a TCP connection over port 102 before two MTAs can create an association. This connection places each MTA in a listening state: You can use the Netstat utility to observe this state.

The default setting in Exchange Server 5.5 is 20 control blocks per Exchange Server. Before SP2, Microsoft also used one control block per association and one control block to enable the MTA to be in a listening state. In SP2, the MTA no longer requires a control block to be in a listening state. Therefore, SP2 lets you use system resources more efficiently.

If you have a link slower than 128Kbps or large messages that result in message backlogging activity, you can allow up to 2000 control blocks (associations) per Exchange Server. To increase the control block value, go to the HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\ Services\ SExchangeMTA\ Parameters Registry key and enter the number of blocks you want as a decimal value under the TCPIP Control Blocks value.

If you don't know the minimum number of control blocks you need, use the equation

(Number of X.400 connectors * 10) + 10

and place the results in the key as a decimal value. For example, on a bridgehead server that hosts eight X.400 connections, enter (8 * 10) + 10 = 90 TCP/IP control blocks.

The Association parameters you configure are Lifetime, Disconnect (sec), and Threshold (msgs). The Lifetime parameter designates the length of time an MTA attempts to hold open an association after the successful completion of message transfers. With unreliable WAN links, decreasing this value from 300 seconds (5 minutes) to 180 or 120 seconds forces the MTA to close the open association more quickly, resulting in a more efficient use of system resources. If your link is slow but reliable and you know that more message transfers will occur, you can increase this value to 600 or more to reduce the association creation traffic. To see which port numbers each association is using, run the Netstat -N utility at the command prompt.

The Disconnect value specifies how long the sending MTA waits after an unacknowledged response to a request to disconnect before severing its end of the association. The default is 120 seconds. Hence, the order of activity is

Open association.
Send mail.
Wait 5 minutes after mail is sent.
Send disconnect request.
Receive Disconnect OK and then disconnect.
If no Disconnect OK is received, wait the number of seconds you set in the Disconnect value, then kill the association.

The sum of the Disconnect and Lifetime values is the total amount of time the connection remains open after message transfer completes successfully. For slow but reliable links, increase the Disconnect value as needed to prevent unnecessary RPC binding traffic. For unreliable links, decrease this value to 5 or 10 seconds to prevent the MTA from invoking the Connection retry values unnecessarily.

The Threshold value denotes the number of messages you can transfer over a given association. Increasing this number decreases the need for additional associations between two MTAs. This value is useful when you have high-volume traffic between two or more MTAs. For example, if you've set the number of control blocks to 90, by default only 50 messages can pass over each association, thereby limiting you to 4500 (90 x 50) messages that can transfer at a time. If you increase the Threshold value to 100, you double the number of messages that you can transfer over the 90 control blocks. This capability is especially useful when you know you have a backlog of messages and you're ready to reestablish a link to another site. Increasing the Threshold value will increase the number of messages that the MTA can transfer over your associations.

Transfer Timeouts
When message transfer fails, the MTA sends an NDR according to the Transfer timeout values you set on the X.400 connector Override tab. The MTA handles messages according to their priority: Urgent (default of 1000 seconds per kilobyte), Normal (default of 2000 seconds per kilobyte), or Non-urgent (default of 3000 seconds per kilobyte). Urgent messages receive NDRs more quickly than nonurgent messages. If your link is unreliable or slow, you can increase the Urgent value to slow the rate at which the MTA returns NDRs to the user.

You might think slowing the rate of returning NDRs seems like the wrong thing to do because you want users to receive timely NDRs. However, remember that an NDR, too, is an email message that is subject to all the same limitations and rules as other email messages. Increasing the NDR return rate over a slow or unreliable WAN link might result in unnecessary message retries and link saturation. Obviously, you need to balance your user's need to receive a timely NDR against your link's limitations.

Using Performance Monitor to Assess How to Set Your Values
Several Performance Monitor counters can help you assess how to tune the MTA. Particularly useful is the MTA Throughput Performance Monitor Chart. This chart is a standard set of Exchange-related Performance Monitor counters that is part of the Exchange Server installation. Because most unreliable or slow connections work best over the X.400 connector, monitor the TCP/IP Receive Bytes/sec and TCP/IP Transmit Bytes/sec counters. (You monitor the TCP/IP counters, which relate to the X.400 connector, instead of LAN counters, which relate to the Site connector.) These counters help you gauge the amount of traffic passing over the WAN link and assess whether your WAN link is approaching saturation. In the frame relay example I mentioned previously, if you have 24Kbps of bandwidth available on a 56Kbps frame relay, use these counters to make sure you're not passing more than 2000 to 3000 bytes per second (assuming 8 bits per byte—3000 bytes/sec * 8 bits = 24,000Kbps).

Moreover, observing the transfer rates before you change the MTA values gives you a baseline against which to measure the effectiveness of your changes and to determine which values to change when you adjust the MTA. In addition, you can monitor the MSExchangeMTA: Work Queue Length counter to show you the number of messages in the MTA queue. An increasing counter signifies more messages attempting to be transferred than the MTA can transfer. The only solution to this problem is to increase available bandwidth. You can increase bandwidth by rerouting non-Exchange traffic over other connections (e.g., a dial-up or another T1 connection), rerouting Exchange traffic through other connections to other sites in your organization, or increasing bandwidth over your current connections.

When you use an X.400 connector, you can schedule messages for delivery when the link is most available—most often at night. If your messaging traffic can wait up to 12 hours after a user sends it, you can use the Schedule tab for the X.400 connector to transfer messages during off-peak hours. This approach might be the best way to increase available bandwidth for your Exchange traffic.

Multiple X.400 Connectors with Varying Connections
If you have multiple X.400 connectors and several types of slow or unreliable links, such as frame relay, Dynamic RAS (DRAS), or an often-saturated T1, you still need to modify each end of your X.400 connectors manually. Unlike the Site connector, you can't modify all your X.400 connectors simultaneously. This restriction gives you maximum flexibility when you configure your X.400 connectors over varying types of connections.

Common Problems and Solutions
When the number of messages in the queue exceeds the maximum number of allowable associations between MTAs, you might receive an Event 57 in the application event log with text similar to <X.500 Distinguished Name referring to a remote server> has been reached. The limit is 9. If you receive this error message, remember that the maximum number of associations between MTAs over an X.400 connector is 10—9 associations for low and normal priority mail and 1 association for urgent mail. To eliminate this error message, increase your Threshold value from the default of 50 to 400 or even 1000. This setting lets more messages flow over each association and reduces the number of Event 57 messages in your application log.

If your Exchange server has been offline for a few hours or days, the MTA might start rejecting messages, holding messages in queues, or creating message loops and your system will record Event ID 200 in the application event log. The cause of this behavior is setting the value too low for the Open interval setting—usually below 30 seconds. Therefore, set this value no lower than 60 seconds.

Finally, if you observe that messages aren't flowing over an X.400 connector while messages are building in the queue and no error messages appear in the sending server's application log, check the application log on the receiving server for an Event 9301 and 9202 set of error messages. This sequence of events usually points to a DNS problem; that is, the X.400 connectors are configured with Fully Qualified Domain Names (FQDNs) rather than IP addresses on the Stack tab. The MTA that has logged the error can't resolve the IP address to the FQDN or do a reverse-lookup on the FQDN. Use the Nslookup utility at the command prompt to ensure that you've entered the right FQDN for the server's IP address and that a pointer record is entered into your DNS tables.

Keep a Sharp Eye
Adjusting MTA values can improve Exchange Server's performance over slow or unreliable links. If you use monitoring tools to review your counters and make careful comparisons, you can continue to adjust the MTA's configurations until you reach optimal settings.

Comments

Plain text