LoadSim Revealed: Scientific Method to the Rescue

Microsoft provides a handy little tool with Exchange Server 4.0 called LoadSim (as seen in Screen A), which functions as a load generator and user simulator for capacity testing a messaging platform-specifically, Exchange. It runs on one or more client machines in tandem, sending and receiving messages, accessing public or private folders, etc., as it emulates the activities of a normal Exchange user.

While LoadSim was intended to be a capacity planning tool (to find out how many users you can support on a system, with what kind of response times), it also makes an excellent performance testing tool if used properly. However, LoadSim is not without problems. If you are aware of them, such as client dependencies, a quirky user interface, and sometimes unpredictable behavior, you can use it to test existing systems-or find out what a new one will do for you-by planning your testing strategy around these holes. In the Windows NT Magazine Lab, we decided that LoadSim would make an excellent first step in testing server hardware as messaging platforms-we can tune the system configuration (number of CPUs, amount of memory, disk and network layouts, etc.) and change the user load (number of users, transaction mix) to come up with curves that tell a more complete story about a particular machine. Instead of a single number characterizing the performance of an entire client/server system, we can use these curves to find trends and breakpoints of various types of systems.

Know Your Enemy
First, lets look at the problems we know about. Client dependencies in LoadSim are fairly significant-the horsepower of the client system has a large bearing on measured response times. LoadSim is more memory constrained than CPU constrained, but even with a large amount of memory, the client falls down on high user counts. Besides, you have to do what's real - you can't simulate 1000 users on a single physical client system, because it introduces new dependencies at the client level that you are trying to avoid-actually, it introduces dependencies that you are trying to measure on the server! With too high a user count, whether the CPU is fully taxed and memory is optimal or not, the I/O capabilities of the client system get in the way. With an appropriately fat client, you can simulate a certain number of users and attain the same throughput for each one (within an acceptable tolerance) as you would having a separate physical machine for each client. If you go too far, you hit bottlenecks in the client such as network bandwidth, memory, CPU, and disk utilization, etc., that warp your results.

When we set up our testing environment for the Tricord review, we ran tests using a maximum configuration on the server (four CPUs, 1GB of RAM), while varying the number of users simulated on a single physical client system. We found that the response time didn't start degenerating noticeably until we went above 100 users (that is, the response time at 10 users was within 10%-15% of that for 100 users). Also, other vendors such as Compaq, and even Microsoft, have performed similar tests in a comparable environment to the one we used, and came up with the same results for client load. We also tuned the user load and think times (how long the pause is between user operations) to values between absolute "real world"-which is an eight-hour day with long breaks between actions-and a livable testing environment that wouldn't take 24 hours to get a single data point. We ended up with a two-hour day, and a four-hour test run, which neither overwhelmed the client system, nor represented an unrealistic environment. We took data points from the two middle hours (the last half of the first day and the first half of the second day), so that the ramp-up time (the first hour) for the test to reach steady state did not influence the results, nor did the ramp-down as the users log off.

Since we could operate within a reasonable range of real world results, and keep the test believable and repeatable, we determined that LoadSim was a good starting point for messaging tests. But what about the other problems I mentioned, like the inconsistent interface and unpredictability, which would seem to contradict using this tool at all?

The interface is a resolvable issue-it just takes a little babysitting of the test runs. The utility itself follows the typical Microsoft GUI guidelines (rather than being a command-line interface), but the error trapping is a little weak, so restarting the tool or reloading a set of test parameters can change test settings. Before each run, we had to double check every system to make sure that it was going to run the test we intended.

Unpredictability is a little more difficult to deal with, and it is a two-fold problem. First is the unpredictability of the interface, which I just explained. Second is the unpredictability of the test results. On the one hand, LoadSim is a fine end-to-end testing environment, while on the other hand you don't really know what it is measuring, and can only infer certain things by analyzing the results against server operations (such as disk and CPU utilization). There is a narrow band of settings in the test, as well as a specific hardware configuration on the server, that seems to give relatively error-free logs (see the section on load and scaleability in the main article). A test run isn't necessarily invalid if there are errors-it just points to bottlenecks in the system.

I say that you don't know what LoadSim is really measuring, because response times behave in an odd way when compared to server configuration. On a server with lots of memory, the response time goes up (which is a bad thing). With less memory, the response time drops (which is a good thing), but the message queue at the server is incredibly long, and doesn't finish processing messages until long after the test has actually completed. So, are you measuring user response time (i.e., how long it takes for the interface to return control of the system to the user so that he/she can send another message), or are you measuring total message processing times (server latency)?

Exchange seems to behave such that if the resources are available, it uses them to the best of its abilities. If they aren't, Exchange holds things back (it queues them up), such as outgoing messages, until the proper resources are again available. In our tests, we saw as much as an hour of post-processing after a run with one or two CPUs and 128MB of RAM.

After a few test runs, we knew some of what was going on behind the scenes, and could account for certain values in the results. For your tests, now that you know what some of the issues are, you can deal with the problems at the start: The errors will make more sense, or you can eliminate them entirely.

The Tool
The LoadSim tool itself is easy to use once you know what you are looking for. However, don't go to Microsoft asking for support or waste time searching for extensive documentation, because there is none. Microsoft provides it as a "use it at your own risk" utility, and will not support your efforts with it. There is meager support documentation on the Exchange Server distribution CD, but that's about it.

LoadSim lets you tune a test run in a variety of ways. You can use any number of physical clients, and simulate any number of users performing a wide range of operations-simply install the tool on each system you intend to use. When you enter all of the names of the client systems in the Configuration/Client Machines dialog, a user list is generated based on all available systems, which you then import into the Exchange server.

You can set test parameters for user level (high, medium, or low usage-representing the number of transactions in a day), what the users will do (send, receive, access folders, etc.), how long a test is (think time), what length day and night are, and a great deal more-these settings are saved to .SIM files for later recall. We used the default settings for everything but the length of the day and the overall test, so you should be able to reproduce our tests almost exactly.

While a test is running, all statistics and messages are displayed in a console window on the client systems. You can see current response time (shown as "score"), message types, activities, and errors, and current test status (total time, current user count, etc.). This data is logged to a file for use by the lslog utility which actually calculates test results.

The ISLOG.EXE program can truncate data to eliminate ramp-up and ramp-down periods in the test, concatenate log data sets from multiple client systems, and determine the 95^th percentile response time from the steady-state period of the test run (the response time that represents 95% of all transactions while all users were logged on and the activity was at its peak). You use this data to plot as you see in Graph 1.

Errors
Our tests were not error free (as you can see by the one bogus data point in Graph 1). We found network dependencies rooted in either how the server was handling authentication requests (since it was the domain controller, too) and/or processing network packets, as well as possible problems in the network configuration. Either the server, the network hardware, or Exchange itself couldn't support 1500 simultaneous client logons, so Exchange choked, giving Messaging API (MAPI) errors and connection timeouts.

Possible reasons for this behavior could be network collisions, packets arriving or being processed out of order, or the fact that the network was simply overloaded. MAPI errors and such only occurred with all of the clients going at the same time with a low memory configuration on the server. Adding more memory to the server minimized these effects, and reducing the number of simultaneous logons by staggering the client startup procedure helped a great deal (although too much of a delay between client initializations caused problems with LoadSim because it couldn't find the other email accounts). Lowering the total user count and turning up the think-time (increasing the length of the simulated day) also reduced the error frequency, message queue length at the end of the test run, and memory and CPU utilization.

Is this a network hardware problem or a memory vs. network I/O processing problem? The errors-both symptoms and solutions-point to server and NT limitations rather than all being attributed to a physical network bottleneck, since most errors occurred at test initialization, rather than during the test run.

The Big Question
Can LoadSim help you capacity test your systems, and does it answer the burning question of NT scaleability? Without waffling too much, I can say yes. Properly used, LoadSim can tell you a great deal about your server's performance, what an optimal configuration is, and what load it can support.

We will continue to use LoadSim for performance testing servers in the Windows NT Magazine Lab, and we'll be able to analyze a number of factors according to the load we use: from absolute capacity to best hardware for messaging. Stay tuned for the latest data and stress tests.

Comments

Plain text