Tips for Interpreting Messaging Benchmarks

As messaging and collaboration applications such as Microsoft Exchange Server, Lotus Domino, and Novell GroupWise gain popularity, the industry is scrambling to find a standard for comparing these products' performance. Benchmarking standards organizations and third parties such as Bluecurve want to add standard messaging-server benchmark suites to their portfolios, and software vendors such as Microsoft and Lotus have already defined standards.

A traditional benchmark for hardware performance is users per server. Hardware vendors publish headlines that proclaim that their products can support 10,000, 20,000, or 30,000 users on one server. But users per server is probably not an appropriate performance benchmark for messaging applications. Understandably, administrators are confused. They ask, "What do these benchmark standards mean?" "Which benchmark can I trust?" and "How do I interpret these benchmarks and apply them to my environment?" Let's briefly examine a few of the benchmarks that have emerged for comparing messaging server performance. Then, I'll explain some limits of benchmarks and offer some tips for comparing test results.

A Look at Messaging Benchmarks
Some current messaging benchmark tools are Microsoft's Messaging API (MAPI) Messaging Benchmark (MMB), Bluecurve's Dynameasure Messaging Mark (DMM), and Lotus' NotesBench. These products have similarities and differences.

MMB. Microsoft developed the Exchange Server Load Simulator (LoadSim) tool early in Exchange Server 4.0 development to simulate loads against an Exchange server by mimicking MAPI calls. (For more information about LoadSim, see Greg Todd's four-part Windows NT Magazine series "Understanding and Using LoadSim 5.0," January, February, April, and May 1998.) Microsoft originally developed LoadSim as an internal tool. Later, however, Microsoft, consultants, and customers realized LoadSim's value for capacity planning. Although LoadSim isn't perfect, it lets you customize workloads and report metrics while simulating a load against an Exchange server. The LoadSim tool also lets you compare software releases and hardware platforms (either from competing vendors or within one vendor's product line).

Microsoft introduced MMB to shift the focus away from users per server so that customers won't automatically base their deployments on that metric. MMB is a benchmark workload based on the LoadSim workload profile (a set of messaging actions) for what Microsoft has determined to be a typical corporate email user (known as the Medium LoadSim User). The metric for MMB is a number (MMB) that represents a user transaction load attained during the benchmark run. In addition, MMB reports a 95th percentile response time (in milliseconds) score on the test run.

Microsoft audits and approves results from hardware vendors that want to publish MMB scores for their server platforms. Hardware vendors follow guidelines that Microsoft sets forth in its Exchange Server OEM Benchmarking Policy Guidelines document. Vendors document results and submit them in a specified format to the Exchange Server performance team. The team evaluates the submissions for accuracy and validates that the results are within Microsoft's criteria for a successful benchmark run. MMB and LoadSim have attained widespread acceptance in the Exchange Server space as the de facto standard for comparing Exchange Server performance. For more information about Microsoft's MMB, see http://www.microsoft.com/exchange/guide/perform_scale.asp.

DMM. Bluecurve's Dynameasure/Messaging, Professional Edition is a valuable capacity-planning and measurement tool. Dynameasure/Messaging uses Active Measurement technology to let you create sophisticated workload models and include elements such as Exchange Server topology, transaction mixes, and even variable Information Store (IS) sizes. You can vary these attributes over time (e.g., throughout a workday) and as your deployment grows. Dynameasure also lets you look at your end-to-end deployment by accounting for factors such as client and network loads and variable user behavior.

To give users and vendors a common reference for Exchange Server performance, Bluecurve has developed DMM, an add-on to Dynameasure. DMM offers a predefined workload profile or test suite that lets you compare messaging server performance for MAPI-based servers such as Exchange Server. DMM uses metrics such as transactions per minute and average response time to report results. Bluecurve says the test contains a realistic mix of user transactions applied against a repository of messaging data and that it measures important and commonly used Exchange Server components. DMM provides a representative workload for Exchange Server. However, DMM and MMB vary in the work they produce and the methods they use to simulate the workload. DMM measures transactions per second and gives a response-time metric. MMB measures a user load combined with response time.

Overall, DMM is a promising benchmarking tool. DMM changes the focus from users per server to transactions per second, which is a better method of comparing performance. Although DMM has garnered Intel's support, top-tier Exchange Server vendors such as Compaq, HP, and IBM haven't endorsed the product. These vendors haven't published results based on DMM primarily because Microsoft and its customers haven't shown interest in DMM as a comparison standard for Exchange Server.

One disadvantage of DMM is that anyone who wants to publish its benchmarks must purchase Dynameasure/Messaging Professional. For more information about Bluecurve's Dynameasure/Messaging Professional and DMM, go to http://www.bluecurve.com.

IBM/Lotus NotesBench. Although Lotus NotesBench doesn't apply to Exchange Server, comparisons between Exchange Server and Lotus Notes or Lotus Domino confuse some people when vendors publish performance benchmarks. Like MMB, NotesBench reports a user transaction load and a response-time measurement. The NotesBench organization—a vendor consortium composed of representatives from Iris, Lotus, Compaq, Sun Microsystems, and other hardware vendors and independent test and tool agencies—promotes and supports NotesBench. The consortium meets quarterly to discuss the benchmark run and audit rules and to refine the benchmark to represent end-user workload profiles.

NotesBench is an independent organization, but its involvement with Lotus makes it somewhat biased. NotesBench's position is akin to using Microsoft's MMB for Exchange Server if Microsoft dictated not only the benchmark but also the run and audit process. Limited information about NotesBench is available in the public domain, because only NotesBench licensees can obtain the full details. However, you can obtain some information about NotesBench by registering online at http://www.notesbench.org. The consortium has defined benchmark test suites for specific workloads such as mail, mail/db, GroupWare, and replication server. I don't recommend comparing NotesBench results with results from MMB or DMM tests, because workloads are likely to be different and the comparisons inaccurate. (See "Tips for Using Benchmarks.") Lotus offers Server.Planner, a free planning tool that customers and partners can use to run comparisons of server loads on different platforms and configurations and to modify the workload depending on their requirements.

Other benchmarks. Other organizations are working on mail server benchmarks as part of their industry-standard suites. For example, the Standard Performance Evaluation Corporation (SPEC—http://www.spec.org) wants to develop a mail server benchmark for comparing Internet mail-based profiles (which ISP deployments use) for mail protocols such as POP3, SMTP, and Internet Message Access Protocol (IMAP). A benchmark like this one, however, would have little applicability in the corporate mail deployments that use Exchange's MAPI. As more organizations turn to applications such as Exchange Server as their messaging server, you can count on an increase in the number of organizations involved in performance and benchmarking activities for Exchange Server. When more benchmarking options are available, understanding how to interpret results and apply them in your organization will become paramount.

What the Benchmarks Don't Tell
Most capacity-simulation tools simulate canonical or customized workloads against a messaging server such as Exchange Server, but these tools don't produce identical workloads. For example, Bluecurve's DMM and Microsoft's MMB both attempt to simulate a typical corporate Exchange Server user in their respective workload profile. Lotus' NotesBench attempts to simulate a typical Notes or Domino user. However, the Notes or Domino user and the Exchange Server user are different. The products use different protocols and APIs, so you can't compare one transaction mix to the other. Although MMB and DMM both attempt to simulate the same type of Exchange Server user, their methods for doing so differ. Because these benchmarks aren't the same under the covers, you need to consider more than just the phenomenal users-per-server numbers that the vendors who use those numbers proclaim.

Customer-deployable configurations. One consideration is whether companies perform benchmarks on customer-deployable configurations. A hardware vendor might publish a result of a test it conducted on a platform or configuration that customers would never use (and vendors wouldn't support). For example, many vendors publish benchmarks for disk subsystems configured as RAID 0 disk arrays. Although RAID 0 provides the highest level of disk subsystem performance, few customers use a RAID 0 system, because it fails to protect against data loss. A more realistic test would be to benchmark a RAID 5 system, because it provides the fault tolerance that typical deployments require. However, performance on a RAID 5 system isn't as good as that on a RAID 0 system, so most hardware vendors configure their test systems as RAID 0. To ensure more realistic results, the NotesBench consortium requires RAID 5 configuration for results published for Lotus Domino 4.6 and later.

In addition, many vendors conduct only single-server benchmark testing. Most messaging and collaboration application deployments aren't single-server deployments; therefore, single-server benchmarks aren't useful or valid for multiple-server deployments.

Disaster recovery and message store size. Because messaging and collaboration applications are quickly becoming mission-critical to most organizations, high availability and expedient disaster recovery are important. The need for high availability has created a dilemma for many implementers. Today's hardware platforms, combined with industry-leading software such as Exchange Server and Lotus Domino, can provide enterprise-class scalability, supporting thousands of users per server. However, many customers need their servers to have high availability, so they choose to deploy fewer users per server, because the IS grows as server user loads increase. For example, if you have 1000 users on a server and allocate 30MB of mailbox storage to each user, the IS will undoubtedly exceed 30GB. In many instances, the storage requirements for the IS outweigh the I/O requirements, which benchmarks emphasize.

In addition, you need to consider working space for IS maintenance, upgrades, and other administrative activities. Often, additional storage requirements for IS administrative activities can double the user mailbox cumulative total. Suppose analysis of the I/O requirement shows that you need a disk subsystem of 12 drives. Analysis of the information storage requirements—including user mailbox size, single-instance storage ratio, deleted-item retention configuration, and IS maintenance—might reveal that the disk subsystem requires 24 drives. If you increase the user loads to 20,000 users per server, you can expect an IS larger than 1TB (assuming an average of 50MB per user), which might be beyond disaster-recovery capabilities.

Deploying ISs larger than 1TB can be problematic. To avoid problems, you need to answer these important questions:

Can the backup mechanisms in place meet the requirements within the current backup window (the time period available to perform backup activities) or, more important, the restore window?
Will the backup and restore times provided meet the IT department's service-level agreements?
Is the increased vulnerability to system outage (20,000 users unable to work vs. 2000 users) an acceptable risk?
How much time will the server be unavailable while a restore is in progress?

Although many organizations want to deploy thousands of users per server, many opt for fewer users because of these concerns. Benchmarks don't take into account these and other factors such as directory replication, interserver traffic, and peak loads. Therefore, although vendors need to publish benchmarking information, customers can't take these results as the last word on how to deploy the application in their organization.

Tips for Using Benchmarks
The most important aspect of benchmark interpretation is understanding how to compare different results. This understanding is true not only for messaging and collaborative applications but also for standard benchmarks for other applications. Some of these benchmarks are SPECmarks (for comparing processor/system performance), WinMarks (for measuring Windows performance of hardware components such as video cards), and TPC-C (for comparing performance of online transaction processing—OLTP—applications). Tables 1 and 2 provide an example of how benchmark results can be misleading if you don't know how to properly compare and interpret them. Here are some tips to help you understand test results.

Compare only results based on similar methodologies and user profiles. You can't compare runners to weight lifters in the Olympics. Similarly, you can't compare the popular TPC-C benchmark for OLTP applications to the TPC-D benchmark for decision support applications or Notes-Bench results to LoadSim/MMB results. The methodologies and user profiles in each of these benchmarks are different. Lotus doesn't even disclose methodologies and user profiles for NotesBench. Compare results only when you are sure that a user from result A is similar to a user from result B.

Compare only results based on similar OS architectures. You can't compare race times of an alcohol-fueled dragster against the times of a car that runs diesel. The comparison just isn't fair—alcohol burns hotter than diesel, so the diesel-fueled car just can't compete. Likewise, you can't compare benchmarks from a test run on NT to those from a test run on Novell NetWare or UNIX.

For example, a look behind the results in Tables 1 and 2 reveals that testers ran the Exchange Server benchmarks on NT in native 32-bit mode with access to 4GB of RAM. Testers achieved the IBM/Lotus 27,030 transaction results, however, on an IBM OS/400 system in native mode with access to 40GB of RAM (yes, 40!). Testers reached the IBM/Lotus results using partitioned Domino server instances running on a single-server platform; that is, multiple instances of the Domino server were running on the server, and each server supported a subset of users. If you divide the 27,030 users among 30-server instances (the number used in the benchmark) of Domino, you get about 900 users per partitioned server. In its implementation data, IBM/Lotus doesn't recommend running more than six partitioned Domino server instances in a production environment (for Domino 4.62). Therefore, this benchmark is an unfair comparison not only between NT and OS/400 but also between Exchange Server and Domino. For more information about Exchange Server performance, go to http://www.microsoft.com/exchange/guide/perform_scale.asp.

Compare only results based on similar hardware configurations. Although end users can compare results from different vendors, comparing dissimilar configurations isn't appropriate. Consider, for example, the 15,000 MMB from Compaq and 16,000 MMB from HP results in Table 1. On a seemingly comparable server platform, it appears that the HP configuration allows a load 1000 MMB higher than Compaq's. Digging deeper, I learned that HP used a 24-drive disk subsystem, and Compaq achieved its results with only 18 drives. In this example, the additional I/O capacity of six more drives easily accounts for the 1000 MMB difference. Compaq achieved its 19,000 MMB result using a disk subsystem comparable to HP's but with faster processors. Differences in hardware configurations that directly affect performance occur in most published results.

Another example of this problem is comparing results from single-processor to multiprocessor configurations or Compaq Alpha to Intel processors. You need to compare results on similar hardware configurations and, more important, to configurations you might deploy in your environment.

Consider the price/performance ratio when comparing results. Some companies achieve benchmark results touting tens of thousands of transactions per second or users per server by using gargantuan hardware configurations. For example, companies used more than 1000 disk drives to achieve many of the current Transaction Processing Council (TPC) benchmarks. When comparing results, end users need to consider not only the practicality of deploying these configurations but also the relative price/performance ratio. For example, although the IBM/Lotus benchmark was 27,030 transactions, the cost of the hardware configuration (and the OS) far exceeds that of the Compaq and HP results for Exchange Server. The question to ask is whether you want a 20 percent performance benefit at 300 percent higher cost.

Watch Out for Hype
Don't get caught in the marketing hype that surrounds published benchmark results. Be sure that you're comparing apples to apples and interpreting messaging server performance benchmarks correctly. Although properly comparing and interpreting benchmark results requires diligence and a little investigation, you'll be a better informed user if you make the effort.

Comments

Plain text