The Windows 2000 Magazine Lab conducted benchmark tests of the Compaq ProLiant 8000, a server with eight 550MHz Pentium III Xeon processors, to determine its CPU scalability under Windows NT Server, Enterprise Edition in various application environments. We report our conclusions in the Windows 2000 Magazine Lab Report, "ProLiant 8000," March 2000. Our testing and tuning procedures using Microsoft SQL Server 7.0 Enterprise Edition and Microsoft Internet Information Server (IIS) 4.0 workloads follow.
We configured the test system with 4GB of RAM, a Compaq 4250ES RAID controller, twenty-one 9GB Compaq 10,000rpm Ultra-2 LVD disk drives (model BD00911934), and two Compaq model 3131 dual-port fast Ethernet controllers. We left the 32MB cache that is standard with the 4250ES array controller at its default allocation of 50 percent read and 50 percent write back cache. We defined a 4GB NTFS boot volume (C drive) on a RAID 0 array comprising the first three drives on SCSI channel 1. Then we defined the balance of that RAID 0 array, approximately 22GB, as a second volume (E drive), and we placed a 4GB paging file on the array.
We combined three more drives on SCSI channel 1 into another RAID 0 array and allocated the array’s full 26GB as H drive for database log files. We combined 14 drives on SCSI channels 2 and 3 into another RAID 0 array, which we allocated as a 121GB F volume and used as the primary database location. We reserved the last of the 21 drives as a spare.
We configured all RAID 0 arrays with a 128KB stripe size and formatted logical drive C with NT’s default 512-byte allocation unit. We also formatted logical drives E, F, and H as NTFS partitions with a 16KB allocation unit.
To compare the relative scalability of the Profusion architecture with the older 4-way Intel 450NX PCIset architecture, we ran a series of tests with a Compaq ProLiant 7000 system, which we configured much like the ProLiant 8000. Like the ProLiant 8000, the ProLiant 7000 system had 4GB of RAM and a three-channel RAID controller (Compaq model 3100ES). The ProLiant 7000 had eighteen 9GB Compaq 10,000rpm Wide Ultra SCSI3 disk drives (Compaq model HD0093172C), and two Compaq model NC3122 dual-port Fast Ethernet controllers. We combined the first three drives on SCSI channel 1 to create a RAID 0 array with two logical volumes. Logical drive C was a 4GB boot drive. The balance of the array, approximately 22GB, was logical drive E and contained a 4GB paging file. We combined the other three drives on SCSI channel 1 to create another RAID 0 array, which we allocated as a 26GB drive H and used for database logging. We used the 12 drives on SCSI channels 2 and 3 to create a third 101GB RAID 0 array, which we fully allocated to NT as drive F and used for primary database data storage. Like on the ProLiant 8000, we configured each of the ProLiant 7000’s arrays with a 128KB stripe size and formatted all volumes for NTFS. We formatted drive C, the boot drive, with a 512-byte allocation unit and formatted the other volumes with a 16KB allocation unit.
SQL Server 7.0 Testing—OLTP Workload
The first series of benchmark runs tested CPU scalability for an online transaction processing (OLTP) workload under SQL Server 7.0 Enterprise Edition. Using Compaq’s Smart Start installation process, we installed NT 4.0, Enterprise Edition with Service Pack 5, Microsoft Internet Explorer (IE) 5.0, and SQL Server 7.0 Enterprise Edition with a prerelease version of SP2. Using the Microsoft Transaction Processing Council (TPC-C) data generation tool, we created an 800-warehouse set of databases for use with a transaction-processing class benchmark test that Client Server Solutions prepared to use with Benchmark Factory 1.5 in the Lab. We generated databases to the large RAID 0 array on the 8-way ProLiant 8000 and the 4-way ProLiant 7000. We used SQL Server 7.0 Enterprise Manager to create a disk-based backup of the database immediately after the database completed. After selecting the TPC-C database in Enterprise Manager, we used the Backup option from the Tools menu to create a backup dataset. SQL Server 7.0 Enterprise Manager reported a database file size of 12,008MB with 11,502MB of space available after we ran the database restore. The difference, 506MB, was the amount of data generated for the test database. With NT 4.0, Enterprise Edition booted and a 3GB switch on the 4GB system, 3GB of system memory is available to SQL Server. Because the test dataset size was about 0.5GB, it easily fit into the SQL Server cache and let the database read service requests from the cache without requiring a disk read.
Because our intent was to test CPU scalability, we tuned the SQL Server 7.0 database to minimize I/O during the test and maximize stress on the system CPUs. We always applied tuning parameter changes after we restored the database, then backed up the database to ensure that the database-related tuning changes were restored with the database between test iterations. Listing 1 shows the SQL 7.0 tuning parameters we used.
We ran each test once for each allowable group of processors in the system we were testing. In this context, a test is a series of 10 iterations, each iteration generating a different workload of from 100 to 1000 simulated users in 100-user increments. We selected this set of user loads and a 270ms think time after we ran several calibration tests. We hoped to achieve maximum throughput for the eight-processor test at a workload less than the 1000-user maximum and to achieve the maximum throughput for the single-processor test at a workload greater than the 100-user minimum. Test clients used standard SQL Server security through the systems administrator account to authenticate access to the tested SQL Server machine.
We initialized the server to a known state before testing each group of 100 simulated users. To initialize, we restored the database, rebooted the system, and ran a SQL query that loaded the entire database into SQL cache in system memory. We ran the load cache query a second time; the query ran much faster and didn’t cause disk activity, indicating that the data was cached.
The benchmark test at each level of simulated users comprised 30 seconds of quiet time, 30 seconds of ramp-up time, followed by 8.5 minutes of execution time (during which test results recorded), followed by 30 seconds of ramp-down time. The test took a total of 10 minutes. Each simulated user generated test transactions at intervals using a negative exponential distribution with a mean interval of 270ms.
A Cisco Catalyst 5000 100MB Ethernet switch is the heart of our testing network. We’ve segmented the switch into multiple virtual networks comprising four groups of eight switched ports. On our testing network we installed 48 load-generating client computers divided evenly between six Ethernet collision domains and three TCP/IP networks. Each client computer is AMD K-2 350MHz-based with 64MB of RAM and 4GB of disk storage, and each computer contains an Adaptec model 62011 Fast Ethernet NIC. Each group of eight clients connects to a 100MB Ethernet hub defining one collision domain, which in turn connects to one of the switch ports. Two hubs connect to each of three switch segments. The fourth switch segment didn’t actively participate in this test. Each server we tested was multihomed and connected directly to the switch on each of the four network segments. The server connections were set to 100Mbps, half duplex.
We selected one multihomed computer connected to all four network segments to perform Performance Monitor checking. During the course of each test we recorded network utilization as observed by one load-generating client on each collision domain (i.e., hub) and recorded that client’s memory and CPU utilization. Similarly, we monitored the server under test for memory utilization and for CPU utilization on all processors. During the tuning and calibration phases of our testing we also monitored disk performance and SQL Server statistics. We didn’t install Network Monitor, and we disabled disk parameter monitoring (diskperf) for the final runs because both programs can affect test results.
IIS and Simple ASP Testing
We used IIS 4.0 for a second set of CPU scalability tests. From the NT 4.0 Option Pack, we installed IIS on the ProLiant 7000 and ProLiant 8000, then reinstalled NT 4.0 Service Pack 5 (SP5). In both cases, we placed WWWROOT on the large F drive, and we tuned both systems identically. The benchmark test relied on module Webstone.asp, which Client Server Solutions provides with its Benchmark Factory product, which we copied to a wwwroot\Webstone directory on each server.
Webstone.asp randomly generates characters to the client Web browser in a specified quantity. The server does most of the work with the .asp module and doesn’t perform disk I/O. With this test, we intended to push CPU utilization to 100 percent without a potential disk I/O bottleneck. We created a test transaction in Benchmark Factory by using the Webstone.asp module to generate 1000 random characters for display at the client browser. Our initial runs quickly hit a bottleneck that limited total CPU utilization to about 40 percent. The system needed some tuning.
Microsoft Technet includes a useful technical note, "Internet Information Server 4.0 Tuning Parameters for High-Volume Sites," which you can view at http://www.microsoft.com/ntserver/zipdocs/tuning.exe.
Following Microsoft’s guidelines, we made several tuning changes and used Performance Monitor while Benchmark Factory monitored the effect of each change. As we worked, we discovered we had already made some of the recommended tuning changes. We set the Server Service property for Network Applications. Using IIS Administrator, we disabled logging, turned on Unlimited Connections, and set the Performance Tuning tab option to More than 100,000 hits. Microsoft suggested additional tuning, including removing file extension mappings other than .asp and setting the TCP parameter MaxUserPort to 0xfffe and the TcpWindowSize parameter to 0x4470. None of these tuning changes greatly affected CPU utilization or overall transaction throughput. However, Microsoft also recommended some Active Server Pages (ASP) -specific tuning changes. First, we set Enable Buffering under our Web site’s Properties page. The ASP Registry parameter ProcessorThreadMax limits the number of threads per CPU that IIS will allocate, whereas the parameter AspScriptEngineCacheMax lets each ASP thread cache a script engine, so ASPs process more efficiently. We found that setting ProcessorThreadMax to 1 and AspScriptEngineCacheMax to 8 (on the 8-way) and 4 (on the 4-way) dramatically improved transaction rates at each client-load level and let CPU utilization rise to 99.5 percent at the point of maximum throughput. Performance Monitor showed why: On the ProLiant 8000, the System object’s context switches per second performance counter—a measure of how frequently a processor begins to execute a different thread—dropped from over 60,000 per second to just over 12,000 per second. Because we obtained full CPU utilization—which was the goal of our tuning efforts—we were satisfied with the results, and we performed the test without further tuning.
Performance tuning isn’t always easy but is often worth the effort. Having a way to test with a repeatable workload that is representative of your production workload is a key to successful tuning. A product such as Client Server Solutions’ Benchmark Factory simplifies this part of the task. Knowing the factors that can affect performance—and that you can tune—is the other half of the equation. The Microsoft white paper, "Windows 2000 Performance and Benchmarking," provides a relatively succinct overview of performance tuning factors, most of which apply to NT 4.0. The public version of this document should be available by the time this article publishes. If you can’t find the document, send me an email message ([email protected]) with "Win2000 Performance White Paper" in the subject line, and I’ll let you know if and where the document is available.