SQL Server's Trace and Replay Tool

You finally received the approval to upgrade SQL Server and buy a new server. As you think about all the new SQL Server features that you want to take advantage and the benefits of having new hardware, your boss stops by. She asks, “Now that we can spend the money on this project, how are you going to be certain that we won’t have any issues? Are you sure that things will run faster on the new server? How will you know?” As your boss walks away, you break out in a cold sweat at the thought of the upgrade failing or performing poorly. “There must be an easy, low-cost way to predict performance,” you say to yourself as you start searching for a tool. The good news is that there is such a tool. You can use SQL Server Profiler’s replayable trace functionality.

As a consultant, I’m often asked to assist with SQL Server upgrades and hardware migrations. I’ve found that many IT shops just assume that a new version of SQL Server will perform better than the old one. What they fail to realize is that they might be running legacy code that isn’t efficient or won’t benefit from the new functionality in the new SQL Server version. In the case of a migration to new hardware, many system administrators will configure a new server the same as the old one, which can perpetuate poor server response. Unfortunately, the poor response isn’t realized until a load is placed on the new server. This often happens the day that users are allowed to access the new server. Failing to test new versions of SQL Server or new hardware can result in unpleasant surprises after the upgrade or migration.

Using Profiler’s replayable trace functionality (i.e., the trace replay template), you can test query compatibility and performance in a new SQL Server version or test a workload on new hardware. You can also use it to find T-SQL code that needs to be optimized or disks that need to be reconfigured. You can even use it as an inexpensive stress tester. Although there are other tools you can use for these tasks (e.g., RML Utilities, SQLIO, SQL Server Migration Assistant), they might not be immediately available for use on a new version of SQL Server. This isn’t an issue with the replayable trace functionality because it’s built into SQL Server. So, I’ll concentrate on how to use this powerful yet flexible tool. I’ll explain how the replayable trace process works, describe possible uses for it, and cover some “gotchas” that you need to be aware of.

The Process

Figure 1 outlines the replayable trace process in a graphical workflow. As you can see, you need to perform a backup, capture the workload, move the backup and trace files to the test server, perform the control run, perform a test run, and compare the performance data from those runs.

Figure 1: Understanding the replayable trace process

Performing the backup. You need to back up the database (or databases) from which you want to capture a workload. You can use any method as long as it creates a backup that you can use to restore the database back to the point right before the workload was captured. For example, you can perform a full backup right before capturing the workload or bring the last nightly full backup up-to-date with transaction log backups.

Capturing the workload. You need to capture the workload by collecting trace data. Because there are differences in the trace script generated by newer versions of SQL Server, it’s best to use the server-side trace script generated by your production SQL Server instance. For example, if your production server is running SQL Server 2005, then you should use a SQL Server 2005-generated trace script. Similarly, if your production server is running SQL Server 2008, then you should use a SQL Server 2008-generated trace script.

Moving the files. After the backups and trace files are created, you need to move them to the test server. The test server should be running the same software as the production server, including the same versions of SQL Server and Windows.

Performing the control run. On the test server, you need to first restore the database using the backup files. Then, you need to replay the workload from the trace files. This is referred to as the control run.

During the control run, you need to capture performance data. Ideally, you should capture this data from a remote client because collecting performance data on the same server that’s running the workload can skew the server’s performance. The performance data captured can be as simple as a SQL trace. Here are some SQL trace events that are often collected:

Cursors
Database
Lock:Cancel events and Deadlock-related events
Errors and warnings
Scans
Stored procedures (completed events)
Transactions
T-SQL (completed events)
Deprecation events (especially when upgrading)

Be aware that collecting all stored procedure and T-SQL completed events can generate a significant amount of trace data. To avoid data overload, you can set a duration filter of 200ms so that only those statements that are mostly likely to be using a significant amount of resources are captured for analysis. This is especially important if you’re collecting performance data on the same server that’s running the workload.

You can also use Performance Monitor counters to collect performance data. For example, when you want to measure the workload and assess processor utilization, disk response, and memory pressure, you can use the counters listed in Table 1.

Table 1: Commonly Used Performance Monitor Counters

There are a lot more counters and events that can you can use. What I’ve listed is simply a starting point. You should use the events and counters that give you the information you need about your system.

Performing the test run. After the control run is complete, it’s time to prepare for and perform a test run. The specifics of the test run will vary, depending on what is being changed in the new environment. Suppose that you’re upgrading to a new version of SQL Server. In this case, you must use the same equipment for the test run (and any subsequent test runs) that you used for the control run. That way, you’ll be making an apples-to-apples comparison. On the test server, install the new version of SQL Server, Windows, and any other software needed in the new environment. Once installed, perform another replay of the workload and capture the same performance data.

Comparing the performance data. After you have performance data from the control and test runs, you need to analyze it. I like to import SQL traces into tables and clean up the queries so that they can be grouped. When queries are grouped, it is easy to compare and drill into the T-SQL statements that are most in need of optimization. For example, I was part of a team that helped a client upgrade from SQL Server 2000 to SQL Server 2008. The client wanted to know what code needed to be changed to run on SQL Server 2008. So, we provided the client with information that indicated which stored procedures and statements needed to be changed and which didn’t need any modifications, as Figure 2 shows. As you can see in Figure 2, the usp_proc1 stored procedure had a severe degradation in performance, but the usp_proc10 and usp_proc4 stored procedures actually experienced a performance boost from SQL Server 2008. Drilling further into the data, we revealed the specific statements inside of the stored procedures that need to be worked on.

Figure 2: Comparing queries’ performance data in a table

I also like to import Performance Monitor data into tables, where read, write, and CPU metrics can be summed. When comparing workloads, using graphs is a great way to make results easier to digest, especially for management. For example, the graph in Figure 3 compares the physical disk reads, transfers, and writes of a typical workload and a heavy workload. As you can see in Figure 3, the disk subsystem experienced dangerously high response times for reads during the test of the heavy workload, so the disk subsystem is a cause for concern under a heavy workload.

Figure 3: Comparing workloads’ performance data in a graph

Based on what you find in the comparison, you might need to make changes (e.g., rewrite stored procedures, reconfigure hardware) and perform another test run. This can be done as many times as needed because you can continually restore the database and replay the workload. After you’ve determined what you need to change to make the SQL Server upgrade or hardware migration a success, you can prepare to make those changes in the new environment.

The Possibilities

You can use the replayable trace functionality for more than just testing a new SQL Server version or new hardware. You can use it to test changes you want to make in the existing SQL Server environment. For example, you can test changes you want to make to database IDs and logins. By testing logins, you can make sure that users and groups have appropriate permissions. If you’re not concerned with permissions, you can change the workload so that it uses only one login that has elevated database permissions. That way, you don’t spend time chasing errors that should have never been generated. You can even test changes you want to make to T-SQL statements and other T-SQL code.

You can also create a workload that can be used for stress testing. I recently worked on a project in which the workload had to be expanded to seven times the current load to mimic the load experienced during peak usage at certain times of the year. This was accomplished by capturing multiple workloads from different times of the day. These workloads were then combined by changing the times of the transactions so that all the workloads occurred around the same time. We had great success with the stress testing and used it to prove that a hardware vendor had misconfigured a SAN for this client.

The Gotchas

Like most other tools, replayable traces have some areas that can trip you up if you aren’t careful. I’ve encountered gotchas in all stages of the replayable trace process. When you’re first starting the process, a common gotcha is to let too much time lapse between the database backup and the workload trace. You must start the workload trace immediately after the database backup to ensure that data is in a state that will allow the captured transactions to run. For example, suppose that a backup is taken at 12 p.m. to use for replay and a new customer is added at 12:05 p.m., but no one starts collecting the workload until 12:30 p.m. On replay, the absence of the new customer in the backup will cause any T-SQL code that references the new customer in the workload to fail because the backup happened before the customer was added. The failure would, in turn, skew the replay performance analysis.

When setting up the control run in the test environment, you need to be aware of two common gotchas involving database IDs and logins. The database IDs in the test server must be the same as those in the production server. In addition, production logins need to be transferred to the test server. The only time you don’t have to transfer production logins is when you’re altering the workload to use a single login (like sa) with elevated database permissions.

When you’re testing an upgrade to a new version of SQL Server, a common mistake is not updating the statistics or rebuilding the indexes. Both actions are necessary for a successful performance comparison.

If you’re loading a captured workload into a table, you need to watch out for this gotcha: In test environments running SQL Server 2005 or later, you must use Profiler to load the captured workload into a table. Using the fn_trace_getinfo system trace function can result in errors when replaying the workload.

When replaying the workload in the test environment, expect the test server to be heavily utilized by the replay of the workload. This can cause other users of the test environment to be unable to work and might require that you reserve a time slot in the test environment dedicated only to replay.

When evaluating performance data, you need to be aware of two common gotchas. First, you should avoid using the duration counter in the SQL trace because it will produce inconsistent values. Second, you need to cleanse parameters from queries so that queries can be grouped. For example, you can substitute {Str} for string parameters, {#} for numeric parameters, and {MDY} for datetime parameters. You can find more information about cleansing parameters in “Troubleshooting and Analysis with Traces” on TechNet . This is an excerpt from the book Inside Microsoft SQL Server 2005: Query Tuning and Optimization (Microsoft Press, 2007) by Kalen Delaney, Sunil Agarwal, Craig Freedman, Ron Talmage, and Adam Machanic. Another book you might want to check out is Inside SQL Server 2005: T-SQL Querying (Microsoft Press, 2006) by Itzik Ben-Gan and Dejan Sarka. They provide a CLR user-defined function (UDF) that cleanses parameters. You can also find more information about the gotchas and the replayable trace process in a free three-part video series on replayable traces.

The End Result

Profiler’s replayable trace functionality is a powerful yet flexible tool for comparing and analyzing performance between different SQL Server environments. In a SQL Server upgrade and hardware migration, it can help you identify and fix performance problems before that big event. As a result, you and your managers will know what to expect when the new environment goes online.

Comments

Plain text