Use Xperf's Wait Analysis for Application-Performance Troubleshooting

Xperf has been a common theme in my recent articles, for very good reason. The tool's capabilities, coupled with the Windows operating system's built-in Event Tracing for Windows (ETW), are seemingly limitless.

Imagine that you have a desktop application that suffers some sort of performance problem in which the UI hangs intermittently. Perhaps the problem happens only in select environments outside your test lab. You are unable to reproduce the issue and therefore unable to capture the situation in a live debug. What should you do?

Often, if an application's UI hangs, it's because the thread that is responsible for pumping messages has temporarily quit doing so. Perhaps the UI thread is performing some long CPU-bound work. In that case, you can use Xperf as a sample profiler, to find the code that is responsible for the CPU spike. I wrote about this scenario in my previous article, "Windows Application Performance Profiling with Xperf."

Another potential cause of the UI freeze is that the message pump thread has entered a wait state, waiting for some object (e.g., an event) to become signaled. In this case, sample profiling with Xperf won't help you isolate the problem. You need Xperf's wait analysis capabilities instead.

What is wait analysis? Essentially, Xperf can use the data in the captured .etl (Event Trace Log) file to match a thread that is waiting on an object with the thread that signaled that object. This capability is extremely powerful because it can also help you diagnose general latency and contention within an application. In Xperf and ETW parlance, when a thread is switched out from the running state on a processor and a new thread is switched in, the ETW Kernel Logger can generate a CSwitch event to log the transition. Naturally, a thread that enters a wait state is immediately switched out. In addition, when another thread signals an object on which threads are waiting, the signaling thread also places the waiting threads into the ready state. This operation is captured by a ReadyThread event. Xperf, with its amazing analysis capabilities, can match related CSwitch and ReadyThread events, allowing you to follow the chains of these events.

The Sample Application

To demonstrate this capability, create a simple MFC application within Visual Studio (VS) 2010. I created a project called StuckUI and used the MFC Application wizard to create a dialog-based MFC application. Then, I placed one button on the dialog and wired that button to the following handler:

void CStuckUIDlg::OnBnClickedButtonWork()
{
    HANDLE hEvent = CreateEvent( NULL, TRUE, FALSE, NULL );

    QueueUserWorkItem(
        WorkFunction,
        hEvent,
        WT_EXECUTEDEFAULT
    );

    WaitForSingleObject( hEvent, INFINITE );
}

Notice that in this handler, I create an event, hand it off to WorkFunc via the process thread pool, and then wait on it. Since the button handler runs in the context of the main UI thread, I can easily hang the UI in my application by writing WorkFunc like this:

DWORD WorkFunction( void* param )
{
    HANDLE hEvent = (HANDLE) param;

    Sleep( 5000 );
    SetEvent( hEvent );

    return 0;
}

This approach will effectively hang the UI thread of the StuckUI application for roughly 5 seconds. Now, let's use Xperf to diagnose the problem.

Capturing the Trace

Since I'm on an x64 platform, I built my sample application, which I named StuckUI.exe, to match that platform. To capture the trace, I used Xperf:

Xperf -on Base+CSWITCH+DISPATCHER -stackwalk CSwitch+ReadyThread -BufferSize 1024 -MinBuffers 50 -MaxBuffers 50
StuckUI.exe
Xperf -d StuckUI.etl

There are several things to note when capturing the trace in this way. Notice that I used the Base kernel group and added CSWITCH and DISPATCHER. The latter two kernel flags capture the CSwitch and ReadyThread events that are necessary to perform wait analysis. Additionally, the CSwitch and ReadyThread flags after the stackwalk argument capture the necessary stacks when these events are generated. If you are suspicious that you might also have a CPU spike and haven't yet ruled that out, you can replace Base with Latency and add Profile to the stackwalk list. However, if you've already ruled out a CPU spike, then it's always best to capture only the necessary data so that your trace file does not grow unnecessarily large.

Since CSWITCH and DISPATCHER can quickly generate many events, I also gave Xperf a hint about how much of a memory buffer to allocate in non-paged memory. In this case, I specified a buffer size of 1MB and told Xperf to pre-allocate 50 buffers. I accomplished these tasks by setting MinBuffers and MaxBuffers to the same value -- 50 -- which reserves 50MB of non-paged memory. Depending on your situation, you might need to massage these buffer numbers so as not to lose events. As you can imagine, Xperf is extremely fragile to missed ETW events and will notify you of such events later, when you use Xperf to analyze your trace. It is next to impossible to trust the Xperf analysis if events are missing from the trace.

Don't forget that to capture stack traces on x64 platforms, you must disable paging of the executive and restart, if you haven't already done so. You can conveniently set the necessary registry setting by using this command line:

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Analyzing the Trace

Now the fun begins! Open the resulting ETL file in Xperf. You're presented with several graphs, depending on the configuration of Xperf, which is automatically saved each time you run the tool. (Xperfview.exe is the application that is used for analyzing the ETL. You can launch this application implicitly, simply by passing your ETL file to Xperf.exe.) You should see a graph called Process Lifetimes.

If you don't see the Process Lifetimes graph, you might need to enable it via either the Graphs menu or the fly-out on the left side of the main Xperf window. Incidentally, if you don't see the option to turn on a specific graph that you're looking for, it's usually because your trace was not set up to capture the relevant information for that graph.

I started and stopped the sample application during the trace collection, so you should see diamond shapes in the Process Lifetimes chart; these shapes indicate the start and end times of the StuckUI.exe process. Notice that most of the processes that are running on the machine are grouped together in one line, with no specific start and stop times other than the beginning and end of the trace. If you want, you can separate all the processes by right-clicking and choosing Toggle Permanent Process Grouping.

If I know that the trace has captured the start and end times of my process, I like to select that range within the Process Lifetimes graph, right-click it, and select Clone Selection. This action replicates the selection across all the currently visible graphs. 'Notice that the start and stop diamonds act as snap points, making it easy to select the exact range of the StuckUI.exe process lifetime, as I've done in Figure 1.

Figure 1: Process Lifetimes graph

To start the wait analysis, first ensure that you have set up Xperf to point to the Microsoft public symbol store, as well as the path that contains the private symbols for your application. For example, my symbol path, which includes the directory to the Project Database (PDB) for StuckUI.exe on my machine, is as follows:

srv*c:\symbols*http://msdl.microsoft.com/download/symbols;C:\StuckUI\x64\Debug

To set up the symbol paths, select Trace, Configure Symbol Paths, from the menu. To load the symbols, choose Trace, Load Symbols. You might occasionally see the Xperf UI freeze or hang. This issue is caused by symbol loading and could happen at various times, since Xperf loads symbols on demand. Depending on how many symbols you already have cached and the speed of your connection, these hangs could last for several minutes.

Alternatively, you can set the environment variable _NT_SYMBOL_PATH to set up the symbol path for Xperf. In the Configure Symbol Paths dialog box, you will also see a value for _NT_SYMCACHE_PATH, which typically defaults to C:\symcache. This is an internal, Xperf-formatted symbol cache. Xperf really needs only a subset of the symbol information from the PDB. Therefore, it skims the relevant information from the PDBs and puts it in a file in symbol cache. If your symbol cache is sufficiently complete, you might not even need to set up _NT_SYMBOL_PATH and can simply point Xperf at the symcache path. This option might help avoid the annoying delays caused by Xperf symbol loading.

Now, enable the CPU Scheduling graph (if you haven't already) and select the appropriate range, which might even be the entire span of the trace (in which case no selection is necessary). My wait analysis graph looks like the one in Figure 2, but yours might look much different, depending on the amount of cores and work that are active on the machine during the trace.

Figure 2: CPU Scheduling graph

Next, right-click the CPU Scheduling graph and select Summary Table with Ready Thread.

The most powerful component of the summary table is the gold bar, or aggregator bar. Any column that is enabled and is to the left of this bar becomes a sorting column. The ordering of these columns is important. For example, my summary table begins with the following columns, which are all to the left of the gold bar: NewProcess, NewThreadId, NewThreadStack, Max:Waits (us), ReadyThreadStack, ReadyingProcess, and ReadyingThreadId. So first, I sort by process, then by thread ID, then by thread stack, and so on. If one of these columns is not visible in your chart, then make sure that it is enabled in the menu or the fly-out. To reorder the columns, you can drag and drop them. If you have many columns enabled and cannot see them all without horizontal scrolling, you can drag the column names within the fly-out to reorder them.

After locating the StuckUI.exe process in the NewProcess column, click the plus sign next to that process to expand the tree and reveal the threads within the process. You will then see the process threads in the NewThreadId column. In the NewThreadStack column, you should see two expandable items named [Root]. These items are the stacks that were captured during each CSwitch event. As you drill down the stacks and expand them, you will encounter "forks in the road," so to speak, where the call tree branches.

In an analysis like this, the more information you have, the easier the analysis will be. For example, I know that the UI message pump is stuck and that a UI window (HWND) is associated with one thread at most. Knowing these things, I can use Spy++ to identify the thread that owns the UI. In this example, identifying the right thread is easy because StuckUI.exe has only two threads recorded in the ETL. But if I were troubleshooting explorer.exe and an extension that it loaded, then knowing which of the many Explorer threads to focus on would be helpful indeed. In Figure 3, I have drilled down a few levels into the stacks of thread 9888, which happens to be my UI thread.

Figure 3: A portion of the CSwitch stacks for thread 9888

Now, here's the cool part! For each path in the stack tree, the Max:Waits (us) column displays the maximum amount of time, in microseconds, that the particular code branch has waited. I observed that the UI was stuck for about 5 seconds, so it isn't surprising to find a value close to 5,000,000 in this column. If you keep drilling down into the stack, following the branch paths of the maximum wait in which you're interested, you will get to the top of the stack, where the thread was switched out as it entered the wait state. That's when the magic of Xperf's wait analysis reveals itself, because when you reach that point in the NewThreadStack column, you will find some vital information in the ReadyThreadStack column, as Figure 4 shows.

Figure 4: Top of the waiting thread

Notice that I've called out the location in the waiting stack where CStuckUIDlg::OnBnClickedButtonWork has called WaitForSingleObjectEx, leading to the thread's wait state. If you expand the ReadyThreadStack column, you will see the exact call stack that readied the waiting thread. In the ReadyingProcess and ReadyingThreadId columns, you'll see values that indicate which thread of which process performed the readying work. If you're wondering on which kind of object the waiting stack was waiting, the readying stack might contain the answer. For example, the marked area in Figure 5 shows the call to SetEvent in the readying stack.

Figure 5: The readying stack

You might have noticed that the other thread within StuckUI.exe also had a max wait that was close to 5 seconds. If you drill down into that stack, as Figure 6 shows, you can see the reason.

Figure 6: The sleeping thread

The waiting stack indicates that this thread was the worker thread that called our WorkFunction and, subsequently, SleepEx. The sleep timeout passed was 5 seconds, which explains the max wait value. However, take a look at the readying stack and the associated process and thread ID. The thread ID of zero indicates that it is the idle thread that executes when there is essentially nothing for the processor to do. The idle thread is also responsible for putting the processor to sleep momentarily, to conserve energy. But why is the idle thread the readying thread? Because SleepEx is implemented by using a kernel timer, which fires a Deferred Procedure Call (DPC) to KiTimerExpiration within the kernel at the required time to expire the timer. In addition, the idle thread is a common point from which DPCs are executed. To learn more about DPCs, refer to Windows Internals 5th Edition, by Mark Russinovich, David Solomon, and Alex Ionescu (O'Reilly Media).

The last thing I want to point out is that the ETL also recorded the exact time that the waiting thread was switched out, that the readying thread readied the waiting thread, and that the waiting thread was allowed to execute again. You can see this information by arranging the LastSwitchoutTime (s), ReadyTime (s), and SwitchInTime (s) columns just to the right of the gold bar, as shown in Figure 7, where the information is revealed after you expand the readying stack.

Figure 7: Thread timings

Notice that there is a small amount of time between the thread being readied and the thread actually being switched back onto the processor. This delay is understandable: When a thread is readied, it is placed in the ready queue for the processor. The ready queue is priority-based, so the readied thread might have to wait behind other ready threads that have a higher priority.

An Invaluable Tool

Yet again, Xperf proves its amazing capabilities to analyze your application's execution. (For more details about Xperf, see the sidebar "Where to Get Xperf" below.) In this article, I've demonstrated how you can use Xperf to track down waiting threads and discover how to set them free. However, this is only one use of Xperf's wait analysis. You can also use it to analyze general latency issues, by analyzing which code paths are waiting on which threads. Given the advances in parallel development libraries, this capability is invaluable in validating the performance gain of the parallelization of applications. Happy troubleshooting!

Trey Nash ( [email protected]) is a senior escalation engineer at Microsoft working on the Windows OSs and various other products. When he is not working feverishly within the bowels of the OS, he is delivering training on .NET platform and kernel mode debugging or playing ice hockey.

Where to Get Xperf

Xperf is part of the Windows Performance Tools (WPT) Kit, which ships with the Windows SDK. As of this writing, the most recent release is contained in Windows SDK 7.1. You can learn more about the WPT Kit here .

Comments

Plain text