Heap usage is expensive! Any native (non-managed) Windows application that does anything interesting almost definitely uses one or more heaps. The C/C++ runtime library implements memory allocation and deallocation using Windows heaps. Anytime you allocate a block of memory from the heap, there is a certain amount of overhead involved including extra allocation space for bookkeeping and extra work to avoid excessive fragmentation of the heap, for example. Applications that misuse the heap will degrade when this overhead is excessive.
Additionally, concurrency is the hot new way to realize performance gains in the multiple core environments that are so common. Applications that rely on concurrency (by utilizing the thread pool, for example) must be careful not to defeat that concurrency by having those threads allocate too much from a single heap, which will force serialization.
In keeping with the theme of my previous articles, I'll show you how you can use Xperf (Windows Performance Analyzer) to analyze your applications' heap usage on Windows 7 and pinpoint inefficiencies. With the power of Xperf's stack capturing, you can also precisely identify problem areas in the code.
Note: Remember that to capture stack traces on x64 platforms, you must disable paging of the executive and restart if you have not already done so. You can set the registry setting conveniently by using the following command line:
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
The Example Application
For this article, I will use the console application in Figure 1 to demonstrate some of the common patterns to look for during your analysis. This application simulates three different types of work, one of which contains a real memory leak. You can create a Visual Studio 2010 project and solution for this code, or you can simply save it to a .cpp file and execute the following command from a Visual Studio 2010 command prompt (assuming you saved the code in test.cpp):
cl /EHa /Zi test.cpp /link /debug
This will produce test.exe, and when you execute the application, you'll see the following prompt:
Please choose one of the following:
1 : Do some work
2 : Do some more work
3 : Leak some memory
9 : Quit
Press one of the keys above...
Starting Xperf with the Heap Plug-in
Using Xperf's ETW controller capabilities, you must start both an NT Kernel Logger session and a session for capturing heap-usage data. The kernel session is global to the entire machine, whereas the heap session is specific to one process. When event collection stops, all events are merged into an ETL file.
When starting the heap session, you have three options for identifying the process for which to collect heap data:
- The -PidNewProcess option instructs Xperf to start the process directly,
- The -Pids option instructs Xperf to attach to an already running process, or
- By using the -Pids option while providing zero as the PID and by setting an appropriate registry value in Image File Execution Options (IFEO), Xperf will monitor an application as soon as it is launched.
Of these three options, the third option using IFEO is the most flexible. If you allow Xperf to start the application, your application will run as an elevated administrator because Xperf must be running at that level to start tracing. In many applications, running at such a high privilege will alter the behavior. If you use the option to attach to a process, your results will vary since the heap plug-in for Xperf lacks rundown support. That means it cannot track down and account for allocations that occur before the heap session is started. Therefore, I will demonstrate using the IFEO method, since it mitigates the two aforementioned concerns.
Since these commands are long-winded, I like to set them up in batch files. I have a starttest.bat, which looks like the file shown in Figure 2. The first line adds the necessary TracingFlags DWORD value under IFEO, assuming the application I'll be gathering data against is named test.exe. The second line starts the kernel logger session, and the last line starts the heap logger session, which I've named HeapSession. Note that you don't provide a name for the kernel logger, as it's the only kernel logger on the system.
There are a couple things to note regarding the commands in Figure 2. First, I am explicitly setting the buffer size to 1MB and specifying how many in-memory buffers to allocate by setting MinBuffers and MaxBuffer equal to each other. For the kernel logger, I'm setting up 10 buffers at 1MB each, and for the heap session, 50 buffers at 1MB each. ETW is a lossy mechanism, meaning it is not guaranteed to catch every event. Setting up these buffers (which happen to be in nonpaged memory) helps mitigate any problems. Losing an event during heap-usage collection is especially problematic since losing a free event associated with an allocation will surface as a false positive memory leak.
Additionally, you must make sure to capture Cswitch (context switch) events in the kernel logger since, during analysis, the Xperf state engine needs data recorded in the Cswitch event. For example, it would be inefficient to record the current thread in each ETW event logged. However, if the interesting event contains the current processor, then you can determine the current thread on the processor by finding the previous Cswitch event.
Once I run starttest.bat from an elevated command prompt, I then execute my application, which in this case is test.exe. Once I'm done with that, I then have a stoptest.bat file to turn off tracing, merge the ETL into heap.etl, then remove the IFEO TracingFlags value from the registry. Figure 3 shows my stoptest.bat file. Note that it's a best practice to stop the ETW sessions in the opposite order in which they were started. That is why I stop HeapSession before stopping the kernel session.
Analyzing the Data
While gathering data, I ran the test application (test.exe) and selected option 1 twice, then option 3 three times, and then option 2 twice before quitting, thus producing the following console output:
Work completed, what next?
Work completed, what next?
Done leaking memory, what next?
Done leaking memory, what next?
Done leaking memory, what next?
More work completed, what next?
More work completed, what next?
After executing stoptest.bat from an elevated command prompt, you may notice that Xperf takes some time to create the merged heap.etl trace file. After that's done, you can pass it to Xperf from the command line to start the Xperf viewer.
Remember that the first thing you want to do in Xperf is to verify your symbol path, then load symbols under the Trace submenu. I've set up my symbol path to point to the Microsoft public symbol store as follows: srv*C:\symbols*http://msdl.microsoft.com/download/symbols.
The two graphs of interest in the Xperf main window are Heap Outstanding Allocation Size, shown in Figure 4, and Heap Total Allocation Count, in Figure 5. Notice that I've highlighted interesting areas with colored squares. Xperf has a nifty feature that lets you overlay one graph onto another. However, until you know what sort of patterns to look for, I recommend viewing the graphs separately.
Now, imagine that you are testing this application and do not recall the source code or maybe don't have access to the source code. Consider what is in the blue box in the Heap Outstanding Allocation Size graph, which was generated when I selected option 1 twice in the test application. Nothing seems out of the ordinary in that it allocated some memory to do the work and the amount of memory grew over time before all that memory was freed.
But if you also consider the same time span in the Heap Total Allocation Count, you see something far more sinister. During the first time I ran option 1, test.exe performed around 100,000 allocations! Immediately you should start to question the efficiency of this work. By taking a more careful look at the function behind it, you can see that each time it adds a new long to the list, it allocates a new buffer, copies the old buffer contents into it, then appends the new value. Immediately, you can see the value in Xperf's heap analysis in helping point out inefficiencies without even having to look at the source code.
The portion of the graph within the red box in Figure 4 represents the memory leak behind option 3 in test.exe. We have allocated a certain amount of memory yet didn't free all of it when done. This is manifested by the spikes in the graph and by the steady state value of the total heap allocation rising. Although there is not enough data to pinpoint this as a memory leak simply by looking at the graph, anytime you see that the lower threshold of allocated memory is continually rising, you should investigate for memory leaks.
Finally, the area in green is an interesting one. When you view that area in the Heap Outstanding Allocation Size graph in Figure 4, it looks like nothing is happening. But Figure 5's Heap Total Allocation Count graph paints a far different picture. In this graph, the green area corresponds to option 2 in test.exe where the DoMoreWork() function makes a rapid series of allocations and immediately frees them. DoMoreWork() is meant to simulate functions that contain a loop and need temporary storage space for each iteration. In this case, you should investigate whether the loop should use the same allocated buffer for each iteration or whether it should simply use the stack for temporary storage instead.
Note: Stack allocations are far more efficient than heap allocation simply because they merely update a register to do the allocation. However, you must carefully consider whether your allocations may blow the stack limits or introduce a security concern if a buffer is able to overrun and corrupt the stack.
But What About Stacks?
Using stacks captured by Xperf, you can glean even more data regarding heap usage. Suppose you want to know which function within a contentious area is performing so many allocations. Simply highlight that portion within the Heap Outstanding Allocation Size graph, right-click, then select Summary Table. Let's take a look at one of the first iterations in the blue box.
When you select the range and open the summary table, a column named Type is available with some cryptic acronyms in the cells. They represent the memory allocations with respect to the selection you just made: Allocated Inside/Freed Inside (AIFI), Allocated Inside/Freed Outside (AIFO), Allocated Outside/Freed Inside (AOFI), and Allocated Outside/Freed Outside (AOFO). Since I selected a range fully enclosing one iteration of option 1, I am interested in AIFI only. Figure 6 shows my view of the summary table showing which columns I've turned on and how I am sorting.
Note that the stack is inverted from what you're used to seeing in the debugger. Additionally, there is a symbol bug manifesting itself in this view. The strange symbol name ntdll.dll! ?? ::FNODOBFM::`string' should actually be ntdll.dll!RtlAllocateHeap. Unfortunately, there is a problem with the way the public symbols on the symbol server are generated. The good thing is you should never see this problem when working with symbols from your own binaries.
Very clearly you can see that operator new was called 100,000 times during this time frame. Also notice that only eight of the allocations came from the low fragmentation heap.
Note: The gold bar is an aggregation point within the summary table. Any column type on the left side is used for partitioning or sorting the data set. So in this view, I first sort by process, then by allocation type, then by stack, and then by source (meaning source within the heap). You can do some pretty powerful things by rearranging the columns within the summary table view.
Now I want to show you another powerful feature of the summary table. Select the entire range of Heap Outstanding Allocation Size and bring up the summary table. Drill down into a stack until you find an instance of ntdll.dll! ?? ::FNODOBFM::`string' (which is actually ntdll!RtlAllocateHeap), then select it, right-click and choose View Callers|Innermost… from the context menu. Now Xperf presents a new summary table with ntdll!RtlAllocateHeap (named ntdll.dll! ?? ::FNODOBFM::`string') at the top. What it has done is invert the stack view, and you can now find out who has called RtlAllocateHeap within the selected range. Expand the RtlAllocateHeap node in the Stack column, then click the Count column to sort by the most frequent callers. You should get a summary table view similar to Figure 7, where I have expanded the stack a bit.
Notice that the top caller is malloc, but of the 403,901 times malloc called RtlAllocateHeap, 200,000 of them came from DoMoreWork() and another 200,000 from DoSomeWork().
A Heap of Knowledge
Allocating from the heap is expensive. It's wise to analyze your heap usage during performance analysis to address any unacceptable hotspots during development. Xperf's heap-analysis plug-in is an excellent tool to perform this analysis, allowing you to generate detailed and revealing views into how your application is really using the heap. Xperf also has plug-ins to monitor virtual allocations and system pool usage; however, they gather data systemwide whereas the heap plug-in is process specific. Armed with these tools, you can easily determine how your or any other application is utilizing system memory. Happy analyzing!
Trey Nash is a senior escalation engineer at Microsoft working on the Windows OSs and various other products and is the author of Accelerated C# 2010 (Apress). When he is not working feverishly within the bowels of the OS, he is delivering training on .NET platform and kernel-mode debugging or playing ice hockey.