So far in this troubleshooting series, I've shown you how to examine the results of a crash, but I haven't yet shown you how to delve into what can cause the crash. This month, I show you how heap corruption can make pointers go bad, overwrite critical data, or cause loops and hang systems.
What Is Heap Corruption?
What is heap corruption? Simply put, heap corruption is the circumstance under which misbehaving code corrupts the data heap. (The data heap is a block of memory that the OS sets aside for an application to hold its data in.) To better understand this corruption, let's first revisit how a multithreaded OS and application work.
Windows is both cooperative and preemptive. To implement the cooperative part, applications use synchronization objects that the OS provides. (For more information about synchronization objects, see "Debugging IIS Deadlocks and Blockings," October 2001.) To determine the preemptive part, Windows uses a thread scheduler and a complex set of algorithms. (For in-depth information about thread execution times, see the "Thread Scheduling" section, Chapter 4, David A. Solomon, Inside Microsoft Windows NT, 2nd edition, Microsoft Press, 1998.) When you're working with heap corruption, you must understand both of these concepts.
Memory Allocation for Thread Use
Let's look at a situation in which two threads are running independently and one thread causes corruption in another thread. Thread 1 is processing a request from a client, so that thread requests memory from the heap. Ntdll.dll is responsible for handling this memory allocation; it looks at the heap, determines the best location to give the thread, and passes back a pointer for the memory to Thread 1. To make this determination, the pseudocode that Figure 1 shows calls ntdll.dll. Now Thread 1 has some memory it can use. Figure 2 shows the memory block in the heap.
When the time slice for Thread 1 is finished, Windows stops the thread's execution and determines that Thread 2 is next in line. Thread 2 starts and determines that it also needs memory from the heap, so it requests three pieces of memory that it will use to perform a math division routine. For the sake of this example, assume that Thread 2 stores its numbers as characters and converts them to numbers when it does the math. (Note that although this concept might seem strange, the practice is fairly common and has many uses and benefits.) Ntdll.dll looks at the heap and determines that the next available memory spot is at 15, so it starts giving out memory to Thread 2 from that spot. Figure 3 shows the pseudocode that calls ntdll.dll. Now Thread 2 has memory. Figure 4 shows the heap at this point.
Thread 2 now decides to assign values based on the numbers that were passed in to two of the three character variables in the pseudocode in Figure 3:
a = number1 (converted to a character) b = number2 (converted to a character)
Figure 5 shows the heap after the variable assignment. Windows determines that Thread 2's time slice is finished, so Windows stops Thread 2's execution and lets Thread 1 start again. Thread 1 picks up where it left off. Because Thread 1 has the memory it requested, it starts copying the request string—"Our string here"—into this memory.
String Storage in Memory
At this point, you must understand how the OS stores most strings in memory. You might have heard of null-terminated strings: a string in memory that has an ASCII character of zero as the last character in the string (referred to as a null character). Note that this character isn't the printed 0, which is actually a decimal value of ASCII 48. When Windows reads a string, it starts by reading the first character of the string (i.e., the character that the pointer points to), then reads each subsequent character until it finds a null character. Windows then knows to stop reading.
Thread 1 copies the request string to the memory that X points to. This copy routine knows that a null-terminated string is involved and automatically tacks on a null character at the end of the string. The string is 15 characters long; 15 bytes are allocated for this string. Figure 6 shows the memory block following this allocation.
Thread 1 now parses the string, completes its work, then completes execution. Windows terminates the thread because the thread is finished. The memory is released, but it's not reset or overwritten. The memory is simply available for a new request from another thread. Note that only the original 15 bytes allocated for Thread 1 are available for reallocation.
Debugging in the Real World
Returning to the example, because Thread 1 is finished, Windows turns control back over to Thread 2. Remember that Thread 2 has set up three character variables and has put values in the first two. Now Thread 2 does its math.
Figure 7, page 14, shows the pseudo-code for this math. What do you think the result of this pseudocode will be? If you guessed a divide by zero exception, you're correct. Thread 2 receives the number stored in the heap location that the variable a points to: This number is now 0 because Thread 1 wrote a 0 there. However, here's the problem. If you hook a debugger up to the system when this code executes, the code will trip and generate a Stack Backtrace that points to the division statement in Thread 2 as the problematic code.
Now, because you've been following every step of both threads as the program ran, you know that Thread 2 is simply a victim in this case. In the real debugging world, you have no such information. As a matter of fact, if you look at the process you're debugging, Thread 1 isn't even running (remember that Windows already terminated it). The real problem code doesn't even exist in memory anymore. Furthermore, this type of problem is sporadic and random at best because it results from a very specific set of circumstances. Often, when you work with corruption that involves multiple threads such as in this example, you see different symptoms each time (i.e., the corruption affects different routines). In this example, you saw a divide by 0 exception in the math division routine. However, in another crash, an Access Violation error could occur in a string-copy routine, or you might not see a crash at all—one of the big problems with trying to debug heap corruption.
Also, a long delay might take place before any noticeable problem occurs. In my example, the problem didn't show itself until long after the problematic code had actually executed. For this reason, you can't always assume that the faulting stack is actually pointing to the incorrect code. In addition, heap corruption doesn't always involve multiple threads. Often, a thread trashes its own data.
You might ask why Windows (more specifically, ntdll.dll, which controls memory) lets this corruption occur. The reason is that the heap isn't policed. Ntdll.dll keeps a record of the memory it has given out so that it knows what memory it has left, but that's all this DLL does. The code is responsible for its behavior. The code in the example made one crucial mistake. The code based its memory request on the size of the incoming string—a common practice. However, the code forgot to add one extra byte for the null terminator. To combat this problem, you can use a tool called PageHeap.
To understand how PageHeap works, let's dive a little more deeply into Windows' memory manager. (I'll discuss PageHeap in more detail next month.) Allocating and committing memory in Windows can take a relatively long time (compared with using memory). Therefore, Windows creates heaps of memory and goes through the allocation and commitment of memory for each heap in one lump sum. Windows then has the memory ready to dole out as requests come in.
You can think of this process like requesting that a tanker truck full of gasoline be brought from the refinery to the gas station. Then, a driver can fill up his or her car from the station without waiting. Otherwise, each time a driver wanted gas, he or she would have to wait for that tank of gas to be brought from the refinery.
Windows increases the size of the heaps as necessary. The default size for the heap in a Windows application is 1MB. When Windows has given out half of that 1MB, the OS then doubles the heap. When Windows has given out half of that doubled amount, the OS will double the heap again. In this way, Windows is always ready to hand out memory (unless you run out of memory, which is a topic I'll cover in a future article).
When ntdll.dll doles out memory during a typical session with heaps, it gives a request the next available memory space following the previous request's allocation. However, when you use PageHeap, which is built into ntdll.dll, that behavior changes. Ntdll.dll doles out memory along with unused memory both before and after each piece of memory that's used. Ntdll.dll then marks this unused memory as "no access." Marking the memory in this way tells Windows that applications can't use that memory for any reason. Any read or write operation outside the requested area causes an Access Violation error, and a debugger can stop the program when the error occurs. (In the example, the Access Violation error would occur as soon as the code wrote the null terminator character.)
The topic of heap corruption is too broad to cover in one article. Next month, I'll discuss the specifics of using PageHeap and familiarize you with the tool's requirements, pitfalls, and benefits. I'll also present other considerations about troubleshooting heap corruption.