In my previous article, "Troubleshooting Kernel Memory Corruption", I discussed kernel memory corruption and walked through a high-level kernel pool memory primer. In this article I'll continue the discussion focusing on Special Pool, the primary tool used by the Microsoft support team to troubleshoot kernel memory corruption. We left off by introducing Special Pool as the "smoking gun" methodology used to catch drivers corrupting memory in real time by allocating guard pages around memory allocations. The idea is to catch a driver writing beyond its allocation by forcing it to write into a guard page, causing the system to crash immediately with the culprit on top of the stack.
Using Special Pool to Catch a Problem Driver
In order to do this, a few things change in the memory model. When a driver is tracked under Special Pool, its allocations are no longer shared on a 4KB page. Instead an entire 4KB page is dedicated to the allocation with the driver's buffer placed at the bottom of the page. The intent is to later catch the driver reading or writing beyond its allocation and spilling into the next unallocated guard page. This overrun condition touches the guard page, causing a Bug Check 0xCD: PAGE_FAULT_BEYOND_END_OF_ALLOCATION blue-screen crash. Reviewing the memory dump should show the driver that wrote beyond the allocation.
The rest of the page holding the buffer is filled with a random bit-pattern signature, which on the surface seems uninteresting; however, it serves a very useful purpose. When the memory manger frees the allocation, it scans the entire bit pattern looking for changes to the signature. If the signature was overwritten by even a single bit, the memory manager halts the machine with a Bug Check 0xC1: SPECIAL_POOL_DETECTED_MEMORY_CORRUPTION. Finding corruption in this bit pattern could indicate an underrun condition, in which case we would recommend the use of a special flag to enable the monitoring of underruns. In this model, the diagram is flipped with the driver buffer moved to the top of the page followed by the bit pattern continuing to the bottom of the page. The hope is to catch a read or write landing too early and hitting the guard page ahead of the allocation.
Setting the Trap: Methods for Enabling Special Pool
Special Pool can be enabled using several methods. Historically you could enable Special Pool by directly editing the registry. Adding the value PoolTagOverruns with the value of 1 to the registry subkey HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management enables overrun detection; changing the value to 0 monitors for underruns. Under the same registry location, the value PoolTag indicates the tag to trace. This value allows a lot of freedom, including the use of wildcard characters.
In cases where we are unable to determine which tag is causing the corruption, we typically recommend using the hexadecimal value 0x2a (the ASCII equivalent of *) to monitor tags for all drivers. It's important to note that not all allocations will be allocated from Special Pool memory because it's a finite resource. Also it's an expensive resource because the memory manager allocates an entire 4KB page of memory for the buffer and two additional virtual no-access guard pages, so it's not recommended to run in this mode after you've determined the root cause of the memory corruption.
Another tool to enable Special Pool is the Global Flags (Gflags) utility, which is included with the Windows Debugging tools. Gflags comes in both a GUI and command-line option and includes a comprehensive Help file. Special Pool is enabled on the System Registry tab in the GUI version of Gflags, which Figure 1 shows, by entering the tag you want to track. There's an option to track overruns by selecting the Verify End radio button, or underruns by selecting Verify Start. Like the other tools, Gflags defaults to monitoring for overruns. This tool is another option to target all tags with a wildcard or a specific tag, but it doesn't give you the flexibility to select a list of tags. It does, however, give you the option to track allocations by size if you enter the size value in hex format instead of a four-character tag, but this isn't the best approach because it will monitor all drivers with the allocation size.
Both of the previous methods require a reboot, which may not be an option if your production server cannot be taken down for a maintenance window. The good news is with versions of the Windows kernel starting with 6.0 (i.e., Windows Vista and later), you can use kernel flags to enable Special Pool, thereby preventing the necessity of a reboot; however, the change will not persist across a reboot.
If you use the Gflags command-line utility to specify the kernel flag, you can enable Special Pool on the fly by specifying /k with the Special Pool switch +spp. Here's an example of using the command to monitor allocations of size 30 until the next reboot:
Gflags /k +spp 0x30
Refer to the Global Flags Help file to find complete usage information for the tool.
Driver Verifier is another tool that Microsoft support prescribes to enable Special Pool. It's my favorite tool because it provides more granular options than the other tools as well walk-through wizards. The previously mentioned tools are limited to single drivers or wildcards, whereas Verifier gives the option to choose several drivers from a list.
Here are the typical steps I would perform when using Driver Verifier with Special Pool to investigate a kernel-memory corruption problem:
- Open Driver Verifier by running verifer.exe from the command line.
- When the tool opens, select Create custom settings (for code developers) and click Next.
- Select the option Select individual settings from a full list and click Next.
- Select the Special Pool option. After you click Next, the screen in Figure 2 provides several options, among them Select driver names from a list. This option lets you use a more precise approach so that you choose only specific drivers to track.
After rebooting, the server will continue to run until Special Pool catches a driver corrupting pool. A memory dump may show the bad driver on the stack. Once you've completed your investigation, it's important that you disable Verifier by using the option available in the wizard.
A Few Caveats
Special Pool can be very useful in troubling memory corruption, but it isn't perfect. In some cases, enabling Special Pool changes the timing and causes the problem to stop reproducing. In other cases, Special Pool catches the culprit while the machine is booting, which causes the machine to crash before logon. Remember, it's the job of Special Pool to crash the machine with the culprit on the stack. If this happens prematurely, the crashing machine may prevent the user from disabling Special Pool, which changes the dynamic of the scenario into a critical no-boot issue. As a caveat, if you find yourself in this situation, the Last Known Good option will disable Special Pool by restoring the registry settings back to their previous state.
Another downside of the tool is the red herring effect. Special Pool can very easily discover memory corruption bugs in other drivers that wouldn't normally cause system instability in production. This isn't necessarily a bad thing; however, it may increase the time of the investigation or might lead you to believe that you resolved the problem before catching the real guilty party.
This series merely scratched the surface of troubleshooting memory-corruption problems. In fact, I haven't even discussed the methodology used to determine which tags to target with Special Pool. This subject alone would fill an entire article in itself. In many cases, it's more of an art than science, much like debugging. Suffice it to say that for the majority of cases, Special Pool gets the job done well and helps Microsoft's support teams solve many kernel memory-corruption issues.
Ron Stock ([email protected]) is an escalation engineer for Microsoft's Global Escalation Services team. He specializes in advanced Windows debugging and performance-related issues. For information about Windows debugging, visit his team's blog at blogs.msdn.com/ntdebugging.