Our three development domain controllers (DCs) were suffering from CPU spikes. As a result, users were experiencing slow logons and all sorts of lags when querying Active Directory (AD). I ended up trying several techniques and using several tools to find the reason for the CPU spikes on our DCs.
I began by checking each DC's CPU usage in Process Explorer. This free tool shows you information about the objects (e.g., DLLs, handles, registry keys) a process has opened or loaded. Before using Process Explorer, though, you should configure it to download symbols so that binary data is converted into readable information. To do so, you just need to select Configure Symbols on the Options menu. Process Explorer uses the Debugging Tools for Windows to make the conversions, so you need that toolset installed to use this option.
As Figure 1 shows, I found that the System process was typically consuming around 50 percent of CPU time.
Figure 1: Using Process Explorer to determine CPU usage at the process level (click to enlarge)
The System process isn't bound to an executable image like other processes. Its existence serves OS threads for Windows subsystems and device drivers. CPU spikes in the System process could mean a misbehaving device driver. To get more information toward that end, I decided to look at CPU usage at the thread and stack level.
I double-clicked the System process to bring up its Properties dialog box, then selected the Threads tab. As Figure 2 shows, I found several threads, each of which was consuming about 5 percent of CPU time.
Figure 2: Using Process Explorer to determine CPU usage at the thread level
These threads were logically associated with Srv.sys, which is the file server device driver that responds to network I/O requests for file data on disk partitions shared on a network. Because I previously configured symbols for OS images in Process Explorer, the thread list also showed the function name, which was WorkerThread. In other words, they were system worker threads.
Because any device driver can submit work to a system worker thread, I still didn't know the source of the request. So I highlighted one of the threads and pressed the Module button to see more information about the file (aka module) behind that thread. Sometimes the Version tab in the dialog box that appears includes a description of the component that submitted the work, but unfortunately that wasn't the case this time.
Another way to determine the component that submitted work to a system worker thread is to use the Kernel Profiling Tool (kernrate.exe). This command-line tool lets you track CPU utilization by kernel-mode and user-mode processes. Although the Kernel Profiling Tool has been deprecated, you can still download it at www.microsoft.com/whdc/system/sysperf/krview.mspx. You can also find it in the Microsoft Windows Server 2003 Resource Kit.
When I ran the Kernel Profiling Tool, a module named mfehidk caught my attention. To find the device driver associated with this module, I used the free Strings utility. By running the command
strings *.sys | findstr mfehidk
from the C:\Windows\System32\drivers directory I was able to determine that the module was associated with the McAfee device driver installed in my system. (You can also use other search techniques, which are discussed in the Microsoft article "How to find pool tags that are used by third-party drivers".)
I unregistered the module using the Regsvr32 command, then monitored the DCs for improvements in CPU usage. I was dismayed to see that the spikes didn't go away.
In despair, I turned to Process Monitor. This free tool lets you monitor file system, registry and process activities in real time. To output the activity of the System process only, I selected the Enable Advanced Output option on the Filter menu and selected Include 'System'. Figure 3 shows sample results.
Figure 3: Using Process Monitor to monitor the System process in real time (click to enlarge)
The DCs were serving logon scripts at the time of the data capture, so several OPLOCK NOT GRANTED entries provided an important clue. After some investigation, I found that a badly designed logon script was the culprit. The logon script had a reference to a missing network share. After I corrected the script, the System process's CPU usage dropped to about 5 percent on the DCs.
My experiences will hopefully give you an idea of some of the tools and techniques you can use to troubleshoot performance problems in the System process. You can find more details on the troubleshooting steps I took in my blog entry "Troubleshooting the System Process (CPU Spikes)".