Last month, I showed you what can happen to IIS if it crashes. This month, I look at a more common and sometimes more difficult problem: IIS hanging. The call you receive might go like this:
CUSTOMER: "I can't get to your Web site. My browser times out."
ADMINISTRATOR: "Do you see any errors?"
CUSTOMER: "No. My browser just waits and waits."
ADMINISTRATOR: "The Web server seems to be running fine, and no
errors appear on the server."
You just experienced one symptom of an IIS hang. I classify hangs in two categories: deadlocks and blocking. I show you both types and walk you through the process for solving each type. (The dumps I show you are from real-world scenarios.)
What's a Hang?
A hang is any situation in which IIS or a subset of IIS stops responding to input but doesn't crash or stop running. The two types of hangs are deadlocks and blocking. Because IIS is a multithreaded application, it can suffer when two or more threads are waiting on each other to complete some work. Picture this: Two trains come to a railroad crossing at the same time, but neither can go through until the other has passed. You now have a deadlock. Other threads need resources that threads 1 and 2 have tied up, so those other threads will stop responding and eventually, all of IIS will hang.
Now consider a steam train coming into a train station. The train needs to get water. Unfortunately, the people responsible for keeping the water tower filled didn't do their job, so no water is available. The locomotive is stuck at the station waiting for water to come. You now have blocking. In this case, an external process or request (the water tower) has held up the internal thread (the train).
In "Diagnosing Problems in IIS," May 2001, I introduced you to Windows NT Synchronization objects. These objects let IIS work efficiently. However, they also let hangs occur. In my first scenario, I show you a deadlock that results from the improper use of Synchronization objects in a piece of code. Deadlocks are caused by locks held on critical sections of code. I use a command in WinDbg called !locks to see what's happening.
First, load the dead_locked.dmp file in WinDbg. (You can download dead_locked .dmp from the Code Library on the IIS Administrator Web site at http://www.iisadministrator .com.) Select Open Crash Dump from WinDbg's File menu to load dead_ locked.dmp. Figure 1 shows WinDbg's Command and Disassembly windows with dead_locked.dmp open. Note that I took the dump from a machine running NT 4.0 Service Pack 5 (SP5), so you need to set your symbol path to point to the proper symbols. To download these symbols, go to http://www.microsoft.com/ ntserver/nts/downloads/re commended/sp5/debug/ default.asp. (The encryption level is unimportant for this example.) Note that the more than 50MB download file expands to over 200MB, so be sure that you have the disk space. For information about symbols and how to install them, see "Understanding and Using Symbols" (June 2001).
After your symbols are in place, type
at the command prompt (i.e., the bottom pane that's preceded by 0:000>) in WinDbg. Figure 2 shows the output from running this command against dead_locked.dmp. (Note that I've deleted all but the two items you need to look at.) For easier viewing, I recommend that you paste the output of the !locks command into Microsoft Notepad.
When you go through this output, you first want to look at the LockCount value. If this value is 0, you can ignore the lock because no thread is currently using the Critical Section. When you've found all the locks that have a LockCount value of 1 or higher, you must examine the threads that own the locks. To find the thread ID, find the OwningThread value. Then, use the ~* command, which Figure 3 shows, to list all threads and their IDs.
When you know the ID of the owning thread, you can switch to that thread by using the ~<#>s command (where <#> is the thread number, or the first number in the ~* output line). Let's assume that you just found that thread 38 (ID 147) has a lock. You switch to thread 38 (~38s). Typing
in the prompt window produces the results that Figure 4 shows. Note that these results read top down (i.e., the item at the top represents what the thread is currently doing). Look at the WaitForCriticalSection line, or frame, to see what thread 38 is waiting on. In this case, the thread is waiting on the lock at argument 113c0590. Looking back at the list in Figure 2, you see that thread 43 (ID 3c) owns that lock. So, switch to thread 43 (~43s). Typing
in the prompt window produces the results that Figure 5 shows. Looking at these results, note that thread 43 is waiting on the lock at argument 160bb140. Looking again at Figure 2, you see that thread 38 owns that lock. Thus, thread 38 is waiting on thread 43, which is waiting on thread 38, which is waiting on thread 43.
Now, when you apply the debugging techniques from last month's article to see what the threads are doing, you'll know what the culprit is. Keep in mind that deadlocks don't always involve just two threads. The problem could be thread 1 waiting on thread 2 waiting on thread 3 waiting on thread 1. You must take the time to look at all locks and make sure that no cyclic reference exists.
Like deadlocks, blocking is related to NT Synchronization objects. However, unlike deadlocks, blocking doesn't necessarily involve Critical Sections. With blocking, you must look for one or more threads that are waiting on a task to finish and other threads that are waiting on those threads. Remember the train waiting for water? Now picture 10 other trains also waiting their turn. Everyone is blocked.
To see blocking in action, load the dump file, blocked.dmp, which you can download from the Code Library on the IIS Administrator Web site. For this file, you need to use NT 4.0 SP6a symbols. (You can download these symbols from http://www.microsoft .com/ntserver/nts/downloads/rec ommended/sp6/debug/default.asp.) Load the file in WinDbg, then type
in the WinDbg prompt window to see what all the threads are doing. At this point, stop reading and see whether you can spot what's blocking IIS. (Here's a hint: Remember that IIS runs as a service and, by default, isn't set to interact with the desktop.)
Did you find anything suspicious? If not, here's one more clue. Look closely at thread 35. Thread 35 has a call to MessageBox. A message box usually requires user interaction, but you're not set to interact with the desktop. So, start your investigation here. (To learn more about the MessageBox API call, you can look up the call on the Microsoft Developer Network—MSDN—Web site at http:// msdn.microsoft.com.) Notice that the call has four parameters (or Arguments to Child). The first parameter isn't important. The second parameter is a pointer to the message to display in the dialog box. The third parameter is a pointer to the title of the dialog box, and the fourth parameter is a set of flags telling the dialog box the buttons and icons to display. (Note that the actual API call you see in the stack is MessageBoxA. The A at the end stands for ANSI, as opposed to W for Unicode or wide byte strings. This naming convention is typical of Windows APIs.)
Let's look at the dialog box. The second argument to look at is 4a3a 470. To dump that argument, type
in the prompt window. Figure 6 shows the output of that command. The error tells you that IIS couldn't load a file (nddeapi.dll). Now look at the third argument, 41292d40. To dump that argument, type
in the prompt window. Figure 7 shows the output of that command. Here you see a fatal SQLSTR error. You have your culprit.
However, just to be thorough, check the dialog box type to make sure it's truly waiting on user input. To get to the fourth argument, you need to dump the raw stack starting at the Child Extended Base Pointer (ChildEBP) for the frame you want to examine—in this case, dc 1bbe404. (For a refresher about stacks and frames, see "Your First IIS Crash," September 2001.) Type
in the prompt window. Figure 8 shows the output of that command. Notice that the values in columns 2, 3, 4, and 5 of Figure 8 match the first four columns of
01bbe404 41288959 00000000 04a3a470 41292d40 USER32! MessageBoxA+0x16
which is the MessageBoxA frame, but you get more information.
The next value after 41292d40 is the fourth argument. In this case, the argument is just the hexadecimal number 10 (00000010). By looking in winuser.h, you can see the flag mapping for this value from the list that Figure 9, page 4, shows. (For information about what winuser.h is and where you can get it, see the sidebar "What's Winuser.h?") Note two values in Figure 9—one value is for MB_OK = 0, and the other value is for MB_ICONHAND = 10. These values tell you that the dialog box has an OK button and is displaying the Hand icon. This flag information tells us that the dialog box is waiting for input and is therefore the culprit.
The blocking dialog box problem doesn't come up that often, but it's a good example of how to dive into a dump file with no idea of a culprit and just look through the file for something out of place. Unfortunately, I can't tell you that every message box causes a hang. Some system message boxes might not require user interaction. Such message boxes most likely won't cause hangs. So, if you see a message box in a hang situation, you need to go through the steps I've presented to determine the exact problem. You might also want to enable the Allow service to interact with desktop setting in the World Wide Web Publishing Service Properties dialog box, which Figure 10 shows. This setting lets the dialog box appear to an administrator who's logged on to the server.
Follow Your Instincts
This month, I showed you two simple versions of hangs. Obviously, many more varied and complex hangs can occur. The key is to diagnose the symptoms fully. I recently received a call in which IIS was hanging as soon as it started. I saw only one connection ever being allowed, which seemed odd because, by default, the system had at least two Asynchronous Thread Queue (ATQ) threads to handle incoming requests. If the threads were blocked, IIS should have handled at least two requests. If the network were at fault, IIS should have handled 0. So, I got a dump of the process and found a call to MessageBoxA from custom code. When I asked the customer about the code, he said that the code was part of a high-priority Internet Server API (ISAPI) filter used to authenticate users. Therein was the problem: If the filter threw a dialog box on the first connection and all users were required to go through the filter, IIS would hang at one connection.
In another call I received recently, I saw a machine hanging with dozens of threads that had SockWaitForSingleObject near the top of the stack. I knew that SockWaitForSingleObject was an NT Synchronization object relating to Winsock (or TCP/IP) and told the customer to check the network for problems. In the server farm, the customer found a switch that was saturated with traffic every time the problem occurred. In this case, IIS was hung because of an external piece of hardware.
You can gather a lot of information from these logs by using common sense, following your hunches, and performing basic troubleshooting. Although there are logs that you can use to go right to a line of code and say "This statement is causing the problem," you don't always have to go to that level to get useful information.