Monday, May 16, 9:30 a.m.: Customer's server crashes for the umpteenth time.
Accusations hurtled through the air, and angry email messages and phone calls flew furiously between the small-to-midsized business (SMB) customer and the Value Added Reseller (VAR) that supported the customer's financial application. What spawned this IT battle scene? It all started when a Windows 2000 server that hosted the customer's application started crashing intermittently. I work for a Microsoft Business Solutions Gold Partner, and customers who use Microsoft Business Solutions for Financial Management—Great Plains software are an important part of our practice. My boss dispatched me to the client's site to assess the problem.
By the time the client called us, the server was crashing every few days. Before the crash, ODBC connections from Great Plains clients would become sluggish and finally disconnect. The client's accounting managers, IT people, and Great Plains implementers hurled epithets at each other over the fallen server.
The Great Plains implementer on this project is a capable technician, but his training and experience hadn't prepared him to handle the problem at hand: resolving server lockups and crashes. In desperation, he emailed the client/server coordinator and copied me on the message.
Our Microsoft Customer Relationship Management (CRM) system contains our clients' histories for contacts, product purchases, licensing keys, trouble tickets, and other relevant customer information. I located the client's resident IT support person in the CRM database and phoned him.
10:00 a.m.: I begin problem resolution by calling the client's onsite IT person.
I introduced myself to the IT support person and explained why I was calling. Quickly, I reassured him that I—the VAR—was on his side and that I wanted to help him resolve the problem. I won his trust, and he gave me his full cooperation.
He told me that the server was downed like a badly wounded soldier, bleeding memory slowly but continuously. He also told me that his company's security policies prohibited using remote management software, which would have let me examine the injured system. I'd have to find another way to investigate the problem.
10:20 a.m.: I examine the server event logs for clues.
I asked the IT person whether he could send me the server's System and Application event logs, SQL Server event logs, and perhaps a snapshot of Task Manager. He emailed them to me at 10:40 a.m.
I opened the logs and looked at the System log first. The first thing I saw was a bright red streak of Event ID 2019 errors flashing on my laptop screen: The server was unable to allocate from the system nonpaged pool because the pool was empty. Then, in the Application log, I saw Event ID 208. This error fingered the Great Plains application as part of the problem.
In the SQL Server event log, I saw the Event ID 17052 error. And finally, in the Task Manager snapshot, I got a little more information about the Event ID 2019 error, as Figure 1 shows.
I looked in the Microsoft Help and Support Knowledge Base and found an article at http://support.microsoft.com/?kbid=888928 that showed that the Event ID 2019 error might be related to having McAfee VirusScan installed on the server. McAfee VirusScan was, in fact, on the server, and the vendor had a hotfix for the problem. I notified the local IT support person, who downloaded and quickly applied the hotfix and rebooted the server. Alas, the hotfix failed to stop the resource bleeding.
11:30 a.m.: En route to the client's site, I find a fruitful lead.
Finally I persuaded the client to let me investigate the problem on site. To pass time during my drive to the client's site, I listened to a CD; no, not Pink Floyd or Willie Nelson, but Mark Minasi's Tuning Your Windows 2000 Servers. While perusing the event logs, I'd been mulling over memory leaks and how to find them. On the CD, Mark talks about memory and mentions "leakers"—programs that allocate a file handle every few seconds. By itself, the file handle doesn't use much memory, but the repeated allocations gradually use up a great deal of it.
1:15 p.m.: I find the source of the problem.
When I arrived at the site, I met the IT support person, who ushered me into the server room. I opened Task Manager on the server and customized the view by adding the User Name, Paged Pool, Non-paged Pool, Handle Count, and Thread Count fields. I clicked OK, then maximized the Task Manager window and sorted by file handles.
On my Windows XP laptop, svchost.exe uses 1424 handles and outlook.exe uses 1333 handles. Running on the client's server, however, I found an applet associated with sending messages from the onboard SCSI card. That program had used 700,000 file handles since it had been rebooted 10 minutes before—and the file-handle count continued to climb.
I did a quick Google search on the filename of the errant program, and my results showed that many people were having problems with this file and certain motherboards. This added further evidence that we'd found the problem. Earlier, I'd told the Great Plains consultant that I suspected a memory leak. As I stared intently at Task Manager, I exclaimed, "Well, I guess we found our 'leaker'!"
1:45 p.m.: I bring the "crouching server" back to life.
The final step was to fix the rogue program so that it no longer created file handles ad infinitum. Although the server hardware was under warranty, its service level agreement (SLA) didn't cover onsite support. The server housed sensitive financial information, so moving it off site for service wasn't an option.
My alternative (and easier) solution was to modify the registry entries for the applet. I ran regedit, found the applet's launch areas in the registry, and made changes to the registry subkeys related to the applet (HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run) to prevent the applet from running when the server was rebooted.
Finally, I rebooted the server, and the problem vanished. The administrator signed my time sheet and wished me well. As I drove back to the office, I put my trusty Windows technical CD back in the player. For me, it was just another day of tracking down technical problems, dispelling customer qualms, and relearning something interesting about Windows.