I ran into such a bizarre problem last month that I felt compelled to write about the experience. One of my clients notified me that the Exchange databases on the company's enterprise Exchange Server 2003 server were dismounting. The client could manually remount the databases, and they would stay mounted for a few days, but then they would mysteriously dismount again. I first thought that the database stores were corrupted, but because I didn’t want to cause more downtime during the week, I waited until the weekend to troubleshoot the problem.
When I reviewed the server, I noticed that the Windows 2003 Enterprise Edition server didn't have Service Pack 1 (SP1) installed on it, although Exchange 2003 did have SP2 installed. I had a VPN connection into the client’s network, so I remotely installed Windows 2003 SP1 and rebooted the server. That’s when the fun began. After waiting 15 minutes for the server to reboot, I was still unable to access the server. I didn’t have physical access to the building, so I had to wait until Monday to get on site and see what happened.
When I arrived on site, the server was stuck in a reboot loop. The server is an HP Proliant ML370. The server would start the boot process, show the initial splash screen, then suddenly reboot. In the front of the server, the LED with the box and jagged line would turn red when the server rebooted. I later discovered that this LED indicates a problem with the Internal System Health. I tried to boot the server in Safe Mode, but the server still had the same problem, although the log indicated that the last driver to load was acpitabl.dat. I used Google to perform a search of acpitabl.dat and found that other people were having the same problems after installing Windows 2003 SP1. One possible solution was to change update.sys and cpqcissm.sys (the HP Disk Array drive) to the original versions. So using the Recovery Console, I changed these files, but I still had the reboot problem.
I called HP Technical Support hoping it had seen this problem before. Support staff said they had and suggested an OS reinstallation (repair). I ran the Windows 2003 setup and tried repairing the Windows 2003 installation. After the repair process was completed, I restarted the server and it crashed with the error C0000021A - The Session Manager Initialization Failed. I decided to attempt a complete reinstallation of the OS, so I formatted the C drive and installed a fresh copy of Windows 2003. After installing the OS, I was able to boot the server, but I then discovered that the backup of this server was missing the System State. This server was a domain controller (DC), but without the System State, I would be unable to restore Active Directory (AD) onto this server. You shouldn't make a busy Exchange server a DC, so I decided to leave the Exchange server as a member server and moved the DC function to a different server. I installed SP1 and successfully rebooted the server. Then I installed the latest HP Windows 2003 Support Pack (7.40) and rebooted the server. When I tried to reboot the server it displayed the same reboot loop problem. Because the server started having the problem after the HP Support Pack installation, it probably meant that the problem was hardware specific and didn’t have anything to do with Windows 2003 Support Pack 1. I called HP again with the additional troubleshooting information. Support staff had me open the server case and look at all of the LEDs on the motherboard to see whether any of them turned from green to amber. This particular server had the redundant fan kit installed, and one of the fans was preventing me from seeing an LED on the motherboard, so I temporarily removed one of the redundant fans. I carefully observed the LEDs on the motherboard and the server booted! Could the fan be the problem? I tried moving one of the good fans to the slot where I removed the bad fan, and the server still booted, indicating that the motherboard was probably OK. For an additional test, I installed the bad fan in a different fan slot and the server started rebooting by itself again. I’ve seen bad fans before, but usually the LED on the fans turns red and they refuse to work. The LED on the bad fan was still green and working but caused the server to reboot. Probably the bad fan gave an incorrect signal, such as an overheating condition, to the server , which caused the server to reboot. This made more sense when I observed the booting process on the server. When an HP server first boots, the fans come on full blast and during the boot process, the HP fan drivers load, sense the temperature on the server, and typically reduce the fan speed on the server. The server was rebooting about the time the fan speed slows down during the normal boot process. I left the bad fan out and ordered a replacement from HP.
Because I had formatted the C drive as part of the troubleshooting process, and didn’t have the System State Backup, I now had to rebuild the Exchange server. I ran the Exchange 2003 setup with the /disasterrecovery switch. After Exchange was reinstalled, the services refused to start. This is a common problem when you rejoin a server to the domain, because SID changes. Using ADSIEdit.msc (included with the Windows 2003 Support Tools), go to CN=Configuration, DC=<domain_name>,DC=<domain_extension>, CN=Services, CN=Microsoft Exchange, CN=<Exchange_Organization>, CN=Administrative Groups, CN=<Administrative_Group_Name>, CN=Servers, CN=<Server_Name>, right click and select Properties. Click the Security tab; you'll probably see the original SID value of the server before you rejoined the domain. You can delete this SID and grant full rights to the Exchange server by clicking Add, entering the Exchange Server Name, and granting Full Control to the AD Exchange Server object. After AD replication, I was able to successfully start the Exchange Services. I had to reinstall the virus software, backup software, and other applications that were originally installed on the Exchange server. Make sure to go into the Exchange System Manager (ESM) and set all the Exchange databases to mount when the Exchange server starts. Fortunately, all the Exchange databases were located on the D and E drives that were still intact. The Exchange server has been stable since the server rebuild. The replacement fan arrived the next day and was installed without any problems.
Looking back, if I could have determined that the fan was bad, it could have prevented me from rebuilding the Exchange server. However, it was pure luck that I happened to remove the old fan to view the LEDs on the motherboard and discovered that the fan was bad. Initially it looked like Windows 2003 SP1 caused the server to crash, but it ended up being the bad fan that prevented the server from booting. When troubleshooting problems like this, look at the obvious first, but don’t immediately rule out other factors that might be causing the problem.