When an Exchange Cluster Crashes

A few weeks ago, I received a call from a client notifying me that the company's Microsoft Exchange cluster had crashed. This was a big deal because more than 2000 users were using this Exchange cluster. The cluster is located at a Colorado site, and the company is finishing a migration from Exchange Server 5.5 to Exchange Server 2003.

The Exchange cluster lives in a parent domain that contains four child domains. A domain controller (DC) for the parent domain that's a Global Catalog (GC) was already in place at the same site as the Exchange 2003 cluster. Earlier, on Thursday, a new DC was brought up for one of the child domains in the same site as the Exchange cluster. However, this new DC wasn't made a GC, and I wasn't made aware of this new DC until later. On Friday, the company rebooted the parent domain DC to install some critical updates. Then the client moved mailboxes from the Exchange 5.5 Servers to the new Exchange 2003 cluster. All earlier mailbox moves have worked fine, however when this batch of mailboxes were moved, the users couldn't access their mailboxes after they were transferred to the Exchange 2003 cluster. This batch of mailboxes were marked with a red x and appeared as though they had been deleted. But, the client couldn't reconnect these mailboxes to a valid Active Directory (AD) user account. The client noticed that the Message Transfer Agent (MTA) wasn't started on one of the cluster nodes and decided to reboot all the nodes in the Exchange cluster. This particular cluster has three active nodes and one passive node that's a backup for the active nodes. As you might know, only one node in an Exchange cluster can run the MTA. The client, however, thought this was a problem, and rebooted the nodes in the Exchange cluster. That's when the fun began. When the client rebooted each of the nodes, the Exchange Services failed to start on all nodes, including the passive node.

When I received the client's phone call, I told the client to start the Exchange Services manually, but they didn't start. I told the client to sort the services by status and look for services that were set to Automatic but didn't start. The World Wide Web Publishing Service also didn't start. The Exchange Services depend on the World Wide Web Publishing Service services--so no World Wide Web Publishing Services, no Exchange Services. Often the World Wide Web Publishing Service will fail to start because of a corrupt metabase. The metabase is a database that Microsoft IIS uses to determine its configuration. For more information refer to the Microsoft article "How to troubleshoot IIS metabase corruption on a Windows 2000 Server-based computer that is running Exchange 2000 Server or Exchange Server 2003" (http://support.microsoft.com/?kbid=843093).

I've seen this problem before, so I opened the IIS Manager, right-clicked the server, and selected All Tasks, Backup/Restore Configuration. After backing up the current configuration, I restored a 2-month-old metabase backup on one of the cluster nodes and attempted to restart the World Wide Web Publishing Service. It started! Then I tried to start the Exchange Services, and they started! I opened the Exchange System Manager (ESM) and looked at the Exchange databases, but they weren't mounted. When I attempted to mount the Exchange stores manually, I received an error that the database object couldn't be found, or I had to wait for AD replication to complete. After about 5 minutes, the Exchange Services and the World Wide Web Publishing Service stopped. This was most likely because the passive node was attempting to take over for the failed active node because the Exchange databases didn't mount (the passive node failed as well). Fortunately, the company had good backups for more than a month, so I tried to restore the metabase from tape. I selected the backup from the Wednesday before the DC for the child domain was introduced into the network.

Unfortunately, this version of the metabase had the same problem. I could start the World Wide Web Publishing Service and Exchange Services, but the databases refused to mount. We considered restoring from tape to one of the cluster nodes, but I wasn't comfortable trying this unless I was on location, so I caught a flight to the client site.

When I arrived on site, I called Microsoft Product Support Services (PSS) to gain some possible insight on this problem. After explaining the problem to PSS I wanted to get its blessing on attempting a restore on one of the cluster nodes. PSS suggested I try running Domainprep again in AD. I didn't think doing so would solve the problem, and worse, would cause an AD replication to start on all DCs across the WAN. I decided to try some other troubleshooting steps before calling back.

I first ran a backup on one of the cluster nodes, then tried to restore the cluster node as of Wednesday evening. Even after this restore, the problem persisted--the services would start and soon stop as the cluster attempted a fail over to the passive node (the passive node's services failed as well) causing the entire Exchange cluster to fail.

I knew that something global to the cluster was causing it to fail. I started the Adsi Edit tool on one of the cluster nodes and went to Configuration \[server name\], CN=Configuration, DC=, DC=, CN=Services, CN=Microsoft Exchange and found a duplicate entry for the Exchange organization. The duplicate entry was obvious because it had the Exchange organization name with a SID value appended to the end of it. Also, the duplicate Exchange organization entry didn't have any sub-entries. I deleted the duplicate entry and waited for replication to take place. After replication completed, I tried to start the World Wide Web Publishing Services and Exchange Services on one of the active cluster nodes. Not only did the services start, but the mail stores mounted successfully. Evidently the cluster node was trying to use the duplicate Exchange organization entry in AD, which caused the mail store mount to fail.

I think the introduction of the DC for the child domain that wasn't a GC caused the duplicate entry for the Exchange organization in AD. In theory, when an Exchange server uses a DC that's not a GC, that DC is supposed to query a DC that's a GC for global AD information. For whatever reason, I think the DC that wasn't a GC couldn't contact a GC and created the duplicate entry in AD. As a general rule, you should always place a DC that's a GC in the same site as your Exchange cluster. Unfortunately, this new DC was brought up without my knowledge.

The scary part of this whole experience is how a minor corruption in AD can cause the entire Exchange cluster to fail (Microsoft are you listening?). In hindsight, I'm glad I didn't run a DomainPrep again, because it probably would have extended the duplicate Exchange organization branch in AD. This would have made it much more difficult to determine which entry was the correct entry and which entry was the duplicate.

If I hadn't spotted the duplicate entry in AD, the only way I could've fixed this problem would have been a complete manual uninstallation of the Exchange cluster from scratch, causing the client more downtime. If I had gone down this road, I hope I would've spotted the duplicate entry when I attempted to clean up AD with Adsi Edit, because this is one of the necessary steps when you roll back a failed Exchange installation. For more information on how to manually roll back a failed installation of Exchange refer to the Microsoft article " How to roll back a failed migration from Exchange Server 5.5 to Exchange 2000 Server or to Exchange Server 2003" (http://support.microsoft.com/?kbid=839356). Clusters do go down, so plan accordingly.

Comments

Plain text