Trauma for Exchange 2013 servers when Managed Availability goes bad

Managed Availability is one of the more interesting and perhaps compelling new features introduced in Exchange 2013. The idea is simple: to incorporate the ability to monitor, detect, and fix common problems that occur in a messaging system within the product so that it can, in a sense, take care of itself.

All new technology, even that which is extensively tested by being deployed as a fundamental part of the management framework used for Exchange Online, is prone to teething problems. Not all of the lessons extracted from the datacenter can be applied to the on-premises world and not all on-premises configurations can be replicated or tested within a massively scalable multi-tenant datacenter deployment as used by Office 365.

And so we come to some problems with Database Availability Groups reported by on-premises customers after the deployment of Exchange 2013 RTM CU2. Released on July 11, 2013 and then re-released (V2) on July 29 to fix a public folder permissions bug before running into the MS13-061 security update fiasco on August 14, those who have downloaded and installed the various kits and patches released by Microsoft might be forgiven if their faith in Microsoft’s testing processes has wavered just a tad. Microsoft responded by announcing that they would delay the release of Exchange 2013 RTM CU3“ to ensure that we have enough run time testing”.

But then the wheels seemed to come off the wagon when reports of DAG member servers experiencing regular BSODs started to circulate. To be fair, the problem had been reported well before CU2 was available but the pace of problems accelerated following the release of CU2 and CU2 V2. The problem only appears when Exchange is deployed inside multi-domain Active Directory forests. It's not clear if the problem occurs for standalone servers because no public reports have been filed to indicate that this might be so. I only run multi-role DAG member servers inside a single-domain forest myself, so I have not seen the issue.

Microsoft’s Scott Schnoll responded in the thread with a detailed description of how to disable the ActiveDirectoryConnectivityConfigDCRestart responder, a component of Managed Availability that handles problems that might occur in the connection between Exchange and the domain controller from which a server uses to retrieve configuration information. Exchange stores a lot of information about the organization, servers, and all manner of settings in the Microsoft Exchange container under Services in the Active Directory configuration naming context, a setup that has served Exchange well since it was first used in Exchange 2000. Active Directory is not the problem. As SCOM reports:

The AD Health Set has detected a problem with <2013 Server> at 8/22/2013 7:17:10 PM. The Health Manager is reporting that ActiveDirectoryConnectivityConfigDCProbe/Server Failed with Error message: Received a referral to <contoso.com> when requesting <abc.contoso.com> from <dc1.contoso.com>.

This information tells us that the Health Manager service has detected that the ActiveDirectoryConnectivityConfigDCRestart probe has failed when it attempted to read configuration information from Active Directory. In this case, it seems like the probe failed for no good reason, leaving Exchange with an apparent problem to resolve. Not being able to retrieve accurate configuration data is a catastrophic problem for Exchange because it can lead to messages being routed to the wrong place and other myriad problems. Managed Availability therefore attempted to rectify the problem by invoking whatever actions are defined to cure such a situation and rebooted the server (in case of doubt, a nice server reboot clears everything out and starts afresh). Hence the BSODs.

The fix is to tell Managed Availability to use the Add-GlobalMonitoringOverride cmdlet to create an override the ActiveDirectoryConnectivityConfigDCRestart probe. This command does the trick for Exchange 2013 RTM CU2 by specifying that the override only applies to build number 15.0.712.24.

Add-GlobalMonitoringOverride -Identity Exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder      -PropertyName Enabled -PropertyValue 0 -ApplyVersion "15.0.712.24"

Apparently Microsoft is working on a more permanent fix for the problem. Who knows... we might see it in Exchange 2013 RTM CU3.

[Update 19 Sept: The bug is formally described in KB2883203]

Some might ask why Microsoft’s commitment to deploy and use code in Office 365 before releasing it to customers didn’t catch a problem like this. The answer is simple. Customers run an on-premises environment where an Active Directory forest supports a single Exchange organization. Office 365 does not. Therefore the code path that caused the ActiveDirectoryConnectivityConfigDCRestart probe to misbehave in the way that it did was never exercised by Office 365. The question then is why Microsoft’s dogfood on-premises deployment didn’t catch the problem. We don’t have a good answer to that question right now.

Doubts persist as to the quality controls that surround how Microsoft releases new builds of Exchange to its paying on-premises customers. That is both sad and regrettable. Until Microsoft gets its quality under control, you should play safe and a) test any new code that is released to make sure that you have a good chance to detect any lurking problems and b) wait at least six weeks before deploying any new version of Exchange 2013 into production. Give someone else the chance to be the hero running software on the bleeding edge.

Follow Tony @12Knocksinna

Comments

Plain text