Like many others, I eagerly anticipated the arrival of Exchange 2013 SP1. The first service pack of any Microsoft server application has a special resonance with the installed base, most of whom are reluctant to deploy the RTM version of any software. Far better, the adage goes, to let others blaze a trail to glory. Wise administrators wait until software settles down, bugs found and fixed, features completed and enhanced, and sufficient knowledge exists to represent “best practice” for deployment and operation. By happy coincidence, such a confluence appeared to come together when SP1 made its debut just a week ago.
I was therefore dismayed to read a note from Microsoft to the Exchange MVPs notifying us that a hotfix was to be released to fix a problem with third-party transport extensibility agents that prevented the Transport services restarting following an upgrade to Exchange 2013 SP1, or indeed to prevent the installation of the agents with Exchange 2013 SP1. The fix is now available from Microsoft.
Transport is the one place in Exchange that you can absolutely guarantee a message will pass through on its outbound or inbound journey. That’s why transport rules are so effective in applying things like corporate disclaimers or why Microsoft has used transport rules for its Data Loss Prevention feature. Third parties build agents using the Transport Agents SDK to integrate products that need to interact with messages before or after they leave mailboxes. Anti-malware or anti-spam products are common examples that use transport agents.
The agents depend on .NET Framework code provided by Microsoft to integrate with Exchange. Apparently, a late-breaking fix applied during Exchange 2013 SP1 development caused the issue. A mistake was made in a fix to address a problem (the MS13-061 security hotfix requires the installation media to uninstall) and introduced a badly formatted XML comment line in two assembly redirection policy files. These file allows code to work against different versions of Exchange. The result is that any attempt to load the transport agents into the Global Assembly Cache (GAC) is rejected, which in turn made them unavailable to Exchange.
The problem affects any third-party product that uses transport agents to integrate with Exchange including well-known and widely used products such as TrendMicro Antivirus, Code Two, Symantec Anti-Virus, Exclaimer, and ORF anti-spam. Special credit must go to Peter Kansai of Vamsoft (the developers of ORF), who diagnosed and reported the problem on February 27 in the Exchange development forum. Reports of problems with the transport services in upgrades also surfaced in the comments to the EHLO blog post about SP1. On March 4, Code Two were the first company to provide full details of the required fix when they posted this article.
You won’t experience the problem if you upgrade a server that doesn’t use transport agents.
Fortunately, the fix – as is often the case with irritating problems like this – is simple. Microsoft has developed a PowerShell script to address the problem. Those who use a transport agent can incorporate the script into the installation process for Exchange 2013 SP1 or apply it afterward. The problem has also been corrected in the code base and will therefore not appear in the next cumulative update. I would prefer if Microsoft had rereleased SP1 with an integrated fix but understand that this would have been a slower and more complicated process.
The hopes of software engineering often founder on the twin rocks of expectation and dates. Expectation of new features and a high-quality SP1 release create pressure that is added to by the need to make set dates agreed to within Microsoft and with their partners. In this case, Microsoft actually delayed the release of SP1 to allow for better validation and testing, so the fact that this bug has emerged is especially hard felt by the development group.
Whenever problems like this appear, people leap to the conclusion that Microsoft development processes are incapable of shipping robust and reliable software. It is likely that critics will level four charges against Microsoft:
- The automated test machinery used by Microsoft to validate new builds of Exchange does not result in high quality software. You can argue that this is true because multiple update releases of Exchange have experienced problems over the last few years. However, it’s also true that Exchange sits at the center of a massive ecosystem and it is very difficult to test all known combinations of software that might be deployed in the field. On the other hand, it should be possible to test against the top ten third-party products known to be deployed alongside Exchange by customers.
- The Technology Adoption Program (TAP) used by Exchange to validate early builds with customers, third parties and independent experts is not fit for purpose because the kind of bug found here should have been detected before SP1 shipped. In fact, the bug was found by a third party and reported after SP1 was made available. Build 847.32 is SP1 and (as revealed by Code Two), the bug appears to have a version stamp of 847.30. The Exchange TAP is a good program that has a strong track record of helping to find bugs and problems before code is made available to customers. It’s possible that the cadence of software releases might now be so fast that it is impossible for late-breaking bugs to be found by third parties by testing their products against beta builds released by Microsoft before a cumulative update is released. Every third party who participates in the TAP has a responsibility for testing their product against builds as they are released by Microsoft and if this doesn’t happen (often for good reason), then a potential opportunity for problems opens up.
- The triage process used to assess bugs is broken. I don’t think this assertion is true. The bug was reported to Microsoft on February 27, two days after SP1 shipped. Microsoft didn’t get a chance to decide whether to fix the problem or not (put the bug through the triage process) before SP1 appeared.
- Too much focus is paid on the needs of the service rather than to on-premises deployments. Office 365 is a huge and important environment for Microsoft, but it is a very regimented and structured deployment that bears little resemblance to anything you will find in an on-premises installation. In this case, Office 365 doesn’t use the kind of third party products that depend on transport events and would never have encountered the problem. Microsoft might have been lulled into a false sense of happiness by the good results that they saw when new code was introduced to Exchange Online, especially in new technology that is proven in the service before it is delivered to on-premises customers (like the simplified DAG). Although I acknowledge the enormous difficulty involved in testing interaction with Exchange, I think Microsoft needs to pay more attention to add-on software when they validate new software before customer release.
In summary, a late breaking bug that was reported after SP1 shipped has caused considerable embarrassment and reputational damage to Exchange. The good thing is that the development group has stepped up to the plate with early disclosure and swift response to address the problem. Whether this will have the hoped-for reaction from customers is yet to be seen. In the meantime, I hold to my opinion that Exchange 2013 SP1, flaws and all (there will be more bugs – such is the case with software), is worth deploying.
Microsoft is both blessed and burdened by the size of the Exchange on-premises installed base. The size of the base is a huge strength because of the revenues that it generates. The number of add-on products that run alongside Exchange make the base complex to fully understand and incredibly difficult to move forward. Microsoft’s natural desire for leadership in cloud services needs to be tempered by a little more care for those who elect to remain on-premises.
Follow Tony @12Knocksinna