Exchange 2010 SP2 RU1: A CAS glitch?

When the first Roll-up Update (RU1) appeared for Exchange 2010 SP2 this week, some commentators were taken by the fact that Microsoft includes 58 individually documented fixes in the release. I wasn’t worried by what could be taken to be a rather large number of fixes to appear in an update as I had successfully tested RU1 and it seemed very stable in my environment. Accordingly, I went ahead and wrote up an endorsement of RU1 and then left for a short vacation.

How quickly things change. Ever since, my mailbox has been humming with the arrival of new messages to describe a somewhat esoteric problem affecting deployments using Client Access Servers (CAS) in Internet-facing Active Directory sites that have to proxy incoming client traffic to CAS servers located in other, internal, Active Directory sites (see this TechNet article for details). The problem? Quite simply, Exchange refuses to proxy the traffic from incoming Outlook Web App (OWA) connections to mailbox servers in the internal sites. The upshot is that OWA connections fail. Other protocols seem to connect just fine, so this is a strange error. Eeek!

In my defense, while I still think SP2 RU1 is solid (based on my installation), at least my original article contained the caveat:

“Should you deploy Exchange 2010 SP2 RU1 now? I believe that you should, with the caveat that you should first test the new software by running it within an environment that replicates the essential characteristics of your production systems. That way you’ll find out whether the software works for you and make sure that you don’t encounter one of the edge cases that cause problems for just your users.”

The reports that I have seen so far indicate that the problem surfaces after the CAS servers in the Internet-facing sites are upgraded to Exchange 2010 SP2 RU1 and start to communicate with their counterparts in the internal sites that are not running RU1. A typical error as seen from a browser running the OWA client is:

Exception

Exception type: Microsoft.Exchange.Clients.Owa.Core.OwaAsyncOperationException

Exception message: ProxyProtocolRequest async operation failed

Call stack

Microsoft.Exchange.Clients.Owa.Core.ProxyProtocolRequest.EndSend(IAsyncResult asyncResult)

Microsoft.Exchange.Clients.Owa.Core.ProxyEventHandler.ProxyLogonCallback(IAsyncResult asyncResult)

Inner Exception

Exception type: Microsoft.Exchange.Clients.Owa.Core.OwaInvalidOperationException

Exception message: Invalid user context cookie found in proxy response

The problem goes away after the CAS servers in the internal site are upgraded to RU1 so that's the obvious and easiest solution for most companies to take. Microsoft is more than aware of the issue and I know that the Exchange development group is working hard to track down the problem and issue a fix. I'm sure that some hard words were shouted in Redmond, WA when this issue popped up!

How could yet another bug find its way into a roll-up update? Well, an assumption (that all CAS servers would be quickly upgraded to RU1) might have resulted in a scenario not being exercised in Microsoft's tests. Or perhaps no one tested OWA connections that flow through an external-facing Active Directory site to an internal site that holds mailbox servers. I’d be surprised if either theory turned out to be the case because Microsoft’s test matrix is extensive and is designed to catch regressions caused by incompatibility between different versions of Exchange. As I understand it, the test matrix also incorporates different client types so it's puzzling why OWA has been affected in this manner. For whatever reason, the problem is in SP2 RU1 and has been encountered by large companies of the type that is most likely to deploy Exchange in multiple Active Directory sites with carefully managed Internet connections channeled through specific sites.

We have been down this road before in 2011 when Microsoft suffered quality problems with both RU3 and RU4 for Exchange 2010 SP1. However, I don’t think the current situation is in the same category of failure because this problem doesn’t affect user data, has a workaround, (reroute traffic) and will only affect deployments that have reasonably complex configurations. On the other hand, it’s disappointing to see yet another RU issue bubble to the surface. Oh well, back to the drawing board.

Follow me on Twitter to receive updates about new posts and other info!

Comments

Plain text