A curious incident occurred in Exchange Online over the last week or so when some European users who connect to their mailboxes using the venerable IMAP4 protocol reported that they couldn’t receive new mail. There’s nothing particularly notable about a glitch occurring with a messaging protocol. After all, Exchange supports quite a few (MAPI, Exchange Web Services, ActiveSync, IMAP4, POP3, and SMTP) in order to be able to provide as wide a client choice as possible.
What was remarkable in this case was the length of time it required for Microsoft to diagnose and fix the problem. The incident (EX41924) started at 12:00AM UTC on Monday, January 18 and was eventually closed off at 11:00AM UTC on Sunday, January 24. The start time is when Microsoft formally accepted that something was wrong rather than when users began to experience problems.
Microsoft cites the following as the preliminary root cause:
“As part of our efforts to improve service performance, an update was deployed to a subset of components which are responsible for obtaining the subscribed folder list. However, the update caused a code issue that prevented the list from being automatically loaded.”
In effect, the issue meant that IMAP4 users were unable to receive new mail for a complete working week. In addition, “affected users may have experienced limited functionality with third-party email clients that connect to the service via the IMAP protocol.” Although some people might use Outlook to access Exchange Online via IMAP4, the majority of IMAP4 clients are third-party. Given that IMAP4 is a very straightforward and simple protocol, I wonder what limiting its functionality turns out to be in practice. Some reported that it meant “unable to connect”.
In any case, a workaround existed for the complete period in that users had full access to their mailboxes through Outlook Web App or one of the other access protocols. However, some folks are pretty attached to IMAP4 clients such as Thunderbird, Opera Mail, or Evolution and might not have been amused by the advice to use a browser for a week.
IMAP4 is popular in the academic community, which is a pretty big battleground over cloud services between Microsoft and Google. It’s also popular with Linux users who want a desktop email client to access their corporate email. An interesting factoid included in the customer impact statement makes it sound like some large universities were impacted:
“a limited number of customers appeared to be impacted by this event. However, those customers affected likely had a large number of users experiencing impact. “
In terms of what caused the problem, it appears that a software update was deployed to Exchange Online mailbox servers (where all protocols are processed now) and failed. From the problem statement, we can reasonably assert that the issue prevented Exchange returning a list of subscribed folders (folders that the user wants to synchronize with the client) when requested. As the Inbox is usually a subscribed folder, we end up with not being able to receive messages.
Anyone who has worked with technology for any length of time is all too painfully aware that software updates have a nasty habit of going wrong. Microsoft has attempted to automate Office 365 maintenance operations to the nth degree but obviously things can still go wrong.
As a Post-Incident Report (PIR) isn’t available yet, I reached out to Microsoft to understand why the situation unfolded as it did and learned that a fix was actually available for some time before it was applied. The resolution was included in a service update that was already scheduled to be rolled out. Engineering management took a decision to let the update be applied instead of attempting to patch the problem. There’s a solid argument to back this decision as it’s better to apply tested updates using a tried and trusted method than making one-off changes.
However, Microsoft screwed up royally by failing to communicate what they were doing. Six updates for EX41924 are listed in the Service Health Dashboard. You’d expect that each update would convey a little more information containing additional background and details about the issue and how Microsoft plans to resolve it. But each update contains essentially the same text and is simply a cut and paste exercise featuring some tortuous English that’s a masterpiece in obfuscation.
Regretfully Microsoft has not mastered the art of communication when it comes to explaining why things break and how they will make things better. It’s simply not good enough to keep on repeating the same thing time after time in the hope that customers will go away – or more likely, go asleep.
The net result of the failure to communicate is that you end up with uninformed comment of the type that has appeared in some other outlets (here's one example). Microsoft can't complain about this bad press because they have only themselves to blame.
A complete working week is far too long for Microsoft to explain how they were going to fix a problem. They can and should do better.
Follow Tony @12Knocksinna