Returning to the topic of Managed Availability, in my last post on the topic, I covered the logic and rationale that drove Microsoft to develop and implement such a capability in Exchange 2013 (and, possibly more importantly for their own purposes, in Exchange Online). This post looks at the implementation and how successful it has been to date.
The notion of self-maintaining software is compelling because it should result in lower administrative costs. Providing software with the ability to recognize when it is functioning correctly and when problems happen is usually the first step. This approach allows the software to report problems for administrators to resolve and is what happens with monitoring frameworks such as Microsoft System Center Operations Manager (SCOM) or HP Operations Manager. These frameworks are designed to consume and report alerts signalled by the products they monitor and provide value through the correlation engines that they include to sort really important issues from the data reported by the applications that they monitor. However, because they are general-purpose monitors, no automated capability exists to resolve the problems that cause the alerts.
Managed Availability is composed of a set of probes, monitors, and responders. The probes are designed to measure the essential signs of Exchange server health and to gather data representing health status. Probes can be as simple as grabbing some information made available by a Performance Monitor counter or as complex as a complete end-to-end emulation of how an end user interacts with Exchange, which is when Exchange generates “synthetic transactions” using the health mailboxes that you’ll find in every mailbox database (use Get-Mailbox –Monitoring to see these mailboxes).
The data gathered by probes is reviewed by the monitors and compared to what is known to represent a healthy server. In effect, it’s like a doctor reviewing the information provided by a thermometer (the probe) and deciding what to do by comparing the temperature against signs that we know to represent good health or a fever. If everything is good the monitor can continue to the next set of data, just like a doctor moves on to looking for another health sign if someone’s temperature is fine. However, if the temperature is elevated, the doctor will take action. In terms of Managed Availability, this is when a monitor uses a responder. Where a doctor might prescribe a drug to reduce the patient’s temperature, a responder might do something like restart an IIS app pool.
Here’s where the doctor analogy runs into difficulty. Every second of every day, Managed Availability proves measure hundreds of different health signs on an Exchange 2013 server. Humans are far more complex than an Exchange server but humans have the ability to tell a doctor what’s wrong. Exchange servers can’t, but they can generate many different measurements of all of the processes that run on a server in such a way that software (the Microsoft Exchange Health service and its worker processes) can make sense of the data and decide whether the server is healthy or not.
Different kinds of responders exist to handle different situations and monitors can escalate their reaction to problems if their initial attempt to respond fails to restore a component to full health. For example, let’s say that users cannot connect to Outlook Web App. The first response might be to recycle the OWA app pool in IIS. If that doesn’t work, the responder might decide to ask Active Manager to transition active databases to another copy to move work off the server.
If a server becomes really unhealthy (think of a human administrator tearing their hair out in despair), Managed Availability can force a system bugcheck, thus providing that the well-proven system management 101 practice of “a reboot can’t hurt” is still valuable today.
Because of the blizzard of data generated by Managed Availability and the sheer number of probes, monitors, and responders required to measure all of the interactions and processing that happen within a complex software product like Exchange, it can be confusing to understand just what data is being gathered and why. The Exchange product group have written a number of recent blog posts (such as “Customizing Managed Availability”) in an attempt to throw some light onto these components. Good as these posts are, the message appears not to be getting through in some parts, possibly because Managed Availability functions behind the scenes and you really don’t have to get to grips with it until Managed Availability misfires for some reason (as in the recent case affecting multi-domain deployments). Fortunately, such problems have not been regular occurrences.
Another way of thinking about Managed Availability is that it represents a form of best practice as dictated by the development group for how to deal with different situations that arise on an Exchange server. In effect, the developers have thought through what might happen on a server and how best to turn an unhealthy situation into a healthy situation and encapsulated the though process into probes (to gather the data), monitors (to make sense of the data), and responders (to fix any problems). It’s important to realize that this effort has not happened in a vacuum. Microsoft just happens to have tens of thousands of Exchange 2013 servers running within Office 365 and the feedback gained from the operational experience of those servers forms the backbone of Managed Availability.
The advent of Managed Availability has affected the relationship of Exchange with SCOM. In the past, Exchange management packs shipped a mass of diagnostic data to SCOM and let SCOM’s correlation and reporting engine sort out and highlight important information to administrations. Because Managed Availability now processes and reacts to the information gathered by proves on Exchange 2013 servers, the Exchange 2013 management pack doesn’t provide the same amount of data to SCOM. In effect, SCOM now receives a more filtered set of data to allow it to report the overall health of servers.
Although this might seem that products such as SCOM are less useful in an Exchange 2013 environment, it’s important to recognize that reporting frameworks deal with many more products than Exchange. The fact that Exchange 2013 is better at self-diagnosis and healing than previous releases is immaterial. Administrators still need to have dashboard-type monitors that cover all the important applications running in a datacenter. Managed Availability might signal a trend toward building more resilience into applications, but that does not remove the need to know what’s going on across the datacenter, to report how applications are used, and to gather information required for long-term capacity planning.
I like the notion of self-healing servers, which is why I consider Managed Availability to be the most important technical advance to have happened in Exchange 2013. It’s true that the framework has some rough edges and that it misfires from time to time, and that it probably consumes a tad too much in terms of resources. But I anticipate that Managed Availability will improve and mature over time and that soon we won’t worry too much about the resources that it uses because the value delivered will be so obvious.
The problem, of course, is that people have to get to Exchange 2013 before they can really appreciate the usefulness of Managed Availability, which returns us to migration, a joyous task that awaits many in 2014.
Follow Tony @12Knocksinna
Follow-up to my December 5 post, describing how Microsoft captures information from Outlook clients that connect to Office 365. In all fairness and to maintain balance in the debate, it should be pointed out that Microsoft uses the data captured (which is really only metadata anyway) to refine and improve the connectivity experience for customers. It's not really an opportunity for conspiracy theorists. Unless of course it is. But it couldn't be. So there. The nice people at Microsoft have pointed out that they are guided in everything that they do by some pretty strict privacy policies, which is always good to know.