The Finer Details of Exchange High Availability

The esteemed Scott Schnoll, famed speaker at many TechEd and other conferences around the world, recently tweeted a reminder about one of the most useful articles published about Exchange 2010 High Availability that was originally published in May 2011. Good information like this article ages well and it’s always a pleasure to review to both reinforce knowledge and discover new details that had previously escaped your attention.

Scott presents a number of misconceptions about Exchange high availability in the article. I especially like #4, which reviews how Active Manager makes use of the AutoDatabaseMountDial setting during automatic database transition. This is a property of a mailbox server that can vary from server to server within a deployment. Of course, the name of this setting doesn’t immediately tell you what it governs. It might be better named as AutomaticDatabaseTransitionThreshold (too long) or AutoDatabaseMountThreshold. In any case, the setting is named as it is and we are stuck with it.

What’s important here is how the setting is used. When a failover occurs, the potential always exist that the database copies that are candidates to be mounted to take over the provision of service to clients are not completely up-to-date. Exchange 2010 SP1 introduced a new feature called block mode replication that allows DAG member servers to replicate updates between each other as transactions occur in memory rather than having to wait before complete transaction logs are available to be copied.

At 1 megabyte, Exchange transaction logs aren’t big and they fill quickly on anything but passive servers (a feature called log roll is used to keep a certain level of transaction log motion on servers that don't generate enough transactions to fill logs). Thus, DAG member servers can keep each other updated by shipping transaction logs around and the small log size means that new logs are available for copying very soon after transactions occur. However, logs can contain data from multiple transactions and the first transaction has to wait until the log fills before it can be replicated to the servers that host database copies. Block mode replication addresses the problem by replicating transactional data as soon as transaction log buffers fill, so there’s less chance that a server or disk outage will result in data loss.

But problems do happen. Exchange only switches into block mode replication mode when replication is deemed to be healthy and up-to-date (no logs in the copy queue). If network glitches occur or something else interferes with block mode replication so that the passive nodes fall too far behind the active server (approximately 4MB), Exchange switches back into traditional log file replication mode and copy and replay queues can start to build. If a failure now occurs, Exchange has to figure out what tolerance exists to allow it to perform an automatic activation of a database copy. This is where AutoDatabaseMountDial comes in as it tells Exchange how many log files can be missing (unavailable) before automatic activation becomes impossible. The default setting is 6 (otherwise known as Good Availability), meaning that Exchange is able to automatically activate a database copy even if up to six transaction logs (or 6MB data) is unavailable.

The detail that is often missed in this discussion is the role that another interesting feature called incremental resynchronization plays if the previously active database copy is available. Scott explains what happens using a good example, but think of it like this. The newly activated database copy has a hole in its head when it starts to provide service to clients. That hole can be filled by the missing transaction logs. When the failed server comes back online it might be able to provide the missing information from the data that it holds, assuming that some horrible storage outage hasn’t occurred. Exchange 2010 includes the necessary intelligence to recognize that the hole (or divergence) exists, look for the missing bits, and then figure out how best to fill the hole with those bits.

All software has its limits and it’s entirely possible that Exchange will not be able to retrieve the missing data or be able to figure out how to patch things up through incremental resynchronization. But what I like is the thought that has gone into the detail of high availability with features like block mode replication, single page patching, and incremental resynchronization. It’s evidence of maturity within the implementation. The best thing is that Exchange’s high availability story is likely to improve and get better over time as additional details are addressed to make data more resilient and recovery more automatic. It kind of gives you confidence in the future.

Follow Tony’s ramblings via Twitter.

Comments

Plain text