Q. What are lossy failovers and divergence in Exchange?

A. A lossy failover applies to a Cluster Continuous Replication (CCR) Exchange 2007 high availability solution. Lossy failovers are problems because of the way transaction logs are asynchronously copied from active nodes to passive nodes as transaction logs are closed. It's possible for an active node to crash and be unavailable before a passive node can collect all the transaction logs from the active node, which means a lossy failover has occurred.

A lossy failover doesn't mean data is lost. As I described in an earlier FAQ, \[link to FAQ\] the passive node will query the hub transport servers in its AD site and recover any messages that it's missing. What will change is that the passive node will continue the transaction log numbers at the last log number it received. If the last log file on the previously passive node was E0000000050.log, its next log will be E0000000051.log, even though the previously active node may already have log numbers E0000000051.log and E0000000052.log, as shown here.

Click to expand

The problem is that when the previously active node comes back online and attempts to resume synchronization, the Replication service will detect log files with the same name but different content. Exchange will be in a state of divergence between the active and the passive node because they have different content. This divergence can also occur in cluster split-brain situations where both nodes in a cluster are started or if an administrator runs the "eseutil /r" command.

Fortunately, the replication service can detect divergence by comparing the last log file on the CCR copy to the same log file on the active node. If they have the same content, then everything is OK. If they're different, the replication service goes backward through each pair of log files until it finds two that match, signifying the last time the two nodes were synchronized. Every log file after that is deleted (including the open log, Enn.log, if found) and the logs are copied from the active node, giving the two nodes the same log file content.

The process works as I described it if none of the mismatched transaction logs had been written into the database and the databases still have the same content. If the databases have different content, you have a serious problem and database differences can't be undone. Instead, the entire database of the passive node has to be reseeded from the active node, which could take a long time and should be avoided at all costs.

You might expect that under normal circumstances, with data being written to the database from transaction logs very quickly, database divergence would happen frequently in the event of lossy failovers, and thus full database reseeds would be required often. This would be the case, but Microsoft built a feature called Last Log Resiliency into Exchange 2007, which limits the need for full reseeds. I will discuss Last Log Resiliency in the next FAQ.

For more on Exchange's high availability, see the video "Exchange 2007 High Availability" at ITTV.net.

Related Reading:

Videos:

Audio:

Exchange 2007 and High Availability w/Paul Robichaux

Check out hundreds more useful Q&As like this in John Savill's FAQ for Windows. Also, watch instructional videos made by John at ITTV.net.

Comments

Plain text