The underappreciated Exchange Replay Lag Manager

The underappreciated Exchange Replay Lag Manager

Reg: All right, but apart from the sanitation, medicine, education, wine, public order, irrigation, roads, the fresh water system and public health, what have the Romans ever done for us?

Attendee: Brought peace?

Much like the members of the People’s Front of Judea struggled to find good with the Romans in Monty Python’s “Life of Brian”, those running on-premises Exchange servers often wonder whether Microsoft’s whole-hearted embracement of the cloud has benefited them. But it has, even if the rapid appearance of new features inside Office 365 sometimes makes the on-premises crowd think they are the forgotten few.

Which brings me to the topic of the ReplayLagManagerEnabled parameter for a Database Availability Group (DAG), a little known and badly documented feature of Exchange 2013 that Microsoft developed to support their implementation of native data protection for Exchange Online. As you might know, Exchange Online eschews backups and protects mailboxes by deploying four database copies within a DAG. One of the copies is lagged with a 7-day delay, which then creates the issue of how best to use this copy to maintain high availability.

The Replay Lag Manager, introduced as part of the high availability enhancements in Exchange 2013, provides the solution. This is a component of the Microsoft Exchange DAG management service, the process that deals tasks such as checking whether databases have sufficient redundancy. If enabled, the Replay Lag Manager monitors the health condition of database copies every 60 seconds to decide when it is necessary to force a lagged copy to begin to play down its log set and so become a copy that can be potentially activated should the other copies fail. The idea is to move the lagged copy from its normal condition of being several days behind the other copies to being up to date at times when the Replay Lag Manager has observed signs like database copies going offline for some reason (disk or server failure, maintenance). When the problem condition eases and the set of copies return to normal, the Replay Lag Manager instructs the lag copy to begin to accumulate transaction logs again and gradually go back to the lagged interval.

The Replay Lag Manager is part of Microsoft’s preferred architecture for Exchange 2013 where the blog post on the topic states:

“The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive) for more than 24 hours”

The ability to automatically deal with low disk space conditions is also helped by another feature called Loose Truncation.

The automatic nature of the intervention is the reason why this is important to Exchange Online. Although Microsoft does not publicly disclose the number of DAGs operating within Exchange Online, it’s likely to be over a thousand. Such a large number makes it terrifically difficult for human administrators to take the right action at the right time to maintain high availability. The Replay Lag Manager is just one of the pieces that enables Exchange Online to operate as automatically as possible. It’s less pervasive than other components like Managed Availability, but still critical in terms of keeping databases online and mailboxes available to users.

The ReplayLagManagerEnabled parameter is set to False by default, so a DAG doesn’t use the Replay Lag Manager unless you enable it.

Set-DatabaseAvailabilityGroup –Identity DAG1 –ReplayLagManagerEnabled $True

The parameter can be set on any DAG but doesn’t make much sense unless the DAG includes a lagged copy.

Microsoft doesn’t provide a GUI to control how the Replay Lag Manager works. Instead, default values are used when the feature is enabled that can be adjusted through four registry DWORD values in HKLM SOFTWARE\Microsoft\ExchangeServer\v15\Replay\Parameters\. The settings are:

Registry value Meaning
ReplayLagManagerEnableLagSuppressionWindowInSecs Default 300 (five minutes); the time that sufficient healthy non-lagged copies must be available before lagging can resume.
ReplayLagManagerNumAvailableCopies Default 3. The number of healthy non-lagged copies that must be available for a database before its lagged copy is automatically played down.
ReplayLagManagerDisableLagSuppressionWindowInSecs Default 86400 (1 day). The time in seconds that Replay Manager waits after healthy copies are reduced before instructing a lagged copy to begin playing down.
ReplayLagLowSpacePlaydownThresholdInMB Default 10,000 (MB). If less than this amount of free space is available on the drive holding the lagged copy, it will start to play down logs to free space.

It might be a coincidence, but the default values correspond to the configuration used by Exchange Online!  If you decide to change the default behavior of the Replay Lag Manager by altering the registry values, you have to restart the Microsoft Exchange DAG management service to make the changes effective. Obviously, it is wise to have the same changes made on all member nodes in a DAG.

The default values mean that if the Replay Lag Manager observes that the number of healthy copies for a database falls to under 3 for more than a day, it will instruct the lagged copy to begin to play down its transaction logs. If the number of healthy copies comes back to the required number for more than 5 minutes, the Replay Lag Manager lets the lagged copy resume and gradually go back to its lagged interval (a maximum of 14 days).

The Replay Lag Manager also comes into play when a corrupt page is detected in a lagged database copy. Page patching only works when the corrupt page is the same as the other copies in a DAG, so after a problem is detected with a page (like the old -1018 problem), the Replay Lag Manager orders the lagged copy to play down its logs to a point when page patching is possible. After the corrupt page is fixed, the lagged copy can resume normal operations.

Like many parts of technology, the devil is very much in the detail when it comes to discussing the benefits of innovation in the cloud for on-premises customers. Anything to automate the use (and usefulness) of lagged database copies is welcome. The Replay Lag Manager does just that.

Follow Tony @12Knocksinna

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish