How Windows Server 2012 Improves Active Directory Disaster Recovery

The day I sat down to write this column, springtime in Texas struck with a vengeance, spawning eleven tornadoes across the Dallas–Fort Worth area. Fortunately, we escaped the tornadoes and hail, and we only got several inches of rain. However, the event also spawned an idea for this month's column: disaster recovery for Active Directory (AD), and specifically how it's improved in Windows Server 2012.

How will Server 2012 help AD disaster recovery? I've already written about Server 2012's "virtualization-safe" AD features, but it wasn't until I was listening to the AD team at this year's MVP Summit that I understood the positive impact these features will have on the forest recovery process. To understand the improvement, you first must understand disaster recovery in the AD world.

Forest Recovery: Rare, but Possible

AD is wonderfully fault-tolerant to physical disruptions. Its distributed architecture has the ability to create updates on any domain controller (DC) and have them replicate to other DCs in the domain or forest. This ensures that if a DC or group of DCs is taken out by a local power failure or act of nature, the domain or forest will continue operating with very little impact to the remaining active user population.

There are situations, however, in which you might have to perform a forest recovery. Problems with the root domain in a root-child forest, mismatched schema updates due to extensive replication problems, severe divergence caused by USN rollback due to performing image restores of virtual DCs—these possibilities are extremely uncommon, but they exist. And because your company probably depends on AD for basic functioning, you must have a plan in place.

What if you should have to perform a forest recovery? First, if you're wondering how you'll know when you must do a forest recovery, I have both a long and a short answer. The long answer comes in the form of a high-level list of possible forest recovery situations on TechNet. The short answer is that Microsoft will tell you, because you'll probably have been on the phone with Microsoft Customer Support Services (CSS) for many hours!

Forest Recovery Steps

At a high level, the procedure (as currently documented in "Planning for Active Directory Forest Recovery") involves several steps. Once you've taken all the DCs for the current, failed forest off the network, there are a number of pre-recovery steps you need to perform to prepare the environment.

The next step is to restore one DC per domain using the last known good set of backups from each domain. You’re taking backups of at least two DCs in every domain, aren't you? When each DC has been restored (starting with the root, if you have one), connect it to the network. You'll now have a seed forest, with one DC for every domain.

Once created, you need to build out the seed forest as quickly as possible, performing fresh promotions of AD on the existing DCs. I recommend doing a Dcpromo /forceremoval versus a complete OS reinstallation to prep the DCs for a fresh Dcpromo. Though not well known, /forceremoval basically rips out the AD role from the DC while leaving the OS untouched, so it's far faster when time is of the essence. This buildout phase is by far the most time-consuming part of the forest recovery process, and thus is the place to focus on streamlining. As I said previously, this is a highly simplified process. It ignores little practical considerations, such as the fact that every employee on the corporate network will be hammering these few DCs if you don't take precautions!

For Windows Server 2003, your streamlining options are limited to procedural and Dcpromo from media. Procedurally, you can do a lot to ensure a speedy forest recovery, and though not technical, solid procedures are extremely important in a situation like this. Remember, this is an all-hands-on-deck situation; everyone's either running around with their hair on fire or drumming their fingers waiting for the hair-on-fire individuals to get AD back up. There's no time to be reading and evaluating TechNet articles and ActiveDir community forums on best practices. You must have solid and tested procedures that both your central AD team and remote operations can follow to the letter so that the restoration proceeds in a well-thought-out manner despite the stress of the moment.

Speeding Up the Recovery Process

With procedures behind us, let's talk about what you can do with Dcpromo. One of the primary actions of the Dcpromo process is the creation and population of the DC's local directory service database. There are several ways you accomplish this depending on what version of Windows Server your DC-to-be is running. The first way that all versions of Windows support is replicating AD objects in from other DCs in the domain. This method might be just peachy for normal operations, but at a time like this you definitely don’t want to depend on network connectivity, reliability, and probable congestion to get your authentication infrastructure back up!

So what other promotion tricks can we do? All versions of Windows Server after Windows 2000 support Dcpromo from media (the /IFM option), which promotes a new DC using a system state backup as the source to populate the local directory service database. The advantages of Dcpromo /IFM are that it doesn't require a network, and it's very fast. This is especially impressive for very large databases; at Intel, using IFM cut a 19-hour over-the-network Dcpromo down to 10 minutes. The requirements to make this work in a disaster scenario are that you must keep a running set of several versions of the system state backup stored on a non-system partition. Further, you must do this on every DC if you don't want to be dependent on the network for copying system state backups around.

Virtualization gives you more options for a speedy forest recovery for current versions of Windows Server—if you're careful. You can’t restore virtualized AD DC from snapshots or image-based backups (i.e., external backups of the VM's hard disks), or bad things might happen. My general rule when working with AD and virtualization is: "Don't do anything to AD that it wouldn't expect in a physical environment." You can still use virtualization advantages to speed forest deployment, however. For example, let's say you have a hub-and-spoke network configuration. You could create a generic virtual machine (VM) image in the network hub, with a known good IFM backup loaded on it, then clone that image several times (before you've made it a DC, and thus avoid any AD virtualization problems). Then, perform Dcpromo /IFMs on the cloned images. This will quickly give you a number of DCs in the hub site to support the network load, and branch offices can temporarily authenticate over the WAN until you can rebuild the branch office DCs.

Windows Server 2012's Boost to Forest Recovery

With that background in mind, how exactly can Server 2012 make AD disaster recovery a much easier process? It centers on the ability to clone Server 2012 virtual DCs. Server 2012 Hyper-V (and soon VMware vSphere) passes a value—the VM Gen ID—to VMs to tell them if they've been subjected to a virtualization activity such as being restored from a snapshot or image-based backup. Thus warned, the DC can take corrective actions to allow it to continue functioning correctly with the other DCs in the domain and forest.

Let's take this ability to clone DCs and insert it into the forest recovery process. Now, when you need to scale up the seed forest rapidly, all you need to do is clone the seed DCs. Unlike with Server 2008 R2 and earlier, you don't need to go through any promotion process, so the AD scale-out process (at least in the same networks as the seed DCs) can be very fast indeed. You could speed up the scale-out even more, or simply make it fast in smaller environments, by using differencing disks as a temporary measure. Once you got the initial virtual disk deployed, the difference in disk size for subsequent virtual DCs is quite small (around 200K) and thus deploying additional DCs to scale out becomes lightning fast. Remember, the point behind this initial scale-out is to get enough DCs operational to allow your users, resources, and applications to begin authenticating and authorizing again; you must finish the AD build-out in its final configuration across your network to bring things back to normal.

AD forest recovery is probably at the top of the "what keeps AD administrators up at night" list, yet I will bet only a small number have a documented plan in place. And few of this number have actively tested it, and even fewer test it on a regular basis. Server 2012 will make the process simpler and faster, but only if you've done all the legwork first.

Sean writes about cloud identity, Microsoft hybrid identity, and whatever else he finds interesting at his blog on Enterprise Identity and on Twitter at @shorinsean.

Comments

Plain text