How many of you virtualize Active Directory (AD)? If you do, do you know and follow Microsoft's important guidelines about what you should and shouldn't do if you're virtualizing AD? That's what I thought. According to Microsoft's Customer Service & Support (CSS), AD is the top support area for Windows Server, and AD virtualization issues are at or near the top for AD issues. Shame on you! The good news is thatWindows Server 2012's AD improvements will make your company’s implementation safer for you slackers.
Unless you've been spending a lot of time spelunking in the Cave of Kruber, as an IT person you're certainly aware of how virtualization has been taking over all aspects of the datacenter. The oldest—and thus most developed—area of virtualization is at the server level. Furthermore, because virtualization is one of the fundamental aspects of a computing cloud, its use is only accelerating.
If you're an AD administrator, what this means is that the virtualization team has been trying to get you to virtualize all of your AD domain controllers (DCs). AD administrators are by nature a risk-averse bunch (if we meet in person, I'll tell you the story of how I—OK, it wasn't me exactly—managed to instantly expire 30,000 Intel user accounts during an upgrade, despite layers of precautions), but your first reaction was probably, "No!" However, these virtualization people are a tenacious bunch, and they probably want a good reason why they shouldn't convert your entire AD forest to virtual machines (VMs). I defended my position (successfully) when I was at Intel. At the time, there were two good reasons, and although they still apply today, one of them is going away.
Reasons to Be Careful About Virtualizing AD
Security is the first reason. When I was pushing back at the virtualization folks, there were two aspects of security that we were concerned about: guest VM isolation and host administration. We were concerned about security between guests because we weren't entirely convinced that one VM couldn't gain access to another VM. As server virtualization has matured, that concern is pretty much put to rest. But the second reason—the concern surrounding operational security for host administration—is current and will be current for the foreseeable future.
Operational security for host administration is simply a long-winded way of saying that your virtualization host server administrators don't necessarily know anything about the care and feeding of virtualized AD DCs. And in all current versions of Windows Server, server host or virtualization administrators can seriously screw up your AD installation if they use some of the basic capabilities that any virtualization product provides, such as image-based restores, rollbacks to snapshots, or virtual DC duplications.
The distributed nature of AD in Windows Server 2008 R2 and earlier can't comprehend how virtualization products can change the state of a virtual DC in ways that can't happen to a physical DC. And because it can't comprehend and isn't designed to handle these changes, the logical structure of this distributed system can lose its integrity, specifically in the form of USN rollback, an AD data integrity problem that is difficult to detect and more difficult to recover from. Microsoft has a comprehensive document about running virtualized DCs called “Running Domain Controllers in Hyper-V,” and it includes a section about USN rollback. If you suspect that you're already a victim of a USN rollback, check the Microsoft article “How to detect and recover from a USN rollback in Windows Server 2003, Windows Server 2008, and Windows Server 2008 R2” for information about how to detect and hopefully correct it.
How Server 2012 Makes AD Virtualization-Safe
Making AD completely safe in a virtualized environment was a top priority for the AD team, and they've achieved it in Server 2012. They haven’t just made it safe; they've enabled AD to take full advantage of virtualization's capabilities. Conceptually, how it's done is quite simple. First, you need to make a DC aware of when a rollback in time has happened. Second, the DC must take action that both preserves its integrity and allows it to function normally.
To accomplish the first step, the layer that's executing the change (the hypervisor) needs to flag that a rollback in time has occurred, and communicate it up through the virtualization stack. The application then has to recognize it. This process obviously requires that design changes be made to the hypervisor, the OS, and the AD application. The flagging mechanism is known as the VM-GenerationID.
The VM-GenerationID (or VM Gen ID) is a 128-bit value, held in the hypervisor, that represents the current generation of a VM's state. As long as a VM continues to move forward in time, uninterrupted, the VM Gen ID doesn't change. If the VM is rolled back in time—either from an image-based restore or from applying a snapshot—the ID is changed. This ID is mapped to an address in memory in the VM so that it's available to applications running in the VM at all times. How does a DC know if its VM Gen ID has changed? When a Server 2012 DC is initially promoted (or upgraded), it stores the value of the VM Gen ID identifier in the msDS-GenerationID attribute on the DC’s computer object in its copy of AD during DC installation. Whenever the DC reboots or processes a transaction (e.g., updating an attribute), it compares the current value of the VM Gen ID in memory with the value stored in AD. If they're different, the VM has rolled back in time and the DC must take certain measures to preserve integrity. The VM Gen ID is hypervisor-independent, and other hypervisor manufacturers (i.e., VMware) are building this capability into their products.
If a VM rollback has been detected, the DC performs two actions to prevent USN rollback: It resets the AD database's invocationID and dumps its local Relative Identifier (RID) pool. Resetting the invocationID (the version number of the local database) is the same action that is triggered if a normal restore process is run against the DC, and other mechanisms kick in to ensure that the DC has the latest updates from the other DCs it replicates with (including ones it created itself) but no longer has knowledge of due to the rollback. The RID pool is a collection of several hundred RIDs (part of the domain-unique SID) allocated to the DC by the RID master, to create SIDs when new security principals are created on the DC. Dumping the RID pool and requesting a new allocation from the RID master ensures that duplicate SIDs aren't created in the domain. Note that this doesn't get you out of regular backups, though!
So technologically, AD will be fully virtualizable, although as of this writing the Microsoft AD team hasn't quite decided whether they're going to make it official. But do you want to virtualize AD entirely? You must remember to look at the big picture before you decide. The modern datacenter has (or certainly will have, going forward) layers and layers of abstraction between the AD service and the all-too-fallible hardware. Remember the "eggs in one basket" principle: Look at each layer below your service, work through the possible failure scenarios at each layer and how they’ll affect your service, and plan your service configuration accordingly. For example, you should consider running more than one virtualization solution for some critical parts of your infrastructure so that an issue with one solution (e.g., a bad driver in the VMware ESXi kernel, or an issue causing the Hyper-V parent partition to crash) isn't a single point of failure for your service. Or your VMs, though on different hosts and using different virtualization products, are all stored on a single SAN. If the only way to mitigate one scenario is to have a few physical DCs, so be it! When the virtualization team objects, point out (probably to their second-level management) that the cost of maintaining a few physical boxes is trivial compared with the risk of your entire corporation being unable to log on in the morning.
Server 2012's Active Directory Domain Service (AD DS) has given you one less worry to keep you awake at night. But as with any new capability, you need to look at it in the context of your infrastructure and decide how you can best use it.