Troubleshooting Active Directory Replication

One of Active Directory’s (AD’s) advantages is that it’s a distributed application. Its functionality is spread across multiple domain controllers (DCs) so that the failure of any one DC won’t affect the overall availability of AD. To accomplish this, AD must move its information around freely and efficiently between its DCs in a process known as replication. The AD replication model is a powerful, fault-tolerant, and complex system. It’s also the area that seems to cause the most issues for AD administrators. But that’s usually not AD’s fault.

Why should you monitor replication and keep it working well? If replication isn’t working to one or more of your DCs, a segment of your user population won’t be kept current with the latest directory data. This could result in a host of problems: Password changes aren’t seen; accounts unlocked by administrators aren’t accessible by the account owner; users don’t have access to applications (even though they’ve been added to the correct groups); new users can’t log on (even though their accounts have been created); and, very importantly, terminated employees might be able to access the network after their accounts have been disabled.

Replication issues can also affect Group Policy functioning and site or subnet changes. A DC that hasn’t successfully replicated with its partner DCs will be tombstoned out of the forest and must be rebuilt. Replication problems can also affect schema updates and have been known to cause forest-wide failures.

The Layered Approach

AD administrators should invest a little time to make sure that AD replication is working correctly for the health of their directory—and of their jobs. As a distributed application, AD depends on all the layers of infrastructure on which it’s built. Most of the issues that cause AD service interruptions—including replication—can be traced back to infrastructure or to administrative error (such as accidentally deleting objects). So, the first step in any AD replication troubleshooting must be to make sure that your infrastructure is working correctly. I call this technique troubleshooting from the wire up.

I use the seven-layer OSI network model (physical, data link, network, transport, session, presentation, and application) as a basis for my own AD troubleshooting model. My model is as follows:

Physical (i.e., the wire)
Network
Name resolution
OS
Authentication
The AD application itself

The physical layer refers to the physical network infrastructure: the wires that make networking function. If someone disconnects a network patch cord or runs a backhoe through a fiber circuit, replication isn’t going to work.

The network layer refers to network connectivity above the physical layer: router, switch, and, especially, firewall functionality. With regard to firewalls, DCs communicate over so many ports—some dynamically—that it’s important to carefully follow the guidance laid out in the Microsoft article “How to configure a firewall for domains and trusts”.

Another network-related issue is remote procedure call (RPC) errors, such as RPC server is unavailable. The Ask the Directory Services Team blog includes a very informative post about how to troubleshoot these errors by using the PORTQRY utility. (For more information, see the Microsoft Directory Services Team blog article “Using PORTQRY for troubleshooting”.)

Name Resolution: Suspect #1

Name resolution is where you should focus most of your AD troubleshooting efforts because the majority of AD-related problems are caused by name resolution configuration issues. Several years ago, Microsoft Product Support Services traced 80 percent of AD cases to name resolution issues. (For more information, see “Troubleshooting networks without NetMon”.)

AD is dependent on DNS to register and resolve all the myriad services and nodes it needs, and there are many ways to configure DNS incorrectly. Microsoft has long recognized this, and the DCPROMO wizard has grown increasingly more sophisticated in the way that it configures DNS. Windows IT Pro has published a variety of articles about DNS, including several by Boyd Gerber, a Microsoft network escalation engineer who specializes in DNS. See the Learning Path for a list of Windows IT Pro DNS articles. (For more information about how to troubleshoot DNS, see the Microsoft TechNet article “Troubleshooting Active Directory-Related DNS Problems”.)

Probably the best command to debug DNS problems is DCDIAG /TEST:DNS. This diagnostic command comprehensively tests the DNS service of a DC or of the server that you direct it to by using the /S switch. Using the /V (verbose) switch provides detailed test results. Adding the /E (enterprise) switch runs the command on all DNS servers in your forest. Finally, you can better analyze the volumes of information that this command provides by piping the output to a file by using the /F switch.

Many of these techniques are covered in the DNS page of my Active Directory Troubleshooting flowchart. You can find additional AD troubleshooting tips on my Active Directory Troubleshooting Tips and Tricks blog.

One aspect of AD that’s not well known is how name resolution is tied to replication. One of the most common errors we see when replication isn’t working is some kind of name resolution error, such as RPC server is unavailable or DNS lookup failure. Because we humans and most computer services locate other computers on the network by using the DNS A record (e.g., mycomputer.deuby.net), it’s natural to assume that this is also how DCs find each other for replication. They do—eventually. But only indirectly. For replication purposes, a DC’s directory service registers a GUID in DNS as a CNAME (alias) record. This GUID is unique in the forest. The CNAME is known as the DSA object GUID, and it resolves to the DC’s A record. When a directory service on a DC tries to locate its replication partners, it uses the Fully Qualified Domain Name (FQDN) of the CNAME (e.g., 802e2778-27d1-49ca-9d12-5c439f4c4d3b._msdcs.deuby.net).

If you want to find a DC in the same way that another DC really locates one, you have to find its GUID. There are several ways that you can find the DSA object GUID of a DC. One way is to look it up in the Microsoft Management Console (MMC) DNS Management snap-in under the _msdcs container of the domain’s zone. However, this method works only if the GUID is registered correctly in DNS. If you aren’t sure whether it is, a simple way to verify the registration is to run the command

REPADMIN /SHOWREPL

In this command, dcname represents the name of the DC that’s experiencing replication problems. The DSA object GUID is one of the first items listed in the response. Append _msdcs.domain.com to the GUID, and that will be what you have to ping.

After you obtain the DSA GUID, ping it from a DC that’s receiving the errors. (You could also do this from your own client, but that would probably introduce another variable because you might be using a different DNS server than the one the DC is using.) If you get no response from the ping, or if you receive a “could not find host” error, the replication problem most likely occurs because the CNAME or A record isn’t registered correctly. Reregister the DC’s GUID and its SRV records either by running the NLTEST /DSREGDNS command or by restarting the NETLOGON service.

Critical Layers: Health and Authentication

The importance of checking the OS health of the DC should be self-evident. AD is an application that runs on top of (or, in the case of Windows Server 2008 R2 or Windows Server 2008, is a role of) the Windows Server OS. There’s nothing unique about OS troubleshooting on a DC compared with troubleshooting any other application role. However, a dedicated DC does have an advantage over other application roles if you do encounter OS problems. Instead of spending hours trying to fix an ailing OS, you can simply demote the DC, or forcibly remove the role by using DCPROMO /FORCEREMOVAL. Then, you can quickly rebuild the OS and reinstall AD. This is often the quickest way to get a DC working again.

Similar to name resolution, the authentication layer of the AD troubleshooting model isn’t exactly a software layer. It’s a vital component within AD that, among other functions, determines the valid identities of the DCs themselves to allow them to safely communicate with one other. Kerberos is the security protocol that’s used, and the Kerberos Key Distribution Center (KDC) is part of every DC. If you aren’t familiar with this protocol (and every AD admin should be), the Microsoft Directory Services Team blog has a helpful article. (For more information, see “Kerberos for the Busy Admin”.)

Kerberos itself is an extremely reliable AD component. With respect to replication between DCs, many authentication-related failures are actually caused by external problems, such as time skew between computers. The W32TM utility is the main tool for correcting time skew, which it does by managing the Windows Time service. For example, you can perform the following actions by running the corresponding W32TM commands:

Check the last time that your target DC successfully synchronized its time, and with what server: w32tm /query /status
Force the service to use another DC in the domain: w32tm /config /syncfromflags:DOMHIER
Force the service to rediscover its network resources, then resynchronize with its time source: w32tm /resync /rediscover

If you’ve virtualized some of your DCs, make sure that they’re not synchronizing time with their host but are synchronizing instead with their partner DCs. (For more information about how to troubleshoot Kerberos, see the Microsoft article “Troubleshooting Kerberos Problems”). For more information about Kerberos troubleshooting by using network traces, even though the cause of the problem is name resolution, see the Microsoft Directory Services Team blog article "Troubleshooting Kerberos Authentication problems – Name resolution issues". The Windows Server 2003 R2 Kerberos Technology Center also provides a range of Kerberos-related articles.)

How Replication Works

Before you can effectively troubleshoot replication, you must understand how it works. Replication is the process of forwarding updates for a directory partition to all DCs that have a copy of that partition. For example, if you make a change to a user account in the domain child1.mycompany.com, replication forwards that change to the other child1 DCs because those controllers have a copy of (that is, they host) that domain partition. If you make a change to the site configuration for mycompany.com, replication forwards that change to all other DCs in the mycompany.com forest because site information is stored in the configuration partition that’s hosted on every DC in the forest. Replication works on a per-partition basis, making replication topology more complicated to understand. The good news is that when replication fails, it usually fails for all partitions on a DC because of issues that affect the supporting infrastructure.

To fine-tune the way that DCs replicate with one another, you create an AD site topology that contains your forest's DCs. The site topology is a network of its own that has sites as its nodes and site links as the connections between the nodes. The topology is usually based on your company's LAN and WAN configuration. You can further tune the way that replication connections are generated between sites by changing the relative cost of the site link (i.e., how expensive the WAN circuit is).

Within a site, each DC uses its Knowledge Consistency Checker (KCC) and its knowledge of the site configuration that's stored in the configuration partition to create connection objects between DCs. Connection objects are the pathways that transmit AD objects and attributes to other DCs (replication partners) via the replication process. These connection objects are one-way pathways. This means that every DC must have at least one inbound connection object to receive updates from each upstream replication partner, and at least one outbound connection object to transmit updates to each downstream partner. Replication from one DC to another is triggered by the upstream DC when it advertises to its replication partners that it has an update to share. The DC advertises this almost immediately (within 15 seconds).

In the same way that DCs are connected within a site, sites are linked to each other for replication by connection objects. But the way that the connection objects are created is controlled by how you set up the site links. Most administrators turn down the site link replication interval to 15 minutes from its default of 180 minutes. If you allowed every DC in every site to replicate with every other DC, the situation would quickly become unmanageable. Therefore, one DC is configured as the bridgehead server for each directory partition in each site. In most cases, one bridgehead server handles intersite replication for all directory partitions.

Both within a site and between sites, replication is a pull operation. In other words, a DC always requests updates from its upstream partners instead of pushing them out to its downstream partners. Therefore, when you troubleshoot, you should always think of objects and attribute updates as incoming requests to the DC that you’re working on. (For comprehensive documentation about replication, see the Microsoft TechNet article “How Active Directory Replication Topology Works”.)

The Right Tools for the Job

Now that you have the basic concept under your belt, and you’ve presumably verified that all the underlying AD components are working correctly, what tools will you use to fix replication? The first thing to do is to run DCDIAG on the target DC to check its general health. DCDIAG is the main diagnostic utility for DCs. It runs a suite of 27 tests by default. For example, Figure 1 shows the Replications test failing for a DC named GODAN.

Figure 1: Results of running the DCDIAG tool Replications test for a DC named GODAN

If a DCDIAG test results in warnings or failures, and if the reason for it isn’t immediately obvious, you should rerun DCDIAG. In the follow-up run, focus on the specific test that failed, and specify verbose operation. In this case, DCDIAG /TEST:Replications /V provides little extra useful information; however, a follow-up run of the DCDIAG test on the source DC (Kyoshi) reveals that the directory service isn’t running.

The next utility to concentrate on is REPADMIN. REPADMIN is the Swiss Army knife of replication utilities. It has 69 different commands in three tiers of increasing complexity, from simple checks to destroy-your-own-directory commands. As if that weren’t enough, the syntax of commands often varies slightly between versions. Knowledge of some of the more arcane REPADMIN commands is a requirement for directory service nerd-dom. You can use REPADMIN /?:command to get detailed help about individual commands in Server 2008 R2 or Server 2008. Table 1 shows a list of REPADMIN commands. (For more information about how to use the Windows 2003 version of REPADMIN, see the Microsoft article "Troubleshooting replication with repadmin".)

Table 1: Common REPADMIN Commands

Generally, the first REPADMIN command to run is /SHOWREPL, which is targeted to the DC that’s not receiving updates. Figure 2 shows the result. This is an intimidating result if you haven’t looked at it before. The data is easier to understand if you break it into sections. The first section, preceding the dashed line, shows general information about the DC. In particular, the data shows that the DC is a Global Catalog server, and it shows the DSA GUID. The next section shows every partition, in distinguished name (DN) format, that this DC hosts. It also shows the DC’s replication partner (and the partner’s DSA GUID) and the time that the DC last replicated successfully.

Figure 2: Results of running the REPADMIN command /SHOWREPL

Knowing Where to Look

Replication usually fails on a per-DC basis. So if you see replication from one partition failing and from another partition succeeding, this probably means that the partitions are replicated from different DCs. In this simpler case, restarting the KYOSHI NETLOGON service clears up the problem. After you obtain and study this detailed replication information, troubleshoot from the wire up to eliminate the most likely suspects. (For more help, you can refer to the replication page of my Active Directory Troubleshooting flowchart on my Active Directory Troubleshooting Tips and Tricks blog.)

If the replication problem that you’re troubleshooting is between sites, first check that the sites of the upstream and downstream DCs are connected to one other by site links. To learn which DCs are the bridgehead servers between these sites, run

REPADMIN /BRIDGEHEADS *

(The asterisk returns the bridgehead servers for all your sites.) Then, run

REPADMIN /FAILCACHE FSMO_ISTG:

This command targets the intersite topology generator for the site that‘s represented by the site parameter. It also displays a list of failed replication links that are detected by its KCC. If the problem is caused by an incorrect site topology (e.g., someone moved a DC to a new site without creating a site link object to connect it to the other sites), or if you’re simply moving DCs around, REPADMIN /KCC will force the KCC to recalculate and create connection objects between DCs so that you don’t have to wait for its scheduled run.

When you think you’ve fixed the problem that’s preventing replication, you can trigger general replication for all your target DC’s partners by running

REPADMIN /SYNCALL

or for a specific partner and directory partition by running

REPADMIN /REPLICATE

in which the directory partition is, for example, DC=Deuby,DC=net.

It’s important to monitor replication on a regular basis so that you can correct any issues before they get out of hand. The easiest way to do this is to run

REPADMIN /REPLSUMMARY

regularly. Doing this provides you a replication summary of all the DCs in your forest. For deeper analysis, you can run

REPADMIN  *

(instead of using a DC name). This runs a REPADMIN command, such as /SHOWREPL, against every DC in your forest. Tim Springston, an escalation engineer in Microsoft’s Premier Customer Support Group, has blogged about how to use REPADMIN’s /CSV option to create an organized output of /SHOWREPL * that you can use to look at the replication status of all your DCs in Microsoft Excel. (See "Get the Lowdown on your Replication".)

Here’s another tip that’s no more technical than a dry erase marker: Use a large whiteboard when you troubleshoot replication issues between multiple DCs or sites. Otherwise, the complexity of the relationships between DCs, directory partitions, and sites will quickly make your head spin.

Finally, I want to put in a good word for an old replication tool that doesn’t seem to get much respect: REPLMON. This utility is part of the Windows 2003 and Windows 2000 Support Tools, and it provides a graphical view of your replication topology. It can’t do nearly as many things as REPADMIN, and some features don’t work with Server 2008 R2 or Server 2008. But it’s the best way to learn how DCs establish connections with one other. (I created a short screencast about REPLMON that will walk you through the basic steps. To watch it, visit YouTube. To obtain Windows Support Tools, visit the Windows Server 2003 Service Pack 2 32-bit Support Tools download page.)

Bottom Line: Eternal Vigilance

AD replication is a process that’s prone to failure. But most of the time, a supporting component is the cause of the problem. If you experience replication problems, check those AD foundations—physical, network, name resolution, the OS, and authentication—before you spend much time on AD itself or on the replication process.

If you correct the underlying problems and give AD a little time to reestablish its connections, many problems will simply disappear. Become familiar with REPADMIN and keep a good image of the underlying structure, and you’ll keep your AD environment healthy.

Sean writes about cloud identity, Microsoft hybrid identity, and whatever else he finds interesting at his blog on Enterprise Identity and on Twitter at @shorinsean.

Comments

Plain text