Why AD Replication Troubleshooting Is Hard

A couple of weeks ago at TEC 2010, I gave a session called Active Directory Replication Troubleshooting. Rather than just demonstrate a bunch of REPADMIN tricks (which I’ll certainly do here!), I focused on what I think are the core reasons IT pros have a difficult time troubleshooting replication problems. Half of them don’t have anything to do with replication.

The way I see it, there are four main reasons why troubleshooting AD in general, and AD replication in particular, is difficult:

You aren't really trained in a formal troubleshooting methodology.
You don't approach the problem logically and rigorously.
You don't really understand how replication works.
You don't understand how the tools work.

How much each of these applies to you depends on your background and your job. If you come from the scientific or engineering fields before you were an AD admin, you may be pretty good on the first two. If you’re a full time AD admin for a big company, or a hotshot AD consultant, you’re probably pretty good on the last two. But very few of us are good at all four. And you need all of them to be good at the really tough AD problems. Let’s take a look at the first point.

You aren’t really trained in a formal troubleshooting methodology. Most of us learned how to troubleshoot from our coworkers that were already working in our area. My undergraduate degree was in Electrical Engineering (though I can barely screw in a light bulb now), so I had developed a fair amount of rigor. But in my experience, IT pros aren’t consciously aware of following a formal methodology. Fortunately, there’s a good one lying around for you to simplify and use - the scientific method. Here’s a boiled-down version:

Use Your Experience. For example, a DC can’t communicate with its partners. Have you seen this before? Does this fit in a framework? What does the problem “smell like”?
Form a conjecture. Could a network firewall rule have changed?
Deduce a prediction from that explanation. Some, or all, connectivity will be lost between the DC and systems outside (but not inside) the firewall.
Test. Use ping and tracert to map out the connectivity.

This may sound overly simplistic, but what I’m trying to point out is that most troubleshooters aren’t actively aware of using a method. When you’re working through a complex problem, instinct will only take you so far; you need to consciously use a troubleshooting methodology.

Next time I’ll lay out my favorite troubleshooting principles that fit into this methodology.

Comments

Plain text