Troubleshooting from the Wire up for Active Directory and Beyond

As an IT pro, you’ve read countless articles over the years on how to troubleshoot the various moving parts of IT that make up your job responsibilities. Exchange, SharePoint, SQL Server, and Active Directory—each application has its own unique set of tasks and troubleshooting procedures. These applications do, however, have one trait in common: At a high level, they share a common troubleshooting methodology. Unfortunately, many IT pros don’t consciously use a solid troubleshooting methodology to solve their problems and, as a result, spend more time working on issues than they need to. Understanding and consciously using solid troubleshooting principles will help you reduce the amount of time you spend backtracking in your testing, prevent you from getting lost in your troubleshooting steps, and allow you to easily reproduce fixes to your problems. We’ll focus on these principles in general and later apply them to Active Directory in particular, so you will spend less time fixing AD and more time performing useful work.

Using the Scientific Method

Troubleshooting is probably an IT pro’s single most important technical skill, and yet most of us learned haphazardly how to troubleshoot. Usually, we learned from on-the-job training, perhaps from a more experienced team member or manager. It’s a rare IT pro that’s had professional training in structured problem solving. As a result, we have a grab bag of tools that usually works for us and a roughly structured troubleshooting process. We throw the tools at the process and hope that something we do fixes the problem.

The foundation of a solid troubleshooting skill set is to use an established methodology that guides you to the right solution. Fortunately, you don’t have to invent a new methodology; the scientific method has already been invented and works for just such a task. Your first reaction is probably that you’re already using the scientific method. You guess what the problem is and test your hypothesis. Did that fix the problem? If not, you go back to the first step. If you’ve used this method (and we all have) you’ve had experience with its biggest shortcoming. It’s a shotgun approach. A truly effective troubleshooting methodology is more precise and detailed than this high-level approach, of course.

Cisco has created an eight-step network troubleshooting model based on the scientific method. I’ve expanded this model by including certain troubleshooting principles from my IT experience. The steps that follow will provide you guidance in some of the more critical areas of effective troubleshooting:

1. Define the problem precisely. In this step you want to determine what the problem is exactly. Remove all vagueness and ambiguity from the problem statement. For instance, don’t state, “DC1 isn’t replicating correctly.” This imprecise statement implies that DC1 isn’t pulling updates from any upstream DC, nor are its downstream partners getting updates from it. The precisely stated problem might be much smaller: “DC1 isn’t receiving updates to the configuration partition from DC2.” Precision weeds out assumptions before they can take root.

2. Gather detailed information. Ask yourself questions like: What doesn’t work? What does work? What changed since it was working? If it’s a client, is anyone else having this problem? If so, what do these clients have in common? Do they use the same OS build? Are they on the same subnet? Do they use the same application server? Is there anything unique about this system? If it’s a server, is it behind a firewall? If so, do any other servers show the same symptoms?

3. Consider probable causes for the failure. This is the critical step in troubleshooting when you need to brainstorm and gather hypotheses, or possible reasons, the failure occurred. In a complex application such as Active Directory, there are hundreds of potential causes for failure if you don’t narrow down your choices. This is the time to apply your first troubleshooting principle, Occam’s Razor. Occam’s Razor states that, in a list of potential solutions to a problem, the simplest solution is most often the correct one. You’re probably already using Occam’s Razor throughout the troubleshooting process, but don’t swing it carelessly; you may quickly discover that your first, or maybe even your second or third guess, as to the problem’s root cause is wrong. Then what?

This is when you should apply a principle I call “troubleshooting from the wire up.” The component you support depends on a few or many other infrastructure components. Model your troubleshooting along the lines of the seven-layer Open Systems Interconnect (OSI) model that modern networked systems are based upon, and start from the bottom, the physical network wire, up to your component. In the case of a distributed application, such as Active Directory, the troubleshooting progression is a) physical network, b) name-to-IP-address resolution, and c) the server operating system. Check all of these before you even get to AD. When you brainstorm hypotheses, be sure to include tests to confirm each layer is indeed working as expected. The beauty of this bottom-up method is that it works for all networked systems because that’s how they’re designed to work.

Now, this brute force method may be overkill for simpler problems. If you’re logged onto the server remotely, for example, you know it has power and network connectivity (at least over some ports), so you can eliminate a few steps. Just be aware that you’ve considered them and eliminated them. By consciously noting a step before you dismiss it, you’ve at least made a mental checkmark that’s easier to revisit if you get stuck further in the process. Steps 2 and 3 often work in an iterative fashion; the act of gathering detailed information often reveals facts that inform new hypotheses.

4. Devise a plan to test the hypotheses. When planning your tests, as much as you may be in a hurry to bring that system back up, try to follow the next troubleshooting principle: Change only one variable at a time. If you make more than one change and the problem is fixed, you won’t know whether it was the change to variable A or variable B that fixed the problem. If server A can’t set up a session with server B, and you make changes to both servers’ networking, you’ll never know which was wrong. Sometimes the needs of the situation (or the manager) will require you to make more than one change at a time, but realize that as a result you won’t be able to pinpoint the root cause.

5. Implement the plan. If at all possible, run the test with a partner or auditor. There’s nothing like a second set of critical eyes to catch information the lead troubleshooter might have missed. Complex plans should always be thought through ahead of time, written down, and then closely followed. “Shooting from the hip” during the implementation might make the results invalid, leading you down a road of wrong assumptions.

6. Observe the results of the implementation. Did it fix the problem? If not, did it change the behavior of the problem at all?

7. Repeat the process from step 3, considering the next most likely hypothesis, if the plan does not resolve the problem. Work upward through the dependent layers of your application. If it’s a SharePoint problem, is SQL Server working correctly? If it’s a replication problem between two domain controllers, do they have network connectivity to each other?

8. Document the changes made to solve the problem. Once you have fixed the problem, do a post mortem to review your troubleshooting process. Some questions to ask yourself are: Did you proceed smoothly to a correct conclusion, or did you bang your head on your desk for a few hours first? If it’s the latter, examine why you missed the root cause and adjust your troubleshooting methodology for the next time. A post mortem doesn’t need to be large and formal (unless the outage is large). What counts is that you improve your troubleshooting process so a similar issue doesn’t catch you again. Head banging is better left to rock concerts than to work surfaces!

Figure 1 refers to a flowchart for this method as part of the Active Directory troubleshooting flowchart linked at the “AD Troubleshooting Tips & Tricks” blog.

Troubleshooting Active Directory from the Wire Up

The eight-step troubleshooting methodology detailed above applies to a wide variety of situations, both inside and outside the IT world. For the rest of this article, let’s focus on troubleshooting AD and the tools you should use. AD is a complicated application that s upon other complicated components. What dependent components need to be in good shape for AD to work right?

Figure 2 shows the architectural layers most important to AD functionality: physical, network, name resolution, OS and authentication, and finally AD itself. The physical layer seems obvious—nothing works without power. Or if the network cable isn’t connected, the packets can’t flow—but I’ve lost track of the number of times operations was trying to troubleshoot an unavailable DC only to discover a site was shut down due to a national holiday, and the site operations forgot to tell central operations. This layer also encompasses hardware failures on the DC itself.

The network layer is where you should check for IP address or subnet configuration errors, WAN or LAN failures, and firewall changes that block ports used by AD. Your best tools for this layer are PING, IPCONFIG, and TRACERT. This last possibility has proven to be a common (and frustrating) root cause, perhaps because the solution isn’t technology driven. To correct this, what’s often needed is better communication between the network and directory teams.

Besides AD itself, name resolution is the layer where you should have the strongest troubleshooting skills because that’s where most of the non-AD issues are found. DNS performs the hostname to IP address resolution for AD, but it also performs a variety of other important functions, such as service location and DSA ID (a unique identifier used for replication) translation. The tools for troubleshooting issues related to name resolution are NSLOOKUP, NLTEST, DCDIAG, and IPCONFIG.

Next is the operating system. Dedicated domain controllers are less complicated to troubleshoot at the OS level than many other application servers because they typically have only the base OS, Active Directory Domain Services, and DNS. In Windows 2003, you can use NETSH DIAG GUI, and NETDIAG. In Windows 2008, you can use Server Manager and the Performance and Reliability Monitor. R2 adds the Best Practices Analyzer. And all versions, of course, have the event log. Finally, MPSReports, available from the Microsoft Download Center, is a configuration-gathering tool used by Microsoft Support Services to help diagnose your system. You can use it, too, and view the results of the report using the MPSReports Viewer (also from Microsoft downloads).

Within the OS layer is authentication (i.e. Kerberos), which I’ve broken out because it’s a far more common error source than the general OS. Kerberos relies on close time synchronization and both ticket-granting tickets and session tickets to successfully authenticate identities to resources in the domain. The tools in this area are the event log, KERBTRAY, and KLIST (from the Microsoft Download Center). Sometimes the problem isn’t related to any failures in the authentication mechanism itself but to human “assistance.” I’ve had DCs begin failing authentication because certain Latin American countries legislated unique daylight saving time (DST) changes. The DC in that country didn’t have the hotfix that recognized the change, so operations changed the time manually. This time skew caused the DC’s Kerberos session tickets to fail, and the DC began throwing errors.

Resolving issues in Active Directory requires all these skills to help isolate the various moving parts of this distributed application. As an IT professional you’re called upon to troubleshoot a wide variety of systems, from an Active Directory that supports thousands of users, to your parent’s home computer, to the kitchen light switch. If you build yourself a strong foundation of structured, logical troubleshooting skills, you can repair all of these situations and boost your professional reputation at the same time.

Sean writes about cloud identity, Microsoft hybrid identity, and whatever else he finds interesting at his blog on Enterprise Identity and on Twitter at @shorinsean.

Comments

Plain text