Last time I talked about the need to keep a solid troubleshooting methodology in mind, and promised some of my favorite troubleshooting principles. Of course, these aren’t just useful for figuring out why something in AD isn’t working. They’re equally handy when you’re trying to figure out how to fix a sprinkler, or why that last attempt at the almond-crusted mahi-mahi tasted like black licorice*.
- Consciously use a logical method. Pay attention to how you’re attacking the problem.
- Remember Occam’s Razor. William Of Occam was a Franciscan friar and philosopher that lived about 1300 AD. Occam’s Razor – how to shave a problem – says that the simplest conjecture, that has the fewest assumptions and variables, is probably the right one. (Actually, he said “entia non sunt multiplicanda praeter necessitatem”, but then only Mark Minasi would understand it.)
- What changed? Unless a random, high-energy cosmic ray struck your DC to flip a bit in your DIT, something in the environment probably changed to cause the error. See Occam’s Razor above.
- Suppress your tendency to make assumptions. It’s okay when you’re starting out, but if it’s a tough issue, go back to an orderly elimination of variables. (More on that next time.) Intuitive leaps often land in the mud.
- Only change one variable at a time. If you hope to ever find the root cause of a problem, you must honor this principle. If you’re in a real hurry, and your boss is screaming that he doesn’t care why the problem occurred, then okay. But remind him of this when he wants to know the root cause in the post mortem meeting.
- Trust, but verify, problem evidence. Sometimes an inadequate, or inaccurate, or incomplete, description of the problem or symptoms can lead you and your team on a wild goose chase and waste precious time. Get a clear and thorough description of the symptoms.
- Document your steps early in tough issues. If you realize you’re getting into a real thorny problem, stop and record all the steps you’ve taken so far. (Especially if you’re doing this at 3 AM.) It doesn’t take long at all before you aren’t sure if if you’ve run the same test against a problem more than once. This step is essential to logically eliminating all the potential causes. Plus, it’s very important if you have a post mortem meeting or have to put the problem aside for a few hours or days and return to it, or hand it off to the next shift. I’ve often also kept a time log of actions taken for production outages when managers want to know exactly why it took so long to recover.
Do you have any favorite troubleshooting principles? I’d love to hear them.
Follow Sean on Twitter at @shorinsean or at TechNet at http://tinyurl.com/seantechnet.
* Accidentally grabbed the anise instead of the lemon pepper, in case you were wondering.