Your training-center instructor never told you about days like this. You're overseeing a large server upgrade, and all your preparatory work is finished. You have a detailed task list, a solid back-out plan, good data backups, hardware vendor support, and management approval. However, the bottom drops out when your servers suddenly fail. All hopes of any upgrade are gone. At this point, your lofty goal is to just get back to your original configuration. You work through the weekend trying to get your systems back online in time for the next business day, but as employees start to arrive, they discover that their email and main file servers are nowhere to be found. You finally get things back online, but not without significant productivity losses in the user community. Despite your most careful preparations, you now find yourself in the hot seat, trying to justify your actions leading up to and following the meltdown. You start to think about updating your resume in case things don't go well when you meet with management.
The meeting with management might be called a postmortem, a debriefing, or an inquisition. Depending on how serious the outage was and the size of your company, you might be facing your manager, several managers, or a committee that includes other technical people. Management will be intent on finding out what happened, how it happened, and what steps are necessary to prevent it from happening again. You've probably spent much of your training learning how to perform certain procedures successfully, but you must also learn how to react in the face of disaster. Perhaps your studies have given you the impression that if you click the right buttons, everything will go well. However, in the high-pressure environment of the data center, any mistakes or unforeseen problems can come back to haunt you.
Before you embark on any significant project, remember the old adage that you should hope for the best and plan for the worst. Assemble an up-to-date contact list with phone numbers, cell phone numbers, or pager numbers of any managers you might need to contact in the event of an emergency. Document procedures for contacting users and clients to describe how any outage might affect their work. Determine who will communicate with the organization about any outages. Have contact information for your hardware and software vendors ready.
As you work through a project, consider maintaining a list of the steps you take. This time line could prove invaluable if the event is a success and essential in the event of catastrophe. When you're putting out fires, your memory might fail you. But if you can provide a detailed time line during the postmortem, no one can accuse you of being evasive.
During the meeting, try to answer management's questions honestly. Now is not the time to try to fool anyone. If you made any mistakes, admit them. If you can provide evidence that you acted logically, took all the proper steps, called in the proper resources, and informed management in a timely way, you'll probably survive this crisis. No one expects you to anticipate the unforeseeable.
If you're a senior systems engineer, you might have to pose the tough questions to a colleague whose project went bad. Just remember that you were once in his or her shoes and that bad things can happen to good systems engineers. If you're in a management role, remember that the employees on the other side of the table are probably good at what they do. They might lack maturity, and they might need additional training, but if you support them in this difficult time, they'll work even harder for you. During one crisis I was involved in, our manager stepped in to tell us that she believed in us and knew we were trying our hardest to rectify the situation. Her demeanor was understanding, not confrontational. After that, she had our unequivocal support.