In "Recovering from NT Startup Failures, Part 1," September 1999, I discussed common causes of Windows NT startup failures and introduced you to several techniques that you can use to prevent and quickly recover from NT boot disasters. In this second installation, I provide more prevention and recovery tips, and discuss additional NT boot failure causes and the methods and troubleshooting tools you can use to quickly recover from them.
As I concluded in part 1, the most important step in NT recovery happens long before a failure occurs—preparing for a problem before it begins. To prepare for tomorrow's worst possibilities, you need to take precautionary steps today, such as properly designing your NT systems' hardware and software setup, backing up crucial system configuration information, and developing a disaster-recovery toolkit that includes all the utilities you'll need to recover from common NT boot problems. These resources are your ace in the hole if things go awry.
Most NT users know the importance of maintaining up-to-date copies of the Emergency Repair Disk (ERD) for NT systems. This disk contains a copy of the system Registry and provides crucial information that you need to use NT Setup's Repair process to locate and repair a damaged NT installation. Most IT shops perform regular system backups and create updates of the NT ERD for their NT servers. However, many organizations consider this process tedious and time-consuming because the process requires administrators to physically visit each server and run rdisk.exe. Thus, critical servers' ERDs aren't always as up-to-date as they need to be. If this situation sounds familiar, consider an alternative method of collecting ERD information for your NT machines. Aelita Software Group's ERDisk utility, which Screen 1, page 84 shows, can perform remote, over-the-network ERD creations. In addition to storing ERD information to any drive location (local or network drive) that you specify, ERDisk can handle multiple machines' batch jobs, which you can schedule to run automatically. ERDisk can automate the ERD update process on all your networked NT systems, so you don't have an excuse for not having updated ERDs. (For Aelita's contact information, see "Recovery Resources.")
You need to be vigilant about maintaining updated ERDs for each of your NT systems, but your preventive maintenance shouldn't stop there. In part 1, I discussed methods for maintaining Registry backups that are convenient when you have to perform a recovery operation. For example, the Microsoft Windows NT Server 4.0 Resource Kit regback.exe utility lets you create uncompressed copies of individual Registry hive files. These uncompressed Registry copies are convenient when you need to replace Registry hives. (For more information about regback.exe, see the sidebar "The Regback Profile Quirk," page 86.) However, common sense dictates that storing backup data on the hard disk of the system you're backing up isn't the most fault-tolerant practice. Alternatively, consider using cross-backups, in which you copy important system configuration data, such as Registry backups, from one machine to another machine on the network. The principle behind this practice is that more is always better when it comes to backups, and the best place to store a system's backup is anywhere but on that system.
If cross-backups appeal to you, consider extending this practice beyond Registry data to other types of crucial data. For example, I periodically make offline backups of my Microsoft Exchange Server databases (i.e., dir.edb, pub.edb, and priv.edb) to another server on the network. My backup software uses an Exchange agent to make online backups of Exchange Server; however, I've discovered that a recent offline backup simplifies full Exchange Server recoveries (i.e., when you have to restore Exchange Server from scratch). However, cross-backups should serve as an additional resource that complements your existing disaster-recovery plan—don't use cross-backups to replace your primary backup solution (e.g., tape backups).
If you don't want to junk up your systems with backup data, you can place this information on removable media, such as CD-Recordable (CD-R) and CD-Rewritable (CD-RW) discs, Zip and Jaz cartridges, magneto-optic (MO) cartridges, or similar media. This practice is a good idea because 3.5" disks, which are the only storage media that NT's ERD utility supports, don't have the reputation of being the most reliable media type.
Autostarting Services and Devices
In part 1, I talked about the following common causes of NT startup failures and the blue screen of death:
- Installing software that corrupts the HKEY_LOCAL_MACHINE portion of the Registry—particularly software that installs new services or drivers on the system.
- Changing a system's network configuration (e.g., in the Control Panel Network applet), followed by NT miswriting the configuration's network bindings in the Registry.
- Underlying file corruption that occurs on a key system file that was already in memory and working before the corruption.
In addition, I provided methods you can use to resolve these problems. The recovery methods I discussed involved wholesale replacement of Registry hive files.
This month, I highlight startup failures that result from a service or driver causing a STOP error when it initializes. Rather than completely restoring the Registry or overwriting entire Registry hives, you can edit the Registry to solve this problem. This solution might be preferable to replacing Registry hive files if you don't want to lose configuration settings or if you're not sure which service or driver is causing the problem.
In some cases, the STOP error results from a service or driver that loads before the GUI appears (i.e., when NT initializes the video display driver and shifts into graphics mode). In other cases, the error might occur after NT shifts into graphics mode; it can even happen during or after the logon process because some drivers and services might still be loading in the background after NT displays the logon prompt. This situation might be the cause if you've installed a new service or driver, or after you've reinstalled NT. Additional causes of a service/driver startup problem include software installations that install services or drivers that conflict with other services or drivers or the NT's service pack level, and changes to a system's hardware or software configuration that cause drivers or services that previously loaded successfully to become problematic. For example, physically changing the type of network card without first removing the driver causes the old driver to produce a STOP error.
Another situation that results in a STOP error is when you change a video card driver on a system with a remote control package installed (e.g., Symantec's pcANYWHERE32). Most remote control applications hook the current display driver during their installation, so problems result when you pull the original display driver out from under these applications. The originally hooked driver is no longer active, so rebooting the system results in a STOP error or blue screen. To safely change a video driver on a system with a remote control package installed, uninstall the remote control software, change the video driver, then reinstall the remote control software.
Renaming, Moving, or Deleting Offending Files
You can employ several methods to prevent a service or driver STOP error. One method is to rename, move, or delete the file to stop the service or driver from loading. If you know the name of the offending service or driver, you can try booting into DOS if the boot volume is FAT or try a parallel NT installation if the boot volume is NTFS, then rename the file to a temporary name. In many cases, this solution causes the STOP error to disappear but leaves a reference in NT's configuration to a service or driver that is no longer there. If you choose this method, be sure to reinstall the service or driver or completely uninstall it after you've booted into NT. This renaming method doesn't work and can cause problems in situations that involve multiple chained services or drivers, such as the previous remote control software example.
Offline Registry Editing
Another method to resolve this server/driver startup problem is to edit the Registry to manually disable the service or driver. How do you edit the Registry if you can't boot NT? As long as you have an alternative method of accessing the volume that contains your original NT installation, you can edit the Registry. To gain access to Registry data from outside the original NT installation, you can boot to a parallel NT installation on the same system, or you can install a disk that contains the NT boot partition (i.e., the NT installation folder and Registry hive files) onto another NT system.
Gaining access to the Registry through a parallel NT installation on the same system is easier than using a disk because a parallel installation doesn't require physically moving disks between systems. However, whether the NT boot partition is FAT or NTFS, you must boot from NT to edit Registry data because you have to use an NT Registry editor to edit Registry data, which is impossible from outside NT. Unfortunately, no one has developed an NT Registry editor that runs under a different OS, such as DOS.
After you gain access to the original installation's Registry hive files, you're ready to begin offline editing. Although you're probably familiar with NT's Registry editors, you might not know that you can use them to open Registry hive files on other NT installations or alternate Registry sets from the same installation. To edit Registry hive files offline, open regedt32.exe (regedit.exe doesn't support loading native Registry hive files offline), highlight the HKEY_LOCAL_MACHINE root key, and select the Load Hive option in the Registry menu to locate the hive file you want to bring into the Registry editor. In this case, you want to change a service or driver's startup type, and NT stores this information in the SYSTEM hive. After you locate and select the file, the system will prompt you to provide a key name for the hive file contents, as Screen 2 shows. This activity doesn't modify the original hive file's name, nor does it permanently affect the Registry of the local installation you're booted under. In addition, the name you choose doesn't matter because the Registry editor will use the name only as a temporary Registry branch that contains the data of the original Registry hive file. After you provide a key name, it will appear in the HKEY_LOCAL_MACHINE window.
At this point, you're editing the SYSTEM hive from your original NT installation, and you can resolve your startup failure. As with any Registry editing session, back up the hive file you're working with before you edit. When you open your new key, SYSTEM2 in my example, the display is slightly different from what you usually see under the SYSTEM key. Most notably, the only ControlSet subkeys available are ControlSetxxx keys, where xxx is a number such as 001. The display doesn't exhibit the CurrentControlSet subkey that you usually see when editing the live Registry of a local machine. The display doesn't show CurrentControlSet because it's an alias for the control set that loaded when NT booted.
To ensure you're editing the correct control set and not the default control set of the parallel NT installation, choose the Select subkey under your newly created key. The right pane of the Registry editor will display several values, as Screen 3 shows. NT uses the values and their data to determine which control set is the default set loaded at startup, which value is the CurrentControlSet value, which data represents the Last Known Good configuration, and which set has failed to boot successfully. In Screen 3, the Current value tells you the last control set NT used during startup. This value represents the control set NT is using as the CurrentControlSet entry. In most cases, this value matches the default value. In my example, the data contained in Current is 0*2, which tells you that ControlSet002 is the set you want to edit. After you locate the correct control set, you can modify your service or driver startup state.
The Registry entries related to your original NT installation's services and drivers are under the HKEY_LOCAL_MACHINE\SYSTEM2\ ControlSet00x\Services\name of suspect service or device driver Registry key. In this key, SYSTEM2 refers to the subkey in my example, ControlSet00x reflects the control set you previously determined, and name of suspect service or device driver is the name of the service or device driver that you suspect is causing your problem. Each service and driver that the Services subkey lists stores several values within its root key name, including a Start value (i.e., REG_DWORD value). This value's number determines the current startup state of that service or device driver. Setting the Start value to 0*4 disables a service or driver and prevents NT from attempting to start it during the boot process. Table 1, page 88, lists the possible Start key values for services and device drivers. After you finish editing your Registry offline, you must unload the imported hive file. To do so, highlight the key name you assigned to the hive and select Unload Hive from the Registry menu.
Now that you can disable services and device drivers in your original installation, you can successfully disable the offending element that is preventing NT from booting successfully. Determining which service or driver is the culprit might take experimentation, but you can use the events that lead up to the problem and information that the STOP error screen provides to help isolate and disable the problematic component.
A discussion of NT system recovery isn't complete without mentioning third-party utilities that can assist you in this process. Winternals Software's ERD Commander and Remote Recover, and Systems Internals' NTRecover are excellent products from the premier makers of NT recovery software. Although each of these utilities can help you recover a damaged NT system, they differ in their methodologies and strengths. For example, NTRecover, Systems Internals' original NT recovery utility, lets you access the hard disk of an unstable NT system by connecting a serial cable between the damaged system and a working NT system. After they're connected, you can use NTRecover to copy and delete files, or run Chkdsk or virus scan utilities on the remote disk. In most situations, NTRecover provides all the functionality required to successfully recover an unbootable system.
ERD Commander is a dream come true for NT administrators who long for the days of booting DOS disks to recover wayward DOS and Windows 95 installations. This command-line-based utility boots from a 3.5" disk and can read and write to NTFS volumes. Screen 4 shows ERD Commander's interface. The Professional Edition of this utility includes several enhanced features, such as support for fault-tolerant disk sets (i.e., disk sets using NT's ftdisk.sys driver), the ability to run Chkdsk, password recovery, support for FAT32 volumes, support for the Expand utility, and command-line options that let you selectively control or disable the startup state of services and drivers.
Remote Recover is the newest Winternals recovery-utility product. This utility provides a custom boot disk that includes Network driver interface specification (NDIS 2) driver support to let you remotely access an unstartable system's NTFS volumes over the network. This support lets Remote Recover remotely access the system and perform recovery functions similar to NTRecover and ERD Commander.
Don't Rule Out Hardware
Making assumptions about server disaster recovery is dangerous. For example, when you're dealing with a blue-screened NT installation, assuming that the problem is software-related is easy. However, defective hardware or hardware-related events (e.g., a failing hard disk or disk controller, bad main memory or cache RAM, overly aggressive BIOS performance settings) might be the culprit. By displaying STOP codes that don't indicate hardware as the problem's source, hardware-related blue screens sometimes masquerade as software-related failures.
Hardware-related problems are especially suspect if you have recently changed hardware or a power-related event has occurred (e.g., a full outage or series of voltage sags or spikes). For example, suppose you installed a new fax board in your server last week, and the fax board worked fine during your testing. However, a week later, the server blue screens and the STOP error message doesn't point you to a particular service, driver, or hardware component. The malefactor might be a hardware-related problem with the fax board or the interaction with its driver that occurs only under a heavy traffic load. In a situation like this one, assuming that NT has become damaged is easy. However, if you're fighting a hardware battle with software weapons (e.g., restoring the Registry, reinstalling NT), you might end up chasing your tail for a long time.
In part 1 and in this article, I've shown you advanced techniques and utilities that you can employ in emergency situations in which an NT system refuses to boot. More important, I've discussed proactive measures that you can take now to increase your chances of performing a successful system recovery as well as reduce the amount of time that a recovery operation will take. Microsoft's documentation covers traditional recovery techniques, such as using NT Setup Repair and restoring the Last Known Good configuration, but these measures often prove insufficient. If you perform proactive disaster preparation measures, you might never have to use Microsoft's recovery techniques.
ERD Commander, ERD Commander Professional Edition, and Remote Recover
Contact: Winternals Software * 512-330-9861 or 800-408-8415
ERDisk 3.01 Contact: Aelita Software Group * 614-336-9223 or 800-263-0036
NTRecover Contact: Systems Internals