As Windows NT has become more widely adopted, the number of NT applications has increased dramatically. In addition, virtually every x86 hardware device currently available incorporates NT support. Microsoft estimates that about 12,000 NT-compatible applications and 24,000 NT-supported hardware devices exist. Although these ubiquitous applications and hardware devices make NT an attractive OS that is easy to use and deploy, they're also causing NT to experience growing pains. Misbehaving applications and poorly written device drivers give users the impression that NT is an unstable OS. To date, Microsoft has provided systems administrators only limited tools for system recovery and has largely moved the burden of developing correct applications and drivers to developers. But with Windows 2000 (Win2K), Microsoft has formulated goals to help developers help Microsoft make Win2K an OS with zero unplanned downtime and to help administrators address downtime problems when the problems arise.
How will Microsoft accomplish this? By introducing built-in reliability enhancements in Win2K, and by giving developers tools with which to catch problems in applications and drivers before end users do. However, Microsoft also recognizes that even with development-side efforts, systems will occasionally become unbootable. To address the need for better system recovery than NT 4.0's Emergency Repair Disk (ERD) offers, Microsoft is giving administrators some powerful new tools.
This month, I take you on a tour of the new system-recovery options in Win2K, including safe mode, the Recovery Console (RC), and a new crash dump option. Over the next 2 months, I'll describe the reliability features that Microsoft has added to Win2K (such as Windows File Protection) that protect against unruly applications. I'll conclude Part 3 by describing the Driver Verifier, a powerful tool that helps device driver developers quickly find bugs in their drivers that they might not have noticed before, and that helps administrators identify device drivers that are responsible for system crashes.
Perhaps the most common reason NT 4.0 systems become unbootable is that a device driver, either right after installation or for no apparent reason after having worked satisfactorily for a time, prevents a successful boot by crashing the machine during the boot sequence. Because software or hardware configurations can change over time, latent bugs can surface in drivers. If a driver crashes during the first boot after which the system installed the driver, you're in luck. You can select the Last Known Good configuration during the boot sequence to restore the HKEY_LOCAL_MACHINESYSTEM\CurrentControlSet Registry key to the version that NT used during the last successful boot of the computer. If the Last Known Good configuration can't help, you sometimes have other options. For example, for a FAT system drive, you can boot off a DOS boot disk and manually delete or rename the driver file. If the system drive is NTFS, you must use a third-party recovery tool or load the computer with a parallel installation of NT so that you can access the NTFS drive.
Win2K is susceptible to device drivers that prevent a system from booting, but Win2K offers another way for an administrator to attack the problem: booting in safe mode. Safe mode is a concept Win2K borrows from Windows 9x—a boot configuration consisting of the minimal set of device drivers and services. By relying on only the drivers and services that are necessary for booting, Win2K avoids loading third-party and other nonessential drivers that might crash.
When Win2K boots, you press the F8 key to enter a special boot menu that contains the safe-mode boot options. You typically choose from three safe-mode variations: standard, networking-enabled, and safe mode with command prompt. Standard safe mode comprises the minimum number of device drivers and services necessary to boot successfully. Networking-enabled safe mode adds network drivers and services to those that standard safe mode includes. Finally, safe mode with command prompt is identical to standard safe mode, except that Win2K runs the command prompt application (i.e., cmd.exe) instead of Windows Explorer as the shell when the system enables GUI mode.
Win2K includes a fourth safe mode—directory services repair mode (DS-repair mode)—which is different from the standard and networking-enabled safe modes. You use DS-repair mode to boot the system into a mode that lets you restore the Active Directory (AD) of a domain controller from backup media. All drivers and services load during a DS-repair mode boot; therefore, you wouldn't use DS-repair mode to boot unbootable systems.
How does Win2K know which device drivers and services are part of standard and networking-enabled safe boots? The answer lies in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Control\SafeBoot Registry key. This key, which Screen 1 shows, contains the Minimal and Network subkeys. Each subkey contains a list of subkeys that specify the name either of a device driver or service or of a group of drivers. For example, in Screen 1 you can see the vga.sys subkey. This subkey identifies the VGA display device driver that the start-up configuration includes. The VGA display driver provides basic graphics services for any PC-compatible display adapter. The system uses this driver as the safe-mode display driver, in lieu of a driver that might take advantage of an adapter's advanced hardware features but that might also prevent the system from booting. Each subkey under the SafeBoot key has a default value that describes what the subkey identifies; the vga.sys subkey's default value is Driver.
You can also see the Boot file system subkey, whose default value is Driver Group. When developers design a device driver's installation script, they can specify that the device driver belongs to a driver group. The driver groups that a system defines are listed in HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\ Control\ServiceGroupOrder. Developers specify a driver as a member of a group to indicate to NT or Win2K when to start the driver during the boot process. The ServiceGroupOrder key's primary purpose is to define group load ordering; some driver types must load either before or after other driver types. The Group value beneath a driver's configuration Registry key associates the driver with a group. Driver and service configuration keys reside beneath HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\ Services. Thus, if you look under this key, you'll find the VgaSave key for the VGA display device driver. Any file system drivers that Win2K requires for access to the Win2K system drive are in the Boot file system group. If the system drive is NTFS, then the NTFS driver is part of this group; otherwise, the Fastfat file system driver (which supports FAT12, FAT16, and FAT32 drives in Win2K) is part of this group. Other file system drivers are part of the FileSystem group, which the standard and networking-enabled safe-mode configurations also include. (For more detailed information about driver loading and the boot process, see NT Internals, "Inside the Boot Process, Part 1," November 1998, and "Inside the Boot Process, Part 2," January 1999.)
When you boot into a safe-mode configuration, the boot loader NT Loader (NTLDR) passes to the kernel (ntoskrnl.exe) an associated switch as a command-line parameter, with any switches you have specified in the boot.ini file for the installation you are booting. If you boot into any safe mode, NTLDR passes the /SAFEBOOT: switch. NTLDR appends one or more additional strings to /SAFEBOOT:, depending on which type of safe mode you select. For standard safe mode, NTLDR appends MINIMAL, and adds NETWORK for networking-enabled safe mode. NTLDR adds MINIMAL(ALTERNATESHELL) for safe mode with command prompt and DSREPAIR for DS-repair mode.
The Win2K kernel scans boot parameters in search of the safe-mode switches early during the boot, and sets the internal variable InitSafeBootMode to a value that reflects the switches the kernel finds. The kernel writes the InitSafeBootMode value to the Registry value HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Control\ SafeBootOptions\OptionValue so that user-mode components, such as the Service Control Manager (SCM), can determine what boot mode the system is in. In addition, if the system is booting safe mode with command prompt, the kernel sets the HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\ Control\SafeBoot\ Options\UseAlternateShell value to 1. The kernel records the parameters that NTLDR passes to it in HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\ Control\SystemStartOptions.
When the I/O Manager kernel subsystem loads device drivers that HKEY_ LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services specifies, the I/O Manager executes the function IopLoadDriver. When the Plug and Play Manager (PnP Manager) detects a new device and wants to dynamically load the device driver for the detected device, the PnP Manager executes the function IopCallDriverAddDevice. Both of these functions call the function IopSafeBootDriverLoad before they load the driver in question. IopSafeBootDriverLoad checks InitSafeBootMode's value and determines whether the driver should load. For example, if the system boots in standard safe mode, IopSafeBootDriverLoad looks for the driver's group, if the driver has one, under the Minimal subkey. If IopSafeBootDriverLoad finds the driver's group listed, IopSafeBootDriverLoad indicates to its caller that the driver can load. Otherwise, IopSafeBootDriverLoad looks for the driver's name under the Minimal subkey. If the driver's name is listed as a subkey, the driver can load. If IopSafeBootDriverLoad can't find the driver group or name subkeys, the driver can't load. If the system boots in networking-enabled safe mode, IopSafeBootDriverLoad performs the searches on the Network subkey. If the system doesn't boot in safe mode, IopSafeBootDriverLoad lets all drivers load.
A loophole exists regarding the drivers that safe mode excludes from a boot: NTLDR, rather than the kernel, loads any drivers with a Start value in their Registry key (value 0) that specifies loading the drivers at boot time. Because NTLDR doesn't check the SafeBoot Registry key to identify which drivers to load, NTLDR loads all boot-start drivers.
When the SCM user-mode component (which services.exe implements) initializes during the boot process, SCM checks the value of HKEY_LOCAL_ MACHINE\SYSTEM\ CurrentControlSet\ Control\SafeBoot\Options\OptionValue to determine whether the system is performing a safe boot. If so, SCM mirrors the actions of IopSafeBootDriverLoad. While SCM processes the services listed under HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\ Services, SCM loads only those services that the appropriate safe-mode subkey specifies by name.
Userinit (\winnt\system32\userinit.exe) is another user-mode component that needs to know whether the system is booting in a safe mode. Userinit, the component that initializes a user's environment when the user logs on, checks HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Control\SafeBootOptions\UseAlternateValue. If this value is set, Userinit runs the program that HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Control\SafeBootAlternateShell specifies as the user's shell, rather than executes explorer.exe. Win2K writes the program name cmd.exe to AlternateShell during installation, making the Win32 command prompt the default shell for safe mode with command prompt. Even though command prompt is the shell, you can type explorer.exe at the command prompt to start Windows Explorer, and you can run any other GUI program from the command prompt as well.
Applications don't check the OptionValue Registry value to determine whether the system is booting in safe mode, because Microsoft has not officially documented OptionValue's existence. Instead, applications use the GetSystemMetrics(CLEAN_BOOT) Win32 API. Batch scripts that need to perform certain operations when the system boots in safe mode look for the SAFE_BOOT environment variable because the system defines SAFE_BOOT only when booting in safe mode.
When you direct the system to boot into a safe mode, NTLDR hands the BOOTLOG string to the Win2K kernel as a parameter, together with the parameter that requests the safe mode. When the kernel initializes, it checks for the presence of the BOOTLOG parameter, whether or not any safe-mode parameter is present. If the kernel detects BOOTLOG, the kernel records the action the kernel takes on every device driver it considers for loading. For example, if IopSafeBootDriverLoad tells the I/O Manager not to load a driver, the I/O Manager calls IopBootLog to record that the driver wasn't loaded. Likewise, after IopLoadDriver successfully loads a driver that is part of the safe-mode configuration, IopLoadDriver calls IopBootLog to record that the driver loaded. You can examine boot logs to see which device drivers are part of a boot configuration.
Because the kernel wants to avoid modifying the disk until chkdsk executes, late in the boot process, IopBootLog can't simply dump messages into a log file. Instead, IopBootLog records messages in the HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\ BootLog Registry value. As the first user-mode component to load during a boot, the Session Manager (\winnt\system32\smss.exe) executes chkdsk to ensure the system drives' consistency, then completes Registry initialization by executing the NtInitializeRegistry system call. The kernel takes this action as a cue that it can safely open a log file on the disk, which it does, invoking the function IopCopyBootLogRegistryToFile. IopCopyBootLogRegistryToFile creates the file ntbtlog.txt in the Win2K system directory (i.e., \winnt) and copies the contents of the BootLog Registry value to the file. IopCopyBootLogRegistryToFile also sets a flag for IopBootLog that lets IopBootLog know that writing directly to the log file, rather than recording messages in the Registry, is now OK. Listing 1 shows the partial contents of a sample boot log.
The Recovery Console
Safe mode is a satisfactory fallback for systems that become unbootable because a device driver crashes during the boot sequence, but in some situations a safe-mode boot won't help the system boot. For example, if a driver that prevents the system from booting is a member of a Safe group, then safe-mode boots will fail. Another example of a situation in which safe mode won't help the system boot is when a third-party driver, such as a virus scanner driver, that loads at the boot prevents the system from booting. (Boot-start drivers load whether the system is in safe mode or not.) Other situations in which safe-mode boots will fail are when a system module or critical device driver file that is part of a safe-mode configuration becomes corrupt, or when the system drive's Master Boot Record (MBR) is damaged. With the new RC Win2K tool, you can boot into a limited command-line shell from the Win2K CD-ROM or boot disks to repair an installation without having to boot the installation.
When you boot a system from the Win2K CD-ROM, you eventually see a screen that gives you the choice of either installing Win2K or repairing an existing installation. If you choose to repair an installation, the system prompts you to insert the Win2K CD-ROM (if the CD-ROM isn't already loaded in the system's CD-ROM drive), then the system prompts you to choose among three repair options: to start the RC, to initiate the emergency repair process, or to use the Advanced System Recovery feature to restore a Win2K installation from backup. If you press the F10 key at the Setup Welcome screen, you bypass the menu options and take a shortcut directly to the RC.
When you start the RC, the RC gives you a list of NT and Win2K installations it compiles when it scans the computer's hard disks. After you make a selection, the system prompts you to enter the Administrator account password to log on to the installation as the administrator. If you successfully log on, the system puts you into a command shell that is similar to a DOS command environment. Table 1 shows the list of commands that are at your disposal in this command shell. The command set is flexible and lets you perform simple I/O operations, enable and disable services and drivers, and even repair MBRs and boot records. However, the RC won't let you access directories other than root directories, the system directory of the installation you logged on to, or directories on removable drives such as CD-ROMs and 3.5" disks. This prohibition provides a certain level of security for data that an administrator might not usually be able to access.
By the time the system gives you the choice to install or repair Win2K, the CD-ROM has booted a copy of the Win2K kernel, including all necessary supporting device drivers (e.g., NTFS or FAT drivers, SCSI drivers, a video driver). On x86 systems, the setuptxt.sif file in the i386 directory of the Win2K CD-ROM guides the boot from the CD-ROM; the file contains directives that identify which files need to load and where the files are located on the CD-ROM. Just as when you boot Win2K from a hard disk, the first user-mode program the kernel executes is Session Manager (smss.exe). The Session Manager that Win2K Setup uses differs from the standard-installation Session Manager. The subdirectory \system32 contains the Setup Session Manager, and this component presents you with the menus that let you install or repair Win2K and the menu that asks you what type of repair you want to perform. If you are installing Win2K, Session Manager is the component that guides you through choosing a partition to install to and copies files to the hard disk.
When you run the RC, Session Manager loads and starts two device drivers that implement the RC: spcmdcon.sys and setupdd.sys. Setupdd.sys is a support driver that gives spcmdcon.sys a set of functions that let spcmdcon.sys manage disk partitions, load Registry hives, and display and manage video output. Setupdd.sys communicates with disk drivers to manage disk partitions and uses basic video support built in the Win2K kernel to display messages on the screen.
When you choose an installation to log on to and the RC accepts your password, the RC must validate your logon attempt, even though the installation's Win2K security subsystem isn't up and running. Thus, the RC alone must determine whether your password matches the system's Administrator account. The RC's first step in this process is to use setupdd.sys to load the installation's SAM Registry hive from the disk. The SAM stores password information; the SAM hive resides in \winnt\system32config\sam. After loading the hive, the RC determines whether a system key encrypts the hive. If so, the RC locates the system key in the installation's Registry. SAM hive encryption is a feature that NT 4.0 Service Pack 3 (SP3) introduced that adds protection against DOS-based password snoopers who try to read passwords directly out of a hive file.
Next, the RC locates the Administrator account password in the SAM and decrypts the password, if a system key encrypted the password. In the final authentication step, the RC uses the MD5 hash algorithm—the same algorithm that the Win2K logon process uses—to hash the password and compares the hash against the hashed password that the SAM stores. If the RC finds a match, the system considers you logged on. If the RC doesn't find a match, the system denies you access to the RC.
Most of the commands that the RC implements are uncomplicated. The RC uses the native Win2K system call interface to perform file I/O to support commands such as CD, RENAME, and MOVE. The ENABLE and DISABLE commands, which let you change the startup modes of device drivers and services, work differently. The RC also loads the SYSTEM hive (\winnt\system32config\system) for the installation you log on to. This hive contains HKEY_ LOCAL_MACHINE\ SYSTEM\CurrentControlSet\ Services, the Registry key that contains an installation's device driver and service parameters. For example, when you tell the RC that you want to disable a device driver, the RC reaches into the installation's Services key and manipulates the Start value of the specified driver's key, changing the value to SERVICE_DISABLED. The next time the installation boots, that device driver won't load.
Kernel Crash Dumps
The last Win2K recovery enhancement I'll describe is a new crash dump option. In NT 4.0, you can configure a system to dump the contents of physical memory to a file when the system crashes. The file that the system generates when a crash dump executes is slightly larger than the amount of physical memory present on the computer. However, much or most of the data in the dump file is useless to support technicians who examine the file to troubleshoot the cause of the crash. Crashes initiate in the NT or Win2K kernel, and the kernel contains interesting information regarding the state of a system at the time of a crash, including which applications were active, which device drivers were loaded, and what code was executing. Most of the time, application data (user-mode data) that physical memory stores is not useful for determining the cause of a crash and simply contributes to the size of a crash dump file.
Microsoft recognized this problem and added a new option to Win2K's Startup and Recovery dialog box, which Screen 2 shows. The option Write kernel information only lets you specify that a crash dump exclude all application data. When you enable this option, the crash analysis tools available for Win2K, including dumpexam and WinDbg, will recognize that a crash file contains only kernel data and therefore will interpret the dump appropriately. The space savings this option can achieve vary from system to system, and even from crash to crash, but on a typical 128MB system (in which a full crash dump would be 128MB), a kernel-data-only crash dump is usually around 40MB in size.