There we were, surrounded by cold cuts and veggies neatly arranged on serving trays. The food was forgotten, though, as we stood by, waiting for the crucial moment. No, it wasn't New Year's Eve—it was one of many evenings spent working on an Active Directory (AD) implementation. Yeah, you know all about it: long days, late nights, bad food—the whole package.
We were on day four of an upgrade from Windows NT 4.0 to Windows 2000 AD. Everything was going well, and we'd accomplished a lot during the preupgrade process. Then disaster struck, and the cold cuts and veggies started flying around the room like mobile homes in a tornado. What went wrong, and how did we fix the problem?
The environment we were dealing with included 30 domain controllers (DCs) in one NT 4.0 domain that spanned multiple locations. We'd gone through all the important upgrade preparation tasks: design meetings, testing the upgrade, creating a backup plan, taking an NT 4.0 DC offline, and everything else that goes into a good upgrade to Win2K and AD.
By the time the snacks arrived, we'd installed two Win2K DCs—DC2 and DC3—in the Phoenix central office and had 28 NT 4.0 BDCs left to upgrade in various locations. For the next phase of the upgrade, we planned to migrate a BDC in Delaware—DC1—to Win2K so that we could take advantage of AD's site capabilities when we reinstalled the Windows 95 clients in Delaware with Windows XP Professional Edition. A crew of two handled DC1 in Delaware; another techie and I sat at control central in Phoenix.
Each DC is different, so you need to consider each one individually when attempting a migration. Our team had rigorously examined how to migrate DC1 with the least amount of pain. Our options included performing a common upgrade, performing a scorched earth upgrade (i.e., using Fdisk or Symantec Ghost's GDisk feature to clean the hard disk and start over), or installing a new instance of Win2K on top of the existing \winnt directory. We wanted the cleanest Win2K installation possible, so we immediately ruled out the upgrade scenario. We didn't want to scorch the entire system because DC1 was a Microsoft Systems Management Server (SMS) Client Access Point (CAP) and thus contained more data than we wanted to remove and restore. Therefore, we moved forward with the third option.
Before we began installing Win2K on DC1, we ensured that the system and all files were backed up. The Delaware crew then maneuvered through Win2K's fresh installation options as we sat tight and sampled the cold cuts.
The remote team installed Win2K on DC1, specifying the system as a member server. After the system rebooted, the remote team ran Dcpromo and selected DC1 as a replica DC for the domain. AD synchronization began, and life seemed wonderful. The system rebooted after replication. We waited for the DC to show up in the Microsoft Management Console (MMC) Active Directory Users and Computers console, the MMC Active Directory Sites and Services console, and DNS; it did so, as planned. One problem remained: DC1 was in the wrong site. We (the Phoenix team) used the Active Directory Sites and Services console to move DC1 to the correct site—and things started to go wrong. The AD relationship between DC2 and DC1 failed, and DC1 seemed to fall off the face of the earth. (This was the point at which the veggies started to fly.)
After we tried a few reboots that had no effect, we analyzed our situation. We had a Win2K server that wasn't communicating with the domain as a Win2K DC. We ran Dcpromo on DC1 to try to demote the DC to a member server and clean up the database, but the demotion failed. DC1 had no connections to the domain and was drifting alone.
As we searched for a solution, we couldn't stop thinking about the time. It was nearly 11:00 p.m. in Phoenix, and we were all weary after a full day's work. Worse yet, we'd thought this step would be simple and straightforward, so we'd scheduled a 6:00 flight for the next morning—and we had a 2-hour drive to the airport. We needed a solution that we could implement quickly but that wouldn't cause more work or problems down the road. We had the following options:
- Reinstall DC1 with a different NetBIOS name. Because DC1 was an SMS CAP server, however, all clients in Delaware relied on the existing name. Although this option would take the least amount of time to perform, we'd have a large cleanup task ahead of us.
- Delete DC1 from AD, then reinstall DC1 with its original name. This option was quick but risky: If the existing DC site connections were left stranded in the AD database, the new DC would have problems involving duplicate names.
- Delete DC1 from AD, clean up the AD database, then reinstall DC1 with its original name. This option required the most work but seemed to offer the most stability.
We decided to follow the most robust plan of attack. We needed to remove DC1 from the enterprise, clean up the AD database, and reinstall DC1 with its original name.
When a DC is orphaned inside AD, cleanup involves many areas. You need to remove the computer object in the Active Directory Users and Computers console, as well as clean out DC-connection objects tucked deep in the AD database. You also need to clean up DNS. With all these tasks ahead of us, we quickly got to work.
First, we dove into the Active Directory Users and Computers console to try to clean out the computer object. We were in luck and were able to delete the object. We immediately forced replication to ensure that all DCs received the update that the DC object was no longer valid. We used the Replication Monitor utility (replmon.exe)—part of the Win2K Server CD-ROM's Support Tools—to force AD database replication across site boundaries. (For more information about replication and Replmon, see the sidebar "Crossing Site Boundaries," or "6 Essential Tools for Troubleshooting AD Replication," April 2002, http://www.winnetmag.com, InstantDoc ID 24222.)
After we verified that the other DCs were updated, we opened the Active Directory Sites and Services console to try to remove the server and connection objects from the associated site. However, we couldn't remove the server object from the AD database. We moved on to the next step, noting that we'd need to come back to the Active Directory Sites and Services console and remove the server object.
We were ready to dive into the abyss of the AD database. We decided to use Ntdsutil to clean up the orphaned database record. Ntdsutil—a powerful command-line tool—includes plenty of commands for working with the AD database. We needed the tool's Metadata cleanup command to delete an orphaned database record. You can read more about Ntdsutil and its commands by going to http://www.microsoft.com/windows2000/techinfo/reskit/en-us/default.asp and clicking Distributed Systems Guide, Appendixes, and Active Directory Diagnostic Tool (Ntdsutil.exe).
We were aware of some of the tool's tricky aspects. For example, Ntdsutil doesn't connect to a DC when you first launch the tool; instead, you must manually establish all connections to a DC. As Figure 1 shows, we used the Metadata cleanup command's connection option to connect to DC2, from which we could access the AD database.
After we accessed AD, we needed to point Ntdsutil to the orphaned DC1 object, and to do that, we needed to connect to the appropriate domain and site. As Figure 2 shows, we used the Metadata cleanup command's select operation target option to get a listing of domains and sites, then select the correct domain, site, and server.
After we selected the server object, we were ready to use the Metadata cleanup command's remove selected server option to remove the object from the database. As Figure 3 shows, we entered the command. A dialog box appeared asking if we were certain that we wanted to remove the server object from the database. After checking the server's name and location to ensure that we weren't removing the wrong server, we confirmed our decision in the dialog box and received confirmation of the object's successful removal. To make sure that all other DCs were aware of the change, we used replmon.exe again to force replication. We then used Ntdsutil on both DC2 and DC3 to ensure that the DC1 server record was removed from all DCs that housed the AD database.
The next crucial step in the cleanup process involved DNS. Depending on the type of DNS server you use, you need to take the appropriate action to remove the host record (aka the A record) and SRV records. We were using a BIND DNS server, so we asked the DNS server administrator to remove the essential entries for us. After using Win2K's Nslookup utility to verify that the entries were gone, the time had come to return to the Active Directory Sites and Services console and try again to remove the server object. This time, we were able to remove the object without a hitch. All the server's entries in Active Directory Users and Computers, Active Directory Sites and Services, Ntdsutil, and DNS were finally gone. We were ready to attempt the reinstallation of the DC.
We installed the server and joined it to the domain. We then made sure all the TCP/IP settings (e.g., subnet mask, default gateway, DNS server, WINS server, dynamic update configuration) were correct and in order. Finally, we ensured that the IP subnets were correct for the site in which DC1 was going to reside. We then ran Dcpromo—watching the entire process like a hawk. This procedure was pretty much the same as the one we followed before, but this time, everything worked fine. We still don't really know what the problem was the first time through.
Sweat the Small Stuff
As you can see from our experience, small errors can make big waves within AD. When you're dealing with a migration, be sure you're prepared to deal with both the small and large changes to the DCs in your existing enterprise. Let's hope you won't have many messes as a result—but if you do, you have plenty of good tools, such as Replmon and Ntdsutil, at your disposal