Authentication Topology

You're designing a new Windows 2000 Active Directory (AD) deployment for a network with more than 10,000 users and more than 300 servers in 30 physical locations, connected with occasionally overcommitted T1 lines. What's your biggest headache? If you're like most AD designers, replication is what wakes you up at 3:00 a.m. and keeps your mind spinning long after your eighth cup of coffee at 10:00 a.m. But as important as your replication topology is, it probably isn't what you should be paying the most attention to. During typical operations, replication accounts for relatively little of the network traffic that AD generates, and an end-to-end replication latency of even a few hours won't cause users any real heartburn. But if you let even one user's AD authentication time grow to more than 20 seconds or so, you can bet that your Help desk phone will be ringing off the hook. When you understand how AD clients locate a domain controller (DC) to which to authenticate, you can configure your clients and DCs to shave time off authentication and improve overall performance.

Locating a DC
When a DC boots, and periodically while it's running, the DC's Netlogon service uses the dynamic DNS (DDNS) protocol to publish DNS SRV records (see the sidebar "DNS SRV Records," page 62, for more information about the format of these records). These SRV records describe the types of services that the DC provides—for example, Kerberos authentication, Lightweight Directory Access Protocol (LDAP)—based directory access, and AD Global Catalog (GC) lookups. AD DCs organize SRV records hierarchically so that clients can find the DCs that provide a specific service within a domain or within a particular site and domain.

Figure 1, page 62, shows the DNS naming hierarchy that DCs use to publish their SRV records. You've certainly seen this structure in your Win2K DNS. The structure lets clients locate services even when the clients don't have full knowledge of a service's characteristics. For example, to find any GC in the forest, the client need only specify the forest name and the protocol for which to search (i.e., search the forestname._tcp domain for _gc SRV records). This type of search would return SRV records for all the GC servers in the specified forest. If the client knows a specific AD site name, it can search the forestname._sites.sitename._tcp domain for _gc SRV records. This type of search will return the GCs in the specified site.

Publishing these SRV records in DNS helps clients find the DCs that can most efficiently handle their requests. When a client computer authenticates to a domain, the computer must first locate and authenticate with a DC in its domain. To get the best possible performance, the client computer needs to select a DC in its AD site.

The client's Netlogon process handles the client's end of the authentication process, and Netlogon's DC Locator component is responsible for finding a DC with which to authenticate. In earlier versions of Windows, the client's DC Locator uses WINS to locate a DC. However, Win2K machines and other AD-enabled clients search for the appropriate DNS SRV records.

The first time a workstation authenticates to its domain, the workstation doesn't know which site it's part of, so its first task is to find any DC in the domain. The workstation issues a DNS query for _ldap SRV records in the _tcp.dcs._msdcs.domainname domain. The client queries the primary DNS service specified in its TCP/IP configuration. If the client gets no response from this DNS service, the client fails over to the subsequent DNS services that the TCP/IP configuration specifies.

The DNS service responds with a list of SRV records that correspond to all the DCs in the client's domain. The client takes the records with the lowest-priority value and issues an AD ping (which is actually an LDAP-over-UDP query) to each DC in turn. If a DC doesn't respond within a tenth of a second, the client tries the next DC, and so on, until a DC responds.

When a DC receives an AD ping from a client, the DC calculates two crucial pieces of information before sending a response. First, the DC determines the site closest to the client; to do so, the DC compares the IP address in the request packet with an in-memory data structure that contains the site and subnet associations defined in AD's site objects. The DC also determines whether it's in the site closest (from an IP topology point of view) to the client's site. The DC sends this information and the name of the responding DC's site in a UDP response to the client.

When the client receives this response, it determines whether the responding DC is in the site closest to its site. If so, the client saves the returned client site name in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's DynamicSiteName entry and uses that DC for further domain-authentication requests. If the DC response indicates that the DC isn't in the site closest to the client's site, the client returns to DNS to find a DC in the closest site. This time, because the client knows its site name, it queries DNS for _ldap SRV records in the _tcp.sitename.sites.dc._msdcs.domainname domain. DNS responds with a list of SRV records for DCs in the specified site. The client again selects those SRV records with the lowest priority and issues AD pings to each in turn until one responds within a tenth of a second.

What happens if no DC in the client's site responds? In this (I hope rare) case, the client sticks with the original DC, which might be a long distance away.

Let's walk through a few concrete examples. Figure 2 shows a network with two sites: one in Scottsdale, Arizona, and one in Amsterdam, The Netherlands. Each site contains one DC: DCSC1 and DCAM1, respectively. DNSSC1 is the primary DNS server for the new workstation WSSC1. WSSC1 first queries its DNS server (i.e., DNSSC1) for any DC in the ds.megacorp.com domain, and DNS responds with SRV records for DCAM1 and DCSC1. WSSC1 then sends an AD ping over the WAN to DCAM1, which responds within 100 milliseconds (ms). DCAM1 returns the name of WSSC1's site (i.e., Scottsdale) and a flag indicating that DCAM1 isn't in the site closest to Scottsdale. WSSC1 queries DNSSC1 again, this time requesting DCs in the Scottsdale site. DNSSC1 returns the SRV record for DCSC1. WSSC1 sends an AD ping to DCSC1 and gets a reply that DCSC1 is in the site closest to WSSC1. WSSC1 authenticates with DCSC1 and stores its site name (i.e., Scottsdale) in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's DynamicSiteName value. The next time WSSC1 authenticates, it automatically uses the site name in DynamicSiteName, speeding subsequent authentications.

Now, suppose that WSSC1 is a laptop and that the user travels to the Amsterdam office. When the user plugs the laptop into the Amsterdam network, the laptop receives a new IP address and a new DNS server from the local DHCP server. But this time the transactions look a little different. As Figure 3 shows, WSSC1 queries the DNS server DNSAM1 for DCs in the Scottsdale site because that site is recorded in the laptop's DynamicSiteName registry value. DNSAM1 returns specifics for DCSC1. WSSC1 sends an AD ping to DCSC1, which replies that it isn't in the site closest to WSCC1's site and that WSSC1's site is now Amsterdam. WSSC1 then queries DNSAM1 for DCs in the Amsterdam site and receives the answer DCAM1. WSSC1 sends an AD ping to DCAM1 and gets a reply that DCAM1 is in the site closest to WSSC1's site. WSSC1 authenticates with DCAM1 and stores the site name Amsterdam in the DynamicSiteName registry value.

What if all the DCs in a client's site are down or unresponsive? In this case, the client falls back to whichever DC provided its site information. This DC is typically in another site, which means that the client authenticates to and sets up a secure channel with a remote DC over the WAN—not good for performance. In such circumstances, the client's Netlogon process tries to rediscover a DC in the closest site whenever another attempt to authenticate to the domain occurs. You can set the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's CloseSiteTimeout value to the number of seconds that must elapse before Netlogon will attempt to rediscover its DC. The default value is 900 seconds (i.e., 15 minutes); Netlogon will accept values between 60 and 4,233,600 seconds (i.e., between 1 minute and 49 days).

Sometimes, you don't want an AD client to determine its site membership through the typical DC query. For example, suppose you have a member server with two NICs on different subnets. You can't specify which subnet the system belongs in simply by setting the DynamicSiteName registry value because Netlogon resets that entry. However, the client's Netlogon service does provide a way to bypass the site-membership calculation: Set the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's SiteName value to the name of the site that you want the client to be a member of. The SiteName value requires a string, so you need to enter a name such as Amsterdam. Of course, keep in mind that manually manipulating the registry can be risky; take appropriate precautions to back up your data before you attempt such manipulation.

Covering DC-less Sites
One of the best ways to reduce the complexity and increase the manageability of your AD deployment is to reduce the number of DCs. Each DC you add to your forest increases your hardware, software, and operating costs, not to mention replication latency and troubleshooting time. Satellite offices with good network connectivity and only a few users are great candidates to be DC-less sites.

You might ask, "Why not just add satellite office subnets to the main office site?" This question makes a lot of sense from an AD perspective: Clients would authenticate to DCs in the main office and, because no DC exists at the remote office, you'd have no replication-topology problems to deal with. But AD isn't the only application that relies on the proper definition of sites. For example, Microsoft Dfs, the technology upon which File Replication Service (FRS) is based, locates nearby servers according to AD site definitions. If the satellite office has a Dfs server, you wouldn't want to collapse a satellite site into the main office site.

Suppose you have a small branch office with 30 employees and a reliable T1 connection to the main office. You have no reason to put a DC in the branch location because the network can easily support the authentication and lookup traffic that 30 workstations generate. But which DC will the branch-office clients use to authenticate? The answer lies in the notion of site coverage. DCs publish both generic and site-specific SRV records so that clients can find a DC anywhere in the domain or in a specific site. But by default, DCs also publish site-specific SRV records for nearby sites that have no DCs. This way, a client can use a site-specific DNS query to locate a DC even when no DCs exist in the client's site.

The rules for determining which DC will cover a DC-less site are fairly straightforward. When each DC starts up, it looks in its copy of AD for any DC-less sites. For each DC-less site, the DC determines whether its own site is closest to the DC-less site, which the site-link cost information in AD determines. If the DC's site is closest, the DC publishes site-specific SRV records for that DC-less site. If the DC determines that more than one site shares the lowest site-link cost value, the DC compares the number of DCs in its site with the number of DCs in the other sites. If the DC's site has the most DCs, the DC publishes site-specific SRV records for the DC-less site. Having the site with the most DCs cover the DC-less site provides more DCs to share the load of the DC-less site. If the sites have the same number of DCs, the site that comes first alphabetically covers the DC-less site.

Having all the DCs in a covering site cover the DC-less site has a downside. Each DC in the covering site publishes SRV records for the DC-less site, so a covering site with 5 DCs and 2 GCs will generate an additional 32 SRV records (4 site-specific SRV records per DC and an additional 2 records for each DC that's also a GC). If you have a hub-and-spoke topology with 25 DC-less satellite sites connected to one hub, and you have 25 DCs and 7 GCs in the hub, you'll get a whopping 3550 additional SRV records per domain.

Now consider that each DC is busy republishing its SRV records every hour. (This schedule is the default in Win2K; DCs republish their SRV records every 15 minutes in Windows Server 2003.) You can imagine the kind of network traffic these site-coverage SRV records will generate. And if you have AD-integrated DNS zones, you'll be faced with a significant amount of additional replication traffic going to each DC in the domain. Furthermore, AD stores SRV records with the same name in a multivalued attribute of one AD object. Because of attribute size limitations, AD-integrated zones can handle only about 800 SRV records with the same name. A large branch office could easily exceed this limit.

Thankfully, Win2K Service Pack 2 (SP2) and later provide a way to control site coverage through registry settings. First, you can disable automatic site coverage on the DC. To do so, set the system's HKEY_LOCAL_MACHINE\SYSTEM\CurrentCon-trolSet\Services\Netlogon\Parameters registry subkey's AutoSiteCoverage value to 0. Second, you can explicitly name the sites that a DC will cover. To do so, set the system's HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters subkey's SiteCoverage value to the list of sites to cover (use the sites' relative distinguished names—RDNs—separated by spaces). Setting this value forces the DC to publish SRV records for the named sites, whether or not those sites already contain DCs. You can set the subkey's GcSiteCoverage value in the same way to control the publication of site-specific GC locator records. You can use these registry settings to assign specific DCs in a hub site to service authentications from DC-less satellite sites. This process reduces the number of SRV records added to DNS and makes the system more manageable.

You have one other option to reduce the number of DNS records in your environment. Each DC that hosts an AD-integrated DNS zone publishes a Name Server (NS) record for the domain. DNS uses this record to find authoritative name servers for the zone. If you have many DCs in your domain, you'll get many NS records. To prevent the DNS service from publishing its NS record, set the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters registry subkey's DisableNSRecordsAutoCreation value to 1.

The effect of disabling the publication of NS records is subtle. When a server receives a direct query (e.g., from a client that lists the server as the primary resolver), its DNS service properly resolves names in the zone. But because no NS record exists for the server, other DNS servers, processing recursive queries, won't identify the server as a name server for the domain and won't pass queries to it. For more information about this process, see the Microsoft article "Problems with Many Domain Controllers with Active Directory Integrated DNS Zones" (http://support.microsoft.com/?kbid=267855).

Controlling DC Failovers and Loads
I've shown how you can control which DCs will cover DC-less sites, but what about controlling how the clients within a site use DCs? The answer involves the AD ping process that I described earlier.

Internet Engineering Task Force (IETF) Request for Comments (RFC) 2782 specifies two numeric values—priority and weight—associated with each SRV record. These values control which of several possible candidates the client should select.

SRV priority. The SRV selection rules in RFC 2782 state that clients must first attempt to connect to those hosts with the lowest-priority SRV records and attempt to connect to hosts with higher-valued priorities only when unable to connect to a host with a lower priority. This rule provides multileveled failover for crucial services: Simply set the priority of your primary hosts to 0 and set a higher priority for your backup hosts. By default, all DCs publish their SRV records with a priority of 0. To override this value, set the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's LdapSrvPriority value to a higher number. The DC will use this value in its _ldap SRV records' priority field.

Consider the example that Figure 4 shows. You have two DCs—DC1 and DC2—in the same site, and you want to make sure that DC2 is used only as a failover when DC1 is unavailable. You can set the priority on DC1 to 0 and the priority on DC2 to 10 (the exact value isn't important so long as it's greater than 0). Clients will select DC2 only when DC1 doesn't respond to an AD ping request within 100ms—a situation that likely will occur only when DC1 is down or extraordinarily busy.

SRV weight. RFC 2782 also provides a load-balancing mechanism of sorts. An SRV record's weight field is an integer that defines which SRV records the client should prefer in its selection process. Although the client randomly selects an SRV record from a set of records with the same priority, the probability that the client will select a particular SRV record is proportional to the weight associated with that record.

To define the weight that the DC will publish with its SRV records, set the DC's HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's LdapSrvWeight entry to a value from 0 to 100. By default, DCs use the value 100, indicating that the likelihood that a client will select that DC is 100 percent.

Suppose you have 5000 clients and two DCs in a site. DC1 is a dual-processor 1.6GHz Pentium 4 Xeon box with 1GB of RAM, and DC2 is a single-processor 800MHz Pentium with 256MB of RAM. Without any manual registry configuration, approximately half the clients will attempt to authenticate with DC1 and the other half will attempt to authenticate with DC2. As you can imagine, DC2 will be severely overloaded. However, if you set the weight on DC1 to 80 and the weight on DC2 to 20, clients will prefer the dual-processor box to the single-processor box by a factor of 4-to-1, as Figure 5 shows.

Configuring the Publication of Generic SRV Records
When a client can't find a DC through site-specific SRV records, it will search generic SRV records. This behavior can create some undesirable results. Consider a hub-and-spoke scenario in which the hub site and each satellite site contain a DC. If a client in a satellite site fails to contact its local DC, the client will search DNS for generic SRV records for a DC in its domain. Because all DCs publish generic records by default, the client might select a DC from anywhere in the network, including another remote satellite site—something you almost certainly don't want to happen.

AD provides a way to disable the publication of individual SRV records so that you can make a DC invisible to clients in other sites or even the DC's own site. This ability gives you great control over which DCs clients will use to authenticate. To control the SRV records that a DC publishes, set the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters registry subkey's DnsAvoidRegisterRecords value to a list of mnemonics (separated by spaces) that correspond to the SRV record you want to inhibit. Table 1 shows the available mnemonics and corresponding SRV records.

Consider our hub-and-spoke example: a simple network with a hub in Denver and satellite offices in Scottsdale, San Antonio, and Albuquerque, New Mexico. By default, a client in Scottsdale (i.e., WSSC1) will select the Scottsdale DC (i.e., DCSC1) for authentication. But suppose DCSC1 is down. Ideally, we'd like WSSC1 to fail over to a DC in the Denver hub site. But by default, WSSC1 will simply search the network's generic SRV records, with a good chance of selecting a DC in one of the other satellite sites (i.e., DCSA1 or DCAL1). The solution is to make the DCs in the satellite sites invisible to clients outside each site. In this example, you would disable the publication of all generic records for the DCs in the satellite sites so that clients from other sites won't select them. Specifically, for DCSC1, DCSA1, and DCAL1, set the DnsAvoidRegisterRecords value to include the following mnemonics: Dc, DcByGuid, Gc, GcIpAddress, GenericGc, Kdc, Ldap, LdapIpAddress, Rfc1510Kdc, Rfc1510Kpwd, Rfc1510UdpKdc, and Rfc1510UdpKpwd.

Here's another interesting scenario: Suppose you want to dedicate a DC at some central site for hot backup purposes only. You want the DC to sit there, replicating with the other DCs in the domain, but doing nothing else. In fact, you want to make the DC invisible both inside and outside its site. To do so, set all the mnemonics except DcByGuid, which controls publication of the SRV record that other DCs use to find replication partners.

Speeding Things Up
When you understand how the DC Locator mechanism works, how it depends completely on DCs' SRV records, and how you can control the way in which DCs publish those records, you can optimize your authentication topology to provide the fastest possible logons and directory lookups for your users. Your users—and your Help desk personnel—will thank you.

Authentication Topology

Comments

Plain text