Last week, a client called to tell me that a particular type of email hadn't been delivered to users for more than a week. The client said that users had no problems sending or receiving mail from the desktop, but mail created by a Cold Fusion application running on a Win2K Service Pack 4 (SP4) server, while generating no errors, simply never arrived. I checked the event log and verified that the server hadn't undergone any updates, configuration changes, or reboots during the 24-hour period before the problem surfaced. To complicate matters further, although desktop mail arrived at its destination address, mail sent by Cold Fusion to the same address simply disappeared into cyberspace.
This puzzling state of affairs had me scratching my head as I cruised through several logs on several servers, looking for evidence. The only network modification made on the fatal day was an update to the firewall, a machine that also functions as a Network Address Translation (NAT) and caching DNS server. Hmmm, suspicious yes, but not exactly obvious.
After several troubleshooting activities, I discovered that DNS queries using the Nslookup command on the server running the Cold Fusion application failed, whereas queries from other servers accessing the same DNS server worked as expected. First, I tried restarting both DNS servers; that didn't clear up the problem, but the heavy-handed approach did. After rebooting the main and caching DNS servers, the week’s backlog of Cold Fusion mail started appearing in users Inboxes.
Although I successfully worked around the problem, it took some research in the knowledge base to locate a potential explanation for the sudden breakdown of name resolution on a single server. Here are three newly documented DNS bugs and hotfixes for Win2K SP4 systems. You might want to file them away in the event you encounter Win2K DNS hiccups, especially if your DNS configuration uses forwarders to resolve name lookup requests. Read the entire article before you call the support folks. It turns out that Microsoft released three new versions of DNS in 4 days; if you install only the latest and greatest version, you’ll save time for other activities with much greater fun potential.
During my troubleshooting, I started the Microsoft Management Console (MMC) DNS snap-in on the main DNS server and tried to connect to the caching DNS server. After a lengthy wait, the MMC snap-in reported that it was unable to locate the caching server, placing a red X over the server name in the left pane. According to a Knowledge base posting dated September 16, this is a known problem in Win2K SP4 when you ask the MMC DNS snap-in to connect to a DNS server on a Win2K domain controller (DC). However, I experienced exactly the reverse: I started the MMC on the DC running DNS and the snap-in was unable to connect to the caching DNS server. The Microsoft article " You cannot connect to a DNS server by using the DNS snap-in on a Windows 2000 Server-based domain controller" (http://support.microsoft.com/?kbid=884548) states that Microsoft Product Support Services (PSS) has a hotfix for this bug, a new version of dns.exe (v 2195.6971), with a file release date of August 25.
Here's one possible explanation for why name queries failed only on the Cold Fusion server. On this network, a Win2K DC operates as the main DNS server and forwards queries it can't resolve to a caching server. The caching server, in turn, forwards queries to root servers on the Internet. A September 8 knowledge base posting states that a bug exists in how DNS manages name resolution requests using forwarders. Specifically, a Win2K DNS server ignores “server failure” responses from forwarders and continues to send the name resolution request to the forwarder or to each configured root hint server until the default recursion timeout of 15 seconds is reached. The result of this algorithm is that the DNS server doesn't communicate a lookup failure to the DNS client (the Cold Fusion server in my example) until the timeout period expires.
Extrapolating from the forwarders bug, I suspect that the Cold Fusion application has a flaw in how it processes name lookup failures when the response is delayed for such a long period of time. The Cold Fusion application might generate a hundred DNS name resolution requests during the time it takes the DNS server to respond to a single failure.
The hotfix to eliminate the recursion delay is dated August 26 2004, 1 day later than the version that eliminates the MMC snap-in bug. According to the Microsoft article "The DNS server waits for a recursion timeout before it sends a 'Server Failure' response to the client in Windows 2000" (http://support.microsoft.com/?kbid=873454), the hotfix containing version 5.0.2195.6972, is only available from PSS. After you install the hotfix, you must reboot the system to load the updated code.
The last DNS hotfix corrects a coding recursion error that causes DNS to consume all available CPU time and memory, potentially hanging the system when resources are exhausted. This bug crops up when the server processes a query for a delegated zone, instead of a primary zone. According to the reference article "High memory usage and high CPU utilization in the DNS Server service in Windows 2000" http://support.microsoft.com/?kbid=873441, the bug causes the DNS server to go into an infinite loop querying a record in the delegated zone that doesn't exist. The oldest of the August updates to DNS---version 5.0.2195.6970, with a file release date of August 23--is superseded by versions that correct the MMC DNS snap-in and the forwarders delay bugs.
To check the version of the currently installed DNS, use Windows Explorer to search for the file dns.exe in %systemroot%. Dns.exe should appear in two places: %systemroot%\system32 and %systemroot%\dllcache. The version in \dllcache is the one that is currently running. Right-click %systemroot%\dllcache\dns.exe, pick the version tab, and click Product Version in the Item name box to display the version number in the right pane. When you install a new version of dns.exe, the new version is copied to %systemroot%\system32, but the running version in %systemroot%\dllcache isn't updated until you restart the system.
As of this writing, the most current version of DNS for Win2K SP4 is the one documented in the Microsoft article "The DNS server waits for a recursion timeout before it sends a 'Server Failure' response to the client in Windows 2000" (http://support.microsoft.com/?kbid=873454), version 5.0.2195.6972 of dns.exe with a file release date of August 26, 2004. If Microsoft follows the rules, version .6972 should contain fixes for all three problems I reviewed today. I haven’t installed it yet, so if you do and the results aren't what you expect, share your experiences by posting a comment below.