Let’s talk about legacy authentication on your network—specifically, scaling issues related to legacy authentication that cause you problems. Such problems can lead to outages once you reach a certain activity threshold. The concern boils down to the different performance characteristics of NTLM and Kerberos. The reason for preferring Kerberos authentication in Windows 2000 Active Directory (AD) and later isn’t just the inherent enhanced security; it’s also a better-performing authentication method. Each NTLM-based authentication is unique—even if it’s a repeated authentication to the same resource by the same identity. Kerberos, on the other hand, provides a reusable access-granting service ticket for the resource to that identity, and that reuse requires no interaction with an authenticating server or domain controller (DC). NTLM is a more expensive authentication protocol, as well as a less secure one.
It’s important to know when you’ve reached a point of authentication failure due to resource bottlenecks and an excessive volume of NTLM authentication. That’s what this article is about: understanding that this problem is evident in your environment, and knowing how to fix it.
A resource bottleneck can occur when a Windows computer needs to perform NTLM authentication for some user. (Figure 1 shows the NTLM authentication flow across domains.) For those familiar with the Windows architecture, you’ll recall that the Local Security Authority Subsystem (lsass.exe) process is responsible for handling authentication requests. This is true for all versions and roles of Windows. There are threads within Lsass.exe, and you might think of them as the workers that do the job of executing the code. For NTLM authentication, there’s a maximum number of thread workers that can be running at any time to handle the job. The out-of-the-box defaults for that are to allow a single thread if the computer is a domain member and two threads if it’s a DC. That NTLM thread is used on a domain computer to send a request to a DC, and a similar thread on the DC is used to craft the reply. So, in a typical transaction, there are at least two computers that can see this bottleneck. During that single authentication transaction of domain member to DC in that domain, the client thread is in a wait state until the DC replies.
If the user is requesting authentication from a trusted domain, you now have an additional DC contact to finish that authentication transaction. That wait state I mentioned before would now have the original client and that computer's DC tied up while waiting for the trusted DC’s reply.
Of course, a thread executing an NTLM transaction is faster than the blink of an eye. Speed isn’t a concern until you have a large number of simultaneous NTLM authentication requests, or if many of those transactions are across trusted boundaries to DCs in other domains. Add in a busy server—generating many NTLM authentication requests for its users—and you have a problem emerging.
The name of the limit on the NTLM authentications threads is MaxConcurrentApi. MaxConcurrentApi (of data type REG_DWORD) can be configured in the registry, under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters, and needs only the Netlogon service to restart to take effect.
MaxConcurrentApi is the Windows code that determines the creation of additional threads to handle new NTLM authentication requests. Without a thread to handle an authentication request, the requesting clients (which can be remote computers) might time out, become unresponsive, or return Access Denied errors to the user. That ambiguity is why it can be very difficult to figure out the root cause.
For all versions of Windows, the out-of-the-box default setting for MaxConcurrentApi is only 1 for a member server and 2 if the computer is a DC. In Windows Server 2003 and Windows Server 2008, you can change the MaxConcurrentApi setting as high as 10. If you have Server 2008 R2, the maximum is 150, though the defaults are the same. If you have an original install of Server 2008 (not R2), you can install a hotfix (described in the Microsoft article “You are intermittently prompted for credentials or experience time-outs when you connect to Authenticated Services” at support.microsoft.com/kb/975363), which will let you increase the maximum to 150 as well. That explains the mechanics of the bottleneck. Now let’s talk about identification.
Finding and Fixing the Problem
The most difficult aspect of identifying an authentication bottleneck is that there’s no event logged on any computer. Instead, the errors all happen within the application that requested authentication. Depending on the application's error handling, there might not be enough details to pinpoint NTLM bottlenecks.
Since you don’t have an event and might not have a useful error, you need to look for other symptoms. The thing to keep in mind is that it can happen to any application using NTLM. Common culprits are old line-of-business (LOB) applications, which use NTLM because that was the lowest common denominator at the time.
The best way to tell whether you’re reaching NTLM authentication bottlenecks is to determine if those failures are the result of volume. If the failures tend to occur during high-usage times (e.g., Monday morning, when users are arriving and beginning their work day), that’s an indicator but not necessarily conclusive.
Use the Performance Monitor Netlogon performance object to monitor the server in question during a time when that server is under load. Note that you should do this on the resource server that users are having trouble accessing, as well as on the DCs; you don’t want to miss a potential bottleneck. In the performance log (.blg) file pay attention to the following (as Figure 2 shows):
- Semaphore Holders equal to the current value of the MaxConcurrentApi registry value setting
- Semaphore Timeouts with any number greater than 0
- Semaphore Waiters with any number greater than 0
If you have any timeouts or waiters, you have an NTLM authentication bottleneck.
Recall that I said trusted DCs might be involved and how that could increase those delays and timeouts. You can identify whether trusted domains are a factor by viewing this same performance data in the Report view, as Figure 3 shows. Each domain will appear with detailed numbers.
Identifying that you have the bottleneck is only the first step. Next, you need to address performance issues that are preventing your users from accessing services they need to get their job done. The easiest workaround is to increase the MaxConcurrentApi setting on all involved servers to a number that can handle the load. Because the maximum number is 10, it’s best to raise it to 10 if you have Windows 2003 or Server 2008, or to a greater number if you have Server 2008 R2 (or the hotfix installed). Then, restart the Netlogon service on those servers.
When simply increasing the MaxConcurrentApi setting doesn’t resolve the outage, you have to dig a little deeper to find out which computers and user accounts are sending the authentication requests. The Netlogon service debug log has those answers. (See the Microsoft article “Enabling debug logging for the Net Logon service” at support.microsoft.com/kb/109626 for more information.) This log isn’t verbosely enabled by default, but it’s easy to start, it won’t fill up your drive, and it’s time-indexed for reference.
Things to look for, both in the Netlogon service debug log and elsewhere—in the order of most common to least common—are the following:
- NlpUserValidateHigher: Can't allocate Client API slot—This text entry in the Netlogon log indicates that the computer has NTLM authentication requests waiting but is already at the maximum number of threads. The entries preceding this one will tell you the username and computer the request is coming from.
- NlAllocateClientApi timed out—This text entry in the Netlogon log indicates that one of the clients that was waiting to authenticate gave up after waiting for 45 seconds. The appearance of this entry means that a user somewhere received a credential prompt, an error code, or an indefinite wait.
- (null)\—Null entries in the Netlogon log indicate that a legacy client on your network is submitting NTLM authentication requests for a domain user but omitting the domain of the user, so instead of domain\user you see (null)\user. In Windows 2003, this can result in extra use of those authentication resources, therefore exacerbating a potential bottleneck into a real one. To resolve that concern, disable the ping behavior by using the Neverping setting, as the Microsoft article “The Lsass.exe process may stop responding if you have many external trusts on an Active Directory domain controller” (support.microsoft.com/kb/923241) describes. Note that this isn’t a concern for Server 2008 and later.
- Repeat offenders—Frequent, repeated authentication attempts (i.e., entries start with SamLogon) from the same user and computer appearing in the Netlogon log might indicate an application that is malicious or inefficient.
- Kerberos PAC Validation—Oddly enough, this Kerberos security feature is implemented in Netlogon and uses those same threads that are a bottleneck for NTLM authentication. This behavior has an event that will appear in the System event log—event 7 with the source field of Kerberos. If you’re seeing a high volume of these events and also seeing intermittent authentication outages for your users, try disabling this additional security feature temporarily until you can add more servers to handle the load. Disabling this feature permanently isn’t recommended, and it’s a moot point if it’s an Exchange Server or IIS app pool service, since it cannot be disabled for them. Otherwise, the Microsoft article “You experience a delay in the user-authentication process when you run a high-volume server program on a domain member in Windows 2000 or Windows Server 2003 (support.microsoft.com/kb/906736) describes how to do it. If you confirm that you’re seeing NTLM bottlenecks, the best solution is to use Kerberos instead. Older applications are less likely to support Kerberos, so that might not be an option. That can lead to some tough conversations, weighing the costs of budgeting for new software versus the need for security and scalability. Ultimately, outages and poor performance will help security and scalability win that debate every time.
Two major trends are bringing this topic to the attention of IT folks everywhere: the consumerization of IT and the excellent performance of new hardware and software. Simply put, people want to use unmanaged or legacy clients to connect to really fast services over the cloud. It’s your job to make sure those things “just work” for them, so take a good long look at your network environment and don’t let authentication get you down.