Six Common Problems with Failover Clusters

As you might know, I'm a part of the Microsoft group that supports failover clusters. As a result, I've had to troubleshoot quite a few problems. I'll go over some of the common problems I've seen, explain why they occur, and show you how to fix them.

Common Problem 1

When the Cluster Service starts, it detects the networks on a node, then identifies the network cards in each network. A common problem that I've encountered is that people are unaware that Windows Server Failover Clustering (WSFC) allows only one network card on a node in the same network. All other cards in that network will be ignored.

For example, suppose an administrator, Bill, configured a node with two cards in the same network:

Card1
IP Address: 10.10.10.1
Subnet Mask: 255.0.0.0

Card2
IP Address: 10.10.10.2
Subnet Mask: 255.0.0.0

The Cluster Network Driver (Netft.sys) will use only one network card (or team) per network. So, in the case of this configuration, Card1 will be used by Cluster Network 1 (10.10.10.0/16) and Card2 will be ignored by WSFC and not used for any communication between the nodes. Because only one network is being used, if Card1 goes down or loses network connectivity, the node can't communicate with any other nodes. This is a single point of failure. To avoid this problem, you need to configure your cluster so that there are at least two network paths between nodes. That way, if one of the cards goes down, you still have communication between the nodes using the other card.

Common Problem 2

The second common problem is best described using scenarios. I'll describe the problem using two different cluster configurations: single site and multisite.

Single-site cluster. Suppose that Bill the administrator decided to reconfigure his cluster so that it has two networks between Node1 and Node2. On Node1, he changed the network cards' IP addresses and subnet masks to:

Card1
IP Address: 192.168.0.1 (Cluster Network 1)
Subnet Mask: 255.255.255.0

Card2
IP Address: 10.10.10.1 (Cluster Network 2)
Subnet Mask: 255.0.0.0

Bill also changed the IP addresses on Node2 (192.168.0.2 and 10.10.10.2). In addition, on Node1 in the cluster, he added a file server group, giving it the IP address 192.168.0.15.

Afterward, Bill tested the cluster to see whether the file server group would successfully move to Node2 during a failover. However, the IP address failed to come online, so the file server group stayed in an offline state. In the System event log, Bill sees event 1069, with the description that the IP address resource failed.

Why did this failure occur? The reason becomes evident if you use the Windows PowerShell Get-ClusterLog cmdlet to generate a cluster log. You simply need to run the command:

Get-ClusterLog

This command will generate a cluster log on each node. To generate a cluster log on only one node, you can add the -Node parameter followed by the node's name. You can also add the -TimeSpan parameter to create a log that contains information for only the past x minutes. For example, the following command generates a cluster log on Node2 that contains information for only the past 15 minutes:

Get-ClusterLog –Node Node2 –TimeSpan 15

In the results shown in Figure 1, notice "status 5035."

Figure 1: Receiving a Status of 5035 in the Cluster Log File

This error message basically tells you that a cluster network isn't available for the operation. If Bill were to navigate to Networks in Failover Cluster Manager, he'd see that the 192.168.0.0/24 network contains only one network card for Node1. However, there's a new network 192.0.0.0/8 with Node2's network card. When Bill changed the network card's IP address in Node2, he didn't change the subnet mask. So, error 5035 occurred because Bill misconfigured the card.

When an IP address resource is created, you have the option to specify the network to use based on the IP address. On its own, WSFC won't change the network that the IP address resource will use if the network doesn't exist on the node to which the resource is moving during a failover. In this example, given the IP address specified by Bill and the subnet mask that the IP address will use, the file server group is only going to work on Cluster Network 1 (192.168.0.0/24).

Multisite cluster. In the case of a multisite cluster, each node typically has different IP address networks. When you initially create the cluster and its roles using the New Resource Wizard, you'll be prompted to enter an IP address for each of the node's networks configured for client access, as Figure 2 shows.

When the New Resource Wizard creates the IP addresses and assigns the network name, it automatically gives the network name an "or" dependency. This means that as long as one of the IP addresses is online, the name will also be online. If you create the groups or resources before adding nodes from a different network, you need to manually create these secondary IP addresses and add the "or" dependency.

Common Problem 3

When creating a cluster, you don't have to be a domain administrator but you do need to have the proper rights to create objects in Active Directory (AD). For starters, you need the Read and Create rights on the organizational unit (OU) where the Cluster Name Object (CNO) will be created. The CNO is the computer object associated with a cluster resource called Cluster Name. When creating a cluster, WSFC uses the user account with which you logged on to create the CNO in the same OU where the nodes reside. If you don't have rights to this OU, the cluster creation will fail with the error shown in Figure 3.

Figure 3: Getting an Error When Trying to Create a Cluster

In "Troubleshooting Windows Server 2012 Failover Clusters," I mentioned that you can use the Validate a Configuration Wizard in Failover Cluster Manager to help determine the root causes of problems. Using this wizard, you can run numerous tests, including the Validate Active Directory Configuration test. If you run this test and you don't have rights to the OU, you'll get errors like those shown in Figure 4. After you fix the rights, you should able to create the cluster.

Figure 4: Getting Errors When Running the Validate Active Directory Configuration Test

All other cluster network name resources in the cluster are associated with Virtual Cluster Objects (VCOs), which are created in the same OU as the CNO. Therefore, when creating roles in the cluster, you need to create the CNO with rights (Read and Create) to the OU because the CNO creates all the VCOs in the cluster. If you don't do so, the new role will be created but the name will be in a failed state. In this case, you'll see event ID 1194 in the System event log, as shown in Figure 5.

Figure 5: Receiving Event ID 1194 in the System Event Log

There are other settings on the local machine that can cause errors (including access denied errors) when creating VCOs in AD:

The local Users group no longer includes Authenticated Users. It's usually removed by Group Policy Objects (GPOs) or security templates.
In the local security policy, the Access this computer from the network or Add workstations to the domain option no longer includes Authenticated Users. It's usually removed by GPOs or security templates.
The following security rights are enabled:
Network access: Do not allow anonymous enumeration of SAM accounts
Network access: Do not allow anonymous enumeration of SAM accounts and shares
The Cluster Name resource is in a failed state.

Common Problem 4

The CNO and VCOs are computer accounts—and like user accounts, computer accounts have passwords. AD randomly generates the passwords for computer accounts. By default, the domain policy will reset the password for a computer account every 60 days.

The CNO is used for operations such as joining new nodes to the cluster, creating new objects in the domain, and performing a live migration of virtual machines (VMs) between nodes. The CNO's domain password must be up-to-date for these operations to occur. To be on the safe side, the Cluster Service will attempt to reset the password for its objects at the halfway point (30 days). If the password hasn't been reset at the 60-day mark, the name will fail to come online.

To reset the password, you need to do a repair from within Failover Cluster Manager. As Figure 6 shows, you right-click the failed name resource, select More Actions, and choose Repair.

Figure 6: Resetting the CNO Password Manually in Failover Cluster Manager

When issuing a repair, Failover Cluster Manager uses the user account with which you logged on to contact AD to reset the password. Therefore, you must have the Change Password right on the CNO; otherwise, the repair will fail. You also need to make sure that the Reset Password right is enabled on the CNO and VCOs so that WSFC can reset the password when it needs to.

Common Problem 5

In order for a node to know which other nodes are actively participating in the cluster (i.e., know the current membership), there are a series of heartbeats that go between the nodes over the network. These heartbeat packets are UDP datagrams that travel over port 3343.

Each packet includes a sequence number to track whether the packet is received. Here's how it works: If Node1 sends the sequence number 1111, it expects the return packet to include 1111. This continues between all nodes every second. If Node1 doesn't get the return packet, it will send the next sequence number (1112), and so on.

By default, if the node doesn't receive five heartbeats in five seconds, WSFC determines that the node is down. A participating node still in the cluster will send a packet to the node determined to be down to terminate the Cluster Service and will log event ID 1135 in the System event log, as Figure 7 shows.

Figure 7: Receiving Event ID 1135 in the System Event Log

There are multiple reasons why this occurs, many of which involve communication over port 3343 being blocked:

Network hardware failures
Out-of-date network card drivers or firmware
Network latency
IPv6 enabled on the servers but the following two rules disabled for inbound and outbound traffic in the Windows Firewall:
Core Networking - Neighbor Discovery Advertisement
Core Networking - Neighbor Discovery Solicitation
Switches, firewalls, or routers not properly configured to allow UDP Datagram traffic
Performance problems (e.g., hangs, delays)
Improperly configured receive buffer settings on the network card driver

One of the first things that I always check is the Packets Received Discarded counter that's part of the Network Interface performance object in Performance Monitor. The Packets Received Discarded counter tracks the number of inbound packets that were chosen to be discarded, even though no errors had been detected to prevent their delivery to a higher layer protocol. The buffer is only so big. If it's not big enough, once the buffer fills up, it must discard packets to make room.

To add the Packets Received Discarded counter, open Performance Monitor, right-click its display, and select Add Counters to bring up the Add Counters dialog box. After specifying the appropriate computer, scroll to Network Interface and select the Packets Received Discarded counter. In the Instances of selected object drop-down list, choose the appropriate network card and click Add, as Figure 8 shows.

Figure 8: Adding the Packets Received Discarded Counter in Performance Monitor

When added, look at the counter's Average, Minimum, and Maximum values. If there are values higher than zero, the receive buffer needs to be adjusted for the network adapter. Check with the vendor of the network card to see what it recommends as a setting. A reboot might be necessary.

In a Windows Server 2012 R2 failover cluster, you can also use the Validate a Configuration Wizard to run the Network/Validate Network Communication test. This test checks to see whether it can communicate between the nodes over port 3343. If it can't, it will post an error and a possible cause.

Common Problem 6

Sometimes Failover Cluster Manager fails to open, giving you an error message like that shown in Figure 9. When Failover Cluster Manager opens, it opens a Windows Management Instrumentation (WMI) connection to each node in the cluster. In Figure 9, the error message is saying that one of the nodes has an invalid namespace, which means that the Cluster WMI instance (Cluswmi.mof) has been removed from a node. The trick is finding out which node had it removed, because the error message doesn't tell you that information.

Figure 9: Receiving an Error Message Noting That a Namespace Was Invalid

Listing 1 shows a Windows PowerShell script that you can run to identify the node that's missing the Cluster WMI instance. (You can download this script by clicking the Download the Code button near the top of the page.)

$NodeNames = Get-ClusterNode
ForEach ($ClusterName in $NodeNames)
{
Write-Host -NoNewline "Testing $ClusterName "
Try
  {
    $result = (Get-WmiObject -Class "MSCluster_CLUSTER" `
    -namespace "root\MSCluster" `
    -authentication PacketPrivacy `
    -computername $ClusterName -erroraction stop).__SERVER
    Write-host " : Successfully queried cluster node "
  }
Catch
  {
    Write-host -NoNewline " : Failed to query cluster node "
    Write-host -ForegroundColor Red -BackgroundColor Black `
    $_.Exception.Message
  }
}

After you've identified the node, you can run the command:

Set-Location C:\Windows\System32\Wbem
Mofcomp.exe Cluswmi.mof

The most common reason for a node missing Cluswmi.mof actually stems from the old way of fixing WMI. To clear up problems with WMI, administrators would run the command Mofcomp.exe *.mof, which will compile all the Managed Object Format (MOF) files into the WMI repository. The problem is that there are quite a few uninstall files for the various roles and features in Windows, including Cluster WMI. So when the command is run, it installs Cluswmi.mof, then later uninstalls it. The proper way to rebuild the WMI repository is with the Winmgmt.exe command.

An Ounce of Prevention

As the adage goes, an ounce of prevention is worth a pound of cure. So, I'll conclude by mentioning something you probably already know: You need to keep your machines up-to-date as far as security patches and fixes are concerned. The Microsoft Failover Clustering Team has published articles listing the hotfixes that it would like to see on all clusters. Each Windows version has its own article:

These articles are updated as needed, so they're always pretty current. Note that they don't list every fix. Instead, they list the fixes most needed for stability reasons and the most widely requested fixes based on calls coming into Microsoft.

Comments

Plain text