Troubleshooting Windows Server 2008 R2 Failover Clusters

I want to discuss some of the troubleshooting techniques that we use with Windows Server 2008 R2 failover clusters. There are many ways to troubleshoot clusters, and some engineers might do things that others might not. So I wanted to pass along some of the most common things to look for and where to look to look for them. With that in mind, let’s first talk about the files that you’ll generally be looking at and their descriptions.

One of the first things you’ll be working with is Failover Cluster Manager, the new interface for managing a cluster. With this tool, you’ll be managing groups and resources as well as performing some troubleshooting, which I’ll explain as I go along. Failover Cluster Manager can be accessed from the Start menu and Administrative Tools.

Event Channels

You’re probably familiar with the System event log. It’s where we log critical, error, and warning events. However, it’s not the only event log location that we write to. Starting in Server 2008, there are additional event channels. Figure 1 shows where to find the channels relevant to failover clustering. Here is where we’ll log all the informational-type events and debug/diagnostic events. You’ll find the following list of logs and their channels:

Fig1
Figure 1: Channels relevant to failover clustering

FailoverClustering
- Diagnostic (if Show Analytic and Debug Logs is selected)
- Operational
- Performance-CSV (if Show Analytic and Debug Logs is selected)
FailoverClustering-Client
- Diagnostic (if Show Analytic and Debug Logs is selected)
FailoverClustering-Manager
- Admin
- Diagnostic (if Show Analytic and Debug Logs is selected)
FailoverClustering-WMIProvider
- Admin
- Diagnostic (if Show Analytic and Debug Logs is selected)

If you’re starting/stopping the cluster service, or you’re moving groups, or groups are coming online and offline, and so on, those events will be logged in the FailoverClustering\Operational log. For example:

Event ID: 1061
Description: The Cluster Service successfully formed the failover Cluster “JohnsCluster”

Any failures connecting to other nodes opening Failover Cluster Manager are logged in FailoverClustering-Manager\Admin. For example:

Event ID: 4684
Description: Failover Cluster Manager could not contact the DNS Servers to resolve name “W2K8-R2-NODE2.contoso.com”. For more information see the Failover Cluster Manager Diagnostics channel.

If you look at the FailoverClustering-Manager\Diagnostic log, you would see this:

Event ID: 4609
Description: An error was encountered while attempting to ping “W2K8-R2-NODE2.contoso.com”.  System.ApplicationException: Could not contact one or more DNS Servers. Please verify that DNS configuration is correct and the machine is fully connected to the network.

Event ID: 4612
Description: Server W2K8-R2-NODE2.contoso.com ping failed.

Just from these events, you can see that there is a problem with the node getting to the DNS server and can start troubleshooting this specific problem. What you might see without looking at these logs is possibly the W2K8-R2-NODE2 showing as down in Failover Cluster Manager. (One of the other logs mentioned above is the FailoverClustering\Diagnostic log. I’ll discuss this log a bit later.)

Failover Cluster Manager

To make things a bit easier, you can also view system event errors and warnings from within Failover Cluster Manager. On the main page in the middle pane, there is a Recent Cluster Events link that you can select, as Figure 2 shows. This link provides a handy way to display all warnings and errors that have occurred with Failover Cluster as the source in the past 24 hours. It pulls these events from all nodes and gives you everything in one spot. So there’s no need to go to multiple machines and have multiple event logs open that you must switch between.

Figure 2: Recent Cluster Events

You can use the Query option to look for specific events. On the main page in the left pane, you’ll see Cluster Events. You can right click Cluster Events and choose Query, or you can select Query from the Actions pane on the right. Figure 3 shows the Cluster Events Filter.

Figure 3: The Cluster Events Filter

This is also a good way to display everything in the same location. For example, suppose you’re experiencing the failure of a disk resource. You can bring up Failover Cluster manager and have it query all nodes, the System event log, the error, and the specific date. On the main page, you can see when the disk failed, on what node(s) it failed, and any other pertinent data (such as disk events where a path failed). You also have the ability to save these queries for later use.

You have two more options for looking up events. You can look up all resource-failure events for anything in a group, or you can be resource-specific. In the Actions menu, which Figure 4 shows, you can select Show the critical events for this application (any resource in the group) or Show the critical events for this resource (only the specific resource). Doing so will bring up the query for any of the events in the current event logs on all nodes. This option can also beneficial for determining history and whether the event can be narrowed down to a specific time period or node.

Figure 4: The Failover Cluster Manager’s Actions menu

For those who remember the Windows 2003 Server Cluster days, this is the Cluster.Log equivalent. Starting in Server 2008 failover clustering, the functionality is more in line with the Windows Event Tracing (ETW) process. Instead of writing to a Cluster.Log text file, it writes it to a Diagnostics log located in the C:\Windows\System32\winevt\logs folder. There are three diagnostics logs that we write to (clusterlog.etl.001, clusterlog.etl.002, and clusterlog.etl.003). We’re only going to write to one of these at a time on any given boot. For more information about these log files and how they’re used, check out the Understanding the Cluster Debug Log in 2008 blog post.

This log is enabled and always writing. If you right-click FailoverClustering\Diagnostic and select Disable log, you can see all the events it has written. If you disable this log, the system will no longer write to it and information won’t be saved. If you do this, it’s best to save the event out as an event log or text file and enable it again. There are essentially three main events you’ll see:

Event 2049 is an informational event.
Event 2050 is a warning.
Event 2051 is an error.

These events will only be from the current diagnostic .ETL being written to. You’ll see the event information just as you would the System or Application event log. However, each event will be only one line at a time. So, going event by event through this diagnostic event log can be pretty tedious. You can create a Cluster.Log text file with commands that combine all three of these logs into one to make the review of it much easier.

The PowerShell Get-ClusterLog command goes out to all nodes and generates a Cluster.Log on each node and places it in the C:\Windows\Cluster\Reports folder. This would be the Cluster.Log you might be more familiar with from Windows 2003. There are Get-ClusterLog switches you might want to consider, depending on the circumstances. For example, say you can reproduce a failure at will and need to find the reason for the failure. Simply reproduce the problem and use the command

Get-ClusterLog -TimeSpan 5

to get data from the past 5 minutes. Because you need only the log from the one node you reproduce the problem on, you could add the Node Nodename switch to create the Cluster.Log on this single node. If you have a number of nodes and need to send these logs, it might take some time to connect to each node to get the file. In these circumstances, you could use the -Destination switch. This switch creates a Cluster.Log for each node, copies it to a folder you specify, and tags the name of the machine as part of the file name (e.g., W2K8-R2-Node1_Cluster.Log).

Remember that the Cluster.Log you’re creating is a snapshot in time. It will take what’s there right now and won’t update with anything after it’s generated. When it’s generated, if there’s a Cluster.Log in the Reports folder, it will get deleted to make room for the new one.

Resource Host System

The next thing I wanted to discuss is the Resource Host System (RHS). One of its responsibilities is to monitor the health of all resources in the cluster. It does this through a series of checks (basic and thorough). If a resource doesn’t respond to these checks, RHS will issue the following system event:

Event ID: 1230
Description: Cluster resource 'Cluster Disk 1' (resource type '', DLL 'clusres.dll') either crashed or deadlocked.

In this instance, the disk didn’t respond that the health check that was made. What the cluster will do is fail the resource and restart it to get you back to production. If these checks weren’t in place, it could lead to a hung machine or no connectivity from a client application.

When troubleshooting the RHS event, you must consider the resource. If a disk deadlocks, you would need to consider everything in the disk stack. Was there slow disk I/O? Did you lose a path to the drive? This would be the focus of your troubleshooting. So, next up is reviewing the System event log for disk-related events, looking at Performance Monitor, updating drivers, and so on. If the resource was an IP address or a network name, your focus would be the network stack and everything there.

Cluster Validate

The last thing I want to mention is the Cluster Validate Report. For a cluster to be “certified,” all components must be listed on the Windows Server Catalog, and it must pass a full Cluster Validate. Many people will run Cluster Validate before the cluster is created or just after. However, if there is a problem later on, few people remember to run Cluster Validate. You can use it as a troubleshooting tool! If you’re having some disk problems, run the Storage Tests. If you’re having network-communication problems, run the Network Tests. You can also use Cluster Validate to get information about groups, resources, and settings for your currently running Failover Cluster to be referenced at a later time.

The nice thing about Cluster Validate is that you can run it even while in production. When you run it and select the Storage Test, it will ask if you want to take the running groups offline, as you see in Figure 5. The default setting is to leave the online groups alone, so production won’t be affected. For the Storage Tests, it will test disks that are:

In groups that are offline
In the available storage group
Not a part of the cluster

Figure 5: Running Cluster Validate

Each time you run Cluster Validate, it will create a file in the C:\Windows\Cluster\Reports directory and will tag the date and time as part of the file name. So, every time you run it, it will create a new one and will create the file on all nodes that Cluster Validate was run against.

There are other ways to troubleshoot failover clusters—I just don’t have enough space to cover them all. However, this column should get you started for most of the problems you may face. For more information, check out the Ask the Core Team blog and the Clustering and High Availability blog. Happy clustering!

Comments

Plain text