Windows Server 2012 R2 Failover Clustering

With the release of Windows Server 2012 R2, it's time to discuss some of the new features and enhancements for failover clustering. The new features and enhancements were made with easier management, increased scalability, and more flexibility in mind. Here are the most noteworthy changes.

New Shared .vhdx Files

One new feature that seems to be getting the biggest raves is the ability to have shared Virtual Hard Disk (VHD) files (in the .vhdx file format) for guest clusters running in Hyper-V host clusters. What this means is that you can now use VHDs for your guest clusters without having to also attach your actual storage to those virtual machines (VMs). The shared .vhdx files must reside on the local Cluster Shared Volumes (CSVs) of the host cluster or on a remote Scale-Out File Server.

When you create a .vhdx file for a VM, there's a new option to mark it as shared. If you're using Hyper-V Manager, you select the Enable virtual hard disk sharing option in Advanced Features in the VM's settings, as shown in Figure 1.

Figure 1: Enabling .vhdx File Sharing in Hyper-V Manager

If you're using Microsoft System Center Virtual Machine Manager (VMM), you select the Share the disk across the service tier option on the Hardware Configuration page, as shown in Figure 2. You then add the same .vhdx file to each of the other VMs and select the same setting. When you attach these shared VHDs to guest VMs, they'll appear as Serial Attached SCSI (SAS) drives to the guest VMs.

Figure 2: Enabling .vhdx File Sharing in VMM

You can use Windows PowerShell to set up the .vhdx files if desired. For example, suppose that you want to create a 30GB .vhdx file and assign it to two VMs as a shared VHD. First, you need to create the .vhdx file by running a command like this:

New-VHD -Path C:\ClusterStorage\Volume1\Shared.VHDX `
  -Fixed -SizeBytes 30GB

Then, to assign it to each VM as a shared .vhdx file, you use commands like these:

Add-VMHardDiskDrive -VMName Node1 `
  -Path C:\ClusterStorage\Volume1\Shared.VHDX `
  -ShareVirtualDisk
Add-VMHardDiskDrive -VMName Node2 `
  -Path C:\ClusterStorage\Volume1\Shared.VHDX `
  -ShareVirtualDisk

Using shared .vhdx files is ideal for:

File services running within a VM
SQL Server databases
Other database files that reside on guest clusters

You can find more information about shared .vhdx file requirements and configurations in the Virtual Hard Disk Sharing Overview web page.

New Node Shutdown Process

When you use failover clustering in Windows Server 2012 and earlier, Microsoft recommends that you first move all the VMs off a node before you shut it down (or reboot it). Here's why: when you shut down a node, it induces the cluster-controlled action of Quick Migration on each VM. A quick migration will put that VM in a saved state, move it to another node, then make it come out of the saved state.

When a VM is in a saved state, it's actually down, which means your productivity is down until the VM comes back online. The reasoning behind the recommendation of moving all VMs off the node before shutting it down is that you can use live migration to move those VMs so that there's no loss in productivity. However, if you follow this recommendation, shutting down a node can be a lengthy manual process.

In Server 2012 R2 failover clustering, Microsoft has changed what happens when a node is shut down. The new process has two main components: drain on shutdown, and "best available node" placement.

If you shut down a node without first putting it into maintenance mode, the cluster will then automatically issue a drain. During the drain, the cluster uses live migration to move the VMs off the node, following the machine priority (high, medium, and low). All the VMs are moved, including the low-priority VMs.

When migrating the VMs, the cluster uses "best available node" placement. Here's how it works: before the cluster starts migrating the VMs, it first checks the available memory of the remaining nodes. Using this information, it strategically places the VMs on the best available node, as Figure 3 shows. This ensures a smoother transition because it prevents moving high-priority VMs to a node that doesn't have enough memory.

Figure 3: Migrating the VMs to the Best Available Node

This new process is enabled by default. If you need to manually enable or disable it, you can configure the DrainOnShutdown cluster common property. To enable it, use the PowerShell command:

(Get-Cluster).DrainOnShutdown = 1

To disable it, run the command:

(Get-Cluster).DrainOnShutdown = 0

Additional Health Detection Feature for VM Networks

With Server 2012 R2 failover clustering, you have an additional health detection feature for the networks that the VMs utilize. If a node's network goes down, the cluster will first check to see if the network is down across all the nodes. If it is, the node's VMs will remain where they are. If this is the only node with the problem, the cluster will use live migration to move its VMs to a node in which the network is available.

This feature is enabled by default on all networks configured for VMs. If there are any networks that you don't want to protect with this feature, you can disable it using Hyper-V Manager. You simply need to clear the Protected network check box in Advanced Features in the VM's settings, as shown in Figure 4.

Figure 4: Disabling the Protected Network Feature

New Clusters Dashboard

When managing multiple clusters in Server 2012 and earlier, you must switch between each cluster to see whether there are errors or any concerns. That's no longer the case in Server 2012 R2 because Failover Cluster Manager has the new Clusters dashboard, which Figure 5 shows.

The new Clusters dashboard makes it easier to manage multi-cluster environments. You can quickly check the status of roles and nodes (e.g., up, down, failed) and see whether there are any recent events you need to review. Everything is hyperlinked, so simply clicking the link takes you to what you need to see. For example, clicking the Critical: 3, Error: 1, Warning: 2 link shown in Figure 5 brings up the list of these filtered events so that you can go through them.

CSV Enhancements

In Server 2012 R2 clustering, Microsoft has made several CSV enhancements. They include optimizing the CSV placement policy and adding a dependency check.

The CSV placement policy now spreads ownership of the CSV drives among the nodes to ensure they're evenly distributed. For example, suppose that you have three nodes with four CSV drives, each of which houses five VMs. When all the nodes are running, two of the nodes have one CSV drive and five VMs. The other node has two CSV drives, each with five VMs. You now need to add a fourth node to the cluster. As soon as you add the node, the cluster will automatically give this new node ownership of one of the CSV drives. All the VMs running on that CSV drive will then be moved to this new node using live migration. By doing this, the cluster has more evenly spread the load among all the nodes.

Another enhancement that Microsoft made to CSVs is adding a dependency check. When a node isn't the owner (or coordinator) of a CSV drive, it must go through the network with a Server Message Block (SMB) connection to the coordinator for any metadata updates needed for the drive. The coordinator node has an internal share to which all the other nodes connect for this purpose. This requires the Server Service to be running. If the Server Service were to go down for some reason, the non-coordinator nodes wouldn't have the SMB connection, which would cause errors. More important, any metadata updates would simply be cached rather than sent because there's no way to send them. To get out of this situation, you need to manually move ownership of the CSV drive to another node.

To help avoid this situation, Microsoft added a dependency check that monitors the health of both the internal share and the Server Service. If a dependency check reveals that the Server Service is down, the cluster will move ownership of any CSV drives that the node owns to other nodes. The cluster will also follow the optimized CSV placement policy to evenly distribute the CSV drives. For example, suppose that you have a cluster with three nodes, each of which holds two CSV drives. If one of the node's Server Service goes down, the cluster will move ownership of that node's two CSV drives to each of the remaining two nodes.

Improvement in Network Validation Tests

Failover clustering has always used port 3343 for all communications (e.g., health checks, status reporting) between the nodes. However, there has never been a check for this port. The network validation tests only checked basic network connectivity between the nodes. Because these tests never checked for connectivity over port 3343, you wouldn't know if the Windows Firewall Port 3343 rule was disabled or that port 3343 wasn't open because of a third-party firewall being used.

In Server 2012 R2, the new Validation Network Connectivity test checks for communication over port 3343. When troubleshooting communication problems in the past, you might not have always checked this port first. With this test, it can be your first check. If the port is causing the problem, you'll have saved yourself quite a bit of time troubleshooting the problem.

Dynamic Quorum Enhancements

In Server 2012 failover clustering, Microsoft introduced the concept of the dynamic quorum. When the dynamic quorum feature is enabled, the cluster automatically adjusts the number of votes required to keep a cluster running if nodes go down. In Server 2012 R2 failover clustering, Microsoft has gone a step further by introducing the dynamic witness feature and the LowerQuorumPriorityNodeID property.

When the dynamic witness feature is enabled, the cluster dynamically adjusts the vote of the witness resource (a disk or file share). If there is a majority of nodes (i.e., an odd number of nodes), the witness resource will have its vote removed. If there isn't a majority of nodes (i.e., an even number of nodes) or if the witness resource is needed for a vote, it's dynamically given back the vote.

Because of the new dynamic witness feature, Microsoft has changed its witness recommendations. Previously, the recommendation was based on the number of nodes. If you had an even number of nodes, Microsoft recommended adding a witness resource to get to an odd number. If you had an odd number of nodes, it recommended not adding a witness resource.

With Server 2012 R2, the recommendation is to always add a witness resource. Because of the dynamic witness feature, the cluster will give the witness resource a vote if the cluster needs it or remove the vote if the cluster doesn't need it.

The cluster also adjusts the node weights as needed for when nodes go down or join the cluster. Because of the dynamically changing node weights, they've been added to Failover Cluster Manager so that you can quickly see these weights without having to run any commands against the nodes. You can see these values by selecting Nodes in Failover Cluster Manager, as Figure 6 shows. Note that you still have the option within the quorum configuration to remove a node's vote if desired.

Another dynamic quorum enhancement has been made in the area of multi-site clusters. When you have nodes in two different sites and there's a network break between the two sites, only one site is going to remain running. In Server 2012 (and earlier) failover clustering, the site containing the node that gets the witness resource first is the site that remains running. However, this site might not be the one that you want to remain running. In other words, when you have a 50-50 split where neither site has quorum, you have no way of preselecting which site should remain running.

In Server 2012 R2 failover clustering, there's a new cluster common property that you can use to determine which site survives. You can set the LowerQuorumPriorityNodeID property to specify which node will have its vote removed in case of a 50-50 split.

For example, suppose you have three nodes in your primary site and another three nodes in an offsite location. You can set the LowerQuorumPriorityNodeID property on the offsite nodes so that if you have a 50-50 split, the offsite nodes will stop their Cluster Service until network connectivity is restored. To set this up, you first need to know the Node IDs of the offsite nodes. You can find out this information by running the following PowerShell command for each offsite node (where NodeName is the name of that node):

(Get-ClusterNode –Name "NodeName").Id

After running these commands, let's say that you find out the Node IDs of the offsite nodes are 4, 5, and 6. To ensure that these offsite nodes go down if you have a 50-50 split, you run these commands:

(Get-Cluster).LowerQuorumPriorityNodeID = 4
(Get-Cluster).LowerQuorumPriorityNodeID = 5
(Get-Cluster).LowerQuorumPriorityNodeID = 6

Now if you have a break in communication, the offsite nodes will stop their Cluster Service and all roles in the cluster will stay in the primary site nodes, which will remain running.

Even More Changes

Many new features and enhancements have been added to failover clustering in Server 2012 R2, and they're all for the better. I've introduced you to only some of them. If you want to learn about the changes I didn't discuss or want more information on those I've covered, see the What's New in Failover Clustering in Windows Server 2012 R2 web page.

Comments

Plain text