Q. What is compute and storage resiliency in Windows Server 2016 clustering?

Q. What is compute and storage resiliency in Windows Server 2016 clustering?

A. Windows Server 2016 introduces additional resiliency to handle transient communication interruptions between nodes in a cluster. In Windows Server 2012, a transient network break would result in the nodes being partitioned; that would result in the nodes that could not make quorum losing the ability to host resources, and those resources would be taken by other nodes. This movement of the resources, especially VMs may take many minutes for the VM to be restarted on other nodes as it restarts in a crash-consistent state (since it could not be Live Migrated). Meanwhile, the original node may only be disconnected for seconds meaning the downtime of the service is considerably longer than if the VM had stayed where it was. This is amplified for hosts experiencing transient problems frequently.

Windows Server 2016 solves this problem with compute and storage resiliency: these remove these outages for transient failures by introducing a small period of time a host can be out of the cluster and maintain running resources before they are failed over. By default this timeout is set to 4 minutes which should be more than enough to handle most transient problems.

When a node becomes disconnected, it will switch to an Isolated mode, and any VMs on the host will go into an Unmonitored state. if the VM was hosted on a Scale-out File Server (SoFS), the storage for the VM is still accessible and the VM will continue to run. For other types of storage such as Cluster Shared Volumes, the VM will move to a Paused Critical state to avoid the VM crashing without its storage. The VM is unavailable during this time. Once the disconnected node rejoins the cluster, any Paused Critical VMs will resume and continue to run.

Note that if a node goes into isolated mode 3 times within an hour it is placed in a two-hour quarantine and any resources moved to other host. You can force a node out of quarantine using cluster manager or PowerShell if required. The two hour quarantine and the four minute timeout can be changed and directions are in the links below.

For more information on compute and storage resiliency see:

Comments

Plain text