Like any other Windows server, Storage Spaces Direct nodes require regular maintenance, such as installing the latest patches. Even so, maintenance must be done in a way that keeps the cluster healthy and online. In this article, I will show you how to properly maintain your Storage Spaces Direct cluster nodes.
Before attempting any sort of maintenance, it is extremely important to make sure that the cluster and its storage are healthy. If you find that the cluster is in an unhealthy state, then those issues must be corrected before proceeding with your maintenance tasks.
The easiest way to do a health check is to use the Failover Cluster Manager. Simply navigate through the console tree to Storage | Disks and then make sure that the Status column indicates that your disks are online. You can see what this looks like in Figure 1.
Make sure that your storage volumes are healthy.
Assuming that your disks are healthy, it’s time to move forward with the node maintenance tasks. Since the cluster will need to continue running while you perform the maintenance work, you will have to service one node at a time.
The first step in preparing a node for maintenance is to drain any roles from the node. Select the Failover Cluster Manager’s Nodes tab to see a list of all of the cluster’s nodes. Before continuing, it’s important to make sure that the nodes are all online. (The node status should be Up.) Now, right click on a node and choose the Pause | Drain Roles commands from the shortcut menus. You can see what this looks like in Figure 2.
Make sure that the nodes are all showing a status of Up, and then drain a single node.
Depending on how heavily a node is being used, the process of draining the node can take a few minutes to complete. When the draining is complete, you should see the node’s status change to Paused, as shown in Figure 3.
The drained node shows a status of paused.
At this point, it is safe to perform any required maintenance on the node. In an effort to avoid confusion, I recommend running the Failover Cluster Manager on a machine that is not a part of the S2D cluster. I once saw someone run the Failover Cluster Manager on one Storage Spaces Direct node, take a different node offline, and then accidentally patch and reboot the node on which the Failover Cluster Manager was running (which was still online). It’s an easy mistake to make, but you can help to avoid this sort of confusion by simply running the Failover Cluster Manager on a machine that is not a cluster node.
Once you are done with your maintenance, you will need to bring the node back online and fail back any roles that had previously been hosted on the node. To do so, open the Failover Cluster Manager, right click on the node, and select the Resume | Fail Roles Back commands from the shortcut menus. You can see what this looks like in Figure 4.
You will need to bring the node back online and fail back the services that had previously been running on the node.
Once the node is back online, it’s time to take the next node offline for maintenance. Before you do, though, it is a good idea to make sure that the storage has resynched.
The amount of time that it takes the storage to resynch varies greatly depending on the data’s change rate and on how long the node was offline. To find out if the storage is resynching, open an elevated PowerShell session and enter the Get-StorageJob cmdlet. When you do, you will see an output similar to the one shown in Figure 5.
You can use the Get-StorageJob cmdlet to check the resynchronization status.
Notice in the figure that the storage jobs refer to a repair operation. In this case, the job state for both jobs is Completed, which means that it is safe to continue with the next node. If the job state were Running, then you would need to wait until the jobs completed before servicing the next node.