Executing a Zero-Downtime Storage Hardware Refresh

Performing a storage hardware refresh that avoids downtime and data loss involves a thorough plan. Here’s a real-world example.

Brien Posey

February 22, 2024

4 Min Read
an abstract technological background in pink
Alamy

In a recent article, I explained how I planned a storage refresh in my environment. I outlined five basic requirements that my refresh had to meet:

  1. Increase storage capacity to meet my needs for the next five years.

  2. Complete the storage upgrade without any downtime.

  3. Perform the storage upgrade without experiencing data loss.

  4. Ensure that the new storage maintains or improves upon the current level of resilience.

  5. Match the performance of the new storage with my current setup.

Given these requirements, I would like to discuss how I executed the storage refresh to ensure zero downtime and prevent any data loss (meeting requirements 2 and 3).

The Production Environment’s Setup

Before the hardware refresh, my production environment consisted of two Hyper-V hosts, each connected to a dedicated NAS. I have a single, very large virtual machine that contains all my data. The virtual machine is replicated across both servers by way of the Hyper-V replication feature.

I chose to build my production environment this way, instead of creating a failover cluster, to achieve genuine shared-nothing redundancy. The replication process occurs automatically every 30 seconds. As such, in the event of a critical failure, I could simply activate the standby replica, and so, theoretically, I should never lose more than 30 seconds’ worth of data.

A Redundancy-Driven Approach

I decided to maintain this type of redundancy since it has worked so well for me in the past. For the hardware refresh, my plan involved creating an offline backup, which would act as a last line of defense if something went horribly wrong. From there, I would:

 

  1. Verify that the replicas are in sync, and then break the replica pair.

  2. Shut down the replica NAS and the replica host.

  3. Remove and replace the replica NAS, bring it online, and then re-enable Hyper-V replication.

  4. Once all data was replicated to the new NAS, I would perform a lossless failover to the replica server, making it host the running copy of the production virtual machine.

  5. Break the replica pair again, replace the other NAS, bring it back online, and then reestablish the replication process.

  6. Finally, I would perform one more lossless failover to return the running copy of the VM to its original host.

 

Verifying a replica’s health in Hyper-V is a simple process. Just open the Hyper-V Manager, right-click on the virtual machine, and select the Replication | View Replication Health commands from the shortcut menus.

It’s a good idea to perform this check on both replication partner hosts. In rare circumstances, I have seen two replication partners report completely contradictory health data. Given that one of the replication partners will be taken offline, it’s important to thoroughly confirm the replication’s health.

the Hyper-V Manager console shows that the Hyper-V replication is healthy

Hyper-V Transition 1

Figure 1. It’s important to verify that Hyper-V replication is healthy.

After verifying the replication health and confirming that all data has been replicated between the two hosts, the next step is to disable replication. In the Hyper-V Manager, right-click on the virtual machine and select the Replication | Remove Replication commands from the shortcut menus. This action needs to be performed on both Hyper-V hosts. This process does not delete the virtual machine copy (the replica), but it does stop any further data replication to it.

Hyper-V Manager console shows the Remove Replication menu option

Hyper-V Transition 2

Figure 2. You can use the Remove Replication menu option to terminate the replication partnership.

Addressing Downtime and Data Loss

This brings up two important points. First, as previously noted, my requirements included zero downtime and no data loss. Technically, the type of migration that I am performing cannot be accomplished with literally no downtime and no data loss. A lossless failover (referred to as a planned failover by Microsoft) requires powering down the virtual machine during the failover process. However, the downtime is minimal, usually lasting around a minute or so. The alternative, an unplanned failover, results in the loss of any data not yet replicated.

The second point is that the migration method I performed requires a very small amount of downtime, but this only holds true so long as the primary Hyper-V host does not fail during the storage refresh. Through the refresh, there is no standby virtual machine replica to fall back on. Even so, there is some hardware-level redundancy that will help mitigate the risk of a failure. For example, my Hyper-V host servers have redundant power supplies, while my existing NAS appliances are configured with redundancy to protect against disk failures.

Read more about:

Technical Explainer

About the Author

Brien Posey

Brien Posey is a bestselling technology author, a speaker, and a 20X Microsoft MVP. In addition to his ongoing work in IT, Posey has spent the last several years training as a commercial astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space.

https://brienposey.com/

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like