Q. How does Live Migration work in Windows Server 2008 R2?

A. In the first version of Hyper-V, you had the ability to perform Quick Migration, which was the process where a virtual machine (VM) running on a node in a cluster could be moved to another node in the cluster with a minimal amount of downtime. The exact amount of downtime was basically the time it took to write the content of the VM's memory to the LUN, moving the LUN to the new node, provisioning the VM on the new node and reading the memory from the LUN. Typically all of this would be done in about 30 seconds for a machine with a couple of gigabytes of memory. For many organizations, 30 seconds of planned downtime is acceptable. However there are times when a zero down-time migration is required which if you consider how quick migration functions, you have two challenges: moving the memory to the new node while the VM is still running and moving the LUN containing the VM configuration and most importantly the virtual hard disks to the new node with no access interruption.

In the FAQ "How do Cluster Shared Volumes work in Windows Server 2008 R2" we saw the new multiple concurrent access capability, which solves the issue of zero down-time LUN access between nodes in a cluster.

The other challenge is the memory, which is actually not as bad as it seems:

• The new VM is provisioned on the target node, at which point a snapshot of the memory of the VM is sent to the target node. While the data is being copied pages of memory will be changed and be marked dirty.

• Once the copy of the memory is complete, the pages that have been marked dirty are copied over, which will be considerably faster than the copy of all the memory. While that copy of memory is happening, other pages of memory will change but not as many since the copy is considerably faster.

• The process of dirty memory copies continues until the amount of change is so small it can be coped in milliseconds, at which point the VM is turned off on the original node, the partition state is copied and then the VM is activated on the new node with no visible downtime to clients connected to the VM. Address Resolution Protocol (ARP) update is issued so clients are pointed to the new active node. Once the migration is verified, the VM on the original node is deleted. This is shown in the pictures and video below.

New VM provisioned on target node but not active.

Content of memory copied from active node. During the copy, pages of memory are changed and marked dirty on the current active node.

Copy of dirty pages repeated until amount of memory delta left can be moved in milliseconds.

For the final copy, the active node is paused so there are no dirty pages during the final copy.

Clients access the new node and the old node is deleted.

Comments

Plain text