Q. I heard that VMware ESX allows you to overcommit memory, which means it pages out the memory of guests to a file, giving very poor performance. Is this true?

A. I had heard the same thing, so I decided to dig into exactly what memory overcommit in ESX consists of and the recommendations of its usage. Let's take a step back, though, and look at why ESX even has the feature, because hypervisors like Hyper-V don't.

A host system has a finite amount of memory that can be allocated to guest virtual machines (VMs). Within the guest, the OS allocates and de-allocates memory, constantly changing the amount of memory it's actually using. The hypervisor only sees the allocation of memory from the guest when the guest tries to read or write to a page of memory for the first time. The hypervisor is never told when a page of memory is de-allocated and put on the guest OS's free list. It's not practical for a hypervisor to monitor each OS's implementation of the free memory list because it would likely change between each OS version and service pack. Therefore, the hypervisor only sees the memory use of each guest grow over time, even though the guest may have stopped using large amounts of the memory assigned to it.

One approach to this problem is to make sure the host always has enough memory to cover the memory configured for each VM. If I have 10 VMs, each configured with 2GB of memory, I would need a host with about 24GB of memory—20GB for the VMs and the rest for the hypervisor and management processes, which use additional memory for shadow page tables to map VM memory to physical memory.

The only problem with this approach is that a major reason to virtualize is to consolidate and get the most from the hardware. Consider that when you consolidate physical machines to VMs, you might take a physical box with 4GB of memory and make a VM with 4GB of memory. Most of the time, these VMs use maybe 30 percent of their memory, and they only need more occasionally, for certain workloads. This means your virtual environment has 70 percent of its memory idle, memory that could be used to run a lot more work.

Another approach, then, is to allow VMs to be created and started with total memory allocation that exceeds the amount of memory available on the host, overcommitting the memory. You could run 15 VMs, allocated with 2GB each, on the same 24GB host. So how does it work?

As I mentioned earlier, ESX doesn't allocate memory to a VM until the VM tries to access that memory through a read or write. Over time, the VM will use up more and more memory, because ESX doesn't get told when the guest OS de-allocates memory, making it unable to free the memory. Over time, the 15 VMs would all get to 2GB and the host would run out of memory, even though most of the VMs aren't using the whole 2GB—they just used all of the different memory areas over time. You therefore need ways to save and reclaim memory. ESX uses three methods in the order presented, each less attractive than the one before it: transparent page sharing (TPS), ballooning, and swapping (which you don't want).

Transparent Page Sharing
TPS is enabled by default in ESX and is designed to avoid having duplicate copies of the same information in memory. Think of it as Single Instance Storage (SIS) for memory. In a file system, SIS looks for duplicate copies of a file and only stores the file once. It puts links to the single copy of the file to save disk space. TPS works the same way—ESX creates a hash value for each page of memory content and stores it in a hash table. Over a period, the hashes are compared, and if multiple hashes have the same value a bit by bit check is performed. If the pages are the same, only a single copy of the memory page is stored, and the VMs' memory maps to a shared page. This saves physical memory because you're not duplicating information. This process is illustrated here.

TPS saves memory within a single VM, because even with shared libraries you'll find the same version of a library is loaded multiple times (on a clean Windows Server 2008 installation I saved about 60MB of memory with this). But TPS's real power is when you have multiple VMs running the same OS, because a huge amount of the memory contents will be the same for the OS and application code components. If you have 15 VMs, all the identical memory components are stored in memory once, instead of 15 times. If a guest tries to write to a shared page, the guest gets its own copy created and the change is made there. In one VMware lab test of cross-VM sharing, there were four 1GB Windows Vista installations. Initially, these VMs used a total of about 3.5 GB. After 30 minutes, the total memory use fell to 800MB through TPS. I like this technology the best of ESX's overcommitting techniques. In my testing, it works great and has almost no performance hit.

If TPS isn't reclaiming enough memory, how do you know which pages of memory aren't being used by VMs? The host can't see the free memory, and you need to make sure the guest OS doesn't try to access the memory taken back by the host. The solution is the balloon driver, and it's installed when the ESX integration services are installed in the guest OS. The balloon driver is a high-privilege device driver installed in the guest OS, which means it can request memory from the guest OS and tell the OS it can't page that memory out to a swap file (a feature of device drivers).

When ESX is under memory pressure and needs to reclaim some memory, it sets the balloon driver in the guest to a target size. The balloon driver then asks the guest OS for the amount of memory it needs. Because it's a high-privilege device driver, it will always get the memory it requests, but it may take a little time. The guest OS will look at its memory use and it may have enough memory free to give the balloon driver what it wants. If it doesn't, the guest OS will remove some items that haven't been accessed for a while from memory. The key is that the guest picks what to take out of memory if it has to, so the operation will have the least possible impact on the guest OS and the services and applications running within it.

The balloon driver pins the memory it's allocated, preventing the guest OS from paging it to disk, and informs the hypervisor of the pages it was given. Because the OS won't access the balloon driver's memory and the balloon driver doesn't actually use the memory, the hypervisor can de-allocate this memory from the guest and use it elsewhere, as shown here.

Click to expand.

If the host gets more free memory, the balloon driver will be given a new, smaller target size and release some or all of the memory it's using back to the guest OS. This inflating and deflating of the balloon driver's memory use in the guest gives ESX the ability to reduce the physical memory footprint of a guest in a way the guest decides, instead of the hypervisor, which would have no knowledge and just pick memory at random. It's important to remember that this memory reduction isn't instant—the balloon driver is given a target size, and it takes time for the guest OS to allocate it the memory it wants.

The last technology is swapping, used when ESX needs memory badly and can't reclaim it through TPS or reclaim it quickly enough through ballooning. In this case, ESX will randomly pick pages of the guests' memory and write it them to a disk swap file. At this point, there will likely be a performance impact on the guest OSs—you never want swapping to occur.

When you hear about how bad memory overcommit is, you're hearing about swapping. This technology isn't good, but it's important to realize that it's only used as a last resort. In an environment that has been architected and sized correctly based on the types of workloads it will handle, TPS and ballooning should always manage the memory so swapping never happens. That's the key point, proper planning—if you try to run 100 2GB VMs on a 24GB box then obviously swapping is going to occur and your environment will suck.

It's interesting to consider that Hyper-V's development team is looking at their own version of overcommit using a different approach. They're focusing more on the hot allocation and de-allocation of memory, like SCSI storage can do. This method will work similarly to ESX's balloon driver, except that it will require the guests to support dynamic memory hot addition and removal. Instead of having a driver in the OS using and releasing memory, it will actually add and remove the memory to and from the guest. This requirement will limit the guests supporting it.

I'm a Microsoft and Hyper-V fan, but in this case I really feel some of the press ESX's memory overcommit is getting isn't accurate. When used the right way, it's actually a cool feature.

Full disclosure: I work for EMC, which owns VMware, but that really had no influence on my opinion. I'm part of the Microsoft consulting group, so we do Hyper-V. I like the FAQs to be accurate and I spent a lot of time investigating how ESX's memory works, I tried it out, and I just wanted to answer this question accurately. Disclosure over.

I hope that clears up the overcommit myth. It was fascinating to research how it all works. Look for a video on this topic soon, with demos of the feature in action.

Related Reading:

Check out hundreds more useful Q&As like this in John Savill's FAQ for Windows. Also, watch instructional videos made by John at ITTV.net.
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.