Hadoop Finds a Friend in Virtualization

Big Data, meet Desktop Virtualization.

Hadoop, a widely-used framework for processing very large data sets, is one of the foundations of the growing importance of Big Data analytics inside enterprises. In a standard Hadoop installation, data is spread among a cluster of commodity servers, which perform operations on them in parallel. Spreading the information among many different machines allows huge data sets to be processed more quickly than would be possible with the normal serial architecture of a single machine.

Because Hadoop makes use of "white box" machines, and keeps them running at full capacity, it hasn’t been seen as a natural place for virtualized machines. But in something of a surprise, VMware recently conducted a study that showed performance gains of up to 13% from using virtualization strategies in connection with a Hadoop cluster.

Since many Big Data users are interested in "real time" analytics, a speed boost like that could be significant.

The study reports that "virtualization overhead is never very large, but can be essentially eliminated through careful virtual machine sizing and configuration." And incorporating virtual machine architectures has other advantages, the report said, such as reducing the number of idle disks and increasing overall throughput.

Hadoop’s adoption could increase more quickly if it’s able to borrow some of the successful technology strategies of the world of VDI. Now, it looks like that will happen.

Comments

Plain text