To better cope with scale of its infrastructure Twitter is moving a portion of its data platform from its own data centers to Google’s cloud.
Managing scale is one of the most challenging aspects of operating a vast internet service. But cloud giants like Google have gotten so good at it that even other internet services of serious scale – such as Twitter, which over the years has developed substantial infrastructure expertise of its own – see value in letting them handle at least some of their workloads.
Twitter is moving cold storage and Hadoop clusters to Google Cloud Platform, the social network’s CTO, Parag Agrawal, said in a blog post Thursday. “This will enable us to enhance the experience and productivity of our engineering teams working with our data platform,” he wrote.
He listed expected benefits of the migration more specifically toward the bottom:
- Faster capacity provisioning
- Increased flexibility
- Access to a broader ecosystem of tools and services
- Security improvements
- Enhanced disaster recovery capabilities
Decoupling Hadoop Compute and Storage
A key expected improvement is the ability to separate compute and storage for the class of Hadoop workloads being migrated. Hadoop, the born-at-Yahoo open source software that ties a group of disparate servers together into a powerful computing cluster, was created when networks were much slower than today, so a CPU processing data stored on the same machine was substantially faster than shuttling data over a network.
But even as far back as 2015 studies showed that network improvements had shrunk the difference between reading data from local disks and reading it from remote disks to as little as 8 percent, Tom Phelan, co-founder and chief architect at the big data infrastructure specialist BlueData, wrote at the time.
“This is now relatively well known,” he wrote. “What may be less well known is that this 8% number is decreasing.”
With the difference in speed negligible, the benefits of disaggregating compute and storage far outweigh the benefits of keeping them closely coupled. According to Phelan, the benefits of disaggregation are: finetuning hardware for individual Hadoop modules, more storage-system options, and the ability to run Hadoop in a virtualized environment.
Agarwal said separating compute and storage would have “a number of long-term scaling and operational benefits.”
Less Than 20 Percent of Twitter Moving to GCP
It’s unclear from Agrawal’s post exactly how much of Twitter’s enormous Hadoop infrastructure will be moving into Google’s cloud. “Twitter runs multiple Hadoop clusters that are among the biggest in the world,” he wrote.
Those clusters occupied close to 20 percent of all hardware behind Twitter as of January 2017, according to a blog post by the company’s VP of infrastructure and operations, Mazdak Hashemi.
In this week’s post, Agrawal said Twitter’s Hadoop file systems host more than 300PB of data across tens of thousands of servers, but he also said only “flexible compute Hadoop clusters” and cold storage will be migrated, meaning only a portion of Twitter’s Hadoop capacity will be moved. Cold storage usually means lower-performance and lower-cost storage systems for old, infrequently accessed data.
Twitter hasn’t disclosed publicly where its data centers are located, but according to various reports over the years (including some by Data Center Knowledge), the company has leased large amounts of data center space from RagingWire in Sacramento, California, and from QTS Realty in Atlanta Metro.