Use of HDFS, the Java-based file system that has become nearly synonymous with the so-called big data revolution, has been declining over the last few years. This is primarily due to what can only be described as a general loss of interest in Hadoop. But what is it about the Hadoop architecture that has caused it to be abandoned in droves, when it held such promise only a short time ago?
To be perfectly frank, there does not seem to be one single definitive reason for Hadoop’s decline. Instead, the Hadoop architecture's loss of popularity may be attributed to several different factors.
Some IT professionals have expressed frustration over their ability to perform any kind of meaningful data analytics on Hadoop clusters. A common sentiment is that Hadoop is great for warehousing massive amounts of data, but tends not to be the best solution for those who need to enable end user level interactivity for that data.
Several vendors have created SQL on Hadoop solutions, which allow customers to run SQL queries against data residing on Hadoop. Even so, these tools are not created equally, and each SQL on Hadoop vendor seems to design its tool for a specific use case. Its easy to imagine how many organizations may have found out the hard way that the SQL on Hadoop engine they purchased was not well suited to their big data project.
It is not just the difficulty of making Hadoop do what an organization needs it to do that has led to the decrease in its usage. Another reason that has occasionally been cited for Hadoop’s waning popularity is that Hadoop does not mesh well with current IT trends.
The big data revolution took hold seemingly overnight, and, when it did, Hadoop was well positioned to answer the call. After all, according to Hortonworks, the HDFS file system “has demonstrated production scalability of up to 200 PB of storage” and a single cluster of 4,500 servers supports “close to a billion files and blocks.” Clearly, HDFS can handle big data.
The problem is that the big data trend seems to be over. Even though IT shops are still working on projects that conceivably fall under the big data umbrella, the term is being used less and less. Instead, the IT fad of the moment seems to be machine learning, which is, of course, not natively supported by Hadoop. There are third-party tools such as Apache Mahout that enable machine learning for Hadoop, but it may be too little, too late.
Perhaps the biggest reason for the decline in Hadoop usage, however, is the maturity of IaaS clouds such as Amazon AWS and Microsoft Azure. There are a few different ways in which public clouds have played a significant role in the transition away from Hadoop.
The first reason is simple perception. We live in a cloud-first world. The public cloud providers have done a fantastic job of convincing people that it is far less expensive to run workloads in the cloud than to run those same workloads on premises.
There is also a bit of a tendency to perceive those who continue to deploy new workloads on premise as somehow being behind the times.
A second reason way in which the public cloud providers are slowly contributing to Hadoop’s demise is the fact that cloud providers have essentially built a better mousetrap. While it is worth noting that Amazon does support Hadoop and Spark through Amazon EMR, EMR is not Amazon’s only solution for organizations that need big data analytic capabilities. Amazon also offers Athena, which can be used to analyze petabyte-scale data stored in Amazon S3, and Elasticsearch, which allows for petabyte-scale log analytic, text search, and application monitoring capabilities.
My guess is that Hadoop and the HDFS file system will never completely go away--not, at least, any time soon. I think that there will probably always be a need for high-capacity storage using commodity hardware. Besides, uploading massive amounts of data to the public cloud can be cost prohibitive and may also pose logistical problems.
Having said that, it is becoming increasingly common for new big data projects to be born in the cloud, and IT pros are increasingly finding that cloud-native big data tools are easier to use and more effective than Hadoop.