The amount of data traversing the modern cloud platform is breaking new ground. Annual global data center IP traffic is projected to reach 7.7 zettabytes by the end of 2017, according to the latest Cisco Global Cloud Index report. Overall, data center IP traffic will grow at a compound annual growth rate (CAGR) of 25 percent from 2012 to 2017.
Now, much more than before, organizations are relying on large sets of data to help them run, quantify and grow their business. Over the last couple of years, already large databases have evolved into giga, tera and even petabytes.
Furthermore, this data no longer resides within just one location. As these data growth numbers indicate, with cloud computing, this information is truly distributed.
Big data and data science is taking off in pretty much every industry.
- Science: The Large Hadron Collider conducts about 600 million collisions per second. As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
- Research: NASA’s Center for Climate Simulation (NCCS) stores about 32 petabytes of climate observations and simulations on their supercomputer platform.
- Private/Public: Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
Organizations have been forced to find new and creative ways to manage and control this vast amount of information. The goal isn’t just to organize it, but to be able to analyze and mine the data to further help develop the business. In doing so, there are great open-source management options that large organizations should evaluate:
Apache HBase: This big data management platform was built around Google’s very powerful BigTable management engine. As an open-source, Java-coded, distributed database, HBase was designed to run on top of the already widely used Hadoop environment. As a powerful tool to manage large amounts of data, Apache HBase was adopted by Facebook to help them with their messaging platform needs.
Apache Hadoop: One of the technologies which quickly became the standard in big data management can be found with Apache Hadoop. When it comes to open source management of large data sets, Hadoop is known as a workhorse for truly intensive distributed applications utilization. The flexibility of the Hadoop platform allows it to run on commodity hardware systems and can easily integrate with structured, semi-structured, and even unstructured data sets.
Apache Drill: How big is your data set? Really big? Drill is a great tool for very large data sets. By supporting HBase, Cassandra, and MongoDB – Drill creates an interactive analysis platform which allows for massive throughput and very fast results.
Apache Sqoop. Are you working with data potentially locked within an older system? Well, that’s where Sqoop can help. This platform allows for fast data transfers from relational database systems to Hadoop by leveraging concurrent connections, customizable mapping of data types, and metadata propagation. In fact, you can tailor imports (such as new data only) to HDFS, Hive, and HBase.
Apache Giraph: This is a powerful graph processing platform built for scalability and high availability. Already used by Facebook, Giraph processes run as Hadoop workloads which can live on your existing Hadoop deployment. This way you can get powerful distributed graphing capabilities while utilizing your existing big data processing engine.
Cloudera Impala: The Impala model sits on top of your existing Haddop cluster and monitors for all queries. Where technologies like MapReduce are powerful batch processing solutions – Impala does wonders for real-time SQL queries. Basically, you can get real-time insight into your big data platform via low-latency SQL queries.
Gephi: It’s one thing to correlate and quantify information – but it’s an entirely different story when it comes to creating powerful visualizations of this data. Gephi already supports multiple graph types and networks as large as 1 million nodes. With an already active user community, Gephi has numerous plug-ins, and ways to integrate with existing systems. This tool can help visualize complex IT connections, various points in a distributed system, and how data flow is happening.
MongoDB: This solid platform has been growing in popularity among many organizations looking to gain control over their big data needs. MongoDB was originally created by the folks at DoubleClick and is now being used by several companies as an integration piece for big data management. Designed on an open-source, NoSQL engine, structured data is able to be stored and processed on a JSON-like platform. Currently, organizations such as the New York Times, Craigslist and a few others have adopted MongoDB to help them control big data sets. (Also check out Couchbase Server).
Our new “data-on-demand” society has resulted in vast amounts of information being collected by major IT systems. Whether these are social media photos or international store transactions, the amount of good, quantifiable, data is increasing. The only way to control this growth is to quickly deploy an efficient management solution.
Remember, aside from just being able to sort and organize the data, IT managers must be able to mine the information and make it work for the organization. Business intelligence and the science behind data quantification will continue to grow and expand. Organizations seeking to gain an edge on their competition will be the ones with the most control around their data management system.