Big Data is definitely one of the most important trends in IT today. Big Data is important because it enables organizations to derive new decision-making information from previously untapped data sources. Like Business Intelligence (BI) Big Data is all about making better and faster business decisions and gaining a competitive business advantage.
So what exactly is Big Data? Big Data is a term that used to refer to large amounts of unstructured data. It’s important to be clear that Big Data should not be confused with another database term known as Large Objects (usually called LOBs). Although their names seem similar, Big Data is not the same as Large Object (LOB) data. LOB data is typified by things like video or audio files or XML documents that are associated with some other text and numeric relational data.
For instance, an example of LOB data might be the picture of a product that’s sold on a Web site. LOB data is stored inside the relational database itself or it can be stored in the file system. Big Data is quite different. Unlike relational data, where the data is structured and is typically created by an OLTP application, Big Data is unusually unstructured and is used for data warehousing and decision support applications. Big Data is often characterized by keywords such as volume, variety, and velocity. Big Data refers to very large amounts of data (volume) where the nature of the data could be just about anything (variety) or Big Data might describe the results of capturing real-time complex event streams events (velocity).
Some of the driving forces behind the emergence of Big Data include the declining cost of storage, the reduced cost of data acquisition, and new data processing technologies. Clearly, the cost of storage is lower now and the larger disk drives are more widely available now than at any time in the past. When you couple these factors with the emerging availability of cloud storage, it’s clear that organizations now have the ability to store and process very large volumes of data.
Just as important as the availability of greater storage capacity is the fact that cost of data acquisition has declined significantly. A decade ago, most data was produced from manual entry. In other words, someone was manually typing it into an application which made the cost of acquiring that data quite high. Today, with the proliferation of mobile devices and the near-universal connectivity of the Internet, data acquisition can be automated. This automation drastically reduces the cost of acquiring data. The types of data and the ways it can be collected have expanded far beyond manual entry.
Finally, the third driving factor fueling the growth of Big Data is the emergence of new technologies like Hadoop that enable organizations to process and analyze large amounts of unstructured data in ways that were not really possible before. With Hadoop, a mainframe or supercomputer are not required to process the huge volumes of data collected. Instead, you can divvy up the processing tasks to a number of standard x86 compute nodes where each node processes a portion of the query and the results are joined together. You can think of Hadoop as a data warehousing and reporting solution for unstructured data.
You can get started with Big Data in a number of ways. Microsoft’s HDInsight provides both an on-premise Windows-based Hadoop implementation, as well as an Azure service which can allow organizations access to a Big Data services without needing to buy, configure, or manage the Hadoop infrastructure. You can also connect SQL Server to the HDInsight Server or Service using Microsoft’s SQL Server Connector for Apache Hadoop. Perhaps Microsoft’s upcoming Polybase technology is even more interesting. Polybase will provide a bridge to Big Data from SQL Server allowing SQL Server T-SQL queries to be run against data stored in a Hadoop cluster and to return the data as standard SQL results. Polybase will initially be released with the next release of Microsoft’s Parallel Data Warehouse (PDW).
EMC’s GreenPlum Unified Analytics Platform is another Big Data solution that can help enterprises begin to take advantage of the data insights that Big Data can offer. Unlike Polybase, which is a future technology, the GreenPlum Unified Analytics Platform is available today. Designed to provide enterprise-scale analytic processing, the GreenPlum Unified Analytics Platform consists of the GreenPlum Database, Pivotal HD Enterprise, and GreenPlum Chorus. GreenPlum Database is a highly scalable Massively Parallel Processing (MPP) analytic database. Pivotal HD Enterprise is an enterprise-hardened Hadoop implementation which also offers advanced SQL query services.
Pivotal HD Enterprise can use traditional Hadoop direct-attach data storage or it can make use of EMC’s Isilon OneFS Scale-Out NAS Storage, which provides 100% HDFS compatibility. Solutions like EMC Isilon OneFS Scale-Out NAS Storage can help you manage your Big Data storage costs and complexity. Isilon OneFS Scale-Out NAS Storage provides NAS simplicity in combination with advanced data protection, replication, and snapshots. GreenPlum Chorus is a data exploration and virtualization platform designed to help data scientists collaborate and deliver analytics insights.
Big Data is not replacement for relational databases. Relational database will continue to support the organization’s core mission critical applications. However, Big Data will continue to be a big deal. Big Data provides the ability to open up new business insights enabling organizations to make better business decisions using data that was previously inaccessible.