Big Data is definitely one of the most important growing trends in the database industry. Much like business intelligence (BI), Big Data is important because it allows an organization to derive decision making information from new data sources. Unlike BI, Big Data enables the use of data sources and unstructured data for decision making that wasn't possible before.
In my recent interview with David Campbell, Technical Fellow at Microsoft, Campbell pointed out some of the driving forces behind the emergence of Big Data, including the declining cost of storage and the availability of larger disk drives. One of the unwritten laws of storage is that you can always find a way to fill any and all unused storage, and Big Data can certainly do that. However, just as important as the availability of greater storage capacity is the fact that cost of data acquisition has declined. Campbell noted that a decade ago, most data was produced from manual entry. In other words, somebody entered it into an application. Because this was a manual process, the cost of acquiring that data was high. Today, with the proliferation of devices and the near universal connectivity to the Internet, data acquisition can be more easily automated, which drastically reduces the cost of acquiring data.
Related: Clearing the Big Hurdles to Big Data
Another factor driving the growth of Big Data is the emergence of new technologies such as Hadoop that enable organizations to process and analyze large amounts of unstructured data in ways that weren't previously possible. With Hadoop, a mainframe or supercomputer isn't required to wade through all the data. Instead, you can parse the processing tasks out to a number of standard x86 compute nodes where each node processes a portion of the query and the results are joined together. You can think of Hadoop as data warehousing for unstructured data. Microsoft recently released a Windows version of Hadoop called the HDInsight Server.
Upcoming PolyBase Technology
Right now, Big Data and SQL Server are different islands of computing. SQL Server uses T-SQL to process queries over its relational databases, whereas Hadoop uses MapReduce to run jobs over its Hadoop DistributedFile System (HDFS). You can transfer data between Hadoop and SQL Server using the Microsoft SQL Server Connector for Apache Hadoop. Microsoft's upcoming PolyBase technology will provide a bridge to Big Data from SQL Server. PolyBase will initially be released with the Parallel Data Warehouse (PDW). PolyBase allows SQL Server T-SQL queries to run against data stored in a Hadoop cluster; the data is returned as standard SQL results. PolyBase also permits queries to reference data stored in the HDFS as if the data were in a relational table. HDFS has the additional ability to perform joins between tables in the PDW and data in the HDFS.
Although Big Data is defined by large volumes of data and high processing power, getting started with Big Data doesn't always require a huge investment in additional infrastructure. You don't necessarily have to go out and buy a lot of new servers and new storage. Cloud services such as Windows Azure HDInsight enable your organization to implement Hadoop clusters in the cloud, allowing you to pay for the storage and processing power you need, without the need to buy additional hardware. Windows Azure HDInsight provides a good way to gain experience with Big Data without spending a lot of cash.
Big Data is Not Going Away
There's no doubt that like BI, Big Data is a technology that's not going away. However, Big Data isn't a replacement for relational databases. Relational databases will continue to support an organization's core mission-critical applications. Like BI, Big Data will open up business insights, enabling organizations to make better business decisions.