If you're in the database world these days, it seems that you can't make it to the water cooler without hearing the phrase "big data" used in some context. For most SQL Server database professionals, big data isn't an immediate, pressing concern. Most SQL Server professionals are busy taking care of their mission-critical OLTP databases, business intelligence data marts, and data warehouses. Big data is something that's still on the horizon, with no immediate urgency. However, with all the hype about big data, it's useful to get a picture of what big data is all about, as well as explore the benefits and challenges that big data provides.
For the record, "big data" is the phrase typically used to refer to large amounts of unstructured data. Although their names sound a lot alike, big data is not the same as Large Object (LOB) data. LOB data is typified by things such as video or audio files, or documents that are associated with some other text and numeric relational data. For instance, an example of LOB data might be the picture of a product that's sold on a website. Sometimes LOB data is stored in the database, and sometimes it's stored in the file system.
Big Data vs. LOB Data
Big data is quite different from LOB data. Big data is often characterized by the keywords volume, variety, and velocity. Big data refers to very large amounts of data (volumes), and the nature of the data could be just about anything (variety). Sometimes big data is composed of capturing real-time, complex event streams (velocity). Examples of big data include technologies such as Google, Facebook, and Twitter where extremely large volumes of data are generated and stored. This is a different type of data storage than the structured data that comprises most OLTP applications.
The trend toward big data has come about for a variety of reasons—in part, simply because we can. High volumes of storage have become more affordable than ever, making it possible to economically collect larger amounts of data. If you have the data, you can use it to solve fundamental business problems. In addition, new technologies such as Hadoop and other open-source solutions make it possible to process this data and extract information from it in ways that open up new possibilities.
New Skill Set for SQL Server Administrators
While big data offers new possibilities, it also presents big hurdles. Creating Hapdoop clusters requires an entirely different skill set than most Windows or SQL Server administrators have. Further, loading data into Hapdoop clusters and then retrieving the data in meaningful ways also requires new skills and technologies that most SQL Server organizations don't have. Processing or querying big data implementations has traditionally been quite different than querying relational databases. For instance, with Hadoop, you might create something called a MapReduce job. MapReduce is a programming model that performs distributed computing on a Hadoop cluster. MapReduce jobs are often written in Java, but they can be written in other languages as well. The Hadoop runtime takes care of scheduling the job's execution across multiple nodes, handling the internode communication, and collecting the results. Hadoop is an entirely different animal from SQL Server and this isn’t familiar territory for SQL Server professionals.
Without a doubt, Microsoft is working toward integrating the world of big data with SQL Server. At the 2012 PASS conference, Microsoft presented several technologies designed to open up the realm of big data to SQL Server professionals. First, HDInsight provides both an on-premises Windows-based Hadoop implementation and an Azure service that gives organizations access to a big data service without the need to buy, configure, or manage the Hadoop infrastructure. Perhaps even more important, the upcoming PolyBase technology that will be included in the next version of SQL Server Parallel Data Warehouse (PDW) for the first time will allow you to query Hadoop clusters using familiar T-SQL scripts. Similarly, Excel 2013 will be able to work directly with HDInsight data, allowing you to query big data with familiar tools.
When I talked to Doug Leland, General Manager of Product Management in the Business Platform Marketing Group for Microsoft at PASS last year, I was struck by how much the move to embrace big data reminded me of SQL Server 7's initial support for OLAP services. Including big data in the future release of SQL Server will definitely help businesses clear the hurdles to big data.