Big Data has become one of the most over-used and over-loaded terms in IT and data management in recent years. Since we focus on business intelligence for SQL Server in this blog, I think it’s fair to make the statement that many of us have had “big data” data warehouses and analytics in SQL Server for many years. Yahoo! has OLAP cubes in SSAS of many TBs, for example. I’ve personally built and managed CDR, telecom and ICA media data warehouses in SQL Server of many TBs in size. We used to call these “VLDBs” and truth be told, and there are some very specific complex configurations that you need to put in the place for SQL Server, focusing on columnar databases and in-memory cubes in SQL Server 2012 or look at the MPP SQL Server PDW. SQL Server Fast Track is another good option for large SMP scale-up data warehouses.
The difference with Big Data is that we start to look at providing analytics and insights into your organization’s data assets that are NOT in the data warehouse. Most enterprise BI systems provide maybe 10% of your overall data assets for your decision makers and BI users. The idea with Big Data is that we can additional provide 2 specific features in our data systems with SQL Server if we were to utilize the primary Big Data platform: Hadoop. These features are:
1. Optimized ETL. A key reason why your data warehouse does not contain all of the data for insights that your organization may need could be because the data sources are too varied, too large or unstructured in nature. Linux shops that take advantage of Apache’s Hadoop Big Data platform have been utilizing tools in Hadoop like Hive, Sqoop, HDFS and MapReduce for years now. Microsoft has just released a developer preview of their first Windows port of the Hadoop platform which I’ll talk about in more detail below. You can pre-process very complex large files in Hadoop MapReduce in a distributed manner and then ETL the resulting aggregated results into your SQL Server data warehouse using SSIS or the Apache Hadoop Big Data extraction tool Sqoop.
2. You can always just leave the source data in HDFS and run Hive queries against that data. This is the scenario I’ll introduce you to below. This is what I call Big Data Analytics.
The Microsoft port of Hadoop to Windows is available here in developer preview from Microsoft and Hortonworks: Microsoft HDInsight. There is a version that you can download and install locally on Windows or Windows Server. Or you can use the public cloud version of Hadoop on Azure. I downloaded and installed the local version of Hadoop on my Windows 7 laptop to get the Hive ODBC driver that I used in this demo. The distribution with install a name node and data node together on your box. That won’t give you the distributed quality that Hadoop provides you with MapReduce and HDFS. But it will give you a way of getting started on these Big Data technologies on your Windows platform and integrate them with Excel and SQL Server in a single-node configuration and includes some great examples of MapReduce, Pig and Hive.
Hive is part of the Apache Hadoop platform that allows users to build SQL-live queries using the Hive query language (HiveQL or HQL). This is where you can provide data discovery capabilities to your data scientists who want to find new patterns and answers in your organizational data assets that are not available in the aggregate or even in the detail in your data warehouse.
You can write HQL queries from the command-line for Hadoop or you can use the Hive ODBC driver for Excel in Microsoft’s HDInsights Hadoop distribution.
When you click over to the Hive Pane in Excel, you can then connect into your Hive system either on the Windows Hadoop platform or your Linux-based distribution.
Excel will present you with the list of tables available to you in Hive. These tables are built via metadata DDL statements that you use to create schemas that can be queried with data stored in HDFS.
Select your tables & columns and then execute the query. The ODBC driver will issue the Hive query out to the server and return the results from Hadoop to your Excel interface. Note the HQL query syntax which allows your users to write custom SQL-like HiveQL in the bottom-right pane.
Excel may be a more beneficial interface for your company’s analysts who wish to explore big data, or you can still use the HiveQL command-line interface. In either case, you can execute custom queries. In Excel, you are now able to bring this data into analytical tools (such as Excel or PowerPivot) for advanced Big Data Analytics from a single tool.