Hadoop is an open source software project designed for "reliable, scalable, distributed computing." The Hadoop ecosystem also includes a database that "supports structured data storage for large tables," a data warehouse infrastructure that provides "data summarization and ad hoc querying," an ETL framework, and a data mining library. That sounds a lot like a data management platform built by Microsoft!
But though they may seem like competitive technologies, there is an opportunity to combine the strengths of Hadoop with SQL Server to make a data professional's job easier—and more efficient. I'm going to walk you through a real-world example that highlights how to combine these two technologies.
The Case for Hadoop—and Hive
In late 2011, Dr. David Dewitt presented a Big Data keynote session, focused primarily on Hadoop, at the Professional Association for SQL Server (PASS) Summit event. He likened Hadoop to a NoSQL Data Warehouse—providing a scalable fault tolerant framework for storing and processing massive data sets. As an aside, while Hadoop has grown and changed quite a bit since this time, the video is still a great introduction for anyone who wants to learn more about Hadoop's key components and architecture (see the Resources section to view the session on-demand).
At 68 minutes into his presentation, Dr. Dewitt proceeded to give an overview of Hive Query Language (HiveQL, or Hive)—a SQL-like language originally developed by Facebook to wrap around Hadoop's complex and developer-intensive MapReduce framework. A few facts were shared in terms of Facebook's internal usage of MapReduce and Hive:
- ". . . MapReduce was not easy for end users. . ."
- ". . .users ended up spending hours (if not days) to write [MapReduce] programs for even simple analyses"
- "Of the 150K jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are in HiveQL. . ."
Translation: Facebook used a SQL language 99.67% of the time. My initial reaction after learning these facts was to forget about Hadoop! But not so fast. Hadoop offers compelling capabilities, even when a SQL language is being used. These capabilities include parallel processing across commodity hardware, minimal to no ETL, and the ability to define a schema over data on-demand (i.e. schema on read). For more information, see Hadoop Basics.
Choosing a Hadoop Distribution
As an open source project, Hadoop is freely available for download directly from the Hadoop home page. Most organizations, however, use a Hadoop distribution from a company that provides support and/or value added components on top of the core Hadoop installation. Cloudera, MapR, and Hortonworks are example of popular distributions. In my examples, I'm using a distribution of Hadoop based on the Hortonworks Data Platform (HDP).
Microsoft and Hortonworks began a partnership a few years ago to bring Hadoop to Windows, and that partnership now includes Hadoop that can run both on-premise and the cloud. I should also note that Microsoft has put a good deal of effort into the functionality and performance optimization of Hive (see the Stinger Initiative link in the Resource section below for more information).
If you want to keep things on-premise, you can download and install HDP on a machine (or set of machines) running Windows Server 2008/2012. Or, you can download the HDP Sandbox—a preconfigured single-node virtual machine (VM) meant for evaluation/testing purposes. Note that the Sandbox image uses Linux, but it will run on Microsoft Hyper-V (in addition to Oracle VirtualBox and VMWare). Based on my own experiences, I recommend getting started with the Sandbox image. Or, you can leverage the cloud to get Hadoop up and running in minutes.
HDinsight: Hadoop Service That's Part of Azure
HDInsight is a Hadoop service that is part of Microsoft Azure (Microsoft Azure is a collection of cloud-based services), see Figure 1.
Based on HDP, HDInsight allows a customers to quickly spin up a cluster, do work, and then delete the cluster when finished. Since you only pay for services while the cluster is running, it is a very cost effective approach to Hadoop. Additionally, data storage is decoupled from the cluster (via Azure Storage), so you don't have to re-upload data when a cluster is restarted. Data storage in Azure is very cost effective—less than $25 a month per Terabyte.
To follow along with the samples in this article, let me point you to two free resources to walk you through the installation specific steps for Hadoop:
- HDInsight. Please refer to Chapter 2 of the free Ebook, Introducing Microsoft Azure HDInsight.
- HDP Sandbox with Hyper-V. After downloading the Sandbox image, you can use the install guide. Note that the steps around configuring Internet Connection Sharing are optional (the policies on my laptop did not allow sharing to be enabled).
A Sample Use Case – Job Growth in the United States
Let's say we want to analyze job growth statistics across the United States. There is a wealth of publicly available information at both the U.S. Bureau of Labor Statistics (BLS) and the U.S. Census Bureau. The Census Bureau maintains a large and very detailed Quarterly Workforce Indicator (QWI) dataset going back almost twenty years. Partitioned by state (and optionally by county), there are different QWI files that provide employment data by varying factors (e.g., employee gender and age, firm size and age, etc.). See Figure 2.
For example purposes, we'll download data for three states - California, Nebraska, and Texas. See the Loading QWI data form the U.S. Census Bureau into Hadoop step-by-step instructions to download the files into a Hadoop cluster. Note that the data for each state is stored in a compressed .gz file; each compressed file contains a single .csv file.
In order to analyze this dataset, we need to somehow merge these files before we query them. Using a relational data warehouse approach, we would:
- define a table in SQL Server representing the schema in the .csv file, and then
- design an ETL package to decompress each .gz file - and load the .csv file contents into the new relational table.
While not terribly difficult, this does require a bit of development effort—and upfront processing time. And while the amount of data for all 50 states isn't very large (around 33 GB of uncompressed data), we may at some point need to pull in additional QWI files to analyze data by different characteristics. The same dataset at a county level is around 130 GB of uncompressed data. We could easily end up loading terabytes of data to pull in all possible permutations of the workforce data. But, what if we could avoid the ETL effort, drop the compressed files into a folder, and query them as needed? Enter Hadoop.
Creating a Table Definition in Hadoop
After the three QWI files have been copied to a Hadoop folder (or a container in the case of HDInsight), we need to create a table definition in Hive. A Hive editor (similar to SQL Server Management Studio, or SSMS) may be launched from the command line, through Hue, or from the Manage Cluster link in the HDInsight Dashboard (see Figure 3).
Listing 1 shows the syntax needed to create an External Table in Hive. Notice the location of the external table points to a folder/container—rather than an individual file.
-- In Hive, Create a Database CREATE DATABASE CensusDB; -- Use this Database USE CensusDB; -- Create an external table definition CREATE EXTERNAL TABLE qwi2 ( periodicity varchar(256) COMMENT 'Periodicity of report', seasonadj varchar(256) COMMENT 'Seasonal Adjustment Indicator', geo_level varchar(256) COMMENT 'Group: Geographic level of aggregation', geography varchar(256) COMMENT 'Group: Geography code', ind_level varchar(256) COMMENT 'Group: Industry level of aggregation', industry varchar(256) COMMENT 'Group: Industry code', ownercode varchar(256) COMMENT 'Group: Ownership group code', sex varchar(256) COMMENT 'Group: Gender code', agegrp varchar(256) COMMENT 'Group: Age group code (WIA)', race varchar(256) COMMENT 'Group: race', ethnicity varchar(256) COMMENT 'Group: ethnicity', education varchar(256) COMMENT 'Group: education', firmage varchar(256) COMMENT 'Group: Firm Age group', firmsize varchar(256) COMMENT 'Group: Firm Size group', year int COMMENT 'Time: Year', quarter int COMMENT 'Time: Quarter', Emp int COMMENT 'Employment: Counts', EmpEnd int COMMENT 'Employment end-of-quarter: Counts', EmpS int COMMENT 'Employment stable jobs: Counts', EmpTotal int COMMENT 'Employment reference quarter: Counts', EmpSpv int COMMENT 'Employment stable jobs - previous quarter: Counts', HirA int COMMENT 'Hires All: Counts', HirN int COMMENT 'Hires New: Counts', HirR int COMMENT 'Hires Recalls: Counts', Sep int COMMENT 'Separations: Counts', HirAEnd int COMMENT 'End-of-quarter hires', SepBeg int COMMENT 'Beginning-of-quarter separations', HirAEndRepl int COMMENT 'Replacement hires', HirAEndR int COMMENT 'End-of-quarter hiring rate', SepBegR int COMMENT 'Beginning-of-quarter separation rate', HirAEndReplR int COMMENT 'Replacement hiring rate', HirAS int COMMENT 'Hires All stable jobs: Counts', HirNS int COMMENT 'Hires New stable jobs: Counts', SepS int COMMENT 'Separations stable jobs: Counts', SepSnx int COMMENT 'Separations stable jobs - next quarter: Counts', TurnOvrS int COMMENT 'Turnover stable jobs: Ratio', FrmJbGn int COMMENT 'Firm Job Gains: Counts', FrmJbLs int COMMENT 'Firm Job Loss: Counts', FrmJbC int COMMENT 'Firm jobs change: Net Change', FrmJbGnS int COMMENT 'Firm Gain stable jobs: Counts', FrmJbLsS int COMMENT 'Firm Loss stable jobs: Counts', FrmJbCS int COMMENT 'Firm stable jobs change: Net Change', EarnS int COMMENT 'Employees stable jobs: Average monthly earnings', EarnBeg int COMMENT 'Employees beginning-of-quarter : Average monthly earnings', EarnHirAS int COMMENT 'Hires All stable jobs: Average monthly earnings', EarnHirNS int COMMENT 'Hires New stable jobs: Average monthly earnings', EarnSepS int COMMENT 'Separations stable jobs: Average monthly earnings', Payroll int COMMENT 'Total quarterly payroll: Sum', sEmp int COMMENT 'Status: Employment: Counts', sEmpEnd int COMMENT 'Status: Employment end-of-quarter: Counts', sEmpS int COMMENT 'Status: Employment stable jobs: Counts', sEmpTotal int COMMENT 'Status: Employment reference quarter: Counts', sEmpSpv int COMMENT 'Status: Employment stable jobs - previous quarter: Counts', sHirA int COMMENT 'Status: Hires All: Counts', sHirN int COMMENT 'Status: Hires New: Counts', sHirR int COMMENT 'Status: Hires Recalls: Counts', sSep int COMMENT 'Status: Separations: Counts', sHirAEnd int COMMENT 'Status: End-of-quarter hires', sSepBeg int COMMENT 'Status: Beginning-of-quarter separations', sHirAEndRepl int COMMENT 'Status: Replacement hires', sHirAEndR int COMMENT 'Status: End-of-quarter hiring rate', sSepBegR int COMMENT 'Status: Beginning-of-quarter separation rate', sHirAEndReplR int COMMENT 'Status: Replacement hiring rate', sHirAS int COMMENT 'Status: Hires All stable jobs: Counts', sHirNS int COMMENT 'Status: Hires New stable jobs: Counts', sSepS int COMMENT 'Status: Separations stable jobs: Counts', sSepSnx int COMMENT 'Status: Separations stable jobs - next quarter: Counts', sTurnOvrS int COMMENT 'Status: Turnover stable jobs: Ratio', sFrmJbGn int COMMENT 'Status: Firm Job Gains: Counts', sFrmJbLs int COMMENT 'Status: Firm Job Loss: Counts', sFrmJbC int COMMENT 'Status: Firm jobs change: Net Change', sFrmJbGnS int COMMENT 'Status: Firm Gain stable jobs: Counts', sFrmJbLsS int COMMENT 'Status: Firm Loss stable jobs: Counts', sFrmJbCS int COMMENT 'Status: Firm stable jobs change: Net Change', sEarnS int COMMENT 'Status: Employees stable jobs: Average monthly earnings', sEarnBeg int COMMENT 'Status: Employees beginning-of-quarter : Average monthly earnings', sEarnHirAS int COMMENT 'Status: Hires All stable jobs: Average monthly earnings', sEarnHirNS int COMMENT 'Status: Hires New stable jobs: Average monthly earnings', sEarnSepS int COMMENT 'Status: Separations stable jobs: Average monthly earnings', sPayroll int COMMENT 'Status: Total quarterly payroll: Sum' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hadoop/censusqwi'; -- If using HDInsight, you can replace the LOCATION value with a pointer to an Azure container e.g., 'wasb://firstname.lastname@example.org/'; -- Verify table DESCRIBE qwi2; -- Sample Query - Average earnings by Year SELECT Year, Avg(EarnS) FROM qwi2 GROUP BY Year Order BY Year; -- Note, to Query a table using the Hive ODBC driver, make sure you grant select permissions on the table GRANT SELECT ON TABLE qwi2 to user hadaoop;
We can now query the table, in the Hive Editor, using HiveQL. Figure 4 shows a query that returns average monthly earnings by year. Notice the query syntax is equivalent to what it would look like using T-SQL. On the fly, Hive merges the three files together (and decompresses them), and uses the Table definition to extract/filter the file data. Hive natively supports several file formats (e.g., .csv, tab delimited) that can optionally be compressed using algorithms like Zip and Gzip. For non-standard file formats, Hive supports a plug-in type architecture to read custom file and record formats.
While the performance of this query is unlikely to match a well-tuned relational data warehouse, we've eliminated a lot of up-front work. Now, let's talk about how to integrate Hive with SQL Server. Also, see Improving Hive Performance.
Using the Hive ODBC Driver
Microsoft provides a Hive ODBC Driver which can be freely downloaded (see the Resource link). Part of the installation includes a very detailed "Microsoft Hive ODBC Driver Install Guide" pdf document. After the installation is complete, you can configure a Data Source Name (DSN) configuration using the Windows ODBC Data Source Administration client. Note that a sample System DSN, named "Sample Microsoft Hive DSN" is created during the installation. Figure 5 shows the configuration details for a System DSN.
In the Advanced Options dialog box (Figure 6), I selected the "Use Native Query" option to disable conversion of SQL specific syntax (e.g., TOP N) into an equivalent HiveQL syntax (e.g., LIMIT N).
With an ODBC connection established, you can query Hadoop directly from a client tool like Excel PowerPivot—as shown in Figure 7.
Let me point out a few quirks I had to work through. First of all, make sure you grant select privileges for the account used to connect to Hive (see the end of Listing 1); even though this wasn't needed to run queries in the Hive editor, I had to do this when using an ODBC connection. Secondly, I found the connector to be a bit finicky in terms of storing the password, which would result in a 10430 error code ("Not enough information provided to establish a connection to the data source and specified to not prompt for more information"). To work around this issue, I selected the 2nd option (Use Connection string) in the Connection tab of the Data Link Properties, as shown in Figure 8.
This approach provides two opportunities to enter the password (once when selecting the DSN name, and then again in the Connection tab itself). Finally, I was not able to use the PowerPivot graphical interface to select and filter tables; I had to choose the "Write a query that will specify the data to import" option.
Setting up a Linked Server Definition in SQL Server
With the ODBC driver installed, and a DSN configured, we can setup a linked server within SQL Server to point to a Hive database. A linked server allows a user to query Hive—without having to download/configure the Hive ODBC driver on their machine. It's also useful when a particular technology (e.g., multidimensional SQL Server Analysis Services) supports SQL Server, but not Hive, as a data source.
The syntax to create a Linked Server is as follows, where the datasrc parameter maps to a DSN entry:
EXEC master.dbo.sp_addlinkedserver @server = N'HiveTest', @srvproduct=N'HIVE', @provider=N'MSDASQL', @datasrc=N'HiveTest_LocalHive', @provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=hadoop; Password=xxxxxx;'
After the statement completes, you'll see a Linked Server entry in SSMS, as shown in Figure 9.
In terms of querying the linked server, let me point out a few areas that cost me several hours of time. First off, don't forget to set the security context for the connection. Using SSMS, you can right-click on the Linked Server icon (HiveTest in Figure 12), select Properties, and then (in the Security page) enter a remote login/password.
Secondly, I was not able to query a Hive Table using the four-part naming notation e.g., SELECT * FROM [HiveTest].[Hive].[censusdb].[qwi2]. Instead, I used OpenQuery, as shown in the following example:
SELECT * FROM OpenQuery(HiveTest, 'SELECT firmsize, Year FROM censusdb.qwi2 WHERE YEAR IS NOT NULL limit 100');
Finally, the ODBC driver doesn't seem to handle string data very well; queries may fail and return a conversion-related error. I was able to avoid this problem by changing the Default string column length value (in the Advanced Options dialog box of the ODBC driver, see Figure 9) to 8000.
Sometimes a data storage platform comes along with much fanfare, only to wither away or be subsumed by the relational database (e.g., Object Oriented databases). Other technologies live through the hype, and graduate to become full-fledged platforms of their own—like Online Analytical Processing (OLAP) databases. Hadoop is likely here to stay. As database professionals, we need to be ready to exploit this technology—and understand when and how it should be integrated with SQL Server.