SQL Server BI Blog
Use SSIS for ETL from Hadoop

Use SSIS for ETL from Hadoop

HDInsight is the Microsoft version of Hadoop on Windows which provides most of the commonly-used aspects of the Apache Hadoop Big Data platform including the HDFS file system, sqoop for data import/export, Hive for SQL queries, the MapReduce distributed programming infrastructure and ODBC drivers to connect to your data in HDFS from tools like Excel and SQL Server. In this demo, I am going to show you how to use those ODBC drivers to extract data from Hadoop using HDInsight as the data store, transform it in SSIS, and load the data into SQL Server (extraction, transformation and loading or ETL).

First, you need to create a simple schema on top of your data in Hadoop. I have a simple text file that I've loaded into HDInsight (Microsoft's Hadoop on Windows) by generating a schema in Hive (using HQL) and loading that flat file into that schema. Also note that I am only loading a few simple rows into the Hive table from a flat file for this simple example:

create external table ext_sales
  lastname string,
  productid int,
  quantity int,
  sales_amount float
row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';

LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Kromer	123	5	55
Smith	567	1	25
Jones	123	9	99
James	11	12	1
Johnson	456	2	2.5
Singh	456	1	3.25
Yu	123	1	11

We can view the data in Hadoop from that Hive table using the HDInsight Web user tool:

Now, in Visual Studio (SSDT, to be precise), I've created a new Integration Services project that will use that Hadoop-based data store as the source by using the ODBC driver for Hive as the source. Note that I first created a user DSN in Windows with that ODBC connector which I can than reference in any tool uses ODBC such as SSIS, Excel, etc:

I used that ODBC source to connect with HDInsight as a source, used a Derived Column tranformation to add a 7% sales tax each sales row and then add that to the total sales amount, creating 2 new fields which will be added to the data flow and inserted into the target SQL Server 2012 database, which is using the normal SQL Server destination in SSIS:

Mapping the fields, changing data types between unicode and DTS data types all are the same actions that you take in SSIS as if your were using a file or another RDBMS as a source. No different here in that SSIS just sees Hive as an ODBC source:

Now you can run the package and the rows are tranfered into SQL Server, including the new additional columns from my Derived Column transformation:

That's pretty much it. Pretty simple & straightforward. With this technique for ETL, you can use the power of Hadoop's distributed nodes and MapReduce to crunch very large complex and unstructured data and then ETL portions of that into SQL Server for other systems, analytics, etc. It also is a more natural way to import data into SQL Server from Hadoop without needing the learn or use sqoop, which should benefit data warehouse and business intelligence developers and architects who are already well-versed in tools like SSIS.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.