ETL in the Cloud with Azure

For our data warehouse and BI developers who are responsible for ETL processes, let's talk a bit about the current state of Cloud ETL for a moment.

First, I know, it's been a while for me here @ SQL Mag. I've been on a bit of journey the past couple of years through open source and Big Data land, coming back to MSFT last year and now I'll be able to bring you more frequent updates on where we stand and where things are heading in the marketplace. Promise!

Ok, that being said, if you want a bit of history on Cloud BI in general and Microsoft's Azure past ventures, take a look at my posting here on this blog from 2012. Man, those were the days ...

We've come a long way since then! With Cloud adoption taking off and Microsoft investing heavily in Azure and Cloud tech, things have moved along very fast and are much improved. I'm going to present 3 good options for ETL in the Cloud. I want to state that my current role with Microsoft is now a Data Solution Architect where I spend my days working with customers to migrate on-prem data platforms or build net-new solutions in Azure.

That being said, these are 3 common patterns that I help customers on-board and they work very well ... Also, I have a couple of background slides that you can download here that I use when I discuss this topic with customers.

1. Let's start with traditional ETL. That is, ETL platforms that have enterprise-ready features for complex data pipelines that require heavy-duty scheduling, auditing, design environments, change control, etc. etc. Of course, SSIS is a great option for this. You can spin-up a SQL Server image in Azure very quickly from the gallery here. That is the developer edition, which is free. Pick the edition of SQL that you need for SSIS in production based on your ETL needs to ensure that you have all of the features that you require in production.

Of course, that is not truly Platform as a Service (PaaS), that is what we call Infrastructure as a Service (IaaS), which means that you are manually spinning up a VM and you still need to access Visual Studio to design and manage your ETL processes either from an on-prem thick client or RDP / Citrix into Visual Studio on that VM or another VM as a development box.

In that same vein, Microsoft also offers pre-built solution templates from partners that are rolled-up VMs ready to spin-up from the Azure ISV Marketplace Gallery: https://azure.microsoft.com/en-us/marketplace/. You'll find partner solutions ready to go there like Informatica and Attunity.

2. Next, let's move away from hybrid into pure PaaS tools like Azure Data Factory. ADF is positioned as a Data Processing Service, it's in the Cloud in Azure and it is fully managed by Microsoft. All you have to do is to build out data pipelines that, general speaking, focus on data movement, but also offer some feature for data transformation. But to get into advanced transformations, you'll want to use external execution engines through ADF's "activities" that can call out to Hadoop, Machine Learning and SQL Server. ADF also has a Data Management Gateway that allows you to connect up to on-prem data sources so that you are not limited to just Cloud data sources.

Let me walk you through a very simple example using the Copy Wizard in ADF:

a. Begin by spinning-up a new Data Factory from your Azure portal

ADF in Azure Portal

b. In your new ADF, we'll use the Copy Data Wizard to build a quick and easy data pipeline that will use a custom query from an Azure SQL DB data source, modify a flag field and the update another Azure SQL DB as the destination. This is meant to mimic the data capture, transformation and data load (ETL) of a traditional DW scenario in ADF. Select Copy Data (Preview):

c. You will simply walk through the wizard, selecting Azure SQL DB as the source.

d. For in-place data copies, you can just select the table and fields from the wizard that you want to transfer. To show you a little bit more about customizing for ETL, select the "Use Query" tab.

In my case, I want to set an "Is Old" flag on my destination to 0 to indicate that these are new records from the source. So I wrote a very simple custom SELECT statement, setting 'age' to 0:

$$Text.Format('select Average_Rating, Num_of_Ratings, price, Customer_Average_Age, 0 as age from stagingtable')

When you go back into your built-out pipeline, you can see that in the JSON definition of your pipeline under sqlReaderQuery K/V.

e. Now when you reach the Destination step in the Copy Wizard, select Azure SQL DB as the sink and here is where we'll use a very simple script to update the existing rows in the target table BEFORE writing new ones. Under "Repeatability" select "custom script" and there is where I entered this to set all existing rows to "old" ... That will get executed first:

$$Text.Format('update stagingtable set isold=1')

And in the JSON for your pipeline to set that value manually, just look for the key "sqlWriterCleanupScript".

So for ADF natively, you can use these sorts of SQL commands at data capture and before data write to perform your ETL using the graphical Copy Data Wizard. You can get much more advanced than that if you use Hadoop with Hive queries or SQL Server stored procedures in more complex pipelines. Here are 2 good links with more detail on those ADF pipeline activities ... These activities open up huge opportunities for complex ELT where you are landing data in staging and transforming in the appropriate execution engine:

https://docs.microsoft.com/en-us/azure/data-factory/data-factory-hive-activity

https://docs.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity

And here is a link for more activities for data transformation including Pig and MapReduce

3. Lastly, you may want to perform more advanced data transformations, aggregations, mining, machine learning, etc. For that, I like to use another fully managed PaaS service called Azure Machine Learning or AML. With AML, you can build the intelligent parts of the ETL in a very nice fully-hosted browser-based UI and the orchestrate those ETL processes via scheduled ADF pipelines.

Here is an example:

a. In the Azure ML Studio (https://studio.azureml.net), let's use the sample under Experiments | Samples called "Sample 8: Apply SQL transformation". This is a good starting point for using AML for SQL transformation in an ETL workflow.

What I've also tried to demonstrate in that example screenshot above is that you write SQLlite code for the SQL Transform and there are also a number of very good out-of-the-box Data Transformation available to you as a data engineer / developer in AML. Also notice that you get very nice Web-based design environment for your experiments, similar to SSIS in Visual Studio.

b. If you want to do something more complex, like train a machine learning algorithm and use the trained model in your ETL to score data such as in scenarios for sentiment analysis or fraud detection, then you build separate experiments in the same AML environment and also call those from ADF. This is an example that I did to simple estimate the cost of a product based on existing labeled data using the out-of-the-box Linear Regression algorithm in AML. I am a data dude, not a data scientist. So you should get the idea of how easy this is in AML!

c. I took the SQL Transformation sample above and added an output step to load the transformed data into an Azure SQL DB. This way, all I have to do is call this Transform from ADF and the data will get loaded without needing inputs or outputs in my ADF pipeline. In fact, when I test this experiment from the design environment, I can move & transform data using the Input and Output steps just like when you are in Visual Studio writing an SSIS package.

d. What I've shown above is the developer environment in Azure ML Studio where you create and test your experiments. In order to operationalize this for an ETL pipeline to be called by ADF, you need to turn that into a Web Service. Click the "Set up Web Service" button at the bottom of Azure ML Studio. This will generate code to expose your transformations as a Web Service. In the case of the SQL Transform above for ETL purposes, you do not the Input or Output Web Service steps, so just remove those. We're going to grab data within the AML service and load data all in the same service.

Here is a more in-depth tutorial on turning experiments into production-ready services. There you can also learn more about where those Web Services live and how to manage them post-deployment.

e. In ADF, build a pipeline from your Data Factory and select "Author and deploy". You'll need to build a Linked Service to the AML batch service for your AML SQL Transform. Here is mine as an example ... Note that at this point in authoring ADF pipelines, we are working in JSON within the Azure Portal:

{
"name": "AzureMLLinkedService",
"properties": {
"description": "",
"hubName": "kromer_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/abcdefg/services/123455678/jobs",
"apiKey": "**********"
}
}
}

To get the API Key and endpoint for your ML Service, go to Azure ML Studio and click the Web Services global icon on the left-hand panel. Find the service that you generated and there you will see the API Key and URL to sue for the batch endpoint:

f. Follow these directions here on details for building out the rest of an ML pipeline in ADF. Here is the JSON for my ETL activity call to AML in my ADF pipeline. Note that it is calling my AzureMLLinkedService above in order to reach the ETL endpoint in ML:

"activities": [
{
"type": "AzureMLBatchScoring",
"typeProperties": {},
"inputs": [
{
"name": "AzureSqlReportTable"
}
],
"outputs": [
{
"name": "AzureSqlEventsTable"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "AzureMLScoringActivityTemplate",
"description": "My demo ETL Pipeline",
"linkedServiceName": "AzureMLLinkedService"
}
],

Comments

Plain text