Server 2008 Data Profiling

Executive Summary: The Data Profiling task is an incredibly useful, yet often overlooked tool in Microsoft SQL Server 2008. This article explains how to configure and use both the Data Profiling task and the Data Profile Viewer (DPV) to examine and assess your data, then view the results in SQL Server Integration Services (SSIS).

I’ve noticed a distressing lack of coverage of SQL Server 2008 Integration Services’ (SSIS) new data profiling feature, even though data profiling capabilities are a significant feature for data warehouse developers. Data profiling tools have been available for some time, but until now, SQL Server hasn’t natively had the capability. The relative lack of coverage of SQL Server 2008’s data profiling capability truly makes it a hidden treasure.

Data profiling is the process of examining source system data and collecting various statistics for data quality and data integration assessment. It has been a frequently requested item for SSIS—or at least the larger SQL Server business intelligence (BI) platform— for some time now. Microsoft is competing heavily in the BI industry against other BI platform providers (many of which have now consolidated) that have offered data-profiling capabilities for some time. This article shows how to set up the Data Profiling task, designate profiles, and build and view profile output in the new Data Profile Viewer utility.

Available Profiles

The SSIS Data Profiling task contains a total of eight data profiles that you can use to assess the data quality of source systems. Five of the profiles analyze individual columns in a table or view. The remaining three profiles analyze multiple columns or relationships between columns, tables, or views.

At this time, it isn’t known if we’ll be able to create custom data profiles in the release-to-manufacturing (RTM) build of SQL Server 2008.

Some of the available profiles can only profile a particular data type. For example, if you designate a column-length profile on an integer-based column, you’ll receive a design-time warning on the Data Profiling task. If you attempt to execute either an individual Data Profiling task or a package that involves an incorrectly configured Data Profiling task, you’ll get a dialog box informing you of the problem and the execution process will halt.

Column-based profiles. The Column Length Distribution profile informs you of the various lengths for the values in a string-based column and the percentage (relative to the other distinct string lengths) those lengths represent. The Column Length Distribution profile supports columns with character data types. You can use the Column Pattern profile to determine the percentage of values in a column that match a particular profile. The Column Pattern profile supports character data types. If you need to review a column’s NULL count, you can leverage the Column Null Ratio profile, which provides the count and percentage of NULL values in a given column. Because the Column Null Ratio profile is looking only for NULLs, it can analyze a column with any data type.

Next up is the Column Statistics profile, which outputs a few basic statistics for a given column based on the data type. For numeric data types, the profile returns minimum, maximum, mean, and standard deviation values. For datetime data types, the profile returns the minimum and maximum values; the new Date and Time data types aren’t supported. The Column Value Distribution profile informs you of the distinct values in a given column and the percentages those unique values represent. The Column Value Distribution profile supports columns with numeric, character, and datetime data types.

Multicolumn, multitable profiles. There are three profile types in this category. The Candidate Key profile tells you a given column’s or combined columns’ key strength, which is a column’s (or group of columns’) ability to uniquely identify individual records within a given table. The Functional Dependency profile tells you the extent to which values in one column (i.e., dependent column) depend on another column’s (i.e., determinant column’s) values. For example, ZIP code is highly dependent on state. Finally, there is the Value Inclusion profile, which informs you of the potential for a column’s or combined columns’ ability to serve as a foreign key that could enforce a data integrity relationship between two tables. Value Inclusion supports columns with the integer, character, and datetime data types.

Configuring the Data Profiling Task

As with other SSIS tasks, the Data Profiling task is registered and installed during SQL Server 2008 setup. The task’s DLL is located at C:\Program Files\Microsoft SQL Server\100\DTS\Tasks. The Data Profiling task is a SSIS feature; consequently, to use it you must first start up Business Intelligence Development Studio (BIDS). Once BIDS has been launched you will need to either create a new or open an existing SSIS project. Once you open a SSIS package and display the toolbox you will notice a new item called Data Profiling Task. Drag and drop an instance of the Data Profiling Task from the Microsoft Visual Studio toolbox onto the package’s Control Flow designer, as shown in Figure 1.

Note that SQL Server 2008 Data Profiling can analyze only SQL Server 2000 and later systems, and you must use the ADO .NET Connection Manager.

Before you configure the Data Profiling task, you’ll need to create an ADO.NET connection manager to the source system to be profiled. With the Data Profiling task on the designer surface, right-click the task and select Edit. In the editor, you can optionally designate a timeout setting to dictate the length in seconds the task can run before timing out. Next are the destination settings—this is where usage comes into play. There are two primary uses of SQL Server Data Profiling:

• performing ad-hoc formal source system assessments
• performing recurring data quality checks as part of a larger extraction, transformation, and loading (ETL) process

If you’re performing ad-hoc assessments, simply designate a file destination for your profile’s output. However, if you want to perform recurring data profiling as part of an ETL (or other) process, you’ll want to load the output of your profiles into SSIS variables. If you specify a variable destination, simply select or create a new SSIS variable. If you select a file destination, you’ll create a new File Connection Manager and— optionally—specify whether to overwrite an existing file.

Now, to designate profiles, you can either select Profile Requests in the left pane of the editor or click the Quick Profile button in the General section of the editor. The Quick Profile route lets you select multiple profiles quickly that have default settings automatically applied.

As Figure 2 shows, the Profile Requests section has a Profile Type drop-down list from which you can select one of the eight available profiles. You can optionally provide a non-default Request ID for the profile and designate the profile’s settings in the Request Properties section. Note that you can also filter the selected profiles by expanding the View drop–down menu at the top of the dialog box. If you select a profile type in the View drop-down list, the Request Properties section displays the selected instances of the chosen profile type. After you designate the profiles to be used and their settings, click OK to complete the data profiling configuration process. Notice also that the editor’s treeview pane has a section called Expressions. As with the other SSIS tasks, you can optionally assign values for the Data Profiling task properties at runtime. For example, if you wanted to supply a single Data Profiling task with different configurations at runtime, you could apply an expression to the task’s ProfileInputXML expression.

Continue on Page 2

Building and Viewing Profile Output

To build the profiler’s output, simply execute the Data Profiling task. If you designated a file destination, you can view the profiler’s output by using a tool called Data Profile Viewer (DPV) that ships with SQL Server 2008 Community Technology Preview 5 (CTP5). For now, DPV is a standalone executable that analyzes the output of the Data Profiling task. BIDS will probably include support for launching the tool in the final release. DPV doesn’t automatically appear on the Windows Start menu as of CTP5, so I recommend creating a shortcut to the executable. The file is located at C: Program Files\Microsoft SQL Server\100\DTS\Binn and is called DataProfileViewer.exe.

After you launch DPV, you can open a profiler output .xml file by clicking Open, the only command on the menu bar, and navigating to the desired file. This file’s content will have been populated from a prior execution of a Data Profiling task. Figure 3 depicts DPV with output from a Column Value Distribution profile.

As you can see in Figure 3, the DPV utility has two main sections. You use the left pane to choose which profile to review. The executed profiles are arranged by the type of database objects they were applied against. The right side of the DPV window shows you the details of the selected profile. The output of each of the various profiles is unique to that profile, so you’ll see different output in the details pane for each profile you select.

The DataProfile.xsd Schema

The Data Profiling task output adheres to the Data- Profile.xsd schema, which you can see at schemas .microsoft.com/sqlserver/2008/DataDebugger/Data Profile.xsd. The schema has three core sections: data sources, profile input, and profile output. The data sources section contains all the data sources used by the Data Profiling task. The profile input section is where each selected profile is stored, along with the tables, views, and columns it’s applied to. The profile output section contains the results of each profile selected.

A Data Profiling Example

You now have a solid foundation in SQL Server 2008 data profiling. Let’s run through a sample of the feature in a test environment. Suppose that the AdventureWorks Company has charged you with building a new formal data warehouse system to support organizational decision making. As one of your first steps, you’ll need to assess the data quality of the AdventureWorks OLTP database. In this example, you’ll assess the data quality of the Production.Product table. Note that the AdventureWorks sample databases are no longer included as part of the SQL Server setup; the SQL Server sample databases have been relocated to the Microsoft CodePlex portal (www.codeplex.com/ MSFTDBProdSamples).

First, create a new SSIS package and drag an instance of the Data Profiling task onto the Control Flow designer. Next, create the ADO.NET connection manager to a local AdventureWorks OLTP database. Now configure the Data Profiling task. Because you’re performing an ad-hoc analysis, send the output to a local XML file called AW_ProductTableAnalysis.xml on the desktop.

Now you need to select individual profiles to be run against the Production.Product table. Use the quick profile option (you can assume the default settings for these samples). I don’t explicitly walk you though the Quick Profile dialog box, which Figure 4 shows, but it’s quite simple.

The Quick Profile dialog box configures the profiles to use all columns for the selected table or view. To review, click OK in the Quick Profile dialog box, then select Profile Requests in the left pane of the tasks editor. You can browse through the various profiles and review the default settings the Quick Profile process assigns. Click OK to complete the Data Profiling task configuration. You can’t select a Value Inclusion profile with the Quick Profile option; you need to use the Profile Requests section of the editor.

Now execute the package by pressing F5. The Data Profiling task turns green and a file containing the output from the profile operation is created on the desktop. Finally, use DPV to review the output of the Data Profiling task. First, launch the DPV, then click Open and select the new AW_ProductTableAnalysis .xml file. Figure 5 shows the task’s output in the viewer without any individual profile selected. Notice that because you only ran column-level profiles, you must navigate to an individual column before you see any of the profiles you ran in the results pane. You can now see the length, values, NULL ratio, and statistics for the various columns in the Production. Product table!

A Long-Overdue Feature

I don’t know if Microsoft and the Kimball Group (www.kimballgroup.com) have some form of partnership or not; however, the SQL Server BI stack continues to incorporate more and more concepts from Kimball’s formalized data warehouse lifecycle, and data profiling is a good example. Data profiling is a formal part of Kimball’s data warehousing methodology, and beyond Kimball, it’s a fairly common task for a data warehouse developer.

Data profiling was a much overdue feature in the SQL Server BI stack. For a first build, the feature is fairly impressive. Probably the biggest complaint will be that it can currently run only on SQL Server 2000 or later, but I expect this situation to change over time. In addition, I expect to see the ability to build your own custom profiles and plug them into the larger data-profiling framework. SQL Server 2008 has several key BI enhancements, and the SSIS data-profiling capability is definitely one such feature.

Comments

Plain text