SQL Server 2008 has a new feature called Change Data Capture (CDC), a component of the relational database engine that provides change data for consuming applications. CDC populates change tables by asynchronously reading a given database’s transaction log. CDC’s most notable use is for datawarehousing apps in which you need to detect and process changed data on a recurring basis in extraction, transformation, and loading (ETL) processes. With CDC, you can detect changed data in source OLTP systems that run on SQL Server 2008 without needing to implement expensive custom solutions such as timestamps, flags, triggers, and other one-off solutions for detecting changed data.
To begin learning about CDC, let’s enable CDC for a particular database. We’ll then designate the source tables to have their information captured, and thereafter any changed data will be recorded accordingly. To use CDC, you must have SQL Server 2008 Enterprise, Developer, or Evaluation editions.
What CDC Offers
When you enable a source table for CDC, you’re effectively creating a capture instance. You can have as many as two capture instances per source table. The actual capture process is implemented with SQL Server agent jobs calling other routines. Any change data detected is then inserted into a separate change table per table enabled. Once the change data has been made available for consumption purposes, a consumer (typically an ETL process) will query the change data using system-generated functions by either log sequence number (LSN) or time ranges. The generated CDC functions provide the change data in an easy-to-consume query format.
You might be wondering when the change tables are purged. After all, if the change tables are never purged, they can grow to huge proportions. By default, CDC purges data from the change data tables after it’s three days old. Every change table defaults to this three-day purging strategy, which you might not always consider optimal. If you need to deviate from the default purging strategy, system procedures are available for you to do so. If you back up and restore a CDC-enabled database, the restored database doesn’t have CDC enabled by default.
Finally, CDC offers the ability to capture and propagate Data Definition Language (DDL) changes. By default, CDC doesn’t propagate source-table DDL changes to its corresponding change table. The only exception to this rule is when you alter the underlying data type of a particular column in the source table; the change table will automatically have its corresponding column type changed. Removed columns simply get NULL values in their change table, and added columns are ignored in the change table.
To enable true DDL propagation, you need to create a second capture instance of the source table after the DDL statements are issued. This procedure will begin capturing change data using the new table schema. You will then need to update any consuming applications to use the new capture instance.
To configure CDC, you use T-SQL with a SQL Server 2008 instance. We’ll be using the AdventureWorks 2008 OLTP sample database for our configuration section. If you don’t have a copy of the AdventureWorks OLTP sample database, you can download a copy at http://www.codeplex.com/MSFTDBProdSamples. Also, you’ll need SQL Server Agent running.
The first step is to enable a given database for CDC. To do so, you use the sys.sp_cdc_enable_db system procedure. This procedure creates all of CDC’s related objects, jobs, and so on in the current database, as follows:
USE \[AdventureWorks2008\]; GO EXEC sys.sp_cdc_enable_db;
To determine whether a database has CDC enabled, we can interrogate the sys.databases catalog view. A new column called is_cdc_enabled in the sys.databases view is either a 0 (not enabled) or a 1 (is enabled). The following query determines whether the AdventureWorks sample database has been enabled for CDC:
USE AdventureWorks2008; GO SELECT \[is_cdc_enabled\] FROM sys.databases WHERE \[name\] = ‘AdventureWorks2008’
Next, we need to create our capture instance(s) by using the sys.sp_cdc_enable_table_change_data_capture system procedure. This system stored procedure enables CDC on a given source table. The procedure that Web Listing 1 shows creates the associated change tables and systemgenerated change-data query functions.
To determine whether a given table is tracked by CDC, you can interrogate the sys.tables catalog view. A new column called is_tracked_by_cdc in the sys.tables view contains either a 0 (not tracked) or a 1 (is tracked). The following query determines whether CDC has tracked the Sales.SpecialOffer table:
USE AdventureWorks2008; GO SELECT \[is_tracked_by_cdc\] FROM sys.tables WHERE \[name\] = ‘SpecialOffer’
Create the CDC-Consuming SSIS Package
Once you’ve configured CDC, you can create a CDCconsuming consuming SSIS package (i.e., one that consumes the change data). I disagree with the current SQL Server Books Online (BOL) documentation, which states that CDC is a new SSIS technology. CDC is a new feature of the relational database engine: Any tool that can make a connection to SQL Server 2008 can use the exposed change data.
First, let’s fire up Business Intelligence Development Studio (BIDS), which is accessible via Start, All Programs, Microsoft SQL Server 2008, SQL Server Business Intelligence Development Studio. Select File, New, Project from the BIDS menu. Next, choose the Integration Services Project and type ssis_CDC_Sample in the Name textbox, as Figure 1 shows. Click OK to create the new SSIS project.
Next, we need to set up our package before adding Control Flow components. Select View, Other Windows, Variables to display the Variables window. Use the Add Variable toolbar command to create the variables that you see in Web Table 1. Enter the following T-SQL code for the value of the SqlDataQuery variable:
SELECT \[SpecialOfferID\],\[Description\], \[DiscountPct\],\[Type\],\[Category\], \November 24, 2008 12:00 AM,\[EndDate\],\[MinQty\], \[MaxQty\],\[CDC_OPERATION\] FROM udf_Sales_SpecialOffer(null,null)
We need to create the package’s connection managers. Right-click in the Connection Managers pane (located beneath the Control Flow designer's surface), and select the New OLEDB Connection context-menu option. Create two new local OLEDB connections; the first should reference the AdventureWorks 2008 database, and the second should reference the Adventure- Works DW 2008 database. Name the new connection managers AdventureWorks and AdventureWorksDW, respectively.
Now that the package is set up, we can delve into the actual Control Flow elements. First, copy an instance of the Execute SQL Task onto the Control Flow designer's surface and rename the task Calculate Interval. This task will serve as the entry step toward obtaining a datetime range that the system will use to determine and process any change data within the range. Right-click and select the Edit option on the task. Configure the Calculate Interval task as Web Table 2 shows. Configure the result sets for the Calculate Interval task as Web Table 3 shows.
Click OK to complete the Calculate Interval task. Next, we need to place an instance of the For Loop container onto the Control Flow surface. Rename the For Loop container to Wait for Change Data. This For Loop container will serve as the looping mechanism when change data isn’t yet ready. CDC is an asynchronous process; sometimes, no change data will be available to be captured. Also, note that in my design we use a date interval of six minutes ago to one minute ago, so all change data we attempt to process will be at least a full minute old. Edit the Wait for Change Data container as Web Table 4 shows.
Click OK to complete the Wait for Change Data container. At this point, create a default success constraint between the Calculate Interval and the Wait for Change data tasks. Your package’s Control Flow should now look like Figure 2.
Next, drag an instance of the Execute SQL task inside the Wait for Change Data container. Rename the task Check for Data. This task will poll the cdc .lsn_time_mapping table to determine whether change data is ready. Configure the Check for Data task as Web Table 5 shows. Enter Web Listing 2’s T-SQL code for the SQLStatement property, and refer to Web Table 6 for the task’s parameter mappings.
Click OK to complete the Check for Data task. Next, drag an instance of the Script Task to the Control Flow designer's surface inside the Wait for Change Data container. Rename the task to Delay in Seconds. This is the task used for delaying package execution for the case when the change data isn’t yet ready for the calculated interval. Configure the Delay in Seconds task as Web Table 7 shows, using the code in Web Listing 3.
Next, we need to create a Control Flow constraint between the Check for Data and Delay in Seconds tasks. Click the Check for Data task and connect the task constraint to the Delay in Seconds script task. Now, double-click the constraint and configure it as Web Table 8 shows. Click OK to complete the constraint's configuration. Your package should now look like Figure 3.
Drag another instance of the Script Task to the Control Flow surface inside the For Loop container. Rename the Script Task Log Extract Error. This task serves as the point in the package’s Control Flow at which we log an error and stop package execution. This task will execute when no change data is present for our calculated interval. One possible design alternative would be to wrap this entire package’s functionality into a second For Loop container, then simply continue polling for change data instead of failing the package. You could insert a Script Task or Execute SQL Task to delay package execution for five minutes per outer iteration. Configure the Log Extract Error script task as Web Table 9 shows, using the script in Web Listing 4.
Next, we need to create a Control Flow constraint between the Check for Data and Log Extract Error tasks. Click the Check for Data task and connect the task constraint to the Log Extract Error script task. Now, double-click the constraint and configure it as Web Table 10 shows. Click OK to complete the constraint's configuration. We’ve now completed the work that needs to occur inside the Check for Data For Loop container.
Open SQL Server Management Studio (SSMS) and create a new query window connected to the Adventure- Works OLTP sample database. Execute Web Listing 5’s T-SQL code to create our custom user-defined function (UDF); this function gets called from the next task in our SSIS Package. This function returns our example’s change data by using the Net Changes function for a given time period. Notice that the function converts the input parameters for the time to their equivalent LSNs.
Confirm that the function has been created, and clear the query window’s contents by pressing Ctrl+Shift+Delete. (We’ll need to use SSMS again a bit later.) Now, go back to our SSIS package in BIDS and drag another instance of the Execute SQL Task onto the Control Flow. Rename the new Execute SQL task Prepare Query, Use UDF. This task will create the actual T-SQL query for obtaining the change data using our calculated datetime intervals. Configure the Prepare Query, Use UDF task as Web Table 11 shows, and enter Web Listing 6’s T-SQL code for the SQLStatement property. Configure Prepare Query, Use UDF parameters as Web Table 12 shows. Finally, configure the Prepare Query, Use UDF result set as Web Table 13 shows..
Next, create a success constraint between the Check for Data For Loop container and the Prepare Query, Use UDF task. Finally, drag an instance of the Data Flow task onto the Control Flow designer's surface. Rename the task Extract and Load Change Data. Create a basic success constraint between the Prepare Query task and the Extract and Load Change Data task. With the package’s Control Flow complete, it should look like Figure 4. Next, we’ll configure the Data Flow contained within the Extract and Load Change Data task.
Create the Data Flow
Right-click the Extract and Load Change Data Data Flow task, and select Edit. Doing so will open up the Data Flow task’s corresponding Data Flow designer's surface. Drag an instance of the OLEDB source from the toolbox to the Data Flow designer's surface. Rename the source component Change Data. Configure the Change Data source component as Web Table 14 shows.
Click OK to complete the OLEDB source component’s configuration. Next, copy an instance of the conditional split transformation onto the Data Flow designer's surface. Rename the conditional split transformation Filter Based on Operation. This transformation will redirect each row based on the CDC_OPERATION column that our custom UDF returns. Configure the conditional split transformation as Web Table 15 shows.
Click OK to complete the Filter Based on Operation configuration. Next, drag an instance of the OLEDB Destination component to the Data Flow designer's surface and connect the Data Flow from the Filter Based on Operation transformation to the new OLEDB destination. The source Data Flow should be Inserted Records. Rename the component Process Inserts. Configure the Process Inserts destination component as Web Table 16 shows. Web Table 17 shows the task's column mappings. (Destination columns that don't have any inputs from the Data Flow should be set to ignore for their corresponding input column.)
Click OK to complete the Process Inserts component. Next, drag an instance of the OLEDB command transformation to the Data Flow designer's surface. Rename the new OLEDB command component Process Updates. Connect the Data Flow path from the Filter Based on Operation transformation to the new Process Updates OLEDB command using the Updated Records output. Configure Process Updates as Web Table 18 shows, and enter Web Listing 7’s T-SQL code for the SqlCommand property. Web Table 19 shows the Process Updates column mappings.
Click OK to complete the Process Updates OLEDB command. Finally, drag over one more instance of the OLEDB command transformation to the Data Flow’s surface. Connect the Data Flow from the conditional split transformation to the new OLEDB command using the Deleted Records output. Next, rename the new OLEDB command Process Deletes. Configure the new Process Deletes OLEDB command as Web Table 20 shows. Enter the Web Listing 8's T-SQL code for the SqlCommand property. Configure Process Deletes column mappings as Web Table 21 shows.
Click OK to complete the Process Deletes OLEDB command configuration. The package’s Data Flow should now look like Figure 5.
Test the CDC-Consuming SSIS Package
First, build the SSIS package in BIDS. Go to Build, ssis_CDC_Sample. To test our new CDC-consuming package, we’ll submit a series of UPDATE statements to the Sales.SpecialOffer table. These T-SQL statements emulate a front-end application, letting users update the company’s various special offers. Web Listing 9’s T-SQL code is to be submitted against the AdventureWorks OLTP 2008 database inside your original SSMS query window used in the above tasks. (You can clear its old content by pressing Ctrl+Shift+Delete.)
Next, execute the package. If change data was found, the package’s Control Flow status should look like Figure 6. If you query the DimPromotion table in the AdventureWorksDW database, you’ll find that the corresponding record’s DiscountPct is updated as well. This confirms that we captured and processed our change data from the AdventureWorks OLTP 2008 database.
Looking to the Future
I’m impressed with the new CDC feature built into the SQL Server 2008 relational database engine. Overall, the feature is good; it replaces the need to create custom CDC solutions and provides increased performance over those custom solutions because it’s asynchronous. If your organization has SQL Server OLTP environments, CDC is a great candidate for your future incremental ETL solutions!