Quick Data Synchronization Using Table Partitioning and SSIS

Downloads
139796.zip

I recently helped a client migrate an application’s back-end databases from Sybase to SQL Server. One of the databases used by this application, which I’ll call MyApp, was a shared database. The other applications that used this database weren’t being migrated from Sybase to SQL Server, so the client needed a data synchronization mechanism that would transfer over changes made in the Sybase tables to the SQL Server tables.

The SQL Server users of MyApp only read data from 19 tables of 100+ tables in the shared database. The data in those tables was modified by several other applications that were still utilizing Sybase. Because the users of the Sybase applications were spread across multiple time zones around the globe, data changes occurred practically every minute, 24/7. For this reason, the client wanted no more than a 30-minute delay in delivering data changes to SQL Server.

Creating the Data Synchronization Mechanism

Typically, configuring replication is the simplest way to synchronize selected tables in two environments. However, SQL Server doesn’t support Sybase publishers. And using Sybase Replication Server or another third-party tool wasn’t an option for the client for budgetary reasons.

Another synchronization approach is modifying all the applications using the shared database so that updates are made to both the Sybase and SQL Server databases in one transaction. Unfortunately, that approach wasn’t possible because the client wanted to minimize the impact of MyApp’s SQL Server migration on the other applications.

So, using SQL Server Integration Services (SSIS) was my only option. I needed to create an SSIS package that would synchronize the 19 tables and run every 30 minutes. The data synchronization process needed to complete in 12 minutes or less so that at least 18 minutes would be left for other processes to execute during the 30-minute cycle.

After examining the 19 tables, I found that they could be split into two groups. One group consisted of 16 relatively small tables, each of which had 150,000 or fewer rows. The other group consisted of three large tables. Two of these tables had about 1 million rows, and the third had more than 20 million rows.

I also found that all of the tables allowed INSERT, UPDATE, and DELETE operations. However, they didn’t have any timestamps to indicate when a row had been inserted or updated last. More important, I discovered that the deleted rows weren’t “archived” or marked as deleted—the rows were physically removed from the tables when the DELETE command was used. As a result, the biggest obstacle for data synchronization was determining which rows were deleted. The only way to find them was by comparing the data in the Sybase and SQL Server tables.

Initially, I decided to implement a four-phase process for all the tables:

Create stage tables in SQL Server (one for each Sybase table).
Copy all the data in the Sybase tables to the stage tables.
Merge the data in the stage tables with the data in their corresponding core tables (i.e., the production SQL Server tables).
Use the TRUNCATE TABLE command to remove all the rows from the stage tables in preparation for the next run.

The merge phase turned out to be the trickiest. I knew that running the MERGE command against even a small table (150,000 rows) would hurt server responsiveness, so I devised an update operation that would run in batches of 50,000 records. The update operation consisted of four steps:

Delete the rows in the core table that don’t exist in the stage table.
Update the core table with the corresponding rows from the stage table.
Delete all the rows that were used in the update in step 2 from the stage table.
Insert the stage table’s remaining rows into the core table.

Figure 1 shows the results of these steps in a simple scenario.

Figure 1: Results of each step in a sample merge operation for small tables

Initially, I created an SSIS package that used the four-phase process to synchronize data in both small and large tables. I discovered that the process worked fine for the small tables, but it took significantly longer to complete—45 minutes—for the three large tables. The vast majority of that time was spent copying and merging the data. Just copying the data from the large Sybase tables to the stage tables took about 15 minutes on the hardware provided (a dedicated virtual box with 2 CPUs, 3.5GB of RAM, and a 32-bit version of SQL Server 2008). I needed a different approach to deal with the large tables to meet requirement of having the entire data synchronization process complete in 12 minutes or less.

Synchronizing Data in Large Tables

Instead of using the batch-based four-step update process for each large table, I decided to perform a data exchange operation on the entire dataset. The fastest way to exchange data between two SQL Server entities is to use table partitions because there are no physical data transfers in the operation—the operation deals with metadata only. So, I decided to try the partition approach with my large tables.

Listing 1 (below) shows the original structure of one of the large tables, which was named dbo.Users. (For simplicity, I just included the basic structure.) To speed up the data exchange, I decided to create several partitions and populate them in parallel. The number of parallel threads that an SSIS package can run simultaneously is equal to the number of processors plus two. Thus, I created four partitions because the SSIS package would be running on a machine with two CPUs. (If you have a machine with more CPUs, you can create more parallel threads in an SSIS package.)

Listing 2 (below) shows the partition scheme, partition function, and structure of the partitioned dbo.Users table. I split the entire data range (alphanumerical characters in this case) into four intervals: [0;E), [E;L), [L;R), and [R;}). I chose these boundaries so that the number of rows in each interval was approximately the same. This was crucial to get the optimal time for loading the data into the stage tables.

Because there were four partitions, I created four stage tables (dbo.Users_Stage1 through dbo.Users_Stage4). Each stage table has the same structure (columns and indexes) as the partitioned table. In addition, each stage table has a CHECK constraint that matches the corresponding data condition defined in the partition function. Listing 3 (below) shows the structure of the dbo.Users_Stage1 table.

Figure 2 illustrates the partition approach for the dbo.Users table. On the right, note the four additional backup tables (dbo.Users_Old_Part1 through dbo.Users_Old_Part4). Prior to each new data load, the backup tables are truncated so that they don’t contain any data. Then, the old data in the partitions is moved (aka switched) to the corresponding backup tables so that the data from the stage tables can be moved to the corresponding partitions in the SQL Server destination table.

Figure 2: Partition approach for the dbo.Users table

The structure of the backup tables is the same as the structure of the corresponding stage tables, except they don’t have a CHECK constraint. The general requirements for switching partitions can be found on the MSDN web page “Transferring Data Efficiently by Using Partition Switching”. Switching the data takes only a few seconds, no matter how much data is moving in and out of the partitioned table.

Figure 3: SSIS package design for the dbo.Users table

In the simplest case (no other tables, no logging, no error handling, and so on), the SSIS package design for uploading data into the partitioned dbo.Users table would look like that in Figure 3. Truncate_Tbl is an Execute SQL task that truncates the stage tables and backup tables. Load Part 1 through Load Part 4 are Data Flow tasks that upload the data from the Sybase source table into the corresponding stage tables. As Figure 4 shows, the Load Part 1 task uses the command

SELECT * FROM dbo.Users WHERE

  user_id < 'E'

to load the dbo.Users_Stage1 table. The Load Part 2, Load Part 3, and Load Part 4 tasks use the following commands

SELECT * FROM dbo.Users WHERE

  user_id < 'L' AND user_id >= 'E'

SELECT * FROM dbo.Users WHERE

  user_id < 'R' AND user_id >= 'L'

SELECT * FROM dbo.Users WHERE

  user_id >= 'R'

respectively, to load their corresponding stage tables.

By creating four parallel threads for data loading, I was able to decrease the data load process time to around 7 minutes or less, which is about half the time compared to the original process in which there was one Data Flow task per table.

SwitchPart_Tbl 1 through SwitchPart_Tbl 4 are Execute SQL tasks that execute a stored procedure named dbo.SwitchPart_Tbl. This stored procedure first moves the old data from the partitions in the SQL Server destination table to their corresponding backup tables. Then, it moves the new data in the stage tables to their corresponding partitions. You can download the dbo.SwitchPart_Tbl stored procedure by clicking the 139796.zip hotlink on the top of this page.

The Dummy Task in the SSIS package’s design is for simplification only. In a real package, after all four parallel threads complete successfully, you would perform other tasks, such as logging results, sending email notifications if an error occurs, or writing the backup tables to backup media.

Implementing the Partition Approach

Although I needed to synchronize Sybase and SQL Server tables, you can use the partition approach to synchronize other types of heterogeneous tables (e.g., Oracle and SQL Server tables) or two SQL Server tables. For each table that you need to synchronize, follow these steps:

1. Create a partition scheme, partition function, and partitioned table, using the structure required in the destination environment.

2. Create a set of stage tables with the same columns and indexes as the partitioned table. In each stage table, create a CHECK constraint that matches the corresponding data condition defined in the partition function.

3. Create a set of backup tables with the same columns and indexes as the partitioned table. These tables will temporary store the old data from the corresponding partitions.

4. Create an SSIS package with a set of parallel threads. Each thread needs to:

Truncate the stage tables and backup tables.
Copy data from the source table into the stage table.
Run a stored procedure that moves the old data from the destination table’s partitions into the corresponding backup tables, then moves the new data from the stage tables into the corresponding partitions in the destination table.

You can use the partition approach with other data synchronization techniques in the same SSIS package. For example, in my SSIS package, I used the partition approach to synchronize the data in the three large tables and the update approach to synchronize the data in the 16 small tables. All 19 tasks ran in parallel.

Table Partitioning Saves Time

Table partitioning can significantly reduce the time it takes to load massive amounts of data into a SQL Server table. It allows an SSIS package’s Data Flow tasks to execute in parallel. As a result, it takes a lot less time to upload data from the source table to the stage tables and move data from the stage tables to the destination table’s partitions. I used table partitioning to reduce the time it took to synchronize the data in three large tables from 45 minutes to 12 minutes or less.

Listing 1: Original dbo.Users Table

CREATE TABLE dbo.Users (

  user_id varchar (32) NOT NULL PRIMARY KEY,

  Name varchar (80) COLLATE Latin1_General_CI_AS NULL,

-- DBName_Group1 is the name of the file group

-- in which the table was created.

) ON [DBName_Group1]

GO

Listing 2: Partitioned dbo.Users Table

DROP PARTITION SCHEME myRangePS

GO

DROP PARTITION FUNCTION myRangePF

GO

 

CREATE PARTITION FUNCTION myRangePF (varchar(255))

  AS RANGE RIGHT FOR VALUES ('E', 'L', 'R');

GO

CREATE PARTITION SCHEME myRangePS

  AS PARTITION myRangePF

  TO (DBName_Group1, DBName_Group1,

    DBName_Group1, DBName_Group1 );

GO

CREATE TABLE dbo.Users (

  user_id varchar (32) NOT NULL PRIMARY KEY,

  Name varchar (80) NULL

) ON myRangePS (user_id) ;

GO

Listing 3: dbo.Users_Stage1 Table

CREATE TABLE dbo.Users_Stage1 (

  user_id varchar (32) NOT NULL PRIMARY KEY,

  Name varchar (80) COLLATE Latin1_General_CI_AS NULL,

) ON [DBName_Group1]

GO

ALTER TABLE dbo.Users_Stage1

ADD CONSTRAINT [CK_Part1] CHECK ([user_id] < 'E')

GO

Comments

Plain text