Practicing Data Porting

Downloads
44682.zip

In my work, I routinely redesign and redeploy production databases, and to do so, I write programs that transfer data from the old database to the new one. Readers have asked me to share data-transfer tips and techniques, so this month, I respond to that request. To show how to move data from an old database to a new one, I refer to the keyword-array example in my December 2004 article "The Search Path Less Traveled," InstantDoc ID 44273. In that article, I explained how keyword arrays such as the one that Figure 1 shows can adversely affect database performance and data-retrieval accuracy. I also explained how to design a set of normalized tables that will let you easily search by and manage keywords, maintain high data-retrieval accuracy, and minimize duplicate data. This month, I walk you through porting the data from the keyword array into this new set of normalized tables.

Before we start this project, let's briefly analyze Figure 1's ASSET table and determine why we'd want to normalize it. (For a full discussion about data normalization, see my article "Normalization: Why You Need It," March 1999, InstantDoc ID 4887.) Remember this rule of normalization: one topic, one table. The ASSET table breaks this rule by storing data about images and keywords. But a worse problem is the embedded many-to-many (M:N) relationship between image and keyword. As I explained in "The Search Path Less Traveled," an image is described by one or more keywords; a keyword can describe one or more images. As the M:N relationship becomes more complex—more keywords describe each image, and the same set of keywords describes more images—the ASSET table becomes harder to maintain, so we need to normalize it.

You can download this article's Web listings at InstantDoc ID 44682. We'll use the code in Web Listing 1 to create and load the ASSET table. Then, we'll use the code in Web Listing 2 to create the new, normalized tables. Finally, we'll use the code in Web Listing 3 to transfer the data from ASSET into the three new tables, IMAGE, KEYWORD, and IMAGEKEYWORD.

Doing the Prep Work

Figure 2 is a physical data model (it looks suspiciously like an entity-relationship diagram) of the new, normalized tables. The IMAGE table contains a description of each image. The KEYWORD table contains a list of keywords. To associate keywords with images, we need a table like IMAGEKEYWORD, in which each image can be associated with zero or many keywords and each keyword can describe zero or many images.

When you're transferring data from one database schema to another, it helps to visualize what you need to do before you start writing code. Figure 3 shows a mapping technique I use when planning complex data transfers. Each field of the original table (the blue bar) is mapped to another field in one of the new tables (the tan bars). I retain the ID field from the ARRAY table in the IMAGE table's ImageID field because of relationships with other tables that aren't part of this discussion. When transferring data, you have to remember this rule: You must maintain the valid table-to-table associations and relationships of the original database. As Figure 3 shows, we'll transfer data from the ARRAY table's name and asset_image fields to the IMAGE table's ImageName and ImageFile fields, respectively. I renamed these fields to better identify their business function in the new database schema. We'll transform the keyword array, ASSET.keywords, from an array of terms to a list of single words in the KEYWORD table's KeywordDscr field.

Now we have images and keywords mapped, but how do we associate the two? That's the purpose of the IMAGEKEYWORD table. We created this table to retain the M:N relationship and make it work. The IMAGEKEYWORD table stores the ImageID field from the IMAGE table and the KeywordID field from the KEYWORD table as pairs; for each image that one keyword describes, IMAGEKEYWORD will contain one record.

We're almost ready to begin reviewing the code listings, but before we do, I want to mention something about backward compatibility. You might be thinking that when I talk about an older database schema, I'm talking about a legacy system or at least one that's several years old. That's not always true. Occasionally, a client asks me to restructure brand-new SQL Server databases whose design doesn't let a company expand or extend their business processes. In one case, the schema was only a year old when I started working on its replacement. So when you transfer data from one database schema to another, you have to keep in mind that the current UI might not be changing immediately. The decision-makers in your company might decide to restructure the database first and map the existing UI to the new database schema. Then over time, they'll introduce the new schema's enhanced functionality to their users and customers through new Web interfaces. Therefore, when you're designing a new database and transferring data, you have to be aware of how you're going to support this essential backward compatibility.

Creating the New Tables

In the case of these keyword arrays, we're looking at breaking up a string of terms into its component words. If you don't have access to the application code, or if the application code is written so that it's impossible to tell what its doing with the arrays, it's better to proceed conservatively with the data transfer. To determine the length of the longest array string in ASSET.keyword, execute the following code:

SELECT max(len(keywords)) FROM dbo.ASSET

If the longest array string is 55 bytes, for example, you'll create KEYWORD.KeywordDscr as a varchar(55).

Now, let's look at the three Web listings you downloaded. Web Listing 1 creates the original ASSET table and loads it with sample data, including the keyword array for each record. The data set is atypically small—only 33 rows. Production tables like this one typically contain many thousands of rows. Notice the comments scattered through the three code listings. When you're working on a data-transfer project and porting data from one database schema to another, you must document everything you do.

The code in Web Listing 2 creates the three new normalized tables, IMAGE, KEYWORD, and IMAGEKEYWORD. I set the length of KeywordDscr in the KEYWORD table to 40 characters based on the result I got from the SELECT max(len(keywords)) FROM dbo.ASSET query. I created on the KeywordDscr column a unique clustered index; it's indexed for fast searches and unique to avoid duplicate keywords.

Time to Transfer

The scripts for Web Listing 3, which are the most complex of the three listings we're working with, transfer data from ASSET into IMAGE, KEYWORD, and IMAGEKEYWORD. The first part of Web Listing 3 is a simple, direct transfer of data from field to corresponding field. I wrote the transfer script from the point of view of the target table (i.e., where we're transferring the data). But remember that you should always document where the data is coming from, as the comment at callout A in Web Listing 3 shows.

In addition, row counts are necessary to ensure that your transfer scripts are working correctly. Execute SELECT statements throughout a transfer script so that you maintain control over the data. No rule specifies how often to check rowcounts, but you should check in at logical intervals. For example, the SELECT statement at callout B in Web Listing 3 shows a rowcount of 33; by checking this count periodically, you can be sure you're not losing or adding rows. If you're doing a simple, direct transfer, generally the row count at the end of the script will equal the row count at the beginning. The more complicated the data transfer, the less true this statement. In this article's example, breaking apart the array string into a list of component words will vastly increase the row count from the source to the target. There's no way to retain absolute control over the source-to-target row count in this case, but checking it gives you a sense of whether you're on the right track.

Use the SET IDENTITY_INSERT <tablename> ON function (as the line at callout C in Web Listing 3 shows) when you need to transfer identity values from the old database to the new one. As I mentioned, in the case of the IMAGE table, you must carry forward the existing identity values.

Transferring data from the old ASSET table into the new KEYWORD table is complicated. The lines at callout D in Web Listing 3 show the start of the section of code that will break up the keyword array from the ASSET table and transfer the individual component words into the KEYWORD table. The first step in this section, at callout E, checks for the longest string in the source table to be sure the KeywordDscr column in the KEYWORD table is long enough to accommodate the longest string in the Keywords column of the ASSET table.

When I do data transfers, I use "working tables" to store intermediate results. In this example, the code at callout F creates the working table _TEMP_Keywords_Catalog to hold the component words from the keywords array. The code at callout G creates the worktable _TEMP_KeywordArray to hold a copy of the keywords array from the ASSET table, along with the unique identifier (UID) for each record. In this temporary table, you need just the columns from the ASSET.Keywords field—not the entire record—for data manipulation.

The next section of code is where the real work starts. If the code at callout H finds a space in a keywords array record, the code writes the record's unique ID and all characters up to the space into the _TEMP_KeywordsCatalog holding table. Then, the code removes from the keywords array everything it just copied up to and including the space. If the remainder of the array includes any leading spaces (which might happen if the original array record included two or more spaces between words), the code truncates those leading spaces. The code repeats the process until there's nothing left in the original keywords array record, then moves on to the next ASSET table record.

The next few sections of code, starting at callout I, are situation specific. Each data-transfer job is different, and you'll have to create custom solutions for each situation, but I've included code that addresses a few conditions you might encounter. For example, the code removes commas or tabs—which you'd expect if the original keywords array were a comma-delimited or tab-delimited list—from the ends of words. Or you might run into one or more line-feed carriage-returns after each word or, in an exceptional case, embedded in the middle of a word. In the latter condition, I've opted to simply remove these words from the list because tests I performed outside of this code listing show that words that have embedded line-feed carriage returns have always included comparable words that don't have embedded line-feed, carriage-returns. So, eliminating the problem words doesn't mean you've eliminated the useful entries.

The next section of Web Listing 3 loads the production table. Visual observation of the data will likely reveal duplicate data for nearly every keyword, so the code at callout J checks for and removes duplicate keywords. And you can avoid backward-compatibility problems by loading the original array data into the new table, as the code at callout K shows.

Finally, you're ready to use the code at callout L to transfer data into the associative table IMAGEKEYWORD, which you created to cross-reference images and keywords. Every M:N relationship in the database will be represented by an associative table. Note that I've included the section of code at callout M just in case. This part of the code looks for and removes "blank" keywords. The processing that the code has done so far wouldn't create blank keywords, but if an original record is NULL or contains an empty string, a line-feed, a carriage-return, or a strange character that doesn't print, the target table would still include those anomalies. As you can see, in this example, this section of code returns no rows.

As the code at callout N shows, you must always check for duplicates before loading data into a target table. In this case, a duplicate is the entire row: ID plus keyword. In this example, the code found only one duplicate row and eliminated it in the next step.

As you work through the example, keep checking the row counts. In this example, we started with 33 images from the original source data identified by 97 keywords. As the final check section at callout O shows, we end up with 141 image-key combinations. This difference is within the normal ratio of base to associative table. Finally, when you get to the end of your transfer, don't neglect the cleanup stage. Drop every working table, as the code at callout P shows.

It's one thing to understand how to properly structure tables for keyword lookups, it's harder to transfer data from an old database to a new one. In the process, you'll encounter simple, direct data transfers, you'll pull your hair out over complicated transfers that involve data transformations, and you'll create tables to relate one new data set to another. To get through the process successfully, you have to pay attention to detail, continuously monitor row counts, and visualize what data you need to transfer where.

Comments

Plain text