Data Transformation Unplugged

Downloads
25876.zip

You, too, can convert sensitive records for testing and maintenance

Developers need to be able to realistically test the programs that they support not only during initial development but also during ongoing maintenance and enhancement. Access to data that's representative of the production database both in complexity and volume is especially valuable because the developer can test for query accuracy as well as for factors such as performance, index use, and data growth. Where the production data is of low sensitivity, management usually approves putting a snapshot of the data on the development server. However, when production data consists of sensitive material such as financial records, employee records, credit ratings, and trade secrets, management might declare that data too confidential to share. In this environment, management might permit developers to create only a few test records on their database systems. The patterns that result from this small volume of test data might not reflect the patterns of large data volumes in the production system, rendering the tests inaccurate and unhelpful.

How can you satisfy management's security and privacy requirements while meeting the developers' needs? You can meet those requirements if you create test data that has the characteristics of the production data—including the frequency of repeating data, the same volume of data, and patterns of foreign key relationships. By using the Internet as an alternative data source and employing some T-SQL tricks, a developer who has access to the production data can create unrecognizable but representative test data for the development system.

Setting Up the Problem

To establish a sample problem, let's create an imaginary book-publishing company based on the Pubs sample database. Management for this simulated company—let's call it the Manitoulin Conglomerate—is concerned about the confidentiality of financial arrangements, author information, and book titles. Therefore, you need to create transformed data for development and testing. Because the Pubs database is distributed with SQL Server and is loaded with production data, you'll be able to work through the transformation steps as I describe them. (If your Pubs database is missing or has been changed, you can recreate it by using the instpubs.sql script and the Pubimage.bat file located in the Install directory of your SQL Server installation.)

The first step is to create a Pubsdev database that exactly mirrors the Pubs database structure. You can use Query Analyzer or Enterprise Manager to create the database and tables yourself, or you can use the instpubsdev.sql executable script that accompanies this article. You should run all this article's scripts in the Pubsdev database to avoid corrupting the Pubs database. To download all the code in the article plus the executable scripts, go to http://www.tsqlsolutions.com, InstantDoc ID 25876, and click Download the code. To run all this article's examples on SQL Server 7.0, you must be a member of the sysadmin role. For SQL Server 2000, you need to be a member of at least the dbcreator and bulkadmin roles. Now you're set to create the Pubsdev database by running instpubsdev.sql in Query Analyzer. For this project, you'll use only the publishers, authors, titles, and titleauthor tables. Because you're supporting a production environment, the development database's table definitions, data types, and constraints must match those of the production system. Only the data in the tables needs to be transformed.

The next step is to find alternate sources of data to feed the transformations. These days, the natural place to look for test data is the Internet. When you choose a Web site as a data source, you need to make certain that the site is suitable for the needed conversions and that the site's data is available without copyright or other usage restrictions. Table 1 shows the sites I selected. One good place to look for book titles that don't run afoul of proprietary restrictions is at the Gutenberg Project, a volunteer-supported and -funded source of freely available, simple electronic texts that I listed first in Table 1.

Preparing Data to Drive the Transformation

At the Gutenberg Project, you can find catalog files of authors, titles, and other data. I've chosen the titles.txt file as the only file you'll download from the Gutenberg Project because it has the simplest format for programmatically extracting book titles. I've provided the bulkin.fmt format file to control how the titles.txt file is inserted into SQL Server. (The bulkin.fmt file and a scaled-down version of the Gutenberg Project's titles.txt file are included in this article's downloadable files.) To load the data from titles.txt into SQL Server, use the T-SQL statements in Listing 1. The BULK INSERT command is the fastest way to load data into SQL Server.

For the script in Listing 1 to run, the titles.txt file and the bulkin.fmt file must be in the C:\transform directory. If you place the files somewhere else, you need to edit the path statements in Listing 1 to point to the correct directory before running the code. Run the code in Listing 1, which fills the XfmLoadUp table with the contents of titles.txt. The IDENTITY column on the XfmLoadUp table lets you use mathematics to decide later how to use the titles data. The bulkin.fmt format file in Figure 1 causes BULK INSERT to skip the TextID column (thus letting SQL Server create the identity) and loads its text into TextString. This format file is useful for any load of raw-text data. You need the file for this project because the titles.txt file is a list, not a database, so SQL Server reads it as raw data in one column. (If you're running SQL Server 7.0, change the version number in the first line of Figure 1 to 7.0 and delete the unsupported collation specification.) Table 2 explains the transformation techniques that you'll use, and Table 3 defines the transformation strategy for titles. I'll reference the numbered strategies (e.g., #1, #2, #3) in Table 3 when I explain the T-SQL code.

The Manitoulin Conglomerate operates eight publishing companies, represented by the data in the publishers table. The conglomerate doesn't need to disguise the publishing company names, but to facilitate later distribution of publishers across titles, let's associate an integer identity with each publisher. Let's also extract usable data from the raw text in the XfmLoadUp table. The Gutenberg titles.txt file tends to follow a title, by author, birthyear-deathyear format. Occasionally, a subtitle follows the main title, contained between a colon (:) and the text (, by), which leads to the author's name. Because you're only retrieving test data and not attempting a formal data-conversion project, you'll use only rows that match these patterns. Listing 2's T-SQL script, page 4, loads your development publishers table and associates an integer value with each publisher. It also parses the raw text from XfmLoadUp into the XfmNewTitle and XfmNewTitleNote working tables. To prepare the substitute data, run Listing 2. At the time I downloaded titles.txt from the Gutenberg Project, this T-SQL code selected 3778 properly formatted titles from the file—far more than the 18 titles supplied in the Pubs database. So, finding a replacement title for all 18 will be easy. (If you also wanted to increase the volume of titles data, titles.txt would let you add another 3760 titles.) When I defined the : subtitle strings in some titles' lines as notes, Listing 2 also retrieved 272 title notes.

Let's use a math trick to associate a publisher with each Gutenberg title by joining the identity column in XfmLoadUp to the identity column in XfmPublisherMatch. The modulo operator (number % number) returns the remainder of the division of the first number by the second. Taking the modulo of each new title's NewTitleID by the number of publishers and adding 1 distributes Manitoulin Conglomerate's publishing companies evenly across the titles.

The Gutenberg titles don't carry copyright dates, so let's appropriate the author's birth year and save it as an integer for a future generation step. Just as you associated an integer value with each publisher, you need to do the same with each of the original titles and title types so that you can finish preparing the data you need to run the major transformation. For this purpose, run the code in Listing 3, page 5.

Listing 3's code contains a couple of interesting details. First, the production titles data in the Pubs database seems to imply a business rule that ties the first two characters of the title_id to the type classification of the title. Therefore, if you choose to follow this assumed rule, you can insert the distinct set of types and type codes (the two-character title_id prefix) into the table that you'll use to transform types. The SELECT clause that provides data to the INSERT INTO command uses an ORDER BY to return the distinct rows in descending order. This simple example shows you one way to vary the order in which you assign identity values. Second, the additional statement that updates the XfmTypeMatch table creates a calculated begin and end range for each title type based on the count of production rows that used that type. Unlike the even distribution of publishers, the begin and end range lets you distribute the various title types proportionately across titles.

Transforming the Titles

Now, let's create the transformed title data for the Manitoulin developers. Run Listing 4 to generate contents for the titles table. You need two arithmetic calculations—one to choose from the candidate new titles and the other to choose the rotation rows that provide alternate monetary values. Remember that the goal is to produce test data that reflects the shape of the source data. How closely the data distribution should match the original is a design decision; you have more options than those I offer here. Let's look at a detailed explanation of each numbered transformation strategy in Table 3 to see what this code accomplishes.

Strategy #1. The code forms the title_id from the two-character prefix associated with title types, plus a four-character generation from the identity column of the new title. This two-character prefix satisfies the type-code business rule that you inferred earlier, and no business rule was identified for the number.

Strategy #2. Knitting together the original title_id with an identity value lets you choose how to substitute the title. By dividing the number of new titles by the number of original titles, you can calculate how far apart you can spread the new titles without breaching the substitute value range, then compute the new title's identity. (Because titles.txt is in alphabetical order, a simple join between the identities retrieves all new titles from the beginning of the alphabet. By multiplying the identity value for the production title by the results of the above division, you can gather new titles from across the alphabet, creating a more natural-looking distribution.)

Strategy #3. The title's type, which also supports the title_id creation, is distributed proportionately across the new titles. The earlier calculation of ranges for each type allows a join of the derived identity for the original title_id into the range for a type. For example, the identity value 12 fits into the range of 10 through 12 that you calculated from the three original rows belonging to popular_comp.

Strategy #4. The publisher_id is distributed evenly across all candidate new titles when you run Listing 2 to fill the XfmNewTitle table. At that time you choose not to maintain the proportion of books to publishers.

Strategies #5, #6, #7, #8. The monetary columns need to be representative of Manitoulin Conglomerate's actual monetary arrangements, and they have a logical relationship to one another. By computing the modulo of the number of title identities with a random number in the range of title identities, you rotate in real monetary values from another source row for this transformed row. (Randomness does occasionally return the source row's own numbers unless you program your code to avoid that possibility.)

Manitoulin doesn't want the actual advances (#6) and year-to-date sales figures (#8) to be widely known. An arbitrary calculation generates new values based on the rotated value from the previous step and keeps them within realistic ranges that are comfortably below the higher figures in the production data.

Strategy #9. The title note is substituted by a simple join between the identity column of the source title and the identity column of the title notes that you extracted from the Gutenberg titles. This join serially applies the substitute notes to your development titles independent of the selection strategy for the substitute titles.

Strategy #10. Finally, the copyright date is generated from today's date and shifted back in time when you use the author's birth year as a count of days to subtract from today's date. Again, this strategy doesn't maintain the original copyright date distribution, but you could perform a different transformation if that is important to you.

How thorough is the transformation? Figure 2, page 6, displays the change in one row to show how complete the transformation is.

Applying the Techniques

How might you best apply these techniques to your data? Let's use a pharmaceutical firm as an example. Substitution from an alternate source of data is best for descriptive information such as names, titles, descriptions, and so forth. (From the Internet resources noted in Table 1, you can see that I've focused on replacement data from alternate sources for this article.) Your firm might not want the names of products—especially new products still awaiting government approval—in the test data because employees with access could draw inferences.

The generation of new values is another way to replace existing values, which is most useful for changing identifying data such as numbers and dates. Your firm might prefer that no test data dates match the real dates.

Redistribution of existing values keeps some connection to the source data and is usually a bit easier for testers to work with than totally altered values. Thus, your firm's products would still include pain relievers and antacids, but to disguise product details, redistribution would assign those designations to different products.

Proportional distribution is especially useful in cases where the data contains a large skew. If 90 percent of your products are pain relievers, duplicating that skew can help you do a better job of testing such processes as query performance and report formatting.

Rotation works when you need representative values that are related to one another. For example, a product might carry production costs as well as wholesale and retail prices. Rotation keeps the relationships among those monetary values in place. If the numbers are too revealing, you can use generation to transform the rotated data while keeping the relationships intact.

More Transformations to Come

Manitoulin's management is happier, but the transformation job isn't finished until you've also transformed the author information. An upcoming article will cover several aspects of transforming data about people. I'll also discuss persistence of test transformations, using substitution sources that are difficult to parse, as well as generating test data when you have no Internet resource. In the meantime, go to the Internet and find the representative data that you need for your development environment. Use the most suitable techniques and turn that confidential production data into a resource for test data.

Comments

Plain text