T-SQL programmers often need to handle data that comes in a denormalized form. Certain tables might appear in a denormalized form because they store data that was imported from legacy systems, for example, or from text files. Also, many online transaction processing (OLTP) environments aren't properly normalized even though they should be. Let's look at an example problem that involves temperature measurement samples stored in a denormalized form. I'd like to thank SQL Server MVPs Linda Wierzbicki and Steve Kass, who came up with some of the tricks that I use in this article.
The Bottoms Up fictional winery has vineyards at several sites. The owner gathers temperatures from each site as often as 30 times a day and performs various statistical calculations based on the measured temperatures. He uses the results to determine how much to water and fertilize the grapes. A legacy system records the temperatures and produces a daily text file in which each record contains a site ID and a comma-delimited array of temperature measurements in Celsius. The system then loads the text file to a table in SQL Server.
The table holds one column containing the site ID and another, character-based column containing the comma-delimited list of temperatures. You need to write queries that calculate various statistics such as average, minimum, and maximum temperatures for each site. Run the script in Listing 1 to create the TempSamples table and populate it with sample data.
The current table format doesn't even meet the first normal form (1NF) requirement of having no repeating groups. A normalized form of the table, such as the one that the code in Listing 2 creates, would be much more convenient to code against. So, you decide to write some code that will populate a copy of the TempSamples table—let's call it NormTempSamples—with a normalized version of the data.
Solution That Uses a UDF
One way to approach the problem is to work out a way to extract a specific element from the comma-delimited list of elements. If you're working with SQL Server 2000, you can write a user-defined function (UDF) that performs the task. The function accepts two arguments: the ordinal position of the element that you want to extract and the string containing the comma-delimited list of elements. Run the script that Listing 3, page 18, shows to create the fn_getelement() UDF. Later, I discuss another solution that works in both SQL Server 2000 and 7.0.
Let's look at how the UDF works. First, the function validates the input. I'll discuss just one validity check that might need explanation. The logical expression
IF @ord > LEN(@str) - LEN(REPLACE(@str, ',', '')) + 1
returns true only if the specified element ordinal position is beyond the end of the list. The trick to correctly using this expression is to subtract the length of the string (without the commas) from the length of the original string (with the commas). In this case, you get the number of elements minus 1, so you need to add 1 back in. Next, the function finds the starting position of the desired element by using a WHILE loop that advances from one comma to the next until the number of iterations reaches the element in the desired position. Finally, the function extracts the desired element from the comma-delimited list, converts the character-based element to the integer data type, and returns the element.
After you've extracted a desired element from the list, you face another obstacle. You need to duplicate each site's row as many times as you have measurements for that site; then, you need to generate the appropriate sample number for each row so that you can invoke the function once for each sample. For example, if site A has five elements, you need to generate five rows with sample numbers ranging from 1 to 5. Then you invoke the function, supplying the list of measurements and the sample number as arguments. You can do all this by first creating an auxiliary table containing the numbers from 1 to the maximum possible number of measurements—30, in this case. Running the code in Web Listing 1 (available online) generates the Nums auxiliary table and populates it with the desired range of numbers.
The last step in the solution is writing a query that produces the desired normalized form of the measurement samples. You need to join the TempSamples table to the Nums table, but the JOIN condition is tricky. It should include the following expression:
ON n <= <num of elements in list>
but how do you find the number of elements? The fn_getelement() function demonstrated a trick that uses the LEN() and REPLACE() functions to count the number of elements. You can use this trick in writing your final query to get the number of elements. Listing 4 shows the complete query, with an ORDER BY clause added for better formatting. All you have to do is use Listing 4's query (without the ORDER BY clause) in an INSERT INTO statement to populate the NormTempSamples table with the normalized data, and you're done.
Solution That Doesn't Use a UDF
If you're working with SQL Server 7.0 or earlier, you can't use UDFs. You can try using an auxiliary table like the one I used in the previous solution, but without a function that can extract the element that's in the desired position, you can't do much with 30 numbers representing the possible element ordinal positions. However, if you insert more numbers into the Nums table, you can use Nums to locate the commas in the list of measurements.
First, run the code that Web Listing 2 shows to insert the numbers up to 8000. I chose that number because with 8000, the query will match each number with a character in the list and because the temperatures column is defined as varchar(8000). Now you can join TempSamples to Nums based on the following JOIN condition:
ON n < LEN(temperatures) AND SUBSTRING(',' + temperatures, n, 1) = ','
This condition returns true when the query finds a comma in the nth character. The string containing the temperatures is prefixed with a comma so that the JOIN condition will include the starting position of the first element. A JOIN query containing the above condition produces the desired number of rows per site, with each row containing the site ID and a list of measurements and with n representing the position of the first character of an element. Now all you have to do is write two expressions in the SELECT list that extract the element and calculate the element's index, as I demonstrated earlier in this article. Listing 5 shows the final query.
T-SQL gives you a powerful means of working with denormalized data. You can also use the tricks I showed in this article to migrate your data from a denormalized to a normalized form. And whenever you can, try to educate your customers about the benefits of using normalized data in their OLTP environments.