Adding Performance

Downloads
20131.zip

Editor's Note: Send your experts-only T-SQL tips to Itzik Ben-Gan at [email protected]. If we use your tip in the magazine, you'll recieve $100 and an exclusive T-SQL Black Belt shirt.

Adding Performance

How many ways can you solve a given business problem by using T-SQL? In most cases, you have several options. But how many times have you considered using mathematics to help you solve business problems? You probably don't often consider that approach if your problem isn't purely mathematical. However, with the aid of mathematics, you can sometimes reduce the amount of T-SQL code you need to solve a problem and optimize the code's performance considerably. In this article and the next, I'll present some business problems and show you how to use a mathematical approach to solve them.

Calculating a Group Product

This example, which Shelly Beckwith of Paternity Testing Corporation and Ted Malone (founder of SQLtrain.com) provided, comes from a biological testing facility where a table is created in a SQL Server database to store the results of individual biological tests. Each test has a specific test ID and a case number the tester assigns before the tests begin. Any case number can have multiple test results. You need to find a final result that is the product of all individual test results for a given case number. Listing 1 contains a script that creates the Testresults table and populates it with sample data.

This problem calls for an aggregate query because it requires the aggregate product of all individual test results for each case number. T-SQL provides aggregate functions such as SUM(), MIN(), MAX(), and COUNT(), but it doesn't include a PRODUCT() aggregate function. However, you can approach this problem from several other directions. For example, you can declare a cursor on a query that sorts the Testresults table's rows by case ID and test ID and assign the value 1 to a variable. Then, use T-SQL code to create a loop to fetch the cursor rows one by one, multiplying the variable by the current row's value. Every time the caseid column value changes, the T-SQL code inserts into a temporary table a row that contains the cumulative product of the test result values, which is stored in the variable, and resets that variable back to 1.

Another approach is to use a temporary table and a loop instead of cursors. The solution in Listing 2 demonstrates the temporary table approach, which the biological testing facility originally used. The code uses a SELECT INTO statement to copy all the rows from the Testresults table into the #T1 temporary table, then adds another column called used and populates it with zeros. The code processes the rows in #T1 in a loop, starting with the row that has the lowest test ID and continuing with the next test ID until it has processed all rows. In each iteration of the loop, the code updates the current row's result column to its current value multiplied by the test result value of the previous processed row that has the same case ID. If the code has processed no rows with the same case ID, it multiplies the value by 1. When the loop finishes, the row with the maximum test ID within each case ID holds the product of all the test result values with the same case ID. The test result values in this specific scenario can't be negative, so the maximum test result value for each case ID in the #T1 table is the desired product, and the GROUP BY query in Listing 2 returns the desired result, which Table 1 shows.

You could also use a mathematical approach to tackle this problem. If you could use one aggregate query to produce the desired result, you'd almost certainly have better performance than the other approaches I discussed. For example, T-SQL provides the SUM() aggregate function. If you could use SUM() to calculate a product, your work would be done. I'll save you the trouble of taking out your mathematics books to search for formulas that translate addition to multiplication: You can find one such formula in the world of logarithms.

Using Logarithms

First, let's review some logarithm concepts. Given the equation a^y = x, where x is a positive real number and a is a number other than 1, the logarithm of x to base a is y. You can represent this relationship as log_ax = y. Let's call this equation EQ1. One rule of exponents says that log_a (xy) = log_ax + log_ay. This equation, which I'll call EQ2, seems promising because it turns multiplication into addition and vice versa. T-SQL provides the LOG10() function, which calculates to base 10 the logarithm of an argument you pass to it. You can sum all the LOG10() result values for each case ID as the following query shows:

SELECT
	caseid,
	SUM(LOG10(result)) AS sum_log
FROM Testresults
GROUP BY caseid

Now, let's represent the test result values in a certain case ID as v₁, v₂, ..., v_n. If you place those values into EQ2, you get log₁₀(v₁*v₂* ... *v_n) = log₁₀v₁ + log₁₀v₂ +...+ log₁₀v_n. For each case ID, you already know the result of the equation's right side. According to EQ1, if you raise 10 to the power of the right side's result, you get the product of all values for that particular case ID. In T-SQL terms, you could revise the query as follows:

SELECT
	caseid,
	POWER(10., SUM(LOG10(result))) AS 'final result'
FROM Testresults
GROUP BY caseid

This query is sufficient if all the test result values are positive, but if at least one test result value is zero, the query returns a domain error because the LOG10() function doesn't accept zero or negative values. Let's insert the following zero result value row into the Testresults table:

INSERT INTO Testresults VALUES(3, 0)

Listing 3 shows the previous query as I revised it to handle zeros. It uses a CASE expression to calculate the number of zero-value occurrences in the result column for each case ID. If at least one zero test-result value exists in a certain case ID, the script returns a zero value as the final result; otherwise, it calculates the product of all test-result values.

In the biological testing facility scenario, test results can't be negative, but you can generalize the previous query to let it support negative test result values as well. First, insert the following negative-result value row into the Testresults table:

INSERT INTO Testresults VALUES(2, -1)

The script in Listing 4 uses a CASE expression to calculate the number of negative test-result values. If the number of negative test-result values is odd, the script multiplies the product of the absolute test-result values by -1; otherwise, it multiplies the product by 1.

Now that I've demonstrated how you can use mathematics to shorten your T-SQL code and improve its performance, you can apply this type of approach to many different scenarios. Next month, I'll discuss another example that can benefit from such an approach.

Comments

Plain text