Algorithms Still Matter

24146.zip

For some search problems, T-SQL provides a not-so-obvious solution: a binary search

If you're an experienced T-SQL programmer, you've spent a lot of time honing your skills with set-based solutions. Therefore, you might think you have little use for the procedural coding techniques that you learn in classic texts such as Donald Knuth's The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd edition (Addison-Wesley, 1998) and Jon Bentley's Programming Pearls, 2nd edition (Addison-Wesley, 1999). However, for some problems, you might need to look into these books again.

Why would you need to return to textbooks about searching and sorting? After all, you use SQL Server to manage data because it does the searching and sorting for you. However, you might encounter a class of search problems that SQL Server's search features don't directly handle.

Even with SQL, relational databases don't lend themselves readily to solving problems that require computing a solution on an ordered list of rows. Such a solution requires list processing, but SQL is a set-processing tool. These two processing methods are so different that Microsoft added the TOP operator in SQL Server 7.0 to handle the common need for a ranked list of a predetermined size. Solutions to list-based problems often require procedural code, either on the server or on the client. Let's look at three effective solutions you can use for list processing: a single statement, a cursor, and a binary search.

Defining the Problem

Imagine you need to find a row in which the sum of all the previous rows in an ordered list reaches a target value. For example, if you have 10 rows sorted by the values 1 through 10, the sum of rows 1 through 3 is 6 (1 + 2 + 3) and the sum of rows 1 through 10 is 55 (1 + 2 + ... + 10). If you want the row in which the sum is 15, you need to find row 5.

An order-fulfillment system for a product distributor might present this kind of problem. The distributor would have a high volume of orders for a broad range of products and would tightly manage stock to keep inventory costs low. Therefore, items would move in and out of stock rapidly. When a backordered item came back in stock, the distributor would want to get the merchandise to the customers who ordered it as quickly as possible.

Of the several standard models for filling backorders, the simplest model is first in/first out—providing the stock strictly according to the age of the orders. A hierarchy might modify this model by requiring that you fulfill preferred customers' orders first. Other models such as equal sharing (try to give one to everybody) and proportional sharing (the bigger the order, the bigger the share) often produce a remainder that the distributor sends out according to some version of the first in/first out method. A distributor will want to know how many orders it can fill before it runs out of the product again and whether the last order it can fill will be complete or partial. Thus, the search problem is how to quickly find the row in an ordered list that receives the last piece of stock.

In a database with a lot of activity and a lot of data, you need to examine the algorithm that you use to find the last-filled row and tune it for speed and efficiency. If the data set is small and the processing is simple, you need only a simple algorithm. (For example, the SQL Server query optimizer follows the principle of using the simplest method when it chooses to do a table scan instead of the more complex process of using an index.)

To limit the search problem for this article, I use the oversimplified table definition that Listing 1 shows and assert that the primary key column—OrderLineItemID—defines the sequence in which the line items were ordered. (A real OrderLineItem table would have such columns as foreign keys to the item ordered and the order header. Also, the order header would probably include the date of the order.)

To test the three solutions to the search problem, I've filled the table with 20,480 rows—a realistic number of orders for a product from a major distributor. The OrderLineItemID column's identity increment value of 3 lets me demonstrate gaps in the sequence, and the ToFillQuantity values range between 1 and 10 so that I avoid too much simplification. The sum of all quantities ordered is 112,640.

Computational Orders of Magnitude

When you're evaluating the performance of the three solutions to this problem, it helps to understand what computational orders of magnitude tell you about the solutions. (You might have been exposed to this information in a computer science course or in your own reading.)

The computational order of magnitude—which the O-notation of O((N)) represents—is a measure of the number of steps taken to achieve an answer, not a direct measure of performance. Reducing the order of magnitude (i.e., the number of steps) might require more complicated steps that each take more time. (A step isn't a line of code but is a single logical operation—such as an iteration through a loop—that might take several lines of code.) But fewer steps usually means a faster answer.

For example, a simple loop through a sorted array of N elements to find a particular element averages on the order of N/2 steps, or O(N/2). A binary search of the same array averages on the order of the natural logarithm of N steps, or O(Log2 N). (A binary search looks for the desired answer by selecting the midpoint of the array, then repeatedly halving the section of the array above or below that point—depending on where the answer lies—until it reaches the answer.) Although the binary-search code is more complex, it finds the answer much faster than the simple loop. As Table 1 shows, the computational order of magnitude makes an enormous difference in the number of steps and in performance. For an O(N) or O(N/2) algorithm, every time N doubles, so does the time to find the solution. For an O(N2) algorithm, every time N doubles, it squares the time needed to find a solution. However, for an O(Log2N) algorithm, you must multiply N by 10 before doubling the time to find a solution.

Compare the difference between the simple loop and the binary search with how SQL Server would search for a specific row in the sample OrderLineItem table that Listing 1 creates. Without the index that the PRIMARY KEY constraint creates, SQL Server would perform a table scan and process an average of 10,240 rows—O(N/2)—to find a primary key. However, with the index that the PRIMARY KEY constraint creates, SQL Server examines index entries across two or three index levels, then accesses one row from the table. SQL Server's algorithm, which searches the index tree to find the exact row in the leaf page, uses the extra complexity of the index to great advantage, making the index faster than either the O(N/2) table scan or an O(Log2N) binary search.

Because the solutions in this article must sum a value from all qualifying rows, I work with two different measures of magnitude: the number of comparisons needed and a summation of the number of rows processed for each comparison.

A Single-Statement Solution

One solution to this problem requires no looping code. Listing 2 shows a simple way to find the target row by using one T-SQL statement. This query uses a self join to get the sum of the ToFillQuantity for a row and every row that comes before it in the ordering sequence that OrderLineItemID defines. (Thus, row 1 joins with itself, row 2 joins with rows 1 and 2, row 3 joins with rows 1 through 3, and so forth.) The TOP and ORDER BY clauses instruct the query to process the rows as an ordered list. The first row that satisfies the HAVING clause (the TOP 1 of the SELECT statement) completes the query, so SQL Server won't process the rest of the table.

This is concise T-SQL code. Because the answer is frequently in the middle of the table, this algorithm involves O(N/2) comparisons. However, because every row joins with every previous row in the table, the query processes O((N/2)2) rows. Therefore, if you get the answer to a query in 10,240 comparisons, the query will still process 104,857,600 rows (10,240 * 10,240). This exponential increase in query processing time is unacceptable.

The Obvious Second Solution: A Cursor

Perhaps the most obvious solution is to use a cursor to find the target row. This type of problem is why cursors were invented. Server-side cursors are fast, and as Listing 3, page 4, shows, the code is simple and straightforward. The code in Listing 3 opens an ordered set of rows, processes each row until it finds the answer, then returns the result.

This sequential search through the rows of a result set executes an average of O(N/2) comparisons and accesses an average of O(N/2) rows. Therefore, if you get the answer to a query in 10,240 comparisons, the query would process only 10,240 rows. You can easily predict this algorithm's performance: Its runtime doubles with every doubling of the rows it must search. For a small set of rows, this solution is fine. However, for our sample, the query will read at least 10,240 rows as the cursor individually fetches them. Even before you measure this solution's performance, intuition tells you that the overhead of the cursor fetches adds up quickly.

The Less Obvious Solution: A Binary Search

Off you go to the bookshelf. The classic answer to the problem of finding a position in a sorted list is to use a binary search. Knuth and Bentley both discuss the difficulty of correctly coding a binary search. Therefore, the code that Listing 4 shows includes comments about the assertions it makes, which carefully follow Bentley's assertions in Programming Pearls.

As I mentioned earlier, a binary search performs O(Log2 N) probes into the data to find an answer. The binary search divides the range of keys to query in half, then adds or subtracts the sum of all the rows in the range, depending on whether the previous midpoint calculation is higher or lower than the desired answer. The binary search sums some rows several times, perhaps on every query, and might never sum other rows. The code in Listing 4 doesn't use ORDER BY because row order doesn't matter as long as only rows in the proper sequence range are in each loop. The SUM() function leverages the power of set-based operations to calculate the solution. All the sorting needed to support this search is handled by the MIN() and MAX() functions as they operate from the current midpoint of the data set. MIN() and MAX() also make the code indifferent to gaps in the sequence. You can see this indifference if you seed and increment the primary key by 3 instead of by 1 as Listing 1 shows. In a more complex system, a datetime or a date followed by a line-item sequence might define the search order. In such a system, calculating the midpoints for the binary search would still be possible, although the calculation would be more complex.

The performance of the binary search doesn't depend on which row has the answer but instead depends on the total size of the data set to be queried. Computing the percentages of the data set processed in each summation as 50 percent + 25 percent + 12.5 percent + 6.25 percent + ... + (1/N) percent shows that the binary search will process O(N) rows. Therefore, the search will perform about 10 comparisons (the code I wrote increases this number to about 15) and will process 20,480 rows. The condition that the final IF statement in Listing 4 handles could be handled inside the binary search loop, but the condition decreases performance in that position.

Gauging Performance

So, how well did these solutions perform? I ran these T-SQL solutions in SQL Server 2000 on a 266MHz Pentium II machine with 176MB of RAM. Table 2, page 6, compares the performance of the solutions.

When I perform a search to let the product distributor know how many orders it can fill before it runs out of the product, the binary search's performance is equal to the cursor code's performance when I process about 420 rows. How can this result be accurate when the binary search accesses all 20,480 rows in the ordered set? This result is a first-hand measure of the value of the set-based power that SQL Server provides. In this example, the binary search is about 24 times faster than the cursor solution when solving for row 10,240, and 53 times faster when solving for row 20,480. (The binary search's performance is equal to the self join when I process about 150 rows.)

On my test machine, the cursor solution processes a row about every 208 microseconds (i.e., 0.208ms), procedurally summing the quantity to fill from each row. The binary-search solution that uses the power of the set-based SUM() function sweeps through this table at about 4 microseconds per row—an increase of about 1.7 orders of magnitude. Thus, adding some complexity to the code to use the power of the relational database's set-based operations pays excellent dividends.

However, a faster solution exists. One way you can speed up standard searches is by paying careful attention to how you index a table. In solving the product distributor's stock-filling problem, you can spend a little extra time during inserts and updates to the order line items to keep some derived data that would assist you when filling stock. For example, for each item, you could maintain a grand total of all order quantities and a total of order quantities for each day. Logically, this derived data would be like a two-level index. When the stock is limited, the total of each day's quantities ordered helps to narrow down the search for the solution row. The grand-total summary value shows when enough stock is on hand to fulfill all the orders, thus eliminating the need for a search. The fastest executing code is code that never executes.

The classic algorithms for solving thorny programming problems can still be valuable in solving the database-search problems that you face today. T-SQL is a programming language, and algorithms still matter. You just need to recognize when you can use and improve on the classics.