Custom Aggregations: The Pivoting Technique

SQL Server's built-in aggregate functions (e.g., AVG, COUNT, MAX, MIN, SUM) are admittedly useful, but what if you need other types of aggregations that SQL Server doesn't provide? In "Custom Aggregations: The Cursor Technique," March 2006, I began a discussion of custom aggregations by introducing four classes of solutions and looking closely at one class in particular: the use of cursors. This time, I present a second class of solutions that rely on pivoting techniques. This article assumes that you're familiar with pivoting ver 2000 and SQL Server 2005 (with its new PIVOT operator). If you need a little background, check out my article "Pivot (or Unpivot) Your Data," June 2004.

As with the rest of the columns in this series, I'm using string concatenation as my example of a custom aggregation task. Here's a quick summary of the two sample concatenation tasks featured in the March column: The first task was to use the Northwind database to return each customer and, for each, a concatenated list of order IDs expressed as a comma-separated list of values and sorted by order ID, as Table 1 shows. The second task was to aggregate distinct, concatenated employee IDs for each customer, as Table 2 shows.

Remember that we're evaluating the usefulness of each solution based on several factors: Is the solution generic? Does the solution support an unlimited number of elements in a group, and does it have any special limitations? Can you apply the aggregation to the elements in a desired order? How well does it perform? Finally, is the solution set-based or is it iterative/procedural? Let's first take a close look at custom aggregate solutions based on pivoting, then evaluate the solution based on the aforementioned criteria, as I did with the cursor-based solutions.

Pivot-Based Custom Aggregates

The logic behind using pivoting to calculate custom aggregates is to rotate multiple items in a group (e.g., the group of rows you're aggregating) into one row, then apply the calculation as a linear calculation. For example, to perform string concatenation of multiple order IDs per customer, you can pivot the order IDs such that they end up in one row, then concatenate them by using the + operator. Similarly, you can achieve other types of custom aggregations, such as a product of elements (by using the * operator) or an aggregate bitwise operation (by using the relevant bitwise operator).

The primary problem of using the pivot operator is that you need to know exactly which values you want to rotate. In the case of this article's example, we want to rotate order IDs, but we don't know order IDs ahead of time. Worse, order IDs differ from customer to customer, so you can't simply specify all of them as the pivot operator's input list.To solve this problem, you can calculate row numbers based on a desired order (in our case, OrderID order). My article "Calculating Row Numbers in SQL Server 2005," April 2004, describes how to calculate row numbers in both SQL Server 2000 and SQL Server 2005. After you assign row numbers to the orders, you can use those row numbers in your pivot expressions.

String Concatenation in SQL Server 2000. Listing 1 shows the solution for SQL Server 2000, which concatenates order IDs for each customer to produce the result that Table 1 shows. The query that creates the derived table D calculates row numbers (in the result column n) for orders per customer, sorting based on OrderID.The outer query then groups the rows by CustomerID and uses a series of MAX(CASE...) expressions to pivot the order IDs—based on their row numbers— from the group into the same target row. The SELECT clause concatenates the pieces into one result string.

The main limitation of this solution is that it imposes a limit on the number of elements in the group. In the case of our example, I've assumed that you need to concatenate no more than five order IDs. If you need to support a large number of elements per group, you'll end up with a huge query string. Furthermore, the cost of using a subquery for the row-numbers calculation increases in an n2 manner based on the group size (where n is the number of rows in the group), even when you have a good index in place (i.e., the one on CustomerID, OrderID). So, this solution is probably most practical if you know ahead of time that you have a fairly small number of elements per group (e.g., a dozen).

String Concatenation in SQL Server 2005. Listing 2 shows the solution for SQL Server 2005. Note that SQL Server 2000's sample Northwind and pubs databases aren't available in SQL Server 2005. Before running the queries in this article, make sure you download and run the installation scripts for these sample databases, which are available at the Microsoft Download Center. Again, the query creating the derived table D assigns row numbers to the orders based on OrderID sort per customer. Now, you use the ROW_NUMBER function, which is both simpler and substantially faster than SQL Server 2000's row-numbers calculation. SQL Server needs to scan the data only once to calculate row numbers.

Using the new PIVOT operator, the outer query pivots the order IDs based on the row numbers that were assigned to them. In the SELECT list, you use the ISNULL function to substitute NULL order IDs with empty strings before concatenating them; if you attempt to concatenate a NULL order ID to a string, you'll end up with a NULL result. The new PIVOT operator allows for a shorter query string than the one in SQL Server 2000. So, you can support more items per group, but practically speaking, you're still limited to a fairly small number. Imagine what the query string would look like if you had to support thousands of items per group.

Distinct String Concatenation in SQL Server 2000. If you're interested in returning the concatenated list of employee IDs per customer, the solutions need to be revised only slightly. Listing 3 shows the solution for SQL Server 2000.The revisions are necessary because you can have multiple occurrences of each CustomerID, EmployeeID combination in the orders, but you care about distinct combinations. So, you add a DISTINCT clause to the query that creates the derived table D and use a COUNT(DISTINCT EmployeeID) instead of COUNT(*) to calculate the row numbers.

Distinct String Concatenation in SQL Server 2005. Similarly, you'll need to adjust the SQL Server 2005 solution, as Listing 4 shows. Instead of querying the Orders table directly, you query a derived table (D1) that contains the distinct combinations of CustomerID, EmployeeID.

Evaluating the Pivot Solution

One benefit of the pivot-based solution is that it's generic.You can treat it as a template and use similar code when you need to implement other custom aggregates. Second, you can apply the aggregation to the elements in a desired order (by creating row numbers in a certain order).Third, the solution is pretty fast: It doesn't involve a lot of I/O, and it utilizes a set-based approach. In SQL Server 2005, it's faster because the data is scanned only once. The ROW_NUMBER function is efficient and relies on one scan of the data to calculate the row numbers. A PIVOT query is a lot like a GROUP BY query behind the scenes; you're looking at a query whose costs are very similar to that of a regular GROUP BY query that scans the data only once. In SQL Server 2000, the solution is a bit more expensive, primarily because of the row-numbers calculation. However, remember that this solution is relevant to small groups only, and if you have a good index in place, row-numbers calculation will be quick. In our case, you'd want an index on CustomerID, OrderID.

The main disadvantage of this solution is that it supports only a limited number of elements in a group. It's not practical to have a very long list of elements in terms of the length of the query string—especially in SQL Server 2000, in which you have a lengthy MAX(CASE...) expression for each item. In SQL Server 2005, the query string is shorter, so you can support more elements per group. But it still isn't realistic to support more than a few dozen items per group. In the next column in this series, I'll continue exploring other classes of solutions to custom aggregates. In the meantime, check out this month's logical puzzle and test your troubleshooting skills in "Catch That Bug!"

Comments

Plain text