To analyze business data, you often need to get aggregate values from database tables. Typically, you can use a GROUP BY clause and an aggregation function such as COUNT() or SUM() to return the information. But what if you need multiple aggregate values for presentation or comparison in one report? For example, your boss might want to track and compare the sales of a product over time, so she requests the total number of orders for today, this month, this year, and last year.
The GROUP BY clause works when you can specify how the values should be grouped (e.g., by month, by day, by total number of orders). But specifying the basis for grouping values is possible only if you can obtain the group values directly from the column data or by applying a standard function or computation to one or more columns in the table. For example, you can apply the YEAR() function to a date column and get values grouped by year, or you can use the LAST_NAME column in the GROUP BY clause to get a grouping of names on the basis of last name.
But in many situations, GROUP BY provides no way to specify the basis for grouping and aggregating values. Say you want to perform quantity aggregation for time intervals such as today and this week. You can't use the GROUP BY clause to define a day and a week in one expression because you can't obtain today and this week by applying any one function or expression to any column data. Alternatively, you might try using a WHERE clause, which would be a slightly better choice because you could use two queries to filter the rowset for the two periods and get the desired aggregate values. However, many reports require tens or even hundreds of aggregates. When you use the WHERE clause for such reports, you have to write many queries and performance decreases. In addition, the process of extracting data from query functions and presenting the data becomes tedious as n queries return n result sets, which you then must typically convert for presentation in one comparison report.
Ideally, you need to be able to calculate multiple aggregations in one query. The more aggregations you can merge into one query, the more speed you gain because you need only one connection and because rowset iteration (reading, processing, or parsing of the table data) happens only once—regardless of the overhead you incur by creating a more complex query, as I'll explain in a moment. I call the technique I use for this kind of calculation inline conditional aggregation. This technique takes extra time for computing conditional expressions during aggregation, but the overall performance gains more than compensate for this added overhead.
The technique uses a conditional expression that you define in a CASE...END expression as the input for an aggregate function such as COUNT() or SUM() in the SELECT statement. The CASE expression lets you specify almost any valid T-SQL expression as a condition. CASE...END in T-SQL is similar to the If statement in general programming languages and to the switch...case statement in C because you use CASE...END for branching into a possible result based on evaluation of Boolean expressions. (For more information about CASE, see the "Transact-SQL" section of SQL Server Books Online—BOL). And by using inline conditional aggregation in one query, you can obtain a horizontal-row format without having to convert all your results from vertical format.
How It Works
Let's look at a simple example of how inline conditional aggregation works. For each row in a database table that a given query accesses, the code passes to the aggregate function the result of the CASE...END expression. For example, the code in Listing 1 shows a CASE...END expression that tells the query to use the value in the quantity column for aggregation only if the order date is August 10, 2003. Otherwise, the expression passes to the aggregate function a value of 0 for that row, as long as the table contains a date column (FLD_ORDER_DATE) and a quantity column (FLD_QTY).
The code in Listing 1 returns the total order quantity for the specified date without using GROUP BY to assist the aggregate function and without using a WHERE clause to filter the rowset from the table. For every row that the query processor iterates through and that reaches the SELECT clause (after being filtered through any WHERE clauses), the code evaluates the CASE...END expression. The SUM() function receives the resulting quantity value in the column only if the row data satisfies the condition. I'll walk through a more complex example in a moment. But first, let's look at some important points to remember when you're using this technique.
Inline conditional aggregation is extensible to multiple aggregates. Because the aggregation, filtering, and grouping happens within the scope of one SELECT statement, you can easily extend a SELECT statement to include more than one inline conditional aggregation column. For example, you can use one query containing multiple conditional aggregate columns to return the quantities for a week, a year, and any other periods you want to see—all at the same time and in separate columns. And you can use any aggregation function or condition as part of the SELECT statement's aggregation column. The conditions go in the CASE...END expression, and the aggregation function takes the entire CASE...END expression as a parameter. I used the SUM() function for the examples in this article. In addition, you can compute and filter within the CASE expression based on any data in the row you're evaluating.
You can combine inline conditional aggregation with the WHERE clause. You use WHERE to filter out rows that aren't relevant in any of the inline aggregations. This filtering makes your solution faster if you're computing multiple aggregates because multiple CASE...END expressions don't have to evaluate irrelevant rows. For example, if you're calculating aggregates for different periods in the current year, you could restrict the rowset to the current-year data by using the WHERE clause to eliminate data for previous years. You could instead use the CASE...END expressions to filter out data you don't need, but that approach is wasteful and slow because it ends up eliminating the rows for each CASE...END expression instead of eliminating them just once through the WHERE clause.
You can combine inline conditional aggregation with the GROUP BY clause. Although Listing 1's code works for multiple aggregates without using the GROUP BY clause, the code is also compatible with GROUP BY. Using GROUP BY to further segregate your inline aggregation, as I'll show in a moment, lets you aggregate in two dimensions or aspects at the same time. For example, say you want the total number of orders for today and this week for each product type. You can define the aggregation columns for today and this week by using inline aggregation, then use GROUP BY to further segregate the aggregates according to product type—as long as the date, quantity, and product-type columns are in the table you're working with.
As I mentioned before, inline conditional aggregation produces a table in which aggregation happens in columns such as total order quantity for today or total order quantity this week and expands horizontally as you add more aggregates. However, the GROUP BY clause produces aggregations in rows such as total order quantity for product type C1 and expands vertically. So combining the two techniques can also let you create complete summary tables from one query faster than using multiple queries for the same purpose. When I used this approach to create a production-summary report that showed total production for different product types across multiple time intervals, the query-processing time, including data collection and overall report generation, dropped from 30 seconds to 3 seconds.
A Real-World Example
Now that you understand the basics of inline conditional aggregation, let's look at a real-life example of how to use this technique to create a sales summary report for comparing product sales across time intervals. This example report includes grouping by aspects such as zone and product category. Consider the typical Order table that Figure 1 shows, which has columns for order date, product type, product ID, quantity, and other values. (You can download the script to create and populate Figure 1's table by entering InstantDoc ID 39394 at http://www.sqlmag.com.)
Say you need to produce a daily report that shows in table format the total number of orders your company has received during the following time intervals:
- This day (F_DAY)
- Up to this month in the year (F_UPTO_MONTH)
- This month (F_MONTH)
- Total this year (F_THIS_YR)
- Same period previous year (F_PREV_YR_SAME_PERIOD)
- Total in the previous year (F_PREV_YR_FULL)
The names in parentheses are the column names I used in the downloadable table-creation code. Now, assume that on August 10, 2003, you want to generate a report in the format that Table 1 shows. The table displays the format for a report that will contain data for four product categories: C1, C2, C3, and C4.
To produce the report that Table 1 shows, you need to combine inline conditional aggregation with the GROUP BY clause, as the query in Listing 2 shows. This query returns all the aggregates in a result set in Query Analyzer, which Figure 2, page 36, shows. Listing 2's code doesn't have a WHERE clause, but as I mentioned earlier, you could use a WHERE clause to filter irrelevant rows. Listing 2's query has an aggregate column for each time interval that Table 1 shows. The CASE...END expression evaluates each row of data for each column. The code adds the order quantity to each column's aggregation only if the condition in the CASE...END expression is satisfied. For example, if an order's date appears as August 10, 2003, the first aggregation column expression (F_DAY) is satisfied, the code passes the order quantity to the SUM() function, and that quantity becomes part of the aggregation for F_DAY. Similarly, the code sums the rows that contain orders for August 1 through August 10 as F_MONTH because those values satisfy the date comparison in the CASE...END expression for the F_MONTH column.
Listing 2's query also uses the GROUP BY clause to segregate product type. Understanding the GROUP BY clause's role might be easier if you knew what would happen if the clause wasn't there: The query would return only one row of data, which would include columns that have total order quantities for all the time intervals regardless of product type. In other words, without GROUP BY, you'd have aggregations for one aspect—time intervals—and you'd get one total-order-quantity value per time interval. But when you add the GROUP BY clause, the code further separates the aggregates according to product type, which means you get a total order quantity for product-type C1 for today, total order quantity for product-type C2 for today, and so on. Thus, you can aggregate for time interval and product type at the same time.
Notice that the result set in Figure 2 is laid out exactly like Table 1. I mention this because database developers often display aggregates in rows when they need the values displayed for comparison in column format. To produce the column format, I used an inline aggregation column for each time interval required (e.g., F_DAY, F_UPTO_MONTH, F_MONTH). I defined each interval by using explicit dates in the CASE expression. The GROUP BY clause works as it would for any other aggregation, automatically segregating the aggregation further for each product category.
For this example, I used dates that reflect a financial year beginning April 1 and ending March 31 of the next year. Table 1 shows the exact dates for each interval. For brevity, I won't show how the code calculates these dates, but you can use T-SQL or any other language to calculate the dates. If you pass the date that denotes today to a T-SQL stored procedure, for example, the stored procedure can then compute the remaining required dates. Given one date as a parameter, you can combine the date computation with the inline aggregation query in one stored procedure to generate all the aggregates for the report. Active Server Pages (ASP) developers, for example, can prepare the dates and the entire query by using a scripting language such as VBScript, then execute the query by using ADO in an ASP page. How you calculate the dates is up to you.
Figure 3 shows Listing 2's query plan in Query Analyzer. Query Analyzer lets you see how SQL Server will execute the operations in a query and how much time each operation will take. The query plan for Listing 2's query comprises four major operations. As you can see, most of the query's processing time (76 percent) is spent on the Clustered Index Scan operation, circled in black in Figure 3. This operation takes the most time because it iterates through the entire clustered index and retrieves the relevant rowset in which the remaining operations take place. In contrast, the Sort operation takes 24 percent of the total processing time, and grouping and aggregations take only about 1 percent of the total processing time. When you add inline aggregation to a query, the additional processing falls into this 1 percent time block.
If you compare one query that uses conditional inline aggregation with multiple queries that use GROUP BY and WHERE clauses, the queries' execution plans would be similar; each query would have Clustered Index Scan and Sort operations. But because conditional inline aggregation uses only one query, you reduce the overall processing time dramatically. For example, say one query producing one aggregate takes 3 seconds. A report that needs 10 aggregates would typically require 10 queries, so you'd need 30 seconds for the solution (10 * 3 = 30). But when you use inline aggregation, you can obtain all 10 (or more) aggregates in one query that requires roughly 3 seconds. So the solution now takes one-tenth of the time that the multiquery solution requires. This approximation might vary slightly because of factors such as caching. But in my experience, even when a data table has tens of thousands of rows, using inline conditional aggregation can reduce query time by 80 percent or more.
Inline conditional aggregation has the rare dual advantages of providing tangible gains in performance and reducing code size. This technique also demonstrates that T-SQL provides many ways to solve problems and proves that writing optimized queries isn't just a matter of knowing the SELECT statement's syntax; you also need practice and experience. When you try different methods, you can find the best technique for each situation you face.