Detailed Aggregates

In discussing what isn’t allowed within a view (as I did in my February column "Viewer Advisory," InstantDoc ID 37660), I'm often asked about the importance of COMPUTE and COMPUTE BY clauses. These aren't allowed within a view’s definition, yet they're important clauses to understand for detailed reporting. For summarizing data, many people use only GROUP BY when COMPUTE BY might be a better alternative. I’m partial to COMPUTE and COMPUTE BY, so I always take a minute to explain their use. When you want more than just the summary data (GROUP BY returns only one row for each grouping), using COMPUTE and COMPUTE BY is the best way to get the details and the aggregates.

To execute this article's examples, you can download the ComputeByExamples.sql commented script file at InstantDoc ID 38150. For simplified explanations of the results, I recommend running the examples in text mode (using Query Analyzer’s Query dropdown menu, choose the Results in Text option or Ctrl+T). Later, I explain how grid mode (the default option for Query Analyzer) works for these examples.

SQL Server supports a non-tabular way of returning aggregations over sets of data. Often, people use GROUP BY to produce summary data, but GROUP BY doesn’t always give enough detailed information. The GROUP BY clause is an ANSI-standard way of producing summary data, but it doesn’t return the details on which the summary data is based. For example, the code in Listing 1 produces one row for each ProductID. In this case, the grouping row shows the sum of items sold (quantity) for each ProductID in the Order Details table. With this query, SQL Server will return a sum of the number of products sold and produce only a sum—just one row—for each ProductID (and only for those that have rows in Order Details). For products without rows in the Order Details table, the query won’t return the ProductID and sum. For products with many sales, the query returns only one row—with the total sold (the sum of quantity).

This result set is extremely useful if you want only the total. However, if you need more details about these sales, the GROUP BY result set doesn’t provide them. For example, if you sold 580 of product ID 23, how would you know whether the total represented one order with a quantity of 580, 580 orders with a quantity of 1 each, or something in between? In some cases, you need the summary data and the details.

As an option, adding the count(), avg(), min(), and max() aggregates gives you more information about each product. But you can’t produce detailed information, such as the order on which the maximum number was sold or the customer to whom you sold it, without writing a separate query. Listing 2 shows how you can add more aggregates, but the results still give only limited information. If you’re trying to get an estimate, this is probably sufficient, but what if you want details about each order in addition to the sum? One might argue that you’re data mining here—looking for information about your data—and that programmatic data mining is better (and I’d probably agree). However, many of my customers still want paper-based or scrollable reports with the details and the sums. COMPUTE and COMPUTE BY offer a nice way to produce these reports.

COMPUTE and COMPUTE BY are T-SQL extensions whose result sets produce summary data while including all the details. The syntax mirrors this concept; instead of asking for aggregates in the SELECT list, you ask for them separately (in the COMPUTE clause) and SQL Server produces a separate result for each aggregate (a "logically" broken report). These reports might be several pages long, but you can flip through the pages and see summary data that’s easy to read because the report format breaks where the summary data exists. The summary data stands out from the rest of the report almost as if it were highlighted, but it’s really just another result set. Often, this report is called a control break report.

Technically, only COMPUTE BY breaks the report into groups, but even COMPUTE produces an extra result set for the grand aggregate (e.g., the grand total or overall average). Being able to quickly find the summary data and also have the details is the benefit of using the COMPUTE and COMPUTE BY clauses. You can combine COMPUTE and COMPUTE BY, even within the same query; COMPUTE offers grand totals and COMPUTE BY offers subtotals.

Consider the GROUP BY clause that Listing 1 shows. For each ProductID, the GROUP BY produces a sum. The result set doesn’t give you the breakdown of sales for each of these ProductIDs. In Listing 3, I removed the GROUP BY so that the query would return just the list of Product ID and Quantity for each order. In the result set, all you get is a ProductID with its quantity. However, if you’re using COMPUTE BY instead of GROUP BY, you can add attributes. Listing 4 adds the OrderDate, CompanyName, ProductName, and CategoryName to the SELECT list. Listing 4 shows a query that gathers the details; with a GROUP BY, this information wouldn't be displayed. Listing 4 gives you the details but not the sum, but what if you want both?

Here’s where COMPUTE and COMPUTE BY come in handy. For the columns you want summary data for, you can add the aggregation in the COMPUTE or COMPUTE BY clause. Before you can write this query, you must meet the first requirement of a COMPUTE or COMPUTE BY clause—including in the SELECT list the columns where you want summary data. (Note that in a GROUP BY clause, you don’t need to list the column defining the grouping.) You need to list each column only once for COMPUTE and COMPUTE BY, even if you want to see multiple aggregates of this data.

Listing 5 shows two additional clauses. The first COMPUTE clause is a COMPUTE BY, meaning that SQL Server will perform the computation for each grouping you defined in the BY portion. This query returns a sum of quantity for each Product ID, similar to the sum that the original GROUP BY in Listing 1 returned. However, this result set is significantly larger because it includes all the details about each order. The second COMPUTE clause is just a regular COMPUTE clause: It gives a grand total—in this case, the sum of all products.

The addition of the COMPUTE and COMPUTE BY columns was simple. However, I skipped a crucial requirement for using COMPUTE BY: You must supply an ORDER BY clause before the COMPUTE BY clause. For COMPUTE BY to work, the data must be ordered in such a way that the aggregate has a logical placement within the result set. For example, in Listing 5, you want to see the sum of quantity for each ProductID. To have a place to list the sum, you need to organize the rows so that all orders for a given ProductID appear together. By ordering the data, the ORDER BY logically organizes the data into groups. In the result set for Listing 5, the summary data appears physically with each grouping after each new ProductID (a sum appears after each ProductID’s details). To provide this logical ordering for COMPUTE BY, SQL Server requires the ORDER BY clause. In Listing 5, the only aggregates that you can add to the COMPUTE BY clause are those that include ProductID. Listing 6 shows the syntax for adding the average quantity ordered and the count of orders for each Product ID. If you want additional groupings and aggregations, you also need a more complex ORDER BY clause.

To produce summary data for a variety of groups, you need to state multiple attributes in the ORDER BY clause. For example, you might want a broader sum of items sold by product type (e.g., the sum of all products that have the same CategoryName) in addition to the sum of each product (by ProductID). To do this, you need to logically order the data to create groupings by category so that you can place the sum after the entire group. You could order all the data by Product and then Category or by Category and then ProductID. Both are acceptable syntactically, but the latter makes more logical sense because you’re really interested in the total sum of all products within each category and then each product individually. Listing 7 shows the changes necessary to create this COMPUTE BY statement.

Listing 7 uses multiple COMPUTE and COMPUTE BY clauses. With this query's ORDER BY clause, you can perform computations only in left-based subsets of the ORDER BY clause. The COMPUTE and COMPUTE BY clauses SQL Server supports for an ORDER BY CategoryID, ProductID statement are:

COMPUTE aggregate(quantity) BY CategoryID, ProductID
COMPUTE aggregate(quantity) BY CategoryID
COMPUTE aggregate(quantity) — for a grand total of quantity

These are left-based because they contain a subset of the ORDER BY starting from the left. Not only are all three of these supported, but all three appear in Listing 7. Listing 7's result set returns detailed data about an order (the OrderDate, CategoryName, ProductName, CompanyName, and Quantity ordered) as well as summary data about Categories and Products. It does so by using all three COMPUTE clauses possible for this ORDER BY clause.

Using text mode, review the output of this data. Next, run the query a second time using grid mode (Query, Results in Grid or Ctrl-D) to see how the results look different. In grid mode, SQL Server returns each of the aggregates in a separate grid from the details and lists them together horizontally (when you request multiple aggregates). In text mode, Query Analyzer returns the summary data vertically in the aggregate column. Grid mode collapses the data better if you’re quickly looking for the sums, but I find the sums harder to read horizontally, especially if I’m asking for multiple aggregates from multiple columns. Additionally, in both grid mode and text mode, SQL Server defines the column headers—for example, sum for sum(), cnt for count(), avg for avg()—and you can't change them. If you plan to handle the COMPUTE or COMPUTE BY result set in your own custom client application, you should work with the application thoroughly to understand how it returns this data. Some client APIs make retrieving values from nontabular result sets difficult.

In reviewing the result set from Listing 7’s code, you notice that some customers purchase the same products multiple times. You realize that what you’re really interested in is the sum of the items purchased by each customer (not individual orders by each customer), but you want to see the totals ordered by the combination of CategoryName and ProductName and by just CategoryName alone. This may seem complex, but it’s not really. You can easily combine GROUP BY with COMPUTE and COMPUTE BY. Group all the individual customer orders for a given product (removing the date of purchase), then list them with each category and product listing.

Logically, the GROUP BY clause should come before the ORDER BY and COMPUTE clauses; GROUP BY will require an aggregate in the SELECT list (to contain the sum of products a customer purchased). Finally, the COMPUTE clauses will produce an aggregate of the individual aggregates. Listing 8 shows the result of adding the GROUP BY clause to group the customer purchases. Note that the COMPUTE clauses must use SUM to gather the sums of all groups. In fact, you must use the complete expression SUM(SUM(OD.Quantity)) and not SUM('Total Purchased') because using column headers in the COMPUTE clause isn't allowed. Column headers (sometimes called aliases) are allowed only in the ORDER BY clause.

Finally, if you want date information, you can include the date in the SELECT list, as long as it’s an aggregated date such as the min or max date. Adding aggregates for date can be helpful in determining the earliest order (min(OrderDate)) or the most recent order (max(OrderDate)). Listing 9 shows a query that takes advantage of this information, showing the count of purchases, the first purchase date, and the most recent purchase date for each CompanyName.

Although COMPUTE and COMPUTE BY might appear challenging to use at first because they create multiple result sets, I recommend using them with the summary data, where other aggregates fail to provide all the necessary information. Overall, COMPUTE and COMPUTE BY are an excellent means of reporting when detailed printed reports are necessary.

Next time: Back to views… Can you modify data through a view?

Comments

Plain text