Optimizing Aggregates with DISTINCT

This came up in a conversation I had with an attendee in the TechEd conference in LA a couple of weeks ago.

Suppose that you have a GROUP BY query where you need to calculate various aggregates, including one aggregate with the DISTINCT option. As an example, consider the following query against the Sales.SalesOrderHeader table in the AdventureWorks database:

SELECT CustomerID,

COUNT(*) AS Cnt,

COUNT(DISTINCT OrderDate) AS NumDays,

SUM(TotalDue) AS TotalDue

FROM Sales.SalesOrderHeader

GROUP BY CustomerID;

The query groups the data by CustomerID, and returns for each customer the total count of orders, distinct count of order dates (number of days with order activity), and total amount due from the customer. The optimization of this query is not optimal. That is, the optimizer chooses to perform separate scans of the data for the non-distinct aggregates and the distinct aggregate. This can be seen in the showplan output for this query, shown here in abbreviated form:

|--Compute Scalar(DEFINE:(\[SalesOrder...

|--Hash Match(Inner Join, HASH:(\[Ex...

|--Compute Scalar(DEFINE:(\[Expr10...

| |--Compute Scalar(DEFINE:(\[Expr...

| |--Hash Match(Aggregate, HASH...

| |--Hash Match(Aggregate, HA...

| |--Clustered Index Scan(O...

|--Compute Scalar(DEFINE:(\[Expr10...

|--Compute Scalar(DEFINE:(\[Expr...

|--Hash Match(Aggregate, HASH...

|--Compute Scalar(DEFINE:(\[...

|--Compute Scalar(DEFINE:...

|--Clustered Index Scan...

Observe the two separate scans of the clustered index. STATISTICS IO reports a scan count of 2, and a total number of logical reads 2,792. The latter is equal to twice the number of pages in the leaf level of the clustered index.

You can achieve the same task with a query that requires only one scan of the data. To do so, write a query that groups the data by both CustomerID and OrderDate, and return for each group the count of rows (call the result column Cnt), and total amount due (call it TotalDue). Define a table expression based on this query, and then in the outer query, group the data by CustomerID alone, and calculate the final aggregates. Those would be the count of orders (calculated as the sum of Cnt), the distinct count of order dates (count of OrderDate), and total due (sum of TotalDue). Here’s the complete solution:

WITH C AS

(

SELECT CustomerID, OrderDate,

COUNT(*) AS Cnt,

SUM(TotalDue) AS TotalDue

FROM Sales.SalesOrderHeader

GROUP BY CustomerID, OrderDate

)

SELECT CustomerID,

SUM(Cnt) AS Cnt,

COUNT(OrderDate) AS NumDays,

SUM(TotalDue) AS TotalDue

FROM C

GROUP BY CustomerID;

This time, the optimizer generates a plan that scans the base data only once, groups the data first to calculate base aggregates, and then aggregates the base aggregates. Here’s the abbreviated showplan output I got for this query:

|--Compute Scalar(DEFINE:(\[Expr1005...

|--Hash Match(Aggregate, HASH:(\[S...

|--Compute Scalar(DEFINE:(\[Expr...

|--Hash Match(Aggregate, HASH...

|--Compute Scalar(DEFINE:(\[...

|--Compute Scalar(DEFINE:...

|--Clustered Index Scan...

Observe that there’s only one scan of the clustered index. Therefore, STATISTICS IO reports a scan count of 1, and a number of logical reads 1,396—half the number you got for the previous query.

Learn more from my post "Combining Distinct and Non-Distinct Aggregates," October 2012.

Cheers,

Comments

Plain text