Combining Distinct and Non-Distinct Aggregates

When a new version of SQL Server is released, we tend to focus on the significant improvements, which are usually well documented. However, Microsoft typically also adds small improvements in each new version of the product that don't always make it into the documentation. For example, some small but undocumented optimization improvements can make your code run faster. Some of SQL Server's optimization improvements prevent you from needing to use workarounds, as in previous versions of the product -- so it's beneficial to know about them. My friend and colleague Herbert Albert discovered an optimization improvement related to queries that involve both distinct and non-distinct aggregate computations. In this article, I cover the suboptimal default optimization for such queries prior to SQL Server 2012 and the workaround that's required to obtain an efficient plan. I then describe the optimization improvement in SQL Server 2012 that prevents you from needing to use the workaround.

Sample Data and Task

For sample data, I used a database called Performance. You can download the source code to create and populate the sample database from http://tsql.solidq.com/books/source_code/Performance.txt.

The table of interest for our purposes is dbo.Fact. This table represents a fact table in a data warehouse and currently has 1,000,000 rows. The table has several columns representing surrogate keys of dimensions (key1, key2, and key3) and several columns representing measures (measure1, measure2, and measure3).

Suppose that you need to query the Fact table, group the rows by the key1 column, and compute two aggregates for each group: the distinct count of the values in the measure1 column and the sum of the values in the measure2 column. An index that can help perform the computations efficiently is one created on key1 and measure1 as the index keys, and measure2 as an included column. You run the following code to create the supporting index:

USE Performance;
CREATE INDEX idx_k1_m1_i_m2 ON dbo.Fact(key1, measure1)
  INCLUDE(measure2);

To measure the I/O costs of your queries, run the following code to enable the STATISTICS IO option:

SET STATISTICS IO ON;

Next, I'll describe how to handle this task prior to SQL Server 2012. I created the plans that I present in the following section to run the code in SQL Server 2008 R2. Finally, I'll describe the optimization changes in SQL Server 2012.

SELECT key1,
  COUNT(DISTINCT measure1) AS cnt_distinct_m1,
  SUM(CAST(measure2 AS BIGINT)) AS sum_m2
FROM dbo.Fact
GROUP BY key1;

Figure 1 shows the execution plan for this query, generated by SQL Server 2008 R2.

Figure 1: Plan for Query with Distinct and Non-Distinct Aggregates in SQL Server 2008 R2

Observe that the index you created to support your solution is scanned twice -- once to compute the distinct aggregate (the top branch of the plan) and a second time to compute the regular aggregate (bottom branch). The leaf level of the index has 2,854 pages; the STATISTICS IO option reports a scan count of 2 and a total number of logical reads of 5,708.

Examine the two Stream Aggregate operators in the top branch of the plan, which together compute the distinct count aggregate of the values in the measure1 column for each key1 group. The first operator (the one on the right) has the Group By property [Performance].[dbo].[Fact].key1, [Performance].[dbo].[Fact].measure1. It's responsible for creating a group for each distinct combination of key1 and measure1 values. The second operator (the one on the left) has the Group By property [Performance].[dbo].[Fact].key1; because it operates on the result of the previous grouping, the count of rows for each key1 group is in fact the distinct count of values in the measure1 column for each key1 group. Then the bottom branch of the plan rescans the index to compute the sum of values in the measure2 column for each key1 group.

Can you see the inefficiency in this plan? There's an alternative strategy that can use one scan of the data to compute both aggregates. This can be achieved if the first aggregate operator that groups the rows by key1 and measure1 computes a partial sum of the values in the measure2 column for each key1 and measure1 group, and then the second operator computes both the distinct count and sum (of partial sums) for each key1 group. With this approach, the bottom branch of the current plan isn't necessary.

Prior to SQL Server 2012, you could use a workaround to achieve a query plan that implements this strategy. The first step is to write a query that groups the rows by key1 and measure1, then compute the partial sum of measure2 for each group, like so:

SELECT key1, measure1,
SUM(CAST(measure2 AS BIGINT)) AS sum_partial_m2
FROM dbo.Fact
GROUP BY key1, measure1;

Figure 2 depicts the plan for this query, showing one scan of the index. The STATISTICS IO option reports a scan count of 1 with a number of logical reads of 2,854.

Figure 2: Plan for Query with Partial Aggregate in SQL Server 2008 R2

The second step is to define a table expression like a CTE based on the previous query, and then in the outer query group the rows by key1 and compute the count of measure1 and the sum of measure2, like so:

WITH C AS
(
SELECT key1, measure1,
SUM(CAST(measure2 AS BIGINT)) AS sum_partial_m2
FROM dbo.Fact
GROUP BY key1, measure1
)
SELECT key1,
COUNT(measure1) AS cnt_distinct_m1,
SUM(sum_partial_m2) AS sum_m2
FROM C
GROUP BY key1;

Because the outer query operates on the result of the grouping by key1 and measure1, the count of measure1 for each key1 group is actually the distinct count. This solution implements the strategy that I mentioned previously. Figure 3 shows the plan produced for this solution in SQL Server 2008 R2.

Figure 3: Plan for Query with Workaround in SQL Server 2008 R2

The STATISTICS IO option reports a scan count of 1 and a number of logical reads of 2,854 -- half the number of reads compared with the first solution.

SQL Server 2012

As it turns out, SQL Server 2012 has an optimization enhancement that implements the aforementioned strategy for the original, simple form of the query, without requiring you to use a workaround. Run the following code in SQL Server 2012:

SELECT key1,
COUNT(DISTINCT measure1) AS cnt_distinct_m1,
SUM(CAST(measure2 AS BIGINT)) AS sum_m2
FROM dbo.Fact
GROUP BY key1;

Examine the execution plan produced for this query, which Figure 4 shows.

Figure 4: Plan for Query without Workaround in SQL Server 2012

Observe that this time the index is scanned only once. The Compute Scalar operator converts the INT measure2 column values to BIGINT. Then the first Stream Aggregate operator groups the rows by key1 and measure1 and computes a partial aggregate of the converted measure2 values per group. Next, the second Stream Aggregate operator groups the rows by key1 and computes the count of measure1 values (resulting in a distinct count) and the sum of partial sums (resulting in the final sum) for each key1 group. Notice how similar this plan is to the one generated for the query with the workaround in SQL Server 2008 R2.

When you're done, run the following code to disable the STATISTICS IO option:

SET STATISTICS IO OFF;

Finally, run the following code to remove the supporting index:

DROP INDEX idx_k1_m1_i_m2 ON dbo.Fact;

What vs. How

One of the goals of SQL's designers was to create a declarative, English-like language for data management and manipulation. This means that users are supposed to specify their requests in a declarative manner, focusing on the logical aspects of the request rather than the physical aspects. The database engine is then supposed to deal with the physical, performance-related aspects. In other words, users are supposed to focus on the "what" part of the request, and the database engine is supposed to focus on the "how" part.

However, if you have even a little experience with query tuning, you know that reality is different. You know that two solutions that perform the same logical task can produce different plans. Therefore, you need to become familiar with how the database engine tends to optimize certain logical constructs, and you need to use query revisions as part of your query tuning tools.

The optimization improvement that I describe in this article is fairly small, but it's a great example of Microsoft pursuing the goal of letting developers focus on writing their requests in a declarative, simple, and intuitive manner, and having a sophisticated database engine that tries to optimize requests in the most efficient way. Such improvements reduce the need to use workarounds that complicate your solutions and that make them more difficult to follow and maintain.

Comments

Plain text