MAX and MIN Aggregates Against Partitioned Tables

When requesting a MAX or MIN aggregate of a column in a nonpartitioned table, provided that you have an index on the aggregated column, the optimizer uses a plan that scans only one row in the leaf of the index. Consider the following query as an example:

SELECT MAX(col2) FROM dbo.T1;

Suppose that T1 is a nonpartitioned table and that you created a nonpartitioned index on T1(col2). The optimizer will generate a plan that scans the leaf of the index from the tail backwards and stops after one row. Similarly, if you ask for MIN(col2), the optimizer will generate a plan that scans the leaf of the index from the head forward and stops after one row. In both cases the plan is extremely efficient and involves very little work.

Partitioned tables and indexes are different; unless you apply the aggregate to the partitioning column, you’ll get an inefficient plan in both SQL Server 2008 and SQL Server 2005. In this article I explain how MAX and MIN aggregates against partitioned tables are optimized, and I provide a better-performing workaround.

Sample Data and Test Environment

Run the code in Listing 1 to generate the sample data that I use in my example. Note that this script will take a few minutes to run. The first part of the code in Listing 1 creates a database called TestMinMax. The second part creates a helper table function called GetNums, which returns a sequence of integers in the range 1 through @n, where @n is an integer input. The third part creates a partitioned table called T1, with the partitioning column being col1, and populates it with 10,000,000 rows in five partitions. The code creates a clustered index on col1, partitioned the same as the table by col1, as well as a nonclustered index on col2, also partitioned the same as the table by col1.

My test machine has an Intel Core i7 processor (quad core with hyperthreading—total of eight logical processors), 4GB of RAM at 1333MHz, and a single 7200RPM hard drive. The execution plans and performance measures that I present are for queries run against SQL Server 2008 SP1 but are similar for SQL Server 2005 SP3.

Querying the Partitioned Table

If you ask for a MIN or MAX aggregate of the partitioning column (col1 in our case), you get an efficient plan that scans only one row in the first or last nonempty partition. As an example, consider the following query (call it Query 1) and its execution plan, which Figure 1 shows.

Figure 1: Execution Plan for Query 1

 SELECT MAX(col1) AS mx

FROM dbo.T1;

If you examine the properties of the Clustered Index Scan operator, you’ll notice that only one partition of the index idx_col1 is accessed (partition 5), and only one row is scanned in that index. I got the following I/O and time statistics for this query on my test machine:

I/O: Table 'T1'. Scan count 1, logical reads 3

Time:  CPU time = 0 ms, elapsed time = 80 ms.

This plan is extremely efficient.

The results are different if you ask for a MIN or MAX of a column that isn’t the partitioning column. For example, there’s an index on col2 that’s partitioned by col1—the same way the table is partitioned. Consider the following query (call it Query 2) and its execution plan, which Figure 2 shows:

SELECT MAX(col2) AS mx

FROM dbo.T1;

Note in the execution plan that the leaf levels of all partitions of the index idx_col2 are scanned in full—a total of 10,000,000 rows. I got the following performance statistics for this query on my test machine:

I/O: Table 'T1'. Scan count 11, logical reads 13798

Time: CPU time = 3183 ms, elapsed time = 2739 ms.

This plan is very inefficient; it ends up scanning 10,000,000 rows.

A better strategy would be to calculate a local, or partial, MAX(col2) aggregate within each partition by scanning only one row from the tail of the local index leaf, then calculate the global, or final, aggregate on top. Note that SQL Server uses a similar strategy when parallelizing aggregate calculations in nonpartitioned cases but doesn’t use such logic with partitioned tables.

To work around this problem, add a filter to the aggregate query that restricts it to only one partition, using the following form:

WHERE $PARTITION.() =

In this case, the optimizer will produce an efficient plan that scans only one row from the edge of the local index. As an example, consider the following query (call it Query 3) and its execution plan, which Figure 3 shows:

Figure 3: Execution Plan for Query 3

SELECT MAX(col2) AS pmx

FROM dbo.T1

WHERE $PARTITION.PF1(col1) = 1;

Only one partition (partition 1) of the index idx_col2 is accessed. The Index Scan operator scans idx_col2 from the tail backwards, and the Top operator stops the scan after one row. This strategy is efficient. I got the following performance measures for this query:

I/O: Table 'T1'. Scan count 1, logical reads 3

Time: CPU time = 16 ms, elapsed time = 56 ms.

To address the original need to calculate the aggregate against the entire table, you can write a query against a table that holds all partition numbers (call it P) and use the APPLY operator to apply the logic presented in Query 3 to each row from P. Then you can apply a global aggregate of all local aggregates on top. If the set of partitions involved is small and static, you can construct P as a virtual table made of constants. In SQL Server 2008 this can be achieved with the enhanced VALUES clause like so:

SELECT MAX(A.pmx) AS mx

FROM (VALUES(1),(2),(3),(4),(5)) AS P(partition_number)

  CROSS APPLY ( SELECT MAX(T1.col2) AS pmx

                FROM dbo.T1

                WHERE $PARTITION.PF1(T1.col1) = P.partition_number ) AS A;

This query (call it Query 4) generates the execution plan in Figure 4.

Figure 4: Execution Plan for Query 4

The plan implements the strategy that we hoped to see originally; the Constant Scan operator scans the five constants representing the five partition numbers in our table P, and for each, applies the aforementioned logic that calculates the local aggregate efficiently with only one row scanned in each of the local indexes. The second Stream Aggregate operator represents the global aggregate on top of the local ones. This plan is very efficient. I got the following performance measures for this query:

I/O: Table 'T1'. Scan count 5, logical reads 15

Time: CPU time = 15 ms, elapsed time = 134 ms.

Using the VALUES clause to define a derived table is a new capability in SQL Server 2008. In SQL Server 2005, use a series of UNION ALL operations like so:

SELECT MAX(A.pmx) AS mx

FROM (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3

      UNION ALL SELECT 4 UNION ALL SELECT 5) AS P(partition_number)

  CROSS APPLY ( SELECT MAX(T1.col2) AS pmx

                FROM dbo.T1

                WHERE $PARTITION.PF1(T1.col1) = P.partition_number ) AS A;

In many cases, the list of partitions isn’t static; instead, partitions are added and removed fairly frequently. So instead of using a constant list of partitions, you can simply query the sys.partitions view dynamically like so:

SELECT MAX(A.pmx) AS mx

FROM sys.partitions AS P

  CROSS APPLY ( SELECT MAX(T1.col2) AS pmx

                FROM dbo.T1

                WHERE $PARTITION.PF1(T1.col1) = P.partition_number ) AS A

WHERE P.object_id = OBJECT_ID('dbo.T1')

  AND P.index_id = INDEXPROPERTY( OBJECT_ID('dbo.T1'), 'idx_col2', 'IndexID' );

This query (call it Query 5) queries the sys.partitions view, filtering only the partitions associated with the index idx_col2 in the table dbo.T1. The rest is the same as in Query 4. Figure 5 shows the execution plan for Query 5.

As you can see, this plan is very similar to the one for Query 4, only instead of the Constant Scan operator, you see a Clustered Index scan of the clustered index on the system table sys.sysrowsets. I got the following performance measures for this query:

I/O: Table 'T1'. Scan count 5, logical reads 15, Table 'sysrowsets'. Scan count 1, logical reads 2

Time: CPU time = 0 ms, elapsed time = 118 ms.

You can calculate a MIN aggregate in a similar manner:

SELECT MIN(A.pmn) AS mn

FROM sys.partitions AS P

  CROSS APPLY ( SELECT MIN(T1.col2) AS pmn

                FROM dbo.T1

                WHERE $PARTITION.PF1(T1.col1) = P.partition_number ) AS A

WHERE P.object_id = OBJECT_ID('dbo.T1')

  AND P.index_id = INDEXPROPERTY( OBJECT_ID('dbo.T1'), 'idx_col2', 'IndexID' );

For Now, the Workaround Works

In this article I describe an optimization shortcoming in SQL Server 2008 SP1 and SQL Server 2005 SP3, related to MAX and MIN aggregate calculations against a partitioned table. With partitioned tables, the optimizer doesn’t use an index on the aggregate column efficiently unless the aggregated column also happens to be the partitioning column. I provide a workaround that results an efficient plan. Hopefully, Microsoft will enhance SQL Server in the future to address such calculations more efficiently without the need for a workaround.

Comments

Plain text