Grouping Sets, Part 1

Downloads
96805.zip

I want to devote this article to a cool concept that's been around for a while but has gained some new functionality in SQL Server 2008. I'm talking about grouping sets, which are very useful for analyzing aggregated data. They let you provide dataanalysis solutions against the relational database as alternatives to Analysis Services when appropriate. Now, I know many of you haven't even thought about checking out SQL Server 2008 yet, but I would urge that you do so. What better way to be prepared for the new version when it arrives? Check out the Web-exclusive sidebar "Getting Your Feet Wet in SQL Server 2008" (InstantDoc ID 96802) for information about downloading a Community Technology Preview (CTP) of SQL Server 2008.

With that in mind, let's examine grouping sets. All the code samples in this series of articles about grouping sets will be against an Orders table created in the tempdb database. (Assuming you have the latest CTP version of SQL Server installed, you can run all the code samples that I'll provide.) Run the code in Listing 1 to create the Orders table and populate it with sample data. As an aside, those of you with a keen eye will notice something special about the INSERT statement in Listing 1 - it utilizes a small but very handy new T-SQL feature in SQL Server 2008 called Row Value Constructors. That is, you can now insert multiple rows into a table with a single INSERT VALUES statement. You simply encapsulate each row's set of values in parentheses and separate the rows with commas.

Grouping Sets

Whenever learning about a new feature, a good way to start is to understand what kind of problems that feature helps you solve and how you solved such problems in the past. A grouping set isn't a new concept - it's nothing more than the set of elements/attributes that you group by. For example, the set of attributes in any traditional GROUP BY query is a grouping set. SQL Server 2008 introduces a new option/subclause called GROUPING SETS, which lets you define multiple grouping sets in the same query. Logically, a query with the GROUPING SETS option produces one result set as if unifying the result sets of multiple GROUP BY queries. It's not that you couldn't achieve the same thing in the past, but with the GROUPING SETS option, you can write much less code, your query will be much more efficient, and you'll have more flexibility than in the past.

As an example, suppose you need to query the Orders table and return total quantity values for each (customer, employee and order year), and for each (customer and year), and for each (employee and year), and also for (the grand total). Yes, you can obtain all result sets with four GROUP BY queries, as you see in Listing 2, page 26. But suppose you want to unify all four result sets into one. In other words, you want one result set unifying four grouping sets: (customer, employee, year), (customer, year), (employee, year), and (). One option is to use UNION ALL set operations between the sets. Because all four result sets originally have different structures/schema, and set operations require that all input sets have the same schemas, you can use NULLs as placeholders for the missing elements. Listing 3 shows the code that unifies all four result sets. The problem with this solution is that the code is lengthy and is very inefficient because the base data (the rows in the Orders table) is going to be fully scanned four times.

SQL Server 2008 lets you write much shorter and more efficient code by using the GROUPING SETS option in the GROUP BY clause, where you simply list your desired grouping sets. Listing 4 shows the new, shorter, and more efficient alternative to Listing 3's solution. Web Table 1 (http://www.sqlmag.com, InstantDoc ID 96805) shows the output of the query in Listing 4.

Note the use of empty parentheses, signifying a grouping set with no attributes. As an aside, I never felt quite comfortable with the fact that when I wanted to aggregate the entire input set of rows, I had to omit the GROUP BY clause altogether. For example, in the past, when you wanted to return the grand total quantity, you wrote a query such

SELECT SUM(qty) AS totalqty
FROM dbo.Orders;

It always felt a bit awkward to call such a query a GROUP BY query; it's a GROUP BY query, but the GROUP BY clause is implicit. As a byproduct of adding native support for multiple grouping sets, SQL Server 2008 lets you write the following explicit alternative to the above query:

SELECT SUM(qty) AS totalqty
FROM dbo.Orders
GROUP BY ();

Back to the query in Listing 4 using the GROUPING SETS option, it's evident that the code is much shorter than the code in Listing 3, with the UNION ALL set operations. As for efficiency, the beauty of the SQL Server 2008 solution is that the base data (the rows in the Orders table) needs to be scanned fewer times, because SQL Server can calculate higher-level aggregates on top of lower-level aggregates. I'll leave you with a full examination of the execution plans for both solutions as an exercise because I find them to be clear and straightforward. But I want to highlight a couple of key points. In the plan for the code in Listing 3, you'll find four branches (one for each GROUP BY query). Each branch starts with a clustered index scan, followed by one aggregate operator operating on the whole input set. (There will also be either sorting or hashing to support the aggregate operation.)

In the plan for the code in Listing 4, you'll find fewer clustered index scans (two in this case), and in some (or all) branches, you'll find multiple aggregate operators, where the non-first aggregate operator in the brunch aggregates the pre-aggregated data. If you compare the costs of the two plans, you'll find that the one for the code in Listing 4 is substantially cheaper than the one for the code Listing 3.

1 vs. Multiple Grouping Sets

Every casual GROUP BY query can be thought of as having a single grouping set. For example, the following two queries are logically equivalent:

-- Query 1
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, empid,
YEAR(orderdate);

-- Query 2
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( (custid, empid, YEAR(orderdate)) );

Note that you shouldn't confuse the use of GROUPING SETS in Query 2 with the following:

-- Query 3
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( custid, empid,
YEAR(orderdate) );

Syntactically, all I did in Query 3 was remove one layer of parentheses; however, logically, I defined three grouping sets: (custid), (empid), and (YEAR(orderdate)) as opposed to one in Query 2: (custid, empid, YEAR(orderdate) ). As I mentioned earlier, Query 2 is logically equivalent to Query 1 (a single traditional GROUP BY query), whereas Query 3 is logically equivalent to UNION ALL operations between three traditional GROUP BY queries:

SELECT ... GROUP BY custid
UNION ALL
SELECT ... GROUP BY empid
UNION ALL
SELECT ... GROUP BY YEAR(orderdate);

Continued on page 2.

Run the three queries above and see for yourself. Query 1 and Query 2 produce the output that Web Table 2 shows, and Query 3 produces the output that Web Table 3 shows. In other words, within the parentheses of the GROUPING SETS option, commas are used as a separator between grouping sets. Within a particular grouping set, commas are used as a separator between the grouping set's elements or attributes. For clarity, you might want to use parentheses surrounding each grouping set even when the grouping set contains a single element. For example, the following is logically equivalent to Query 3:

SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS
(
( custid ),
( empid ),
( YEAR(orderdate) )
);

Mixing Grouping Sets and Grouping Sets Algebra

You can have multiple GROUPING SETS subclauses in the same GROUP BY clause. If you specify multiple ones, something interesting will happen: You'll get a Cartesian product or a cross of the grouping sets in the various GROUPING SETS subclauses. This probably sounds very cryptic, and would be best explained through examples. Given grouping sets (A), (B), (C), (D), and (E), where each capital letter represents a set of elements in the grouping set (e.g., A stands for the set of elements a1, a2, ..., an), the following pseudo GROUP BY clause:

GROUP BY
GROUPING SETS( (A), (B), (C) ),
GROUPING SETS( (D), (E) )
is logically equivalent to
GROUP BY
GROUPING SETS( (A, D), (B, D), (C,
D), (A, E), (B, E), (C, E) )

How about a more tangible example? The code in Listing 5 has two GROUPING SETS subclauses. The first GROUPING SETS subclause has three grouping sets:

GROUPING SETS
(
( custid, empid ),
( custid ),
( empid )
)
The second GROUPING SETS subclause has two
grouping sets:
GROUPING SETS
(
( YEAR(orderdate),
MONTH(orderdate) ),
( YEAR(orderdate)
)
)

You get six grouping sets as a result of the Cartesian product between the three and the two. Listing 6 shows the logical equivalent to the code in Listing 5. All six grouping sets are specified in one GROUPING SETS subclause.

Finally, if all grouping sets have a common element, you can pull them outside all GROUPING SETS subclauses into their own GROUPING SETS subclause, and by doing so, shorten your code. For example, consider the following pseudo GROUP BY clause:

GROUP BY GROUPING SETS( (a, b, c, d),
(a, b, c), (a, b, d), (a, b) )

All grouping sets have the common elements "a, b,"- hence, this can also be expressed as

GROUP BY GROUPING SETS ( (a, b) ),
GROUPING SETS( (c, d), (c), (d), () )

Because GROUP BY GROUPING SETS ( (a, b) ) is logically equivalent to GROUP BY a, b, you can also express this as

GROUP BY a, b, GROUPING SETS( (c, d),
(c), (d), () )

As a more tangible example, the queries in Listings 5 and 6 are logically equivalent to the query in Listing 7; the common element YEAR(orderdate) was pulled out.

What's Next?

If grouping sets seem to be somewhat familiar, it's not by chance. You might have already realized that the GROUPING SETS option is similar to the CUBE and ROLLUP options in previous versions of SQL Server. In fact, you can consider CUBE and ROLLUP as special cases of the GROUPING SETS option. However, the CUBE and ROLLUP options were implemented in a rigid/non-flexible way prior to SQL Server 2008 and are also non-ISO compliant. SQL Server 2008 introduces new, ISO-compliant, more flexible CUBE and ROLLUP options to replace the older ones. In a later article, I'll tell you all about the new CUBE and ROLLUP options, and also about the new GROUPING_ID function that you'll probably find useful in conjunction with grouping sets.

In future articles, I'll cover other T-SQL enhancements in SQL Server 2008, including the new MERGE statement, Table-Valued Parameters, the HEIRARCHYID datatype, and temporal datatype enhancements (DATE, TIME, DATETIME2, DATETIMEOFFSET datatypes).

Download associated code.

Comments

Plain text