Grouping Sets, Part 2

Downloads
97007.zip

Last month, in “Grouping Sets, Part 1,” I described grouping sets and introduced the new SQL Server 2008 GROUPING SETS subclause of the GROUP BY clause. This month, I describe three more new SQL Server 2008 features related to grouping sets: the new CUBE and ROLLUP options and the GROUPING_ID function. In last month’s examples, I used the Orders table, which you create and populate by running the code in Web Listing 1, and this month I use the same Orders table in my examples. As always, be sure to read last month’s article as a prerequisite to this one.

CUBE and ROLLUP

You might be familiar with the CUBE and ROLLUP options in versions of SQL Server prior to SQL Server 2008. You specified these in a separate WITH clause following the GROUP BY clause, they were very rigid, and they weren’t ISO compliant. The old CUBE and ROLLUP options produced specific grouping sets that you couldn’t combine with other grouping sets in the same query. SQL Server 2008 introduces new, flexible CUBE and ROLLUP subclauses; that is, you can combine them with other grouping sets in the same query, as I’ll demonstrate shortly. Also, the new CUBE and ROLLUP subclauses are ISO compliant.

Think of the new CUBE and ROLLUP subclauses as abbreviations of lengthier GROUPING SETS subclauses. CUBE gives you something akin to (in set theory) a power set. Given a set of attributes as input, CUBE produces all possible permutations of grouping sets out of those attributes, including an empty grouping set. Out of n attributes, CUBE produces 2n grouping sets. For example, CUBE(a, b, c) is equivalent to GROUPING SETS ( (a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), () ). So, the query

SELECT custid, empid, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY CUBE(custid, empid);

is equivalent to

SELECT custid, empid, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( (custid, empid),
(custid), (empid), () );

Both generate the output that Web Table 1 shows.

ROLLUP is handy when the attributes given as input have a hierarchy—for example, the attributes country, region, and city. Suppose you want to find all permutations of grouping sets that have business value that can be produced out of these attributes. The grouping set (city) has no business value because there might be different cities in your data that have the same name but appear in different countries and regions. There’s no business value in grouping data by the city alone because different cities that happen to have the same name will be arranged in the same group. However, the grouping set (country, region, city) does have business value because it generates a group for each unique city(within a specific country and region). So, the subclause ROLLUP (country, region, city) is equivalent to GROUPING SETS ( (country, region, city), (country, region), (country), () ). Out of the eight possible permutations of grouping sets that CUBE would produce in this case, ROLLUP produces four. More generally, out of n attributes, CUBE produces 2n grouping sets, whereas ROLLUP produces n + 1 grouping sets—those that have business value in case of a hierarchy among the attributes. Similarly, ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) is equivalent to GROUPING SETS( (YEAR(orderdate), MONTH(orderdate), DAY(orderdate)), (YEAR (orderdate), MONTH(orderdate)), (YEAR(orderdate)), () ).

To test the ROLLUP option, run the following query:

SELECT
YEAR(orderdate) AS orderyear,
MONTH(orderdate) AS ordermonth,
DAY(orderdate) AS orderday,
SUM(qty) AS qty
FROM dbo.Orders
GROUP BY ROLLUP(YEAR(orderdate),
MONTH(orderdate), DAY(orderdate));

You’ll get the output that Web Table 2 shows. Run the logical equivalent with the explicit GROUPING SETS subclause, and observe that you get the same output:

SELECT
YEAR(orderdate) AS orderyear,
MONTH(orderdate) AS ordermonth,
DAY(orderdate) AS orderday,
SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS(
(YEAR(orderdate), MONTH(orderdate),DAY(orderdate)),
(YEAR(orderdate), MONTH(orderdate)),
(YEAR(orderdate)),
() );

As I mentioned earlier, the new CUBE and ROLLUP subclauses are flexible in the sense that you can combine them with other grouping sets in the same query. In fact, you can cross multiple CUBE, ROLLUP, and GROUPING SETS subclauses in the same GROUP BY clause. You can have as many as you like, as long as you don’t exceed the hardcoded limit of 4096 grouping sets per query. The following query is an example:

SELECT custid, empid,
YEAR(orderdate) AS orderyear,
MONTH(orderdate) AS ordermonth,
DAY(orderdate) AS orderday,
SUM(qty) AS qty
FROM dbo.Orders
GROUP BY
CUBE(custid, empid),
ROLLUP(YEAR(orderdate),
MONTH(orderdate), DAY(orderdate));

To determine which grouping sets you get, first expand the CUBE and ROLLUP subclauses to the logically equivalent GROUPING SETS subclauses. The above query’s GROUP BY clause is equivalent to Figure 1.

After applying the Cartesian product between the two subclauses, you get 16 grouping sets, as Figure 2 shows.

The GROUPING_ID Function

As I mentioned earlier, the advantage of the new SQL Server 2008 features related to grouping sets—compared with unifying result sets of multiple GROUP BY queries— is the brevity of code and the efficiency of calculating the aggregates. Because a single query can now handle multiple grouping sets, you might want to easily and efficiently distinguish one grouping set from another in the result set. You might also want to materialize the result set of such a query in a table, then repeatedly query it for a different grouping set in each request. You need an identifier (for each grouping set) that you can use to isolate a grouping set, and that you can also index in the materialized table so that a request for each grouping set will be satisfied efficiently.

The answer lies in the new GROUPING_ID function. The function accepts a set of attributes as input—typically all attributes that participate in any of the grouping sets—and returns an integer bitmap in which each bit represents a different attribute. A bit is turned off if the corresponding attribute is part of the current grouping set and turned on if it isn’t. For example, consider the function GROUPING_ID(e, d, c, b, a). The attribute a is represented by the first bit (1), b by the second bit (2), c by the third bit (4), d by the fourth bit (8), and e by the fifth bit (16). All rows in the result set produced for the grouping set (a, c, e) will have the integer 10 returned from the function. Remember that a bit representing an attribute is turned off when the corresponding attribute is part of the grouping set; hence, the bits representing the attributes a, c, and e are turned off, and the bits representing the missing attributes b (2) and d (8) are turned on. You get 2 + 8 = 10. The following query should help clarify the output of the GROUPING_ID function:

SELECT
GROUPING_ID(e, d, c, b, a) as n,
COALESCE(e, 1) as \[16\],
COALESCE(d, 1) as \[8\],
COALESCE(c, 1) as \[4\],
COALESCE(b, 1) as \[2\],
COALESCE(a, 1) as \[1\]
FROM (VALUES(0, 0, 0, 0, 0)) AS D(a, b, c, d, e)
GROUP BY CUBE (a, b, c, d, e)
ORDER BY n;

Web Table 3 shows the output of this query.

The query produces all possible permutations of grouping sets out of five attributes: a, b, c, d, and e. With five attributes, you get 25 = 32 possible grouping sets. Because the source table is a virtual table with only one row (with zeros in all attributes), you get only one row in the result set for each grouping set. I used the COALESCE function to substitute a NULL representing the fact that the attribute isn’t part of the current grouping set with 1. You can see that the GROUPING_ID function returns an integer that is equal to the sum of the values of the bits representing the missing attributes in the current grouping set.

For a more tangible example, run the code in Web Listing 2 to materialize the result set of the query I showed in the previous section, along with the result of the GROUPING_ID function (call it grp_id), and cluster the table by the grp_id value and the attributes. Now, whenever you need to pull all rows produced for a specific grouping set, you simply need to query the table and filter the relevant grp_id value. The table was clustered first by the grp_id value, so the execution plan will be as efficient as it can get (clustered index seek plus partial ordered scan). For example, to pull the grouping set (custid, orderyear, ordermonth), you need to filter grp_id = 9 (non-participating attributes empid – bit 8 plus orderday – 1 = 9):

SELECT *
FROM dbo.MyGroupingSets
WHERE grp_id = 9;

As I mentioned earlier, the query is very efficient, and you get the expected result set, as Web Table 4 shows.

Incremental Updates

You’re probably wondering whether SQL Server 2008 also provides the means to handle incremental updates of the materialized aggregates stored in the MyGroupingSets table—that is, add the delta of changes since the last processing of aggregates to MyGroupingSets without reprocessing the whole thing. SQL Server 2008 does offer that means, as long as the aggregates are additive (e.g., SUM, COUNT, and AVG, which can be calculated based on SUM and COUNT). You can achieve incremental updates by using another new T-SQL feature in SQL Server 2008—the MERGE statement—which I’ll cover in a future article. But how about just a glimpse?

Suppose MyGroupingSets currently holds aggregates for all activity up to and including April 18, 2008. Run the code in Web Listing 3 to simulate new order activity added in April 19, 2008. Then, run the code in Web Listing 4 to apply the incremental load of April 19, 2008, to the MyGrouping- Sets table using the new MERGE statement. Don’t worry about the syntax; remember, this is only a glimpse. The basic concept of the code in Web Listing 4 is to add rows whose keys don’t exist in the target table and update rows whose keys do.

Less Code, Better Performance

SQL Server 2008 introduces several important features related to grouping sets: the GROUPING_SETS, CUBE and ROLLUP subclauses and the GROUPING_ID function. It also introduces an unrelated feature—the MERGE statement—that can help in handling incremental updates of additive aggregates, among other things. With these new features, you can support applications that need to provide analytical capabilities of aggregates with less code and better performance. These new features also let you provide a relational alternative to basic cubes in Analysis Services for customers who prefer to stick to the relational data warehouse/data mart alone. For information about that alternative, see the Web-exclusive sidebar “Grouping Sets in the Relational Database as an Alternative to Cubes in Analysis Services.” (see Associated .Zip File)

Comments

Plain text