Logical Query Processing Part 7: GROUP BY and HAVING

This article continues the series about logical query processing, which describes the logical, or conceptual, interpretation of queries. Part 1 provided an overview of the topic and also a sample database called TSQLV4. It also provided sample queries which I’ll referred to as simple sample query and complex sample query. I’ll use the same sample database and queries in this article. Parts 2, 3, 4 and 5 covered the logical query processing aspects of the first major query clause—the FROM clause. Part 6 covered the second major query clause—the WHERE clause. In this article I continue the series by describing the logical query processing aspects of the third and fourth major query clauses—GROUP BY and HAVING, respectively.

Logical query processing flow chart including the GROUP BY and HAVING clauses

The optional GROUP BY clause is processed in the third logical query processing step, and the optional HAVING clause in the fourth. The third step, which processes the GROUP BY clause, operates on the virtual table returned by the second step, which processes the WHERE clause. The third step arranges the rows from the input table in groups based on the grouping set that you define in the GROUP BY clause. Then the fourth step filters groups based on the predicate that you specify in the HAVING clause. Only groups for which the predicate evaluates to true are returned, whereas groups for which the predicate evaluates to false or unknown are discarded. Figure 1 shows a flow chart with the logical query processing of the FROM, WHERE, GROUP BY and HAVING clauses.

Figure 1: Logical query processing flow chart – GROUP BY and HAVING

Figure 01 - Logical query processing flow chart - GROUP BY and HAVING.jpg

I’ll use simple sample query and complex sample query from part 1 in the series to demonstrate the inputs and outputs of the steps that process the GROUP BY and HAVING clauses. The following code includes the GROUP BY and HAVING clauses of our simple sample query:

SELECT ...
FROM Sales.Customers AS C
  LEFT OUTER JOIN Sales.Orders AS O
    ON C.custid = O.custid
WHERE C.country = N'Spain'
GROUP BY C.custid
HAVING COUNT( O.orderid ) <= 3;

Following is the state of the data after handling the WHERE clause (step 2), and before handling the GROUP BY clause (step 3):

C.custid    O.orderid
----------- -----------
8       10326
8       10801
8       10970
22      NULL
29      10366
29      10426
29      10568
29      10887
29      10928
30      10303
30      10550
30      10629
30      10872
30      10874
30      10888
30      10911
30      10948
30      11009
30      11037
69      10281
69      10282
69      10306
69      10917
69      11013

This input set consists of 24 rows (showing here only the columns C.custid and O.orderid). The grouping set defined in our query is (C.custid), and since there are five distinct C.custid values in the input set, the result of the GROUP BY phase organizes the input rows in five groups, like so:


|---------|------------------|
| Groups  |       Rows       |
|---------|---------|--------|
|C.custid |C.custid |orderid |
|---------|---------|--------|
|     |8    |10326   |
|8    |8    |10801   |
|     |8    |10970   |
|---------|---------|--------|
|22       |22       |NULL    |
|---------|---------|--------|
|     |29       |10366   |
|     |29       |10426   |
|29       |29       |10568   |
|     |29       |10887   |
|     |29       |10928   |
|---------|---------|--------|
|     |30       |10303   |
|     |30       |10550   |
|     |30       |10629   |
|     |30       |10872   |
|30       |30       |10874   |
|     |30       |10888   |
|     |30       |10911   |
|     |30       |10948   |
|     |30       |11009   |
|     |30       |11037   |
|---------|---------|--------|
|     |69       |10281   |
|     |69       |10282   |
|69       |69       |10306   |
|     |69       |10917   |
|     |69       |11013   |
|---------|---------|--------|

The HAVING phase applies the predicate COUNT( O.orderid ) <= 3 to filter groups. Only two out of the five input groups qualify, so the output of this step consists of only those two groups:


|---------|------------------|
| Groups  |       Rows       |
|---------|---------|--------|
|C.custid |C.custid |orderid |
|---------|---------|--------|
|     |8    |10326   |
|8    |8    |10801   |
|     |8    |10970   |
|---------|---------|--------|
|22       |22       |NULL    |
|---------|---------|--------|

The following code has the GROUP BY and HAVING parts of our complex sample query:

SELECT ...
FROM Sales.Customers AS C
  LEFT OUTER JOIN
  ( Sales.Orders AS O
      INNER JOIN Sales.OrderDetails AS OD
        ON O.orderid = OD.orderid
        AND O.orderdate >= '20160101' )
    ON C.custid = O.custid
  CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city),
            OD.qty * OD.unitprice * (1 - OD.discount) ) 
      ) AS A(custlocation, val)
WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle')
GROUP BY C.custid, A.custlocation
HAVING COUNT( DISTINCT O.orderid ) <= 3;

The WHERE phase returned the following result, which consists of 27 rows:

C.custid  A.custlocation  O.orderid  OD.productid  A.val
--------- --------------- ---------- ------------- -------------
8     Spain.Madrid    10970      52        224.0000000
22    Spain.Madrid    NULL       NULL      NULL
57    France.Paris    NULL       NULL      NULL
69    Spain.Madrid    10917      30        25.8900000
69    Spain.Madrid    10917      60        340.0000000
69    Spain.Madrid    11013      23        90.0000000
69    Spain.Madrid    11013      42        56.0000000
69    Spain.Madrid    11013      45        190.0000000
69    Spain.Madrid    11013      68        25.0000000
74    France.Paris    10907      75        108.5000000
74    France.Paris    10964      18        375.0000000
74    France.Paris    10964      38        1317.5000000
74    France.Paris    10964      69        360.0000000
74    France.Paris    11043      11        210.0000000
89    USA.WA.Seattle  10861      17        1638.0000000
89    USA.WA.Seattle  10861      18        1250.0000000
89    USA.WA.Seattle  10861      21        400.0000000
89    USA.WA.Seattle  10861      33        87.5000000
89    USA.WA.Seattle  10861      62        147.9000000
89    USA.WA.Seattle  10904      58        198.7500000
89    USA.WA.Seattle  10904      62        1725.5000000
89    USA.WA.Seattle  11032      36        665.0000000
89    USA.WA.Seattle  11032      38        6587.5000000
89    USA.WA.Seattle  11032      59        1650.0000000
89    USA.WA.Seattle  11066      16        52.3500000
89    USA.WA.Seattle  11066      19        386.4000000
89    USA.WA.Seattle  11066      34        490.0000000

This set is used as the input to the GROUP BY phase, which in this query groups the rows by the grouping set (C.custid, A.custlocation). The step organizes the rows from the input set in the following six groups:

|---------|----------------------------------------------------------------|
| Groups  |                Rows                |
|---------|---------|---------------|----------|-------------|-------------|
|C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val    |
|---------|---------|---------------|----------|-------------|-------------|
|8    |8    |Spain.Madrid   |10970     |52       |224.0000000  |
|---------|---------|---------------|----------|-------------|-------------|
|22       |22       |Spain.Madrid   |NULL      |NULL     |NULL     |
|---------|---------|---------------|----------|-------------|-------------|
|57       |57       |France.Paris   |NULL      |NULL     |NULL     |
|---------|---------|---------------|----------|-------------|-------------|
|     |69       |Spain.Madrid   |10917     |30       |25.8900000   |
|     |69       |Spain.Madrid   |10917     |60       |340.0000000  |
|69       |69       |Spain.Madrid   |11013     |23       |90.0000000   |
|     |69       |Spain.Madrid   |11013     |42       |56.0000000   |
|     |69       |Spain.Madrid   |11013     |45       |190.0000000  |
|     |69       |Spain.Madrid   |11013     |68       |25.0000000   |
|---------|---------|---------------|----------|-------------|-------------|
|     |74       |France.Paris   |10907     |75       |108.5000000  |
|     |74       |France.Paris   |10964     |18       |375.0000000  |
|74       |74       |France.Paris   |10964     |38       |1317.5000000 |
|     |74       |France.Paris   |10964     |69       |360.0000000  |
|     |74       |France.Paris   |11043     |11       |210.0000000  |
|---------|---------|---------------|----------|-------------|-------------|
|     |89       |USA.WA.Seattle |10861     |17       |1638.0000000 |
|     |89       |USA.WA.Seattle |10861     |18       |1250.0000000 |
|     |89       |USA.WA.Seattle |10861     |21       |400.0000000  |
|     |89       |USA.WA.Seattle |10861     |33       |87.5000000   |
|     |89       |USA.WA.Seattle |10861     |62       |147.9000000  |
|     |89       |USA.WA.Seattle |10904     |58       |198.7500000  |
|89       |89       |USA.WA.Seattle |10904     |62       |1725.5000000 |
|     |89       |USA.WA.Seattle |11032     |36       |665.0000000  |
|     |89       |USA.WA.Seattle |11032     |38       |6587.5000000 |
|     |89       |USA.WA.Seattle |11032     |59       |1650.0000000 |
|     |89       |USA.WA.Seattle |11066     |16       |52.3500000   |
|     |89       |USA.WA.Seattle |11066     |19       |386.4000000  |
|     |89       |USA.WA.Seattle |11066     |34       |490.0000000  |
|---------|---------|---------------|----------|-------------|-------------|

The query then filters groups based on the HAVING predicate COUNT( DISTINCT O.orderid ) <= 3, returning five out of the original six groups:

|---------|----------------------------------------------------------------|
| Groups  |                Rows                |
|---------|---------|---------------|----------|-------------|-------------|
|C.custid |C.custid |A.custlocation |O.orderid |OD.productid |A.val    |
|---------|---------|---------------|----------|-------------|-------------|
|8    |8    |Spain.Madrid   |10970     |52       |224.0000000  |
|---------|---------|---------------|----------|-------------|-------------|
|22       |22       |Spain.Madrid   |NULL      |NULL     |NULL     |
|---------|---------|---------------|----------|-------------|-------------|
|57       |57       |France.Paris   |NULL      |NULL     |NULL     |
|---------|---------|---------------|----------|-------------|-------------|
|     |69       |Spain.Madrid   |10917     |30       |25.8900000   |
|     |69       |Spain.Madrid   |10917     |60       |340.0000000  |
|69       |69       |Spain.Madrid   |11013     |23       |90.0000000   |
|     |69       |Spain.Madrid   |11013     |42       |56.0000000   |
|     |69       |Spain.Madrid   |11013     |45       |190.0000000  |
|     |69       |Spain.Madrid   |11013     |68       |25.0000000   |
|---------|---------|---------------|----------|-------------|-------------|
|     |74       |France.Paris   |10907     |75       |108.5000000  |
|     |74       |France.Paris   |10964     |18       |375.0000000  |
|74       |74       |France.Paris   |10964     |38       |1317.5000000 |
|     |74       |France.Paris   |10964     |69       |360.0000000  |
|     |74       |France.Paris   |11043     |11       |210.0000000  |
|---------|---------|---------------|----------|-------------|-------------|

If you’re wondering why the HAVING predicate computes a distinct count of order IDs, that’s because the query joins the Sales.Orders table with the Sales.OrderDetails table, resulting in a row per order line—not a row per order. The HAVING filter in our query is supposed to return only groups that have an order count—not an order line count—that is less than or equal to three.

Detail is discarded

The GROUP by phase organizes the rows in groups, and from that point in the logical processing of the query, your expressions don’t have direct access to the detailed state of the data anymore. This applies to all subsequent phases, which are the steps that process the HAVING (step 4), SELECT (step 5) and ORDER BY (step 6) clauses of the query. If in those clauses you want to refer to a column from the input tables, you can do so directly only if that column is part of the query’s grouping set. In order to refer to a column that is not part of the query’s grouping set, that column must be contained in a group aggregate function. For example, the following query is invalid since in the SELECT list you refer to the column orderid, and this column is neither part of your grouping set, nor is it contained in a group aggregate function:

SELECT custid, orderid
FROM Sales.Orders
GROUP BY custid;

You get the following error:

Msg 8120, Level 16, State 1, Line 48
Column 'Sales.Orders.orderid' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The reasoning behind this restriction is that within a single group you can have rows with different order ID values, but the result will consist of only one row per group. However, applying an aggregate function to a column (or an expression based on a column) guarantees that there will be only one result value per group. For example, the following query computes the maximum order ID per customer, and therefore is a perfectly valid query:

SELECT custid, MAX(orderid) AS lastorder
FROM Sales.Orders
GROUP BY custid;

Theoretically SQL could allow exceptions to the above rule in cases where based on constraints it could infer that certain columns can have only one distinct value per group, even if they’re not part of the grouping set. However, SQL doesn’t support such implied inference. As an example, consider the following query:

SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorder
FROM Sales.Customers AS C
  INNER JOIN Sales.Orders AS O
    ON C.custid = O.custid
GROUP BY C.custid;

There’s a primary key constraint defined on the custid column in the Sales.Customers table, enforcing the uniqueness of the custid values. This means that all other columns in the table are functionally dependent on the custid column. So, in the result of the join between Sales.Customers and Sales.Orders, all rows with the same custid value will always share the same values in all of the rest of the columns. So in the above query, SQL could theoretically infer that since C.custid is part of the grouping set, there will always be precisely one C.companyname value associated with each group. However, as mentioned, SQL (and the same applies to T-SQL) doesn’t support this sort of inference, and you get the following error when trying to execute this query:

Msg 8120, Level 16, State 1, Line 56
Column 'Sales.Customers.companyname' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

There are two main workarounds that SQL supports. One is to add the C.companyname column to the grouping set, like so (call this Query 1):

SELECT C.custid, C.companyname, MAX(O.orderid) AS lastorder
FROM Sales.Customers AS C
  INNER JOIN Sales.Orders AS O
    ON C.custid = O.custid
GROUP BY C.custid, C.companyname;

Another is to apply an artificial aggregate function like MIN or MAX to the C.companyname column, like so (call this Query 2):

SELECT C.custid, MAX(C.companyname) AS companyname, MAX(O.orderid) AS lastorder
FROM Sales.Customers AS C
  INNER JOIN Sales.Orders AS O
    ON C.custid = O.custid
GROUP BY C.custid;

If you’re wondering which solution is more efficient, examine the query plans shown in Figure 2:

Figure 2: Plans for Query 1 and Query 2

Figure 02 - Plans for Query 1 and Query 2.jpg

Observe that in the plan for Query 1, the optimizer applied a pushdown of the grouping and aggregation work right after scanning the data from the Orders table, prior to applying the join with the Customers table. It figures that since the join is an equijoin, grouping by O.custid is the same as grouping by C.custid, and then the pushdown can be applied while preserving the original query meaning. This leaves the join with less work to do compared to processing the grouping and aggregation after the join. In the plan for Query 2, the optimizer doesn’t realize that the aggregate against the C.companyname column is artificial, and handles the grouping and aggregation after the join. Personally, I find Query 1 also to be the clearer and more natural of the two. Since Query 1 is both more efficient and more natural, it’s generally my preferred choice between the two.

Aggregating data without discarding the detail

Compared to getting the raw, or detailed, state of the data, grouping gives you new insights in the form of aggregated calculations. However, at the same time grouping also discards the detail, as demonstrated earlier. For example, consider the following query:

SELECT custid, SUM(val) AS custtotal
FROM Sales.OrderValues
GROUP BY custid;

The source view has a row per order, but since the query groups the data by custid, the output contains a row per customer, with the customer’s total order values:

custid  custtotal
------- ----------
1       4273.00
2       1402.95
3       7023.98
...

(89 row(s) affected)

But what if you don’t want to discard the detail? For example, suppose that you want to return the detailed information about the order and in addition you want to compute the percentage of the current order value out of the customer total. You need the percentage calculation to divide the detailed order value by the aggregate of all order values for the same customer. You could achieve this by writing a query that groups the data by custid and computes the customer total, then define a table expression based on that query, and use an outer query to join the table expression with the Sales.OrderValues view to match the detail with the aggregates. But this adds complexity to the code. Alternatively, you can compute the aggregate using a window function instead of a group function, like so:

SELECT custid, orderid, val,
  val / SUM(val) OVER(PARTITION BY custid) AS pctcust
FROM Sales.OrderValues;

Unlike grouping, which discards the detail, windowing doesn’t discard the detail. Th aggregate is computed against a window of rows that is derived from the underlying query result set, and is defined by the function’s OVER clause. Since the underlying query result is established only when logical query processing reaches the SELECT phase (after processing the FROM, WHERE, GROUP BY and HAVING phases), window functions are allowed only in the SELECT and ORDER BY clauses of the query. If you specify an empty OVER clause, the function operates on the complete underlying query result, and you would get the grand total. If you add a window partition clause, like in the above query, the function operates on a restricted partition, which in our example restricts the rows to only the ones where the custid value is the same as in the current row. In other words, our window function computes the customer total. The window function returns its result without discarding the detail, and hence you can combine detail elements from the row with the result of the window aggregate function. Our query divides the current order value by the customer total to compute the percentage: val / SUM(val) OVER(PARTITION BY custid). This query generates the following output:

custid  orderid  val      pctcust
------- -------- -------- -----------------------------
1       10643    814.50   0.19061549262813011935408378
1       10692    878.00   0.20547624619705125204774163
1       10702    330.00   0.07722911303533816990404867
1       10835    845.80   0.19794055698572431546922536
1       10952    471.20   0.11027381230985256260238708
1       11011    933.50   0.21846477884390358062251345
2       10926    514.40   0.36665597491001104814854413
2       10759    320.00   0.22809080865319505328058733
2       10625    479.75   0.34195801703553227128550554
2       10308    88.80    0.06329519940126162728536298
3       10365    403.20   0.05740335251524064704056674
3       10507    749.06   0.10664324215046170404813225
3       10535    1940.85  0.27631770022124208781915666
3       10677    813.37   0.11579901992887223482982582
3       10573    2082.00  0.29641314468435274587911696
3       10682    375.50   0.05345971941833547362036907
3       10856    660.00   0.09396382108149510676283246
...

(830 row(s) affected)

Curiously, grouping and windowing aren't mutually exclusive. You can apply window functions against grouped data. Just remember that grouping happens in step 3 in logical query processing, and windowing happens post grouping, in the SELECT (step 5) or ORDER BY (step 6) phases. The basic rule that you want to remember is that unlike group aggregate functions which you can apply to detail elements as their inputs, window functions can only be applied to elements that would have otherwise normally been allowed in the SELECT clause. For instance, suppose that you want to write a grouped query where you compute customer total order values, and in addition, the percent of the customer total out of the grand total. You might attempt to use the following query:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(val) OVER() AS pct
FROM Sales.OrderValues
GROUP BY custid;

The reference to the val column in the group aggregate SUM functions is perfectly valid, but the reference to the val column in the window aggregate SUM function is invalid, just like such a reference would have been invalid directly in the SELECT clause. You get the following error:

Msg 8120, Level 16, State 1, Line 137
Column 'Sales.OrderValues.val' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

You might try an alternative where you apply the window function to the alias custtotal:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(custtotal) OVER() AS pct
FROM Sales.OrderValues
GROUP BY custid;

But this attempt also fails since such a reference would have been invalid directly in the SELECT clause due to the set-based treatment of expressions in the same logical phase. You get the following error:

Msg 207, Level 16, State 1, Line 145
Invalid column name 'custtotal'.

What is valid, surprisingly (or not), is to apply the window function to the group function, like so:

SELECT custid, SUM(val) AS custtotal, SUM(val) / SUM(SUM(val)) OVER() AS pct
FROM Sales.OrderValues
GROUP BY custid;

This time, the query runs successfully and you get the following output:

custid  custtotal  pct
------- ---------- ---------
1       4273.00    0.003375
2       1402.95    0.001108
3       7023.98    0.005549
...

(89 row(s) affected)

Similarly, suppose that you wanted to group the data from the Sales.OrderValues view by custid and orderdate, and in addition to computing the daily total, you want to compute the running total values from the beginning of the customer’s activity until the current date. As shown in the previous example, you can combine grouping and windowing to achieve this task, like so:

SELECT custid, orderdate, SUM(val) AS daytotal,
  SUM(SUM(val)) OVER(PARTITION BY custid
             ORDER BY orderdate
             ROWS UNBOUNDED PRECEDING) AS runtotal
FROM Sales.OrderValues
GROUP BY custid, orderdate;

This query generates the following output:

custid  orderdate  daytotal  runtotal
------- ---------- --------- ---------
1       2015-08-25 814.50    814.50
1       2015-10-03 878.00    1692.50
1       2015-10-13 330.00    2022.50
1       2016-01-15 845.80    2868.30
1       2016-03-16 471.20    3339.50
1       2016-04-09 933.50    4273.00
2       2014-09-18 88.80     88.80
2       2015-08-08 479.75    568.55
2       2015-11-28 320.00    888.55
2       2016-03-04 514.40    1402.95
3       2014-11-27 403.20    403.20
3       2015-04-15 749.06    1152.26
3       2015-05-13 1940.85   3093.11
3       2015-06-19 2082.00   5175.11
3       2015-09-22 813.37    5988.48
3       2015-09-25 375.50    6363.98
3       2016-01-28 660.00    7023.98
...

(823 row(s) affected)

HAVING versus WHERE

The HAVING clause serves a similar filtering purpose to the WHERE clause, where it filters data based on a predicate. The key difference between the two is that the WHERE clause filters rows before grouping, whereas the HAVING clause filters whole groups after grouping. So at the group level, the HAVING clause determines whether to keep the group or discard it based on the outcome of the predicate (if true, keep, if false or unknown, discard). As an example, suppose that you need to query the Sales.OrderValues view, and filter only orders that were placed on or after May 1^st, 2016. You want to group the remaining orders by employee, and filter only groups having three orders or fewer. For qualifying groups, you want to return the employee ID and the order count. Here, the filter against the orderdate column has to be applied in the WHERE clause because it’s supposed to be treated as a row filter: WHERE orderdate >= '20160501'. Since orderdate is not part of the query’s grouping set, you can’t refer to it in the HAVING clause, unless it is contained within an aggregate function. But then it’s not like you want to filter groups where the outcome of an aggregate against the orderdate column satisfies some condition; you want to filter rows representing orders placed on or after a certain date. Conversely, after grouping, you want to apply a group filter that keeps only employee groups having three or fewer orders. Since this filter involves an aggregate calculation of the count of orders and needs to be applied at the group level, it has to be specified in the HAVING clause: HAVING COUNT(*) <= 3. Here’s the complete solution query:

SELECT empid, COUNT(*) AS numorders
FROM Sales.OrderValues
WHERE orderdate >= '20160501'
GROUP BY empid
HAVING COUNT(*) <= 3;

This query generates the following output:

empid       numorders
----------- -----------
2       2
4       2
7       2
8       3

If you need to apply a filter based on a column that is part of the query’s grouping set, e.g., empid > 0 in our query, you can choose whether to apply it as a row filter in the WHERE clause or as a group filter in the HAVING clause. The outcome is the same. For most people, in such a case it is more natural to apply the filter in the WHERE clause. SQL Server’s optimizer knows this and very likely will create the same plan in both cases.

GROUP BY ALL

T-SQL supports a nonstandard feature called GROUP BY ALL. It’s a pretty interesting feature, but you should be aware that the official documentation for SQL Server has a note advising to refrain from using it since Microsoft is planning to drop support for it at some point. Here I wanted to describe the logical query processing aspects of this feature and provide a recommended alternative.

In essence, the GROUP BY ALL option preserves empty groups that were filtered out by the WHERE clause. Any aggregate that you apply to those groups operates on an empty set. A COUNT(*) aggregate against such groups returns 0. As an example, consider the following query (call it Query 3):

SELECT empid, COUNT(*) AS numorders
FROM Sales.OrderValues
WHERE orderdate >= '20160501'
GROUP BY ALL empid
HAVING COUNT(*) <= 3;

The filter in the WHERE clause eliminates completely all rows for customers 3, 5, 6 nd 9. Earlier you saw that the result of a similar query without the GROUP BY ALL option didn’t return the groups for these customers. But since the GROUP BY ALL option preserves empty groups, the output of this query includes the groups for these customers with a count of 0:

empid       numorders
----------- -----------
2       2
3       0
4       2
5       0
6       0
7       2
8       3
9       0

As mentioned, Microsoft announced their intention to drop support for this feature in the future. Here’s the alternative to the previous query that should keep working in the future (call this Query 4):

SELECT empid, COUNT(tokeep) AS numorders
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( CASE WHEN orderdate >= '20160501' THEN 1 END ) )
    AS A(tokeep)
GROUP BY empid
HAVING COUNT(tokeep) <= 3;

Instead of filtering the rows in the WHERE clause, the query uses a CASE expression based on the same predicate to determine whether to keep the value (tokeep column is 1) or to ignore it (tokeep column is NULL). You then apply your aggregate calculation to the tokeep column. Not only that this solution uses only supported elements that are likely to stay supported, it is also more efficient than the GROUP BY ALL solution. Figure 3 shows the query plans for Query 3 (GROUP BY ALL solution) and Query 4 (solution with CASE expression) using SQL Sentry’s plan explorer.

Figure 3: Plans for Query 3 and Query 4

Figure 03 - Plans for Query 3 and Query 4.jpg

The plan for the GROUP BY ALL solution (Query 3) repeats the scanning of the input data twice—once (the top branch of the plan) to return all rows without applying the WHERE filter along with a constant flag NULL, and a second time (bottom branch of the plan) with applying the WHERE filter along with a constant flag 0. The plan then concatenates the results, groups the rows by empid, computes the aggregates COUNT(flag), and finally applies the HAVING filter. Conversely, the plan for the solution based on the CASE expression (Query 4) scans the input data only once.

Grouping sets

Traditional group queries compute aggregates for a single grouping set (a single set of expressions that you group by). However, sometimes you need to write queries that produce aggregates for multiple grouping sets, usually for reporting purposes. For example, suppose that you want to query the Sales.Orders table, and compute daily, monthly, yearly, and grand order counts. You could achieve this by writing a separate grouped query for each grouping set, and then unify the results, like so:

WITH C AS
(
  SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders
  FROM Sales.Orders
    CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )
  AS A(orderyear, ordermonth, orderday)
  GROUP BY orderyear, ordermonth, orderday

  UNION ALL

  SELECT orderyear, ordermonth, NULL AS orderday, COUNT(*) AS numorders
  FROM Sales.Orders
    CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate)) )
  AS A(orderyear, ordermonth)
  GROUP BY orderyear, ordermonth

  UNION ALL

  SELECT orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders
  FROM Sales.Orders
    CROSS APPLY ( VALUES(YEAR(orderdate)) ) AS A(orderyear)
  GROUP BY orderyear

  UNION ALL

  SELECT NULL AS orderyear, NULL AS ordermonth, NULL AS orderday, COUNT(*) AS numorders
  FROM Sales.Orders
)
SELECT orderyear, ordermonth, orderday, numorders
FROM C
ORDER BY
  CASE WHEN orderyear IS NOT NULL THEN 0 ELSE 1 END, orderyear,
  CASE WHEN ordermonth IS NOT NULL THEN 0 ELSE 1 END, ordermonth,
  CASE WHEN orderday IS NOT NULL THEN 0 ELSE 1 END, orderday;

Observe that the code uses NULLs as place holders for elements that are not part of the current grouping set but are relevant in other grouping sets. The expressions in the ORDER BY clause guarantee hierarchical presentation order (daily counts followed by month total, monthly counts followed by year total, and finally the grand total). This query generates the following output, shown here in abbreviated form:

orderyear   ordermonth  orderday    numorders
----------- ----------- ----------- -----------
2014    7       4       1
2014    7       5       1
2014    7       8       2
2014    7       9       1
2014    7       10      1
2014    7       11      1
2014    7       12      1
2014    7       15      1
2014    7       16      1
2014    7       17      1
2014    7       18      1
2014    7       19      2
2014    7       22      1
2014    7       23      1
2014    7       24      1
2014    7       25      1
2014    7       26      1
2014    7       29      1
2014    7       30      1
2014    7       31      1
2014    7       NULL    22
...
2016    4       29      3
2016    4       30      4
2016    4       NULL    74
2016    5       1       3
2016    5       4       3
2016    5       5       4
2016    5       6       4
2016    5       NULL    14
2016    NULL    NULL    270
NULL    NULL    NULL    830

(507 row(s) affected)

SQL (and so does T-SQL) Supports a much more elegant solution for our need using a clause called GROUPING SETS, which you specify in the GROUP BY clause, and in which you list multiple grouping sets, like so:


SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders
FROM Sales.Orders
  CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )
    AS A(orderyear, ordermonth, orderday)
GROUP BY GROUPING SETS
(
  (orderyear, ordermonth, orderday),
  (orderyear, ordermonth),
  (orderyear),
  ()
)
ORDER BY
  GROUPING(orderyear), orderyear,
  GROUPING(ordermonth), ordermonth,
  GROUPING(orderday), orderday;

The GROUPING function returns a 0 for an input element if it’s part of the grouping set and 1 if it’s not (when it’s an aggregate). In our query it simplifies the expressions in the ORDER BY clause that guarantee hierarchical presentation ordering. From a logical query processing perspective, the GROUPING SETS clause allows you to define multiple grouping sets, and as a result associate each row returned from the WHERE pahse possibly with multiple groups instead of just one.

Back to the last query, when you need to compute all grouping sets that represent leading combination in a hierarchy, like our time hierarchy, instead of explicitly listing all those grouping sets in the GROUPING SETS clause, you can use an abbreviated syntax with the ROLLUP clause, like so:

SELECT orderyear, ordermonth, orderday, COUNT(*) AS numorders
FROM Sales.Orders
  CROSS APPLY ( VALUES(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) )
    AS A(orderyear, ordermonth, orderday)
GROUP BY ROLLUP(orderyear, ordermonth, orderday)
ORDER BY
  GROUPING(orderyear), orderyear,
  GROUPING(ordermonth), ordermonth,
  GROUPING(orderday), orderday;

This ROLLUP clause is equivalent to the previous query’s GROUPING SETS clause. It defines all grouping sets that represent leading combination of the input expressions.

For more information about grouping sets, see Grouping Sets Part 1 and Grouping Sets Part 2.

Conclusion

In this article I covered the logical query processing aspects of the GROUP BY and HAVING clauses. I provided the logical query processing flow chart with the FROM, WHERE, GROUP BY and HAVING clauses. I explained that grouping discards the detail and provided an alternative in the form of windowing, which doesn’t discard the detail. I explained the difference between the WHERE and HAVING predicates. I also covered the proprietary GROUP BY ALL option and provided an alternative that is more efficient and that is likely to be supported in the future, unlike GROUP BY ALL. Finally, I showed how to define multiple grouping sets in a single query using the GROUPSING SETS and ROLLUP clauses. Next month I’ll continue the discussion about logical query processing, covering the SELECT and ORDER BY clauses.

Comments

Plain text