SQL Server internally processes GROUP BY and DISTINCT clauses in a similar manner. For both types of queries, SQL Server returns one row of output for each distinct value of a column or set of columns in the input. The difference is that with a GROUP BY query, you can optionally include an aggregation such as count(), sum(), or avg() to be performed on all the rows with matching values for the grouping column. For example, using the Northwind database's Orders table, suppose you want to know which customers have placed orders. In other words, you want to see a list of distinct customer IDs. In T-SQL, you can write the query like this:
-- Query 1: SELECT DISTINCT customerID FROM Orders
Alternatively, you can get the same output by using GROUP BY (without an aggregate function):
-- Query 2: SELECT customerID FROM Orders GROUP BY customerID
SQL Server 2000 and 7.0 process both queries in exactly the same way, and the results are identical. The queries return one row of output for each unique CustomerID value. The following plan shows that to process this query, SQL Server uses a technique called stream aggregation:
|--Stream Aggregate(GROUP BY:(\[Orders\] .\[CustomerID\])) |--Index Scan(OBJECT:(\[Northwind\] .\[dbo\].\[Orders\].\[CustomerID\]), ORDERED FORWARD)
I discuss some details of this plan later. But first, let's look at another example with GROUP BY.
The GROUP BY construct lets you perform further processing for each customer ID beyond simply listing the distinct values. By adding an aggregate function like sum() to the SELECT list, you're asking for summary information for each unique customer ID. The following query requests the sum of all the freight charges for each customer:
-- Query 3: SELECT customerID, sum(freight) FROM orders GROUP BY customerID
Again, the query returns only one row for each customer, no matter how many rows in the Orders table have that CustomerID value. And for each customer ID, you get summary information based on all the rows that have that CustomerID value.
However, as Figure 1 shows, the query plan for Query 3 is a bit different from the previous two; it uses a technique called hash aggregation, which doesn't imply any sorting. I'll forgo the discussion of exactly what hash aggregation involves until next time.
Stream aggregation and hash aggregation are the two ways that SQL Server can process both GROUP BY and DISTINCT queries. For stream aggregation, SQL Server sorts all the data first. For a GROUP BY query, the grouped columns are the sort key; for a DISTINCT query, the sort key comprises the columns in the SELECT list. After the data is sorted, isolating the distinct values and returning one row for each is relatively straightforward. When SQL Server uses stream aggregation to process GROUP BY or DISTINCT, the results are returned in sorted order. If you execute Queries 1 and 2 above, SQL Server automatically sorts the result set by CustomerID. When you execute Query 3, which uses hash instead of stream aggregation, the data comes back in seemingly random order.
Before SQL Server 7.0, SQL Server could perform GROUP BY and DISTINCT operations only by using stream aggregation, so the data always came back sorted. Consequently, many SQL Server programmers erroneously assumed that GROUP BY implied sorting and were confused when some grouped results in SQL Server 7.0 didn't come back sorted. But when the optimizer in SQL Server 2000 or 7.0 decides to use hash aggregation, the results aren't ordered. If you need data in a particular sorted order, you must use an ORDER BY clause in your query.
Because of the SQL Server 6.x applications that didn't work as expected in SQL Server 2000 and 7.0 and the difficulty of inserting the proper ORDER BY clause in all queries in the applications, Microsoft offered another solution. In SQL Server 2000 and 7.0, you can choose a compatibility level for your databases to operate in. Most of the differences between the compatibility levels affect which words SQL Server recognizes as keywords. Also, if your database is running in 65 (for SQL Server 6.5) or 60 (for SQL Server 6.0) compatibility mode, all GROUP BY queries return sorted results. You can see this behavior by running Query 3 after changing your compatibility level, as Listing 1 shows. Note that DISTINCT queries don't return sorted data when you're running in 65 compatibility level. For details about changing compatibility levels, see sp_dbcmptlevel in SQL Server Books Online (BOL).
What types of indexes can be useful for GROUP BY and DISTINCT operations? The Orders table has a nonclustered index on CustomerID, so Queries 1 and 2 are covered queries. The plan for those queries shows that SQL Server scans a nonclustered index—a common indicator of a covered query. The plan also includes the words "ordered forward," suggesting that the query is taking advantage of the leaf level's being stored in a particular order. The CustomerID values are already sorted in the leaf level of the nonclustered index, so stream aggregation is an obvious choice. SQL Server doesn't need to do any extra sorting, and the data is already in the right order. Query 3, which attempts to get data that isn't in the index, isn't a covered query.
Because many GROUP BY or DISTINCT queries involve only a few columns, such queries are relatively easy to cover with a nonclustered index. If you ran the examples in "The Big Cover Up," September 2001, InstantDoc ID 21729, you created an index that covers Query 4. That index is a nonclustered index on CustomerID and EmployeeID. If you don't already have an index on those columns, add the following code to create one:
--CREATE INDEX customer_employee on orders(customerID, employeeID) -- Query 4: SELECT CustomerID, count(EmployeeID) FROM Orders GROUP BY CustomerID
I won't show you the query plans for the remaining queries. You can see them by highlighting the query in Query Analyzer and selecting Display Estimated Execution Plan from the toolbar, or by executing SET SHOWPLAN_TEXT ON before running the queries.
In the Orders table, the clustering key is the OrderID column, so that column is included as an extra key in every nonclustered index row. Because of the inclusion of the OrderID values, Query 5 is also covered.
-- Query 5: SELECT CustomerID, count(OrderID) FROM Orders GROUP BY CustomerID
The biggest clue that a query is covered is that the query plan shows an index scan on a nonclustered index. The plans for both Query 4 and Query 5 indicate that type of scan.
Most of the examples in the rest of this column deal with GROUP BY operations, but almost all DISTINCT operations behave the same. In many cases, DISTINCT can even take better advantage of covering indexes because you're interested in fewer columns.
An interesting fact to be aware of when you're considering indexing a GROUP BY query that involves a count(*) aggregate is the difference between count(*) and count(<column>). Count(*) just returns the number of rows and is frequently a covered query because SQL Server can just count the rows in a nonclustered index leaf to find out "how many rows." So a simple query like Query 6 is covered by every nonclustered index on the Orders table because their leaf levels all have the same number of rows—one for each data row.
-- Query 6: SELECT count(*) FROM Orders
Query 7 is covered by any nonclustered index that contains CustomerID.
--Query 7: SELECT CustomerID, count(*) FROM Orders GROUP BY CustomerID
But if you need to compute count(<column>), SQL Server needs to know how many rows for that particular column have a value. So, you might think that the particular column needs to be part of any covering index. In SQL Server 7.0 and earlier, an index has to include the aggregated column to be used as a covering index. However, the SQL Server 2000 query optimizer can figure out that if the column doesn't allow NULLs, then count(<column>) is the same as count(*) because all the rows must contain a non-NULL value for the column.
The script in Listing 2 illustrates this point by creating a copy of the Order Details table, then building a clustered index on OrderID and a nonclustered index on ProductID. The count() aggregation is on the Quantity column, which initially doesn't allow NULLs. The query plan for a query involving count(*) is the same as the plan for a query involving count(Quantity) and includes stream aggregation because the data is already sorted by ProductID. Turning on STATISTICS IO shows the same number of logical reads for both queries. The script then drops the Quantity column and recreates it, allowing NULLs. The same query's plan now uses hash aggregation, and the number of logical reads increases.
Besides making use of a covering nonclustered index, a GROUP BY query could also use a clustered index on the grouping column because, just as in the covering case, the data is already in the proper order for stream aggregation. The query plan for Query 8 shows a clustered index scan and no sorting; the plan doesn't even include any aggregation operators:
-- Query 8: SELECT OrderID, sum(Quantity) FROM \[Order Details\] GROUP BY OrderID
As I mentioned, in SQL Server 2000 and 7.0, ordering of the output rows isn't guaranteed unless you include an ORDER BY clause (or you're running in 60 or 65 compatibility mode). The following examples show what happens to the query plan when you add an ORDER BY statement to sort by the grouping column.
The plan for Query 9 uses stream aggregation, so the data already comes back sorted by the grouping column:
-- Query 9: -- Stream aggregation, with or without ORDER BY SELECT ShipName, max(Freight) FROM Orders GROUP BY ShipName --ORDER BY ShipName
If you run the query with and without including the ORDER BY, the plans are the same. If the original query plan for a GROUP BY query uses hash aggregation, adding an ORDER BY clause forces a sort operation. However, the sort can occur either before or after the aggregation. I've found that because SQL Server can quickly sort small amounts of data, in most cases involving very small tables (less than 1000 rows or so), SQL Server performs the sort before the aggregation. The plan for Query 10 in Listing 3 shows hash aggregation; when Query 11 in Listing 3 adds ORDER BY, the plan changes to include the SORT operation before the aggregation. Because the data is sorted first, the aggregation is a stream aggregate.
Query 12 in Listing 4 also has a plan with a hash aggregation, but that query is based on the larger Order Details table. Query 13 in Listing 4 shows that the sort takes place after the hash aggregation.
In many cases, SQL Server can efficiently process queries that involve GROUP BY or DISTINCT by using either a stream aggregation on sorted data or a hash aggregation on unsorted data. You can tune these queries by either making sure that a covering nonclustered index exists or by creating a table's clustered index on the columns in the GROUP BY clause or on the columns in the SELECT list after the DISTINCT keyword.