Optimizing TOP N Per Group Queries

A querying task can have several solutions, and one solution won’t provide the best performance in all cases. Several factors can influence which solution performs best, including data density, availability of indexes, and so on. To decide on a solution, you must know your data, understand how SQL Server’s optimizer works, and determine whether it’s worthwhile to add important indexes in terms of overall impact on the system. In this article I present a simple, common querying task; then I show three approaches to solve the task, and I discuss when each solution would work best.

Challenge

The task that’s the focus of this article is commonly referred to as Top N Per Group. The idea is to return a top number of rows for each group of rows. Suppose you have a table called Orders with attributes called orderid, custid, empid, shipperid, orderdate, shipdate, and filler (representing other attributes). You also have related tables called Customers, Employees, and Shippers. You need to write a query that returns the N most recent orders per customer (or employee or shipper). "Most recent" is determined by order date descending, then by order ID descending as the tiebreaker.

Writing a solution that works isn’t difficult; the challenge is determining the optimal solution depending on variable factors and what’s specific to your case in your system. Also, assuming you can determine the optimal index strategy to support your solution, you must consider whether you’re allowed and if you can afford to create such an index. Sometimes one solution would work best if the optimal index to support it were available, but another solution would work better without such an index in place. In our case, the density of the partitioning column (e.g., custid, if you need to return the top rows per customer) also plays an important role in determining which solution is optimal.

Sample Data

Run the code in Listing 1 to create a sample database called TestTOPN and within it a helper function called GetNums. The helper function accepts a number as input and returns a sequence of integers in the range 1 through the input value. The function is used by the code in Listing 2 to populate the different tables with a requested number of rows. Run the code in Listing 2 to create the Customers, Employees, Shippers, and Orders tables and fill the tables with sample data.

Listing 1: Code to Create Sample Database TestTOPN and Helper Function GetNums

SET NOCOUNT ON;
IF DB_ID('TestTOPN') IS NULL CREATE DATABASE TestTOPN;
GO
USE TestTOPN;
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
  SELECT TOP(@n) n FROM Nums ORDER BY n;
GO

The code in Listing 2 populates the Customers table with 50,000 rows, the Employees table with 400 rows, the Shippers table with 10 rows, and the Orders table with 1,000,000 rows (approximately 240-byte row size, for a total of more than 200MB). To test your solutions with other data distribution, change the variables’ values. With the existing numbers, the custid attribute represents a partitioning column with low density (large number of distinct custid values, each appearing a small number of times in the Orders table), whereas the shipperid attribute represents a partitioning column with high density (small number of distinct shipperid values, each appearing a large number of times).

Listing 2: Code to Create and Populate Tables with Sample Data

DECLARE
  @num_orders      AS INT     ,
  @num_customers   AS INT     ,
  @num_employees   AS INT     ,
  @num_shippers    AS INT     ,
  @start_orderdate AS DATETIME,
  @end_orderdate   AS DATETIME;
 
SELECT
  @num_orders      =    1000000,
  @num_customers   =      50000,
  @num_employees   =    400,
  @num_shippers    =     10,
  @start_orderdate = '20060101',
  @end_orderdate   = '20100531';
 
IF OBJECT_ID('dbo.Orders'   , 'U') IS NOT NULL DROP TABLE dbo.Orders;
IF OBJECT_ID('dbo.Customers', 'U') IS NOT NULL DROP TABLE dbo.Customers;
IF OBJECT_ID('dbo.Employees', 'U') IS NOT NULL DROP TABLE dbo.Employees;
IF OBJECT_ID('dbo.Shippers' , 'U') IS NOT NULL DROP TABLE dbo.Shippers;
 
-- Customers
CREATE TABLE dbo.Customers
(
  custid   INT     NOT NULL,
  custname VARCHAR(50) NOT NULL,
  filler   CHAR(200)   NOT NULL DEFAULT('a'),
  CONSTRAINT PK_Customers PRIMARY KEY(custid)
);
 
INSERT INTO dbo.Customers WITH (TABLOCK) (custid, custname)
  SELECT n, 'Cust ' + CAST(n AS VARCHAR(10)) FROM dbo.GetNums(@num_customers);
 
-- Employees
CREATE TABLE dbo.Employees
(
  empid   INT     NOT NULL,
  empname VARCHAR(50) NOT NULL,
  filler  CHAR(200)   NOT NULL DEFAULT('a'),
  CONSTRAINT PK_Employees PRIMARY KEY(empid)
);
 
INSERT INTO dbo.Employees WITH (TABLOCK) (empid, empname)
  SELECT n, 'Emp ' + CAST(n AS VARCHAR(10)) FROM dbo.GetNums(@num_employees);
 
-- Shippers
CREATE TABLE dbo.Shippers
(
  shipperid   INT     NOT NULL,
  shippername VARCHAR(50) NOT NULL,
  filler      CHAR(200)   NOT NULL DEFAULT('a'),
  CONSTRAINT PK_Shippers PRIMARY KEY(shipperid)
);
 
INSERT INTO dbo.Shippers WITH (TABLOCK) (shipperid, shippername)
  SELECT n, 'Shipper ' + CAST(n AS VARCHAR(10)) FROM dbo.GetNums(@num_shippers);
 
-- Orders
CREATE TABLE dbo.Orders
(
  orderid    INT       NOT NULL,
  custid     INT       NOT NULL,
  empid      INT       NOT NULL,
  shipperid      INT       NOT NULL,
  orderdate      DATETIME  NOT NULL,
  shipdate       DATETIME  NULL,
  filler     CHAR(200) NOT NULL DEFAULT('a'),
  CONSTRAINT PK_Orders PRIMARY KEY(orderid),
);
 
WITH C AS
(
  SELECT
    n AS orderid,
    ABS(CHECKSUM(NEWID())) % @num_customers + 1 AS custid,
    ABS(CHECKSUM(NEWID())) % @num_employees + 1 AS empid,
    ABS(CHECKSUM(NEWID())) % @num_shippers  + 1 AS shipperid,
    DATEADD(day,
        ABS(CHECKSUM(NEWID()))
      % (DATEDIFF(day, @start_orderdate, @end_orderdate) + 1),
        @start_orderdate) AS orderdate,
    ABS(CHECKSUM(NEWID())) % 31 AS shipdays
  FROM dbo.GetNums(@num_orders)
)
INSERT INTO dbo.Orders WITH (TABLOCK) (orderid, custid, empid, shipperid, orderdate, shipdate)
  SELECT orderid, custid, empid, shipperid, orderdate,
    CASE
  WHEN DATEADD(day, shipdays, orderdate) > @end_orderdate THEN NULL
  ELSE DATEADD(day, shipdays, orderdate)
    END AS shipdate
  FROM C;

Regardless of the solution you use, if you can afford to create optimal indexing, the best approach is to create an index on the partitioning column (e.g., custid, if you need top rows per customer), plus the ordering columns as the keys (orderdate DESC, orderid DESC), and include in the index definition all the other attributes you need to return from the query (e.g., filler). As I mentioned, I’ll use custid as the partitioning column to demonstrate queries with low density and shipperid for high density. To demonstrate solutions when an index is available, you must run the following commands to create the indexes before testing:

CREATE UNIQUE NONCLUSTERED INDEX idx_unc_sid_odD_oidD_Ifiller
  ON dbo.Orders(shipperid, orderdate DESC, orderid DESC)
  INCLUDE(filler);

CREATE UNIQUE NONCLUSTERED INDEX idx_unc_cid_odD_oidD_Ifiller
  ON dbo.Orders(custid, orderdate DESC, orderid DESC)
  INCLUDE(filler);

To demonstrate solutions when the optimal index isn’t available, use the following code to drop the indexes:

DROP INDEX
  dbo.Orders.idx_unc_sid_odD_oidD_Ifiller,
  dbo.Orders.idx_unc_cid_odD_oidD_Ifiller;

My test machine has an Intel Core i7 processor (quad core with hyperthreading, hence eight logical processors), 4GB of RAM, and a 7,200RPM hard drive. I tested three solutions, based on the ROW_NUMBER function, the APPLY operator, and grouping and aggregation of concatenated elements.

Solution Based on ROW_NUMBER

Listing 3 contains the solution based on the ROW_NUMBER function. This solution assumes that the request was for the most recent order for each customer. The concept is straightforward—assign row numbers partitioned by custid, ordered by orderdate DESC, orderid DESC, and filter only the rows in which the row number is equal to 1.

Listing 3: Solution Based on ROW_NUMBER

WITH C AS
(
  SELECT custid, orderdate, orderid, filler,
    ROW_NUMBER() OVER(PARTITION BY custid
          ORDER BY orderdate DESC, orderid DESC) AS rownum
  FROM dbo.Orders
)
SELECT custid, orderdate, orderid, filler
FROM C
WHERE rownum = 1;

Some benefits to this solution aren’t related to performance. For example, the solution is simple and can easily be tailored to work with N > 1 (simply change the filter to rownum <= N—e.g., rownum <= 3).

As for performance, Figure 1 shows the plan for this query when the optimal index is available—in both low-density and high-density cases—and Figure 2 shows the plan when the optimal index isn’t available. When an optimal index exists (i.e., supports the partitioning and ordering needs as well as covers the query), the bulk of the cost of the plan is associated with a single ordered scan of the index.

Note that when partitioning is involved in the ranking calculation, the direction of the index columns must match the direction of the columns in the calculation’s ORDER BY list for the optimizer to consider relying on index ordering. Thus, if you create the index with the keylist (custid, orderdate, orderid) with both orderdate and orderid ascending, you’ll end up with an expensive sort in the plan. Apparently, the optimizer wasn’t coded to realize that a backward scan of the index could be used in such a case.

Returning to the plan in Figure 1, very little cost is associated with the actual calculation of the row numbers and the filtering. No expensive sorting is required. Such a plan is ideal for a low-density scenario (7.7 seconds on my system, mostly attributed to the scan of about 28,000 pages). This plan is also acceptable for a high-density scenario (6.1 seconds on my system); however, a far more efficient solution based on the APPLY operator is available.

When an optimal index doesn’t exist, the ROW_NUMBER solution is very inefficient. The table is scanned only once, but there’s a high cost associated with sorting, as you can see in Figure 2.

Solution Based on APPLY

The second solution is based on the APPLY operator. Listing 4 shows an example for this solution, with custid as the partitioning column and 1 as the number of most recent orders to return per partition. This solution queries the table representing the partitioning entity (custid) and uses the APPLY operator to apply a query that returns the TOP(N) most recent orders for each outer row. This solution is straightforward and can easily be tailored to support multiple rows per partition (simply adjust the input to the TOP option).

Listing 4: Solution Based on APPLY

SELECT A.*
FROM dbo.Customers AS C
  CROSS APPLY (SELECT TOP (1) custid, orderdate, orderid, filler
       FROM dbo.Orders AS O
       WHERE O.custid = C.custid
       ORDER BY orderdate DESC, orderid DESC) AS A;

As for performance, Figure 3 shows the plan when the optimal index is available, for both low-density and high-density cases. Figure 4 shows the plans when the index isn’t available; note the different plans for low-density (upper plan) and high-density (lower plan) cases.

You can see in the plan in Figure 3 that when an index is available, the plan scans the table representing the partitioning entity, then for each row performs a seek and partial scan in the index on the Orders table. This plan is highly efficient in high-density cases because the number of seek operations is equal to the number of partitions; with a small number of partitions, there’s a small number of seek operations. When I ran this solution for the high-density case (with shipperid instead of custid), when the index was available, I got only 39 logical reads—compared with about 28,000 reads in the previous solution, based on ROW_NUMBER. In the low-density case there will be many seek operations, and the I/O cost will become accordingly high, to a point where the ROW_NUMBER solution will be cheaper. As an example, when using custid as the partitioning column, the solution in Listing 4 involves more than 350,000 reads.

As for the efficiency of this solution when an optimal index doesn’t exist, observe the plans in Figure 4. In the low-density case (upper plan), for each partition, the optimizer scans the clustered index of the Orders table, creates an index (index spool), and applies a Top N Sort. In the high-density case (lower plan), the optimizer doesn’t create an index but follows the other steps. In both cases the plans involve an extremely high number of reads (more than 3,000,000 in the low-density case and more than 300,000 in the high-density case), as well as the sorting costs (although it’s not a full sort but rather a Top N Sort with the purpose of filtering).

Solution Based on Concatenation

Some solutions work well only when the right indexes are available, but without those indexes the solutions perform badly. Such is the case with both the solutions I presented. Creating new indexes isn’t always an option: Perhaps your organization has policies against doing so; or perhaps write performance is paramount on your systems, and adding indexes adds overhead to the writes. If creating an optimal index to support the ROW_NUMBER or APPLY solutions isn’t an option, you can use a solution based on grouping and aggregation of concatenated elements. Listing 5 presents such a solution.

Listing 5: Solution Based on Concatenation

WITH C AS
(
  SELECT
    custid,
    MAX((CONVERT(CHAR(8), orderdate, 112)
     + RIGHT('000000000' + CAST(orderid AS VARCHAR(10)), 10)
     + filler) COLLATE Latin1_General_BIN2) AS string
  FROM dbo.Orders
  GROUP BY custid
)
SELECT custid,
  CAST(SUBSTRING(string,  1,   8) AS DATETIME ) AS orderdate,
  CAST(SUBSTRING(string,  9,  10) AS INT      ) AS orderid,
  CAST(SUBSTRING(string, 19, 200) AS CHAR(200)) AS filler
FROM C;

The main downside of this solution is that it’s complex and unintuitive—unlike the others. The common table expression (CTE) query groups the data by the partitioning column. The query applies the MAX aggregate to the concatenated partitioning + ordering + covered elements after converting them to a common form that preserves ordering and comparison behavior of the partitioning + ordering elements (fixed sized character strings with binary collation). Of course, you need to make sure that ordering and comparison behavior is preserved, hence the use of style 112 (YYYYMMDD) for the order dates and the addition of leading zeros to the order IDs. The outer query then breaks the concatenated strings into the individual elements and converts them to their original types. Another downside of this solution is that it works only with N = 1.

As for performance, observe the plans for this solution, shown in Figure 5 (with index in place) and Figure 6 (no index). With an index in place, the optimizer relies on index ordering and applies a stream aggregate. In the low-density case (upper plan), the optimizer uses a serial plan, whereas in the high-density case (lower plan), it utilizes parallelism, splitting the work to local (per thread) then global aggregates. These plans are very efficient, with performance similar to the solution based on ROW_NUMBER; however, the complexity of this solution and the fact that it works only with N = 1 makes it less preferable.

Observe the plans in Figure 6 (no index in place). The plans scan the table once and use a hash aggregate for the majority of the aggregate work (only aggregate operator in the low-density case, and local aggregate operator in the high-density case). A hash aggregate doesn’t involve sorting and is more efficient than a stream aggregate when dealing with a large number of rows. In the high-density case a sort and stream aggregate are used only to process the global aggregate, which involves a small number of rows to process and is therefore inexpensive. Note that in both plans in Figure 6 the bulk of the cost is associated with the single scan of the data.

Benchmark Results

You can find benchmark results for my three solutions in Figure 7 (duration), Figure 8 (CPU), and Figure 9 (logical reads). Examine the duration and CPU statistics in Figure 7 and Figure 8. Note that when an index exists, the ROW_NUMBER solution is the most efficient in the low-density case and the APPLY solution is the most efficient in the high-density case. Without an index, the aggregate concatenation solution works best in all densities. In Figure 9 you can see how inefficient the APPLY solution is in terms of I/O, in all cases but one—high density with an index available. (In the low-density without index scenario, the bar for APPLY actually extends about seven times beyond the ceiling of the figure, but I wanted to use a scale that reasonably compared the other bars.)

Comparing these solutions demonstrates the importance of writing multiple solutions for a given task and carefully analyzing their characteristics. You must understand the strengths and weaknesses of each solution, as well as the circumstances in which each solution excels.

Comments

Plain text