What You Need to Know about the Batch Mode Window Aggregate Operator in SQL Server 2016: Part 2

SQL Server 2016 introduces the batch mode Window Aggregate operator, which dramatically improves the performance of calculating window functions. Besides the general performance advantages of batch mode processing compared to row mode processing, this operator uses a dedicated code path for each window function. Many inefficiencies in the original row mode optimization are removed. For example, the need for an on-disk spool is eliminated by maintaining the window of rows in memory and using a multi-pass process over that window in a streaming fashion.

Last month I covered frameless aggregate window functions. This month I’ll focus on ranking and statistical window functions.

Sample Data

In this article I will use the same sample data that I used last month. If you don’t have it available already use the following code to create it:

-- testwindow database

SET NOCOUNT ON;

IF DB_ID(N'testwindow') IS NULL CREATE DATABASE testwindow;

GO

USE testwindow;

GO


-- GetNums helper function

DROP FUNCTION IF EXISTS dbo.GetNums;

GO


CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),

L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum

     FROM L5)

  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n

  FROM Nums

  ORDER BY rownum;

GO


-- Transactions table with 10M rows (200 accounts x 50000 transactions per act)

-- Traditional rowstore B-tree index

DROP TABLE IF EXISTS dbo.Transactions;


CREATE TABLE dbo.Transactions

(

  actid  INT   NOT NULL,

  tranid INT   NOT NULL,

  val    MONEY NOT NULL,

  CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)

);

GO


DECLARE

  @num_partitions     AS INT = 200,

  @rows_per_partition AS INT = 50000;


INSERT INTO dbo.Transactions WITH (TABLOCK) (actid, tranid, val)

  SELECT NP.n, RPP.n,

(ABS(CHECKSUM(NEWID())%2)*2-1) * (1 + ABS(CHECKSUM(NEWID())%5))

  FROM dbo.GetNums(1, @num_partitions) AS NP

CROSS JOIN dbo.GetNums(1, @rows_per_partition) AS RPP;

GO

As a reminder, the sample data is created in a database called testwindow. There are three tables holding identical copies of the data, which consists of 10,000,000 rows representing bank account transactions (500 accounts x 20,000 transactions per account). The table Transactions uses only rowstore representation of the data with a clustered B-tree index defined on the key list: (actid, tranid). Queries against this table currently use only row mode processing due to the requirement to have at least one columnstore index present to enable batch mode processing. So I’ll use this table to demonstrate row mode processing over rowstore. The table TransactionsCS has only a clustered columnstore index, so I’ll use it when I want to demonstrate batch mode processing over columnstore. The table TransactionsDCS uses the same organization as Transactions (clustered B-tree index), but in addition has a dummy empty filtered columnstore index to enable batch processing. I’ll use this table when I want to demonstrate batch mode processing over rowstore.

Ranking Window Functions

T-SQL supports four ranking window functions. Those are: ROW_NUMBER, RANK, DENSE_RANK and NTILE. I’ll discuss the first three as one group since they are optimized similarly, and the last separately since it is optimized differently. The following query (call it Query 1) demonstrates the use of the first three functions, when there’s a supporting B-tree index:

SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY tranid) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY tranid) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS drnk

FROM dbo.Transactions;

The index helps in both coverage of the query elements and avoiding the need for sorting. It follows the pattern POC, where P stands for partitioning, O for ordering and C for covering. The P and O parts form the index key list, and the C part just needs to be included. Since our B-tree index is a clustered index, it naturally covers the query. If the query had equality-based filters, you would place the filter elements as leading elements in the key list, in which case you would have the pattern FPOC (with F for filtering). Since our sample query doesn’t have equality-based filters, there’s no F part in our index--rather, just the POC parts.

The execution plan for this query is shown in Figure 1. It uses row mode processing over rowstore representation of the data.

Figure 1: Plan for Query (ranking functions, row mode over rowstore)

The plan scans the POC index in order, and therefore there’s no need for a sort operator. One Segment operator is used to flag the Sequence Project operator when a new partition starts (used for all three functions). Another Segment operator is used to flag the Sequence Project operator when an ordering value changes (used for RANK and DENSE_RANK). The Sequence Project operator handles the actual computation, advancing the rank value based on the function’s semantics. The Sequence Project operator calculates the ranking functions in a streaming fashion without the need for a spool. This makes it very efficient despite the fact that it uses row mode processing. If you can prepare a POC index to support your query like in this example, the plan can rely on index order and not incur a performance penalty related to sorting.

The performance statistics for this query on my system are: duration: 12 sec, CPU: 12 sec, logical reads: 31K, writes: 0.

The main drawback of the row mode processing of these ranking functions is that they are always computed in a serial zone in the plan, so you don’t benefit from having a strong machine with many CPUs.

If you don’t have a supporting POC index, the performance of the query degrades significantly due to the need for sorting. Even if the optimizer chooses parallel scanning and sorting of the data, it still needs to convert the parallel streams into one serial stream using a Gather Streams exchange operator. To demonstrate such optimization, change the ordering of the ranking functions to be based on the column val instead of tranid, like so (call it Query 2):

SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY val) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY val) AS drnk

FROM dbo.Transactions;

The plan for this query is shown in Figure 2.

Figure 2: Plan for Query 2 (ranking functions, row mode over rowstore, with sorting)

There’s no POC index present to support this query, and therefore the plan uses a Sort operator (which in this example also spills to tempdb). This example is good for demonstrating the implications on performance when a sort is required. Remember that in practice your queries often involve additional elements like range filters, joins, and so on, that may prevent the ability to create a supporting POC index. Sorting may then be unavoidable. The performance statistics that I got for this query are: duration: 28 sec, CPU: 53 sec, logical reads: 53K, writes: 13. Notice that it took more than twice as long to complete compared to Query 1.

To test the performance of batch processing over columnstore, test any of the above queries against TransactionsCS, like so (call it Query 3):

SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY tranid) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY tranid) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS drnk

FROM dbo.TransactionsCS;

The plan for this query is shown in Figure 3.

Figure 3: Plan for Query 3 (ranking functions, batch mode over columnstore)

Remember that when querying columnstore, sorting for computation of window functions is unavoidable since columnstore data isn’t ordered; however, you do get the benefits of reduced I/O cost, batch mode processing and parallelism, with much better scaling for larger number of CPUs, compared to row mode processing. I got the following performance statistics for this query: duration: 10 sec, CPU: 23 sec, logical reads: 6K, writes: 0. Mind you, my test laptop has four logical CPUs (two cores with hyper threading). On a more serious machine with more CPUs, the performance would naturally be much better.

You will get a similar plan and similar performance statistics if you change the window ordering to be based on the column val instead of tranid. You can test this by using the following query (call it Query 4):

SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY val) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY val) AS drnk

FROM dbo.TransactionsCS;

To test the query performance with batch mode over rowstore optimization, issue the query against the table TransactionsDCS, like so (call it Query 5):

SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY tranid) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY tranid) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS drnk

FROM dbo.TransactionsDCS;

Recall that the optimizer can’t rely on a parallel ordered B-tree index scan to support a batch mode Window Aggregate operator. Since the row mode optimization of the above three ranking functions is pretty efficient when the data is preordered in a B-tree index, albeit serial processing, the optimizer creates the same plan for Query 5 as the one it created for Query 1 (shown earlier in Figure 1), and you get similar performance.

If you query the TransactionsDCS table and you don’t have a POC index to support your window functions, since explicit sorting is needed anyway the optimizer uses the batch mode over row store strategy. The following query (call it Query 6) demonstrates this:


SELECT actid, tranid, val,

  ROW_NUMBER() OVER(PARTITION BY actid ORDER BY val) AS rownum,

  RANK() OVER(PARTITION BY actid ORDER BY val) AS rnk,

  DENSE_RANK() OVER(PARTITION BY actid ORDER BY val) AS drnk

FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 4.

Figure 4: Plan for Query 6 (ranking functions, batch mode over rowstore)

The plan is a parallel plan. Notice that the B-tree index is scanned using row mode in an Ordered: False fashion since its order is not useful for the window functions. The rest of the work is done using batch mode operators. The performance statistics that I got for this query are: duration: 12 sec, CPU: 30 sec, logical reads: 31K, writes: 0.

Figure 5 shows a summary of the run times for the different queries.

Figure 5: Graph summarizing performance of three ranking functions

As you can see, row mode processing over row store does reasonably well only when there’s a supporting POC index to avoid the need for sorting. Still, the plan is serial, so you won’t benefit from having a strong machine with many CPUs. Batch mode does well even when sorting is needed and also scales very well with multiple CPUs.

The NTILE window function computes tile numbers for result rows, with the tile size being the result of the total row count divided by the requested number of tiles. For example, the following query (call it Query 7) against the table Transactions (row mode over rowstore) computes tile numbers based on a requested number of 10 tiles:

SELECT actid, tranid, val,

  NTILE(10) OVER(PARTITION BY actid ORDER BY tranid) AS ntile10

FROM dbo.Transactions;

The plan for this query is shown in Figure 6.

Figure 6: Plan for Query 7 (NTILE, row mode over rowstore)

As you can see, the plan is quite elaborate, and a bit similar to the plans I showed last month for frameless aggregate window functions. That’s because the NTILE function needs to compute the count of rows in the partition before it can compute the tile size. The calculation of the row count is optimized similar to the way the function COUNT(*) OVER(PARTITION BY actid) would get optimized. The plan reads the data in order from the B-tree rowstore index, segments it by the actid column, and stores it in an on-disk spool. For each segment, the middle and bottom branches of the plan read the data from the spool one time to compute the row count and another time to get the detail. The last part of the plan (at the top-left) segments the data again by the actid column, and then the Sequence Project operator computes the tile number. This plan is not too efficient, mainly due to the use of the on-disk spool. The performance statistics that I got for this query are: duration: 47 sec, CPU: 47 sec, logical reads: 20M, writes: 40K.

If you didn’t have a supporting POC index, matters would have been even worse because the plan would have had to also include a sort operation. For example, when I changed the ordering element in the last query from tranid to val, the query took 57 seconds to complete.

The count part of the calculation can be handled with parallel batch processing (Window Aggregate operator) if a columnstore index is present. In my testing I always get the calculation of the tile numbers in a serial zone. If the optimizer chooses row mode processing for the calculation of the tile number, it will use a combination of a Segment and Sequence project operators. If it chooses batch mode processing, it will use a Window Aggregate operator. Since the biggest performance penalty in the last query had to do with the computation of the count due to the use of the on-disk spool, batch processing here can result in a significantly improved performance—even if a sort is needed. To demonstrate this, query the TransactionsCS table, like so (call it Query 8):

SELECT actid, tranid, val,

  NTILE(10) OVER(PARTITION BY actid ORDER BY tranid) AS ntile10

FROM dbo.TransactionsCS;

The plan for this query is shown in Figure 7.

Figure 7: Plan for Query 8 (NTILE, batch mode over columnstore)

Notice the count aggregate is computed using a batch mode Window Aggregate operator in a parallel zone in the plan, and then the tile numbers are computed using row mode processing with the Segment and Sequence Project operators in a serial zone in the plan. I get the following performance statistics for this query: duration: 14 sec, CPU: 23 sec, logical reads: 6K, writes: 0. That’s a pretty big improvement compared to close to a minute of run time.

To test batch mode processing over rowstore, without the need for sorting, use the following query (call it Query 9) against the table TransactionsDCS:

SELECT actid, tranid, val,

  NTILE(10) OVER(PARTITION BY actid ORDER BY tranid) AS ntile10

FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 8.

Figure 8: Plan for Query 9 (NTILE, batch mode over rowstore, no sort)

The plan is serial, but there’s no need for sorting since the data is pulled ordered from the B-tree rowstore index. Both the calculation of the count aggregate and the actual computation of the tile numbers are done using batch mode Window Aggregate operators. The performance statistics that I get for this query are: 8 sec, CPU: 8 sec, logical reads: 31K, writes: 0. The query run time dropped to 8 seconds!

If you didn’t have a POC index to support the computation of the window functions, then there would have been the extra cost of sorting. To demonstrate this, change the ordering of the NTILE function from being based on the tranid column to the val column, like so (call this Query 10):

SELECT actid, tranid, val,

  NTILE(10) OVER(PARTITION BY actid ORDER BY val) AS ntile10

FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 9.

Figure 9: Plan for Query 10 (NTILE, batch mode over rowstore, with sort)

The plan scans the B-tree rowstore index in an Ordered: False fashion using row mode processing. The plan then uses batch mode Sort and Window Aggregate operators to compute the count. The plan then switches to a serial zone to compute the tile numbers in row mode with the Segment and Sequence Project operators. I got the following performance statistics for this query: duration: 16 sec, CPU: 30 sec, logical reads: 31K, writes: 0.

Figure 10 has a graph summarizing the performance of the different queries that compute the NTILE function.

Figure 10: Graph summarizing the performance of the NTILE function

As you can see, the queries that use batch mode processing perform significantly better than the ones using row mode processing.

Statistical Window Functions

T-SQL supports two pairs of window distribution functions that perform statistical calculations. The inverse distribution functions PERCENTILE_CONT and PERCENTILE_DISC compute percentiles using continuous and discrete distribution models, respectively. The rank distribution functions PERCENT_RANK and CUME_DIST compute percentile ranks and cumulative distribution, respectively. For details about the meaning of these functions, see the section Window Distribution Functions in article Microsoft SQL Server 2012: How to Write T-SQL Window Functions, Part 2.

Traditional row mode over rowstore processing for these functions is very inefficient mainly due the use of an on-disk spool, in some cases multiple times in the query plan. I used the following queries to test the performance of these functions (performance characteristics included as comments):

---------------------------------------------------------------------

-- PERCENTILE_CONT (PERCENTILE_DISC has similar optimization)

---------------------------------------------------------------------


-- Query 11, row mode over rowstore, no sort

-- Two rounds of writing to and reading from an on-disk spool

-- duration: 110 sec, CPU: 121 sec, logical reads: 41M, writes: 101K

-- If you change to ORDER BY val, will require a sort

SELECT actid, tranid, val,

  PERCENTILE_CONT(0.5)

WITHIN GROUP(ORDER BY tranid) OVER(PARTITION BY actid) AS mediantid

FROM dbo.Transactions;


-- Query 12, batch mode over columnstore, sort, no spool

-- duration: 14 sec, CPU: 13 sec, logical reads: 5K, writes: 0

SELECT actid, tranid, val,

  PERCENTILE_CONT(0.5)

WITHIN GROUP(ORDER BY tranid) OVER(PARTITION BY actid) AS mediantid

FROM dbo.TransactionsCS;


-- Query 13, batch mode over rowstore, no sort, no spool

-- duration: 11 sec, CPU: 11 sec, logical reads: 31K, writes: 0

SELECT actid, tranid, val,

  PERCENTILE_CONT(0.5)

WITHIN GROUP(ORDER BY tranid) OVER(PARTITION BY actid) AS mediantid

FROM dbo.TransactionsDCS;


---------------------------------------------------------------------

-- PERCENT_RANK and CUME_DIST

---------------------------------------------------------------------


-- Query 14, row mode over rowstore, no sort, on-disk spool

-- duration: 54 sec, CPU: 53 sec, logical reads: 20M, writes: 40K

SELECT actid, tranid, val,

  PERCENT_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS pctrk

FROM dbo.Transactions;


-- Query 15, row mode over rowstore, no sort, on-disk spool

-- duration: 239 sec, CPU: 250 sec, logical reads: 125M, writes: 82K

SELECT actid, tranid, val,

  CUME_DIST() OVER(PARTITION BY actid ORDER BY tranid) AS cumedist

FROM dbo.Transactions;


-- Query 16, batch mode over columnstore, sort, no spool

-- duration: 17 sec, CPU: 17 sec, logical reads: 6K, writes: 0K

SELECT actid, tranid, val,

  PERCENT_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS pctrk

FROM dbo.TransactionsCS;


-- Query 17, batch mode over columnstore, sort, no spool

-- duration: 11 sec, CPU: 10 sec, logical reads: 6K, writes: 0K

SELECT actid, tranid, val,

  CUME_DIST() OVER(PARTITION BY actid ORDER BY tranid) AS cumedist

FROM dbo.TransactionsCS;


-- Query 18, batch mode over rowstore, no sort, no spool

-- duration: 8 sec, CPU: 8 sec, logical reads: 31K, writes: 0K

SELECT actid, tranid, val,

  PERCENT_RANK() OVER(PARTITION BY actid ORDER BY tranid) AS pctrk

FROM dbo.TransactionsDCS;


-- Query 19, batch mode over rowstore, no sort, no spool

-- duration: 8 sec, CPU: 8 sec, logical reads: 31K, writes: 0K

SELECT actid, tranid, val,

  CUME_DIST() OVER(PARTITION BY actid ORDER BY tranid) AS cumedist

FROM dbo.TransactionsDCS;

Figure 11 shows a graph with the run times of the different queries that compute the PERCENTILE_CONT function. The performance characteristics for PERCENTILE_DISC is similar.

Figure 11: Graph summarizing the performance of the PERCENTILE_CONT and PERCENTILE_DISC functions

Figure 12 shows a graph with the run times of the different queries that compute the PERCENT_RANK and CUME_DIST functions.

Figure 12: Graph summarizing the performance of the PERCENT_RANK and CUME_DIST functions

What’s Next?

In this article I focused on the optimization of ranking and statistical window function using the batch mode Window Aggregate operator. The performance improvements that are realized with the new operator are pretty impressive. Those are especially significant with functions that were previously optimized with an on-disk spool, like NTILE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK and CUME_DIST. One of the important benefits in the optimized operator is that there are no changes required to your code. All you need is to have a columnstore index present, even if it’s a dummy empty one to enable the use of batch mode operators, and it is likely that this requirement will be lifted in the future.

Next month I will complete this series by covering batch mode optimization of aggregate window functions with a frame, like the ones you use to compute running totals, previous/next row and other calculations.

Comments

Plain text