What You Need to Know about the Batch Mode Window Aggregate Operator in SQL Server 2016: Part 1

Editorial note: This month I was scheduled to cover Part 6 of the Logical Query Processing series, but now that SQL Server 2016 has been released, I feel like I have to interrupt that series with a three-part series about an exciting new feature: the batch mode Window Aggregate operator. Once I’m done covering this feature, I’ll get back to the Logical Query Processing series. Consider the coverage of the new feature as breaking news!

Some think that window functions are only useful for solving specialized data analysis tasks. In practice, they serve a much bigger and more important role—they provide an alternative paradigm for solving general querying tasks to the traditional predicate-based paradigm. In other words, instead of solving querying tasks using classic solutions based on joins and subqueries with predicates in the ON and WHERE clauses, you can use window functions. The main benefits in using window functions are that they often allow more elegant and more efficient solutions than the traditional tools allow. I’ve been demonstrating such solutions for over a decade in my column, and I’ll continue doing so in the future. To me, it’s only a matter of time until more and more people realize their power, and until they become more and more abundant in people’s code. It’s natural evolution that better tools replace less efficient ones over time.

There are three main milestones in SQL Server’s support for window functions. SQL Server 2005 introduced frameless aggregate window functions and ranking window functions. SQL Server 2012 introduced aggregate window functions with a frame, as well as offset and statistical window functions. SQL Server 2016 introduces support for the batch mode Window Aggregate operator, which significantly improves the performance of most window functions. This operator is the focus of this series of articles.

There are still many more powerful window function capabilities in the SQL standard and in other database platforms that hopefully we will see SQL Server supporting in the future. For example, the standard nested window functions and RANGE window frame unit with INTERVAL-based delimiters, Teradata’s RESET WHEN phrase, Oracle’s MATCH_RECOGNIZE clause, and more…

Note that if you want to be able to run the code samples from this article you will need access to SQL Server 2016 Enterprise, Enterprise Evaluation, or Developer edition.

Many thanks to Milan Stojic, Milan Ruzic and your colleagues from the SQL Server dev team for sharing information about the Window Aggregate operator, and for making such amazing improvements in the engine!

Batch mode processing and the backdoor to enable it over rowstore

Historically, SQL Server used a row-based execution model. This means that the operators in the query execution plan process one row at a time. SQL Server has to evaluate metadata information per row. Any internal functions that the operator invokes are applied per row. SQL Server 2012 introduced a new execution engine that, for operators that support it, processes a batch of rows at a time instead of a single row at a time. SQL Server 2014 and 2016 significantly improve batch processing capabilities. Figure 1 shows an illustration of a batch of rows.

Figure 1: Batch

A batch consists of 900 rows. There’s a vector of 900 elements for each applicable column, plus a qualifying rows vector that indicates which rows are still logically part of the batch or have been deleted—for example by a filter. Operators that support batch mode process a whole batch at a time. This means they evaluate metadata only once per batch, and use the batch as the input unit for functions that they invoke. Batch processing significantly reduces the number of CPU cycles needed and utilizes the CPU cache much better. When dealing with large numbers of rows, batch processing can improve query performance in orders of magnitude. When analyzing query execution plans, you can see whether the operator was supposed to use and whether it actually used batch mode in the operator’s properties Estimated Execution Mode and Actual Execution Mode, respectively.

So far, in SQL Server versions 2012, 2014 and 2016, there’s a requirement that at least one columnstore index be present on one of the queried tables to enable the use of batch processing. Also so far, columnstore indexes are supported only in the Enterprise edition of SQL Server. The columnstore technology lends itself to batch processing since it stores the data in a compressed column oriented form, but batch processing can also improve query performance significantly when the data is pulled from traditional rowstore. SQL Server can convert individual rows that it pulls from a rowstore heap or B-tree into batches, and then use operators that support batch mode to process the data.

SQL Server 2016 introduces a number of batch processing related improvements that can significantly improve the performance of queries using window functions. Those include a superfast batch Window Aggregate operator, a batch Sort operator and even support for batch mode processing in serial plans.

If most of your queries anyway benefit from columnstore representation of the data, the upgrade to 2016 will just make them run much faster. But what if your queries actually benefit from rowstore representation of the data, and could do much better with batch processing? For example, if your queries perform ordered calculations such as the ones used by window functions, pulling the data ordered from a B-tree index can eliminate the cost of sorting, which is mandatory with a columnstore index. Admittedly, in many cases the queries are too complex, involving joins and other query elements, to make it even possible to rely on a B-tree index to preorder the data. But some cases are simple enough to make it possible.

One solution is to keep both rowstore and columnstore representations of the data, and let the optimizer choose the best for each query. In SQL Server 2016 you can have the main representation of the data be an updateable clustered columnstore index plus have nonclustered B-tree indexes on the same table; or, you can have the main representation of the data be a clustered B-tree rowstore index plus have an updateable nonclustered columnstore index on the same table. In other words, the two representations of the data can happily coexist.

But if you need to keep only a rowstore representation of the data, yet you still want to benefit from batch processing, there is a simple backdoor…

“The code is hidden in tumblers. One position opens a lock. Another position opens one of these doors.”

-- Seraph, The Matrix

SQL Server 2016 introduces filtered nonclustered columnstore indexes. This means that you can create a dummy empty nonclustered columnstore index on your table simply to enable the use of batch mode operators in the plans for queries against that table. For example, suppose that you have a table called T1 with an integer column called col1. You can create the following index:

CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy ON dbo.T1(col1)

  WHERE col1 = -1 AND col1 = -2;

SQL Server will detect the contradiction between the expressions col1 = -1 and col1 = -2, and use a Constant Scan operator in the plan for the creation of the index, avoiding any actual data access. So the index costs you nothing to create, and has no effect on write performance. Furthermore, there’s no need to change your existing queries. When relevant and optimal, SQL Server will convert individual rows to batches and use batch mode operators in your query plans, and your queries will just start running faster, as I will demonstrate shortly!

I mentioned this backdoor to fellow Data Platform MVP Niko Neugebauer, who’s known for his columnstore expertise, and he said that he has been using another trick with a similar effect. He uses a left outer join with a false condition with an empty dummy table that has a columnstore index. For example, suppose that you had the following query against a table T1 which currently uses rowstore representation of the data:

SELECT col1, col2, col3,

  SUM(col3) OVER(PARTITION BY col1) AS col1total

FROM dbo.T1;

You create the following Dummy empty table with a clustered columnstore index:

CREATE TABLE dbo.Dummy(dummy INT NOT NULL, INDEX idx_cs CLUSTERED COLUMNSTORE);

And then change your query as follows:

SELECT col1, col2, col3,

  SUM(col3) OVER(PARTITION BY col1) AS col1total

FROM dbo.T1

  LEFT OUTER JOIN dbo.Dummy ON 1 = 2;

The contradiction detection tells SQL Server it doesn’t need to actually access the table Dummy, but the presence of a table with a columnstore index enables the use of batch processing. The main problem with this trick is that it requires a query change.

What I like about the trick with the filtered index is that it costs nothing to create, it doesn’t impact write performance, and it doesn’t require any code changes to your queries.

Sample data

For sample data I’ll use a table called Transactions, which holds 10,000,000 rows with bank account transactions. Use the following code to create and populate the Transactions table:

-- testwindow database

SET NOCOUNT ON;

IF DB_ID(N'testwindow') IS NULL CREATE DATABASE testwindow;

GO

USE testwindow;

GO


-- GetNums helper function

DROP FUNCTION IF EXISTS dbo.GetNums;

GO


CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),

L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum

     FROM L5)

  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n

  FROM Nums

  ORDER BY rownum;

GO


-- Transactions table with 10M rows (200 accounts x 50000 transactions per act)

-- Traditional rowstore B-tree index

DROP TABLE IF EXISTS dbo.Transactions;


CREATE TABLE dbo.Transactions

(

  actid  INT   NOT NULL,

  tranid INT   NOT NULL,

  val    MONEY NOT NULL,

  CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)

);

GO


DECLARE

  @num_partitions     AS INT = 200,

  @rows_per_partition AS INT = 50000;


INSERT INTO dbo.Transactions WITH (TABLOCK) (actid, tranid, val)

  SELECT NP.n, RPP.n,

(ABS(CHECKSUM(NEWID())%2)*2-1) * (1 + ABS(CHECKSUM(NEWID())%5))

  FROM dbo.GetNums(1, @num_partitions) AS NP

CROSS JOIN dbo.GetNums(1, @rows_per_partition) AS RPP;

GO

The actid column holds the account ID, the tranid column holds the transaction ID, and the val column holds the transaction amount (positive for a deposits and negative for a withdrawals). This table uses a traditional B-tree rowstore representation of the data, using a clustered index with the key list (actid, tranid). So whenever I’ll want to demonstrate queries against rowstore with row mode processing, I’ll use this table.

To demonstrate queries against columnstore representation of the data, with full batch mode support, I’ll use a table called TransactionsCS (CS for columnstore) with a clustered columnstore index. Use the following code to create the table TransactionsCS and populate it with the data from the table Transactions:

-- TransactionsCS

-- Clustered columnstore index

DROP TABLE IF EXISTS dbo.TransactionsCS;

SELECT * INTO dbo.TransactionsCS FROM dbo.Transactions;

CREATE CLUSTERED COLUMNSTORE INDEX idx_cs ON dbo.TransactionsCS;

To demonstrate queries against rowstore, with batch mode operators enabled thanks to the trick with the dummy empty nonclustered columnstore index, I’ll use a third table called TransactionsDCS (DCS for dummy columnstore). Use the following code to create and populate this table:

-- TransactionsDCS

-- Traditional rowstore B-tree index

--   + dummy empty filtered nonclustered columnstore index

--     to enable using batch mode operators

DROP TABLE IF EXISTS dbo.TransactionsDCS;

SELECT * INTO dbo.TransactionsDCS FROM dbo.Transactions;

ALTER TABLE dbo.TransactionsDCS

  ADD CONSTRAINT PK_TransactionsDCS PRIMARY KEY(actid, tranid);

CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy ON dbo.TransactionsDCS(actid)

  WHERE actid = -1 AND actid = -2;

This table has the same B-tree clustered index as in the table Transactions, plus the dummy columnstore index.

At the date of this writing, when you query the table Transactions, SQL Server 2016 will only consider row mode processing since there’s no columnstore index present on the table. When you query TransactionsCS or TransactionsDCS, SQL Server will consider using batch mode processing. If, when querying TransactionsCS or TransactionsDCS, for troubleshooting purposes you want SQL Server to not consider using batch processing, you can use query trace flag 9453 by adding the following at the end of the query: OPTION (QUERYTRACEON 9453).

Frameless aggregate window functions

Frameless aggregate window functions were introduced in SQL Server 2005. They allow you to compute similar aggregates to the ones you compute in grouped queries, but without hiding the detail. This means that you can apply calculations that mix detail and aggregates, such as the percent of the current row value out of the group total.

I’ll use the following example (call it Query 1) to test the performance of a frameless aggregate window function against the table Transactions (row mode processing over rowstore):

SELECT actid, tranid, val,

  SUM(val) OVER(PARTITION BY actid) AS acttotal

FROM dbo.Transactions;

Note that in all of my performance tests, I enable the SSMS query option Discard results after execution, to not take into account the time it takes to print 10,000,000 output rows. To enable this option right click an empty area in your query window, choose Query Options in the context menu, then check this option under Results, Grid, as the following screenshot shows:

Here you return both the detail about each transaction (the account ID, transaction ID, current transaction value) and the account total (which represents the current account balance). As mentioned, typically you would have more elaborate calculations that mix detail and aggregates, such as the ratio between the current transaction value and the average for the account and transaction type: 100.0 * val / AVG(val) OVER(PARTITION BY actid, SIGN(val)) - 100.00. I’m assuming that you’re already familiar with use cases and the computations required to achieve them. Since my focus is performance improvements in SQL Server 2016 I’ll be using simplistic, sometimes contrived, forms of window functions without embedding them in more elaborate calculations.

The execution plan for Query 1 is shown in Figure 2.

Figure 2: Plan for Query 1 (row mode over rowstore)

The top branch of the plan performs an ordered scan of the clustered index, segments the data by the actid value (the partitioning element), and writes the rows into an on-disk spool (a temporary table). For each segment, the inner part of the top loop activates the middle and bottom branches of the plan. The middle branch of the plan reads the spool to compute the account total (the aggregate). The bottom branch of the plan also reads the spool to get the detailed level of the data. The middle Nested Loops operator joins the aggregate and detailed levels of the data.

The performance statistics I got for this query on my system are: duration: 47 sec, CPU: 57 sec, logical reads: 20M, writes: 40K. If there’s any doubt, this is an inefficient plan. 47 seconds is pretty long! The main inefficiency has to do with the writing and reading of the data into and from the on-disk spool. Notice there were 20M reads, and 40K writes, mostly related to the spool. The two positive things about this plan are that it can rely on index order (no need for an explicit sort operation) and that it is parallel. But those are outweighed by the plans inefficiencies. Naturally, if the query was more complex, involving joins and other elements, and you couldn’t create a supporting index, there would have been the extra cost of sorting using row mode processing.

To test the performance of the same query with batch mode processing over columnstore, simply change the target table to TransactionsCS, like so (call it Query 2):

SELECT actid, tranid, val,

  SUM(val) OVER(PARTITION BY actid) AS acttotal

FROM dbo.TransactionsCS;

The plan for this query is shown in Figure 3.

Figure 3: Plan for Query 2 (batch mode over columnstore)

Before we get into the details of this plan, you have to admire its simplicity compared to the previous one. Here you see for the first time the new batch mode Window Aggregate operator in action. The plan is a parallel one and it uses batch mode processing on top of columnstore data. It scans the data from the columnstore index, hence the small number of reads involved (only 6K). However, since columnstore data isn’t ordered, a Sort operator has to be used to order the data based on the partitioning column (actid). To soften the blow, SQL Server 2016 uses the new batch mode Sort operator.

Most of the magic happens in the new batch mode Window Aggregate operator. It replaces all of the older logic with the Repartition Streams, Segment, Spool, Aggregate and join operators. As described by the members of the SQL dev team, this operator runs in parallel (though it can also run serially as I’ll demonstrate shortly), and does ordered computation in a “streaming fashion” as a multi-pass process (to maintain high parallelism). This technique is known for On-GPUs ordered processing (such as video stream processing). One of the critical parts is removing the need for the old spool by maintaining the “window rows” in memory and, doing superfast processing over that window with a dedicated code path for each function. To paraphrase Matrix’s spoonboy, there is no spool! This operator demonstrates significantly improved scaling of parallelism with DOP n versus DOP 1 compared to the older row mode computations.

The plan then uses a Compute Scalar operator to return a NULL back from the aggregate function when the row count is 0, and finally a Gather Streams operator to gather the streams from the multiple threads into one stream of rows back to the caller.

The performance statistics for this query are: duration: 7 sec, CPU: 13 sec, logical reads: 6k, writes: 0. The query run time dropped from 47 to 7 seconds only! CPU utilization dropped significantly. The number of reads dropped from 20M to only 6K! The number of writes dropped to 0 since the plan didn’t use an on-disk spool.

The most expensive part of this plan is the sort operation. As mentioned, since columnstore doesn’t keep the data ordered, a Sort operator has to be used to order it by the partitioning element (actid) for the Window Aggregate operator.

If you’re wondering whether it is possible to combine the benefits of a B-tree rowstore index to avoid the need for explicit sorting when the query is simple enough to allow it, and the benefits of batch processing, the answer is yes. But as a reminder, at least for now, SQL Server 2016 requires that at least one columnstore index be present on the queried tables. Recall that the sample data for this article includes a table called TransactionsDCS, which is similar to Transactions (same data, same B-tree rowstore clustered index), but also has a dummy empty filtered nonclustered columnstore index to allow the optimizer to consider batch mode operators. To test the performance of the same query with batch mode processing over rowstore, change the target table to TransactionsDCS, like so (call it Query 3):

SELECT actid, tranid, val,

  SUM(val) OVER(PARTITION BY actid) AS acttotal

FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 4.

Figure 4: Plan for Query 3 (batch mode over rowstore)

If you thought the previous plan was simple, what will you think about this one?! The plan pulls the data using a row mode index order scan, converts it into batches, and then uses the batch mode Window Aggregate and Compute Scalar operators to do the rest of the work. There’s no need for explicit sorting.

The performance statistics for this query are: duration: 7 sec, CPU: 7 sec, logical reads: 31k, writes: 0.

Curiously, despite the lack of explicit sorting, this query takes the same time to complete as Query 2. Obviously, scanning a B-tree index is more expensive than scanning a columnstore index, as evident in the number of reads (31K versus 6K), but in both cases the scan takes a small portion of the query run time. The main inefficiency in this plan is the lack of parallelism. Currently, SQL Server isn’t able to pull the data ordered from a B-tree index with a parallel scan and distribute it gradually and dynamically over worker threads like what would be optimal for the Window Aggregate operator without a mediator like a batch Sort operator. Doing it without a mediator would require more work on Microsoft’s part in the SQL Server engine, and hopefully we will see such support in the future. For now, the optimizer has to make a choice between a serial plan without a Sort operator and a parallel plan with one. As you can see, it chose the former as it deemed it more optimal. To see an example for the latter, for troubleshooting purposes, you can force the use of a parallel plan with query trace flag 8649, like so (call it Query 4):

SELECT actid, tranid, val,

  SUM(val) OVER(PARTITION BY actid) AS acttotal

FROM dbo.TransactionsDCS

OPTION (QUERYTRACEON 8649);

The plan for this query is shown in Figure 5.

Figure 5: Plan for Query 4 (batch mode over rowstore with forced parallelism)

As you can see, the plan uses a parallel scan of the B-tree index with an Ordered: False property (doesn’t care about index order) and then uses a batch mode Sort operator.

Figure 6 summarizes the run times of the different queries for frameless aggregate window functions.

Figure 6: Run times for frameless aggregate window functions

What’s next?

The new batch mode Window Aggregate operator in SQL Server 2016 significantly improves the performance of most window functions. Currently, batch mode operators are considered only when at least one columnstore index is present on one of the queried tables, but I showed how you can create a dummy empty filtered columnstore index as a backdoor. Still, for now you need Enterprise edition of SQL Server to allow creating columnstore indexes. Hopefully in the future, Microsoft will remove the requirement to have a columnstore index present to enable using batch mode operators. Then we will be able to benefit from the performance improvements of the batch execution engine without the need to create a dummy columnstore index, and perhaps it will also be available in Standard edition.

In this month’s article I focused on frameless aggregate window functions. In the upcoming articles I’ll cover ranking and statistical window functions, aggregate window functions with a frame, and offset window functions.

Comments

Plain text