SQL Server 2012: How to Write T-SQL Window Functions, Part 3

Part 1: How to Use Microsoft SQL Server 2012's Window Functions
Part 2: How to Write T-SQL Windows Functions

This article is the last in a three-part series about SQL Server 2012's enhanced window functions. In the first two articles in the series, I focused on the logical aspects of window functions. I described window aggregate, offset, and distribution functions. In this final article, I focus on optimization of window functions. I provide indexing guidelines for optimal performance, and I describe cases that fall under the "fast-track" category, with especially optimal treatment; cases that compute the difference between two cumulative values; cases that expand all frame rows; and cases in which the optimizer can use an in-memory spool versus an on-disk spool.

As a reminder, you need to use SQL Server 2012 (CTP3 or later) to run the sample code from this series. Download SQL Server 2012.

For the purposes of this article, I use bank account transactions as sample data. To create the sample data, you need a helper table called GetNums that accepts two integers as inputs and returns a sequence of integers in the requested range. Run the code in Listing 1 to create the GetNums function in the tempdb database (for test purposes).

USE tempdb;
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
        FROM L5)
  SELECT @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum
  OFFSET 0 ROWS FETCH FIRST @high - @low + 1 ROWS ONLY;
GO

Run the code in Listing 2 to create the Accounts and Transactions tables and fill them with sample data (100 accounts x 20,000 transactions in each account).

-- create tables
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID('dbo.Transactions', 'U') IS NOT NULL DROP TABLE dbo.Transactions;
IF OBJECT_ID('dbo.Accounts', 'U') IS NOT NULL DROP TABLE dbo.Accounts;
GO
CREATE TABLE dbo.Accounts
(
  actid   INT     NOT NULL,
  actname VARCHAR(50) NOT NULL,
  CONSTRAINT PK_Accounts PRIMARY KEY(actid)
);

CREATE TABLE dbo.Transactions
(
  actid  INT   NOT NULL,
  tranid INT   NOT NULL,
  val    MONEY NOT NULL,
  CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid),
  CONSTRAINT FK_Transactions_Accounts
    FOREIGN KEY(actid)
    REFERENCES dbo.Accounts(actid)
);
GO

-- populate tables
DECLARE
  @num_partitions     AS INT = 100,
  @rows_per_partition AS INT = 20000;

TRUNCATE TABLE dbo.Transactions;
DELETE FROM dbo.Accounts;

INSERT INTO dbo.Accounts WITH (TABLOCK) (actid, actname)
  SELECT n AS actid, 'account ' + CAST(n AS VARCHAR(10)) AS actname
  FROM dbo.GetNums(1, @num_partitions) AS P;

INSERT INTO dbo.Transactions WITH (TABLOCK) (actid, tranid, val)
  SELECT NP.n, RPP.n,
    (ABS(CHECKSUM(NEWID())%2)*2-1) * (1 + ABS(CHECKSUM(NEWID())%5))
  FROM dbo.GetNums(1, @num_partitions) AS NP
    CROSS JOIN dbo.GetNums(1, @rows_per_partition) AS RPP;

Let's start our discussion of window function optimization with indexing guidelines.

Indexing Guidelines

To compute the enhanced aggregate window functions, the optimizer needs the rows sorted first by the window partitioning elements, then by the window ordering elements. Also, the index needs to include in its leaf level all the rest of the columns in the query for coverage purposes. If the rows are ordered as needed and covered by an index, the optimizer can scan the index in order, use a Segment operator to handle one partition at a time, a Window Spool operator to expand the applicable frame rows per source row, and a Stream Aggregate operator to perform the actual aggregate calculations.

A simple way to remember what kind of index you need to create is to use the acronym POC for Partitioning, Ordering, and Coverage. The P and O columns need to be defined as the index key elements. The C columns need to be included in the index leaf, using an INCLUDE clause in a nonclustered index. (If the index is clustered, everything is implicitly covered.) In the absence of a POC index, the optimizer must sort the rows based on the P and O elements -- and sorting a large set can be expensive.

As an example, supposed that you need to compute bank account balances at each transaction point. This task can be achieved with a running sum of the transaction values, partitioned by the account ID and ordered by the transaction ID. A POC index already exists; that's the clustered index defined by the primary key constraint PK_Transactions in Listing 2 as part of the definition of the Transactions table.

Here's the query computing the bank account balances:

SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS UNBOUNDED PRECEDING) AS runval
FROM dbo.Transactions;

Figure 1 shows the plan for this query. Observe in the plan that the clustered index PK_Transactions is scanned in an ordered fashion, and therefore no sorting is required.

To contrast, the following query computes a window function that doesn't have a POC index to support it:

SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid
        ORDER BY val
        ROWS UNBOUNDED PRECEDING) AS runval
FROM dbo.Transactions;

Figure 2 shows the plan for this query. Observe that in this plan the optimizer added a Sort operator that sorts the data by the P (actid) and O (val) elements.

Fast Track

Conceptually, for each row in the underlying query, a window function defines a frame of rows within the current window partition and applies the calculation against this frame to produce a scalar outcome. For example, consider a frame such as ROWS BETWEEN 11 PRECEDING AND CURRENT ROW. For each row, the window frame will have up to 12 applicable rows. SQL Server 2012 introduces a Window Spool operator that's responsible for expanding for each row the applicable frame of rows; then, a Stream Aggregate operator aggregates those rows to produce the scalar result. In the worst-case scenario, the Window Spool operator expands all frame rows. I discuss such cases later in the article. However, the optimizer can identify special cases and avoid such complete expansion.

The best-case scenario is what I refer to as the fast-track case. This case is applicable when the top (i.e., first) delimiter of the frame is UNBOUNDED PRECEDING. With such a top delimiter, conceptually, the number of rows per frame should keep increasing by one for each underlying row; so the total number of rows in all frames in the partition increases in a quadratic manner with respect to the number of underlying rows. But that's only conceptually. In practice, the optimizer keeps just two rows in each frame -- one with the aggregated values so far (apply the calculation between the previously aggregated values and the current value) and the current row (needed to return the detail elements). That's a huge savings!

Here's an example of a query that falls under the fast-track category because it uses UNBOUNDED PRECEDING as the top delimiter of the frame:

SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS UNBOUNDED PRECEDING) AS runval
FROM dbo.Transactions;

This query ran for 7 seconds on my system. Figure 3 shows the execution plan for the query.

Observe in the query plan that the number of input rows to the Window Spool operator is 2,000,000 (the number of rows in the table), and the number of output rows is 4,000,000 (one row with aggregates, and current row for detail per each underlying row). Then the Stream Aggregate operator returns 2,000,000 rows that combine the detail and the aggregates.

Compute Two Cumulative Values

When using a delimiter other than UNBOUNDED PRECEDING, it doesn't necessarily mean that the optimizer has to expand all applicable frame rows for each underlying row. Some cases can be handled with a calculation of two cumulative values -- one for the bottom (i.e., last) row in the frame and one for the top (i.e., first) row, and then the result can be computed by subtracting the top aggregate from the bottom one. To qualify for this type of optimization, the function must be cumulative (e.g., SUM, COUNT, AVG), and the frame size needs to be greater than four rows. If the function isn't cumulative (e.g., MIN, MAX), the result can't be derived from the results of the aggregates for the points used as delimiters. If the frame has four rows or fewer, it isn't worthwhile to the optimizer to use this optimization.

The following query qualifies for this special optimization because it uses a cumulative aggregate (AVG) and a frame greater than four rows (25 rows in this case):

SELECT actid, tranid, val,
  AVG(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS BETWEEN 24 PRECEDING
             AND CURRENT ROW) AS avgval
FROM dbo.Transactions;

This query ran for 16 seconds on my system. Figure 4 shows its plan.

Figure 4: Plan computing two cumulative values

The plan starts by scanning the clustered index in order. It then segments the data by actid and computes row numbers to tell which rows are applicable to each frame. It then segments the data again by actid and uses the Window Spool and Stream Aggregate operators to compute the top aggregate. The plan then uses the next series of operators to compute the bottom aggregate and then finally computes the final result by subtracting the top aggregate from the bottom one.

Expanding All Frame Rows

When the frame has four rows or fewer or the function isn't cumulative (e.g., MIN, MAX), the Window Spool operator will expand all applicable frame rows for each underlying row.

Here's an example with four rows in the frame:

SELECT actid, tranid, val,
  AVG(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS BETWEEN 3 PRECEDING
             AND CURRENT ROW) AS avgval
FROM dbo.Transactions;

Figure 5 shows the plan for this query. Observe that the total number of rows returned by the Window Spool operator is close to 10,000,000. That's because there are four rows in almost all frames; in addition, the operator always keeps the current row for the detail elements. The first few rows in each partition naturally have fewer rows in the frame than four -- hence the number of rows returned by the Window Spool operator is slightly less than 10,000,000.

As an example for noncumulative function, the following query computes a MAX window aggregate, so even though it has 25 rows in each frame, the Window Spool operator has no choice but to expand them all:

SELECT actid, tranid, val,
  MAX(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS BETWEEN 24 PRECEDING
             AND CURRENT ROW) AS maxval
FROM dbo.Transactions;

Because of the large expanded frames, this query ran for 25 seconds on my system. Figure 6 shows the plan for the query. Observe in the plan that 2,000,000 rows are provided as input to the Window Spool operator, and more than 50,000,000 rows are produced as its output.

Figure 6: Plan when function isn't cumulative

In-Memory vs. On-Disk Spool

Understanding the last optimization aspect of window functions is crucial if you care about performance. The Window Spool operator typically tries to use an optimized in-memory spool. However, in certain circumstances it must use an on-disk spool that incurs the usual overhead of on-disk work tables (e.g., latches, IO).

The on-disk spool is used when the number of rows in the frame is greater than an internal threshold number (currently 10,000, but it could change in the future) or when using the RANGE window frame unit, as opposed to ROWS. Regarding the threshold number, what matters more precisely is the extreme delta between the row numbers of the current, top, and bottom rows in the frame.

You can use one of two options to determine whether you're getting the costly on-disk spool. One option is a new extended event that you can define and start by running the code in Listing 3. You can then find the information in the file xe_xe_window_spool.xel (Extended Events Log).

CREATE EVENT SESSION xe_window_spool ON SERVER
ADD EVENT sqlserver.window_spool_ondisk_warning
  ( ACTION (sqlserver.plan_handle, sqlserver.sql_text) )
ADD TARGET package0.asynchronous_file_target
  ( SET FILENAME  = N'c:\temp\xe_xe_window_spool.xel',
    metadatafile  = N'c:\temp\xe_xe_window_spool.xem' );

ALTER EVENT SESSION xe_window_spool ON SERVER STATE = START;

The other option is to simply enable STATISTICS IO in the session:

SET STATISTICS IO ON;

When the in-memory spool is used, STATISTICS IO reports zeros as the number of reads against the worktable. When the on-disk spool is used, the numbers are greater than zero.

I'll show a few cases to demonstrate in-memory versus on-disk spools. Here's the first example:

SELECT actid, tranid, val,
  MAX(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS BETWEEN 9999 PRECEDING
             AND 9999 PRECEDING) AS maxval
FROM dbo.Transactions;

This query uses the ROWS window frame unit. The extreme delta between the row numbers of the current, top, and bottom rows is 10,000 (10,000 - 1 + 1 is the total number of rows). So this case qualifies for the in-memory spool. This query ran for 9 seconds on my system. You won't find an extended event reported for this query. Figure 7 shows the output of STATISTICS IO.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Shift the frame from the previous query by only one row, and you get a frame with a delta of 10,001 rows (10,001 - 1 + 1):

SELECT actid, tranid, val,
  MAX(val) OVER(PARTITION BY actid
        ORDER BY tranid
        ROWS BETWEEN 10000 PRECEDING
             AND 10000 PRECEDING) AS maxval
FROM dbo.Transactions;

This time, the on-disk spool was used. The query ran for 56 seconds on my system. You'll find an extended event reported for this query in the file, and STATISTICS IO will show a number of reads greater than zero, as Figure 8 shows.

Table 'Worktable'. Scan count 2000100, logical reads 12086701, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Similarly, when using the RANGE window frame unit, an on-disk spool is used regardless of the frame size. Recall that when not indicating a window frame unit, RANGE UNBOUNDED PRECEDING is assumed by default. Hence, the following query uses an on-disk spool:

SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid
        ORDER BY tranid) AS runval
FROM dbo.Transactions;

This query ran for 66 seconds on my system. An extended event is reported in the file. Figure 9 shows the output of STATISTICS IO.

Table 'Worktable'. Scan count 2000100, logical reads 12044701, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

If you just use ROWS UNBOUNDED PRECEDING instead, the query runs for only 7 seconds because it uses an in-memory spool. So, be careful with the RANGE option. When you're done, run the following code to drop the event session:

DROP EVENT SESSION xe_window_spool ON SERVER;

Increasing Use of Window Functions

Window functions help solve a wide variety of business problems. Standard SQL recognizes this fact and provides extensive coverage of windowing functionality. SQL Server 2012 significantly enhances support for window functions by adding ordering and framing options for aggregates, as well as offset and distribution functions. It also introduces new and enhanced plan operators such as Window Spool and Stream Aggregate to efficiently optimize the functions.

In time, people will realize how profound these functions are and will use them much more often in code. Eventually, window functions will appear in most queries. For this reason, I hope that Microsoft will continue to invest in this area and will gradually implement additional functionality from standard SQL into future SQL Server versions, as well as in SQL Azure.

Comments

Plain text