Take Advantage of SQL Server 2012's Window Functionality

Simply put, a window function is one that performs calculations (e.g., aggregations, rankings) for a defined set of rows, known as a window. For some time, other database management systems (DBMSs) such as Oracle, Sybase, and DB2 have had full support for window functions. Until recently, SQL Server only partially supported their implementation. The support for window functions has greatly improved in SQL Server 2012, with the introduction of new window order and frame clauses, support for distribution functions, and support for offset functions. I won't go into the details of what has changed in SQL Server 2012 window functionality, because that information has already been covered in SQL Server Pro. (See the Learning Path for articles that discuss these changes.) Instead, I'll give you examples of how to use:

Window ranking functions
Window ranking functions with the PARTITION BY clause
Window ranking functions with the ROLLUP, CUBE, and GROUPING SETS operators
Window distribution functions
Window aggregate functions
Window offset functions

Along the way, I'll point out what's still needed, in the hope that Microsoft will provide the missing functionality in a future version of SQL Server.

If you'd like to follow along with the examples, you can run the Using_Window_Functions_in_SQL_Server_2012.sql script by clicking the Download button. This script creates and populates a sample table named total_sales. It contains five columns:

The year column, which stores the years the sales took place
The month column, which stores the months the sales took place, using the values 1 (January) through 12 (December)
The prd_type_id column, which stores the IDs representing the types of products
The emp_id column, which stores the IDs of the employees who handled the sales
The amount column, which stores the dollar amounts of the sales

The script also includes the queries presented here.

Using Window Ranking Functions

Two useful window ranking functions are RANK and DENSE_RANK, which have been available since SQL Server 2005. Both functions rank items in a group. The difference between them is that RANK leaves a gap in the sequence when there's a tie, whereas DENSE_RANK doesn't leave any gaps.

For example, suppose you need to get the ranking of sales by product type for the year 2012. The following query uses both the RANK and DENSE_RANK functions:

SELECT prd_type_id, SUM(amount),
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank,
DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS dense_rank
FROM total_sales WHERE year = 2012 AND amount IS NOT NULL
GROUP BY prd_type_id
ORDER BY prd_type_id;

Notice the use of the OVER clause when calling the RANK and DENSE_RANK functions. The OVER clause defines the window for the function.

Figure 1 shows the results for this query. There are no ties, so RANK and DENSE_RANK return the same ranks. As you can see, there aren't any results for prd_type_id 5. That's because the table's amount column contains nulls for those rows in which the prd_type_id value is 5. The query omits these rows because of the AND amount IS NOT NULL code in the WHERE clause. The following query leaves out that code so that these rows are included:

Figure 1: Results from using the RANK and DENSE_RANK functions with the AND amount IS NOT NULL code

SELECT prd_type_id, SUM(amount) AS SUM_Amount,
RANK() OVER (ORDER BY SUM(amount) ASC) AS rank,
DENSE_RANK() OVER (ORDER BY SUM(amount) ASC) AS dense_rank
FROM total_sales WHERE year = 2012
GROUP BY prd_type_id
ORDER BY prd_type_id;

The output is shown in Figure 2. Notice that the last row contains a null value for SUM_Amount, which is the sum of the values in the amount column. Also notice that both the RANK and DENSE_RANK functions return 1 for this row. By default, RANK and DENSE_RANK assign the lowest rank of 5 to null values in descending rankings (i.e., DESC is used in the OVER clause) and the highest rank of 1 to null values in ascending rankings (i.e., ASC is used in the OVER clause).

Figure 2: Results from using the RANK and DENSE_RANK functions without the AND amount IS NOT NULL co

The ability to control the ranking of null values is still missing in SQL Server's support for window functions. If you could use a "NULLS FIRST" or "NULLS LAST" clause when using a ranking function, you could explicitly control whether nulls are the highest or lowest ranking in a group. For example, the following hypothetical query uses "NULLS LAST" to specify that nulls are the lowest ranking:

SELECT prd_type_id, SUM(amount),
RANK() OVER (ORDER BY SUM(amount) DESC NULLS LAST) AS rank,
DENSE_RANK() OVER (ORDER BY SUM(amount) DESC NULLS LAST)
AS dense_rank
FROM total_sales WHERE year = 2012
GROUP BY prd_type_id
ORDER BY prd_type_id

Using the PARTITION BY Clause

When working with window ranking functions (as well as other types of window functions), you can use the PARTITION BY clause in the OVER clause to divide a group into subgroups. For example, the following query uses the PARTITION BY clause to subdivide the sales amounts by month:

SELECT prd_type_id, month, SUM(amount) AS SUM_Amount,
RANK() OVER (PARTITION BY month ORDER BY SUM(amount) DESC)
AS rank
FROM total_sales WHERE year = 2012 AND amount IS NOT NULL
GROUP BY prd_type_id, month
ORDER BY prd_type_id, month;

Figure 3 shows an excerpt of the results from this query.

Figure 3: Excerpt of the results from using the RANK function with the PARTITION BY clause

Using the ROLLUP, CUBE, and GROUPING SETS Operators

You can use the ROLLUP, CUBE, and GROUPING SETS operators with the window ranking functions. These operators are extensions of the GROUP BY clause.

With the ROLLUP operator, the GROUP BY clause returns a row containing a subtotal for each group of rows, plus a row containing a grand total for all the groups. For example, the following query uses the ROLLUP operator with the RANK function to get the sales rankings by the product type ID:

SELECT prd_type_id, SUM(amount) SUM_Amount,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM total_sales WHERE year = 2012
GROUP BY ROLLUP(prd_type_id)
ORDER BY prd_type_id;

As Figure 4 shows, the query returns a subtotal for each product type ID as well as a grand total for them. In the SUM_Amount column in this figure, rows 2 through 6 are the subtotals and row 1 is the grand total.

Figure 4: Results from using the RANK function with the ROLLUP operator

With the CUBE operator, the GROUP BY clause returns rows containing a subtotal for all combinations of columns, plus a row containing the grand total. For example, the following query uses the CUBE operator with the RANK function to get all rankings of sales by product type ID and employee ID:

SELECT prd_type_id, emp_id, SUM(amount)as SUM_Amount,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM total_sales WHERE year = 2012
GROUP BY CUBE(prd_type_id, emp_id)
ORDER BY prd_type_id, emp_id;

Figure 5 shows an excerpt of the results from this query.

Figure 5: Excerpt of the results from using the RANK function with the CUBE operator

With the GROUPING SETS operator, the GROUP BY clause returns just the subtotal rows. For example, the following query uses the GROUPING SETS operator with the RANK function to get just the sales amount subtotals for the prd_type_id and emp_id columns:

SELECT prd_type_id, emp_id, SUM(amount) SUM_Amount,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM total_sales WHERE year = 2012
GROUP BY GROUPING SETS(prd_type_id, emp_id)
ORDER BY prd_type_id, emp_id;

Figure 6 shows the results.

Figure 6: Results from using the RANK function with the GROUPING SETS operator

Using Window Distribution Functions

As mentioned previously, SQL Server 2012 introduces support for window distribution functions, which include the CUME_DIST and PERCENT_RANK functions. Short for cumulative distribution, CUME_DIST calculates the position of a specified value relative to a group of values. PERCENT_RANK calculates the percent rank of a value relative to a group of values. For example, the following query uses the CUME_DIST and PERCENT_RANK functions to get the cumulative distribution and percent rank of sales:

SELECT prd_type_id, SUM(amount) SUM_Amount,
CUME_DIST() OVER (ORDER BY SUM(amount) ASC) AS cume_dist,
PERCENT_RANK() OVER (ORDER BY SUM(amount) ASC)
AS percent_rank
FROM total_sales WHERE year = 2012
GROUP BY prd_type_id
ORDER BY prd_type_id

Figure 7 shows the results. Let's take a closer look at the cumulative distribution results. The CUME_DIST function returns a range of values that are greater than 0 and less than or equal to 1. In this case, there are five discrete variables, so each discrete variable has probability of 0.2. The probability that the discrete variable is less than 402751.16 (the total sales amount for prd_type_id 4) is the cumulative distribution of all the discrete variables. (Their total sales amounts are less than or equal to the total sales amount of prd_type_id 4.) In this case, the discrete variables are null for prd_type_id 5, 186381.22 for prd_type_id 2, and 402751.16 for prd_type_id 4. Thus, the cumulative distribution for prd_type_id 4 is 0.6 (3 ´ 0.2). PERCENT_RANK works in similar way.

Figure 7: Results from using the CUME_DIST and PERCENT_RANK functions

Still missing in SQL Server 2012 are "what if" distribution and rank functions to calculate the rank and percentile that a new row would have if you inserted it into a table. For example, the following hypothetical query uses a "WITHIN GROUP" clause to calculate the "what if" rank and percent rank of a sales amount of $500,000:

SELECT
RANK(500000) WITHIN GROUP (ORDER BY SUM(amount) DESC)
AS rank,
PERCENT_RANK(500000) WITHIN GROUP
(ORDER BY SUM(amount) DESC) AS percent_rank
FROM all_sales WHERE year = 2003 AND amount IS NOT NULL
GROUP BY prd_type_id
ORDER BY prd_type_id;

Using Window Aggregate Functions

Window aggregate functions perform calculations on a specified range of rows, range of values, or interval of time. The window describes the subset that's being processed by the aggregate functions, which return a single value. You define the start and end of the window.

You can use a window with the following aggregate functions: SUM, AVG, MAX, MIN, COUNT, VAR, and STDEV. For example, the following query demonstrates how to use the SUM function in a window to calculate the cumulative sales amount from January to December 2012:

SELECT month, SUM(amount) AS month_amount,
SUM(SUM(amount)) OVER
(ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS cumulative_amount
FROM total_sales WHERE year = 2012
GROUP BY month
ORDER BY month;

This query adds each month's sales amount to the cumulative total. Let's take a closer look at the code in red, which creates the window and calculates the cumulative total. In the SUM(SUM(amount)) expression, the inner SUM function totals each month's sales amount and the outer SUM function computes the cumulative total. The ORDER BY month clause orders the rows read by the query by month. TheROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW clause defines the start and end of the window. The start is set to UNBOUNDED PROCEEDING, which means the start of the window is fixed at the first row in the result set returned by the query. The end of the window is set to CURRENT ROW, which represents the current row in the result set being processed. The end of the window slides down one row in the result set as each row is processed.

As a result, the window initially contains only the first row. After the sales amount for January is added to the cumulative total, the end of the window moves down one row to the second row. The window now contains two rows. After the sales amount for February is added to the cumulative total, the end of the window moves down one row to the third row. The window now contains three rows. This continues until the row for December is processed and the window contains 12 rows. Figure 8 shows the results.

Figure 8: Results from the query calculating the cumulative total

If you just want the cumulative total for the last two quarters of the year (i.e., June through December), you can use a query such as:

SELECT month, SUM(amount) AS month_amount,
SUM(SUM(amount)) OVER
(ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS cumulative_amount
FROM total_sales WHERE year = 2012
AND month BETWEEN 6 AND 12
GROUP BY month
ORDER BY month;

If you want to calculate the moving average of the sales amount between the current month and the previous three months, you'd use a query such as

SELECT month, SUM(amount) AS month_amount,
AVG(SUM(amount)) OVER
(ORDER BY month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
AS moving_average
FROM total_sales WHERE year = 2012
GROUP BY month
ORDER BY month;

Let's take a closer look at the code in red, which creates the window and computes the moving average. In the AVG(SUM(amount)) expression, the SUM function totals each month's sales amount and the outer AVG function computes the average. The ORDER BY month clause orders the rows read by the query by month.

The ROWS BETWEEN 3 PROCEEDING AND CURRENT ROW clause defines the start of the window as the three rows preceding the current row and the end of the window as the current row being processed. Both the start and the end of the window begin with the first row read by the query. The end of the window moves down after each row is processed. Initially, the start of the window moves down only after the fourth row has been processed; thereafter, it moves down one row after each row is processed, until the last row in the result is processed.

When there are fewer than three months, the moving average is based on only the months available. For example, if you look at the query's output in Figure 9, you'll notice that the moving average for February (month 2) is calculated from two months:

Figure 9: Results from the query calculating the moving average

(116671.60 + 95525.55)/2 = 106098.575000

For November (month 11), the moving average is calculated from four months:

(160221.98 + 264480.79 + 199609.68 + 212735.68)/4 = 209262.032500

You can use the window aggregate functions to perform calculations across groups and partitions within groups. Take, for example, the following query, which uses the PARTITION BY clause with the SUM function:

SELECT month, prd_type_id,
SUM(SUM(amount)) OVER (PARTITION BY month)
AS total_month_amount,
SUM(SUM(amount)) OVER (PARTITION BY prd_type_id)
AS total_product_type_amount
FROM total_sales WHERE year = 2012 AND month <= 3
GROUP BY month, prd_type_id
ORDER BY month, prd_type_id;

For the first three months of 2012, it returns the total sum of all sales for all three months (total_month_amount) and the total sum of all sales for all product types (total_product_type_amount). Figure 10 shows the results.

Figure 10: Results from using the SUM function with the PARTITION BY clause

Using Window Offset Functions

Window offset functions let you access values located in rows other than the current row. They include the FIRST_VALUE, LAST_VALUE, LAG, and LEAD functions.

For example, the following query uses the FIRST_VALUE and LAST_VALUE functions to get the sales amount for the previous month relative to the current month and the sales amount for the next month relative to the current month:

SELECT month, SUM(amount) AS month_amount,
FIRST_VALUE(SUM(amount)) OVER
(ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
AS previous_month_amount,
LAST_VALUE(SUM(amount)) OVER
(ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
AS next_month_amount
FROM total_sales WHERE year = 2012
GROUP BY month
ORDER BY month;

The FIRST_VALUE function gets the first row in the window defined by the ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING clause. In this clause, 1 PRECEDING refers to the row preceding the current row being processed and 1 FOLLOWING refers to the row after the current row.

The start of the window begins with the first row read by the query. The end of the window begins with the second row and moves down after each row is processed. The start of the window moves down only after the second row has been processed. Processing continues until the last row read by the query is processed.

The LAST_VALUE function works the same way, except it gets the last row in the defined window. Figure 11 shows the results.

Figure 11: Results from using the FIRST_VALUE and LAST_VALUE functions

You can use the LAG and LEAD functions to get a value in a row where that row is a certain number of rows away from the current row. For example, the following query uses LAG and LEAD to get the sales amount for the previous month relative to the current month and the sales amount for the next month relative to the current month:

SELECT month, SUM(amount) AS month_amount,
LAG(SUM(amount), 1) OVER (ORDER BY month)
AS previous_month_amount,
LEAD(SUM(amount), 1) OVER (ORDER BY month)
AS next_month_amount
FROM total_sales WHERE year = 2012
GROUP BY month
ORDER BY month;

In the LAG(SUM(amount), 1) expression, the value of 1 indicates to go back one row from the current row, which means getting the sales amount for the previous month. In the LEAD(SUM(amount), 1) expression, the value of 1 indicates to go forward one row from the current row, which means getting the sales amount for the next month. The results of this query are basically the same as the results of the previous query.

Daunting But Useful

The window functionality in SQL Server 2012 can seem daunting at first, especially if you're not familiar with it. I hope that these examples have given you some insights into how to take advantage of this useful functionality in your queries.

Learning Path

To learn more about SQL Server 2012's window functions:

"How to Use Microsoft SQL Server 2012's Window Functions, Part 1"

"Microsoft SQL Server 2012: How To Write T-SQL Window Functions, Part 2"

"SQL Server 2012: How to Write T-SQL Window Functions, Part 3"

Comments

Plain text