Skip navigation
blue and silver computer mouse

SQL Server 2005's APPLY, Part 2

Advanced uses for this table operator

Last month I started a two-part series about SQL Server 2005’s APPLY operator. I covered the fundamentals of APPLY, explaining the three modes in which you can use it: CROSS APPLY, OUTER APPLY, and implicit APPLY. Now that I’ve covered the fundamentals, this month I discuss more advanced uses for APPLY, including efficient parallelism of queries, reuse of column aliases, aggregate over columns, unpivoting, and inlining scalar functions.

In the examples in this article, I use sample databases called InsideTSQL2008 and Performance that I originally created for my books. To create and populate those databases, go and download the source code for the book Inside Microsoft SQL Server 2008: T-SQL Querying (Microsoft Press, 2009). Use the file InsideTSQL2008 to create and populate the sample database InsideTSQL2008. Use the first part in the source code file for Chapter 4 (Query Tuning) to create and populate the Performance database.

Efficient Parallelism of Queries

The APPLY operator can be used to address a wide variety of problems—typically, problems that involve a partitioning element (e.g., an employee, a customer, a shipper). One interesting use of APPLY is to utilize parallelism efficiently in a certain class of problems. I initially learned about this technique from Adam Machanic, a SQL Server MVP.

To demonstrate the technique, let’s use the Performance sample database. Run the following code to create an index on the Orders table:

USE Performance
CREATE INDEX idx1 ON dbo.Orders(empid, orderid);

Suppose that you need to calculate for each row in the Orders table two row numbers: one partitioned by empid and ordered by orderid and another partitioned by empid and ordered by orderid DESC (or any other order that’s different from the first function’s order). You can use the following query to achieve this task:

SELECT empid, orderid,
  ROW_NUMBER() OVER(PARTITION BY empid
            ORDER BY orderid) AS rownum_asc,
  ROW_NUMBER() OVER(PARTITION BY empid
            ORDER BY orderid DESC) AS rownum_desc
FROM dbo.Orders;

Recall that earlier you created an index called idx1 on (empid, orderid). The tricky part here is that the two functions have different ordering specifications, and in such a case, SQL Server can rely on an index ordering to support only one of them. For the other function, SQL Server has to sort the rows. Figure 1 shows the execution plan for this query.

Plan for two window functions without APPLY
Figure 1: Plan for two window functions without APPLY

SQL Server uses parallelism to sort one of the functions, but it sorts by both the partitioning element (empid) and the ordering element (orderid DESC). I ran this query on a machine with eight logical CPUs (if you have fewer logical CPUs, you can mimic a similar machine for test purposes by using the startup parameter -P8), and it took it 3 seconds to finish with hot cache and results discarded.

A more efficient strategy and utilization of parallelism is to filter each partition’s rows through the index and apply a sort only to one partition’s rows at a time. To achieve this, you can use the APPLY operator, like so:

SELECT A.*
FROM dbo.Employees AS E
  CROSS APPLY
    (SELECT empid, orderid,
   ROW_NUMBER() OVER(ORDER BY orderid) AS rownum_asc,
   ROW_NUMBER() OVER(ORDER BY orderid DESC) AS rownum_desc
     FROM dbo.Orders AS O
     WHERE O.empid = E.empid) AS A;

Figure 2 shows the execution plan for this query.

Plan for two window functions with APPLY
Figure 2: Plan for two window functions with APPLY

Observe that the plan scans the employees clustered index first, and then for each employee, performs a seek in the index on Orders to filter the current employee’s rows. Then the sort is applied only to one employee’s rows. You can also see here the use of an optimization technique called few outer rows, in which the optimizer realizes that a small number of rows are returned from the outer part of the Nested Loops join and uses a Repartition Streams exchange operator to evenly distribute the rows to the different threads that then handle the work in the inner part of the join. This query finished in 1 second on the same machine—one third of the runtime of the previous query.

When you’re done, run the following code for cleanup:

DROP INDEX idx1 ON dbo.Orders;

Reuse of Column Aliases

Suppose that you have a query with lengthy expressions that you need to repeat more than once. For example, you need to write a query against the Sales.Orders table in the InsideTSQL2008 database, and in this query filter only orders that were placed prior to the last day of the month, and return for each order the respective beginning-of-month and end-of-month dates. It would be nice if you could alias the expression that calculates the beginning-of-month date (call it beginning_of_month), then use that alias in the expression that calculates the end-of-month date (call it end_of_month), and then use that alias in the query filter. Making the following attempt

USE InsideTSQL2008;

SELECT orderid, orderdate, custid, empid,
  CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)
    AS beginning_of_month,
  DATEADD(day, -1, DATEADD(month, 1, beginning_of_month))
    AS end_of_month
FROM Sales.Orders
WHERE orderdate < end_of_month;

generates the error message that Figure 3 shows.

Msg 207, Level 16, State 1, Line 5
Invalid column name 'end_of_month'.
Msg 207, Level 16, State 1, Line 3
Invalid column name 'beginning_of_month'.

To understand the reasons for the errors, you need to understand certain aspects in the design of SQL. You can’t reuse aliases defined in the SELECT clause within the same SELECT clause because conceptually the language evaluates all expressions that appear in the same logical query processing phase (SELECT, in our case) at the same point in time. You also can’t reuse aliases defined in the SELECT clause in the WHERE clause, because the WHERE clause is conceptually evaluated prior to the SELECT clause. The conceptual evaluation order of the clauses is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.

One option is to repeat the logic in all expressions in the query, like so:

SELECT orderid, orderdate, custid, empid,
  CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)
    AS beginning_of_month,
  DATEADD(day, -1,
    DATEADD(month, 1,
  CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)))
     AS end_of_month
FROM Sales.Orders
WHERE orderdate <
  DATEADD(day, -1,
    DATEADD(month, 1,
  CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)));

This results in a lot of redundancy and increases the likelihood of errors.

Another option is to gradually build your solution query in steps, using CTEs in which you don’t need to repeat the expressions, as Listing 1 shows. However, this layered approach adds its own complexity to the solution and is a bit verbose even though you don’t need to repeat the expressions.

WITH BOM AS
(
  SELECT orderid, orderdate, custid, empid,
    CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)
  AS beginning_of_month
  FROM Sales.Orders
),
EOM AS
(
  SELECT orderid, orderdate, custid, empid, beginning_of_month,
    DATEADD(day, -1, DATEADD(month, 1, beginning_of_month))
  AS end_of_month
  FROM BOM
)
SELECT orderid, orderdate, custid, empid,
  beginning_of_month, end_of_month
FROM EOM
WHERE orderdate < end_of_month;

It turns out that you can use the APPLY operator to handle this need in an elegant and more concise manner than the alternatives. Remember that APPLY is a table operator that’s evaluated in the FROM clause of the query. If you have multiple table operators, they’re evaluated from left to right, and each has access to the table result produced by the previous table operator. Also, the FROM clause is the first query clause to be conceptually evaluated—before the WHERE clause and the others. This means that if you define a column alias in a table expression used by an APPLY operator, all subsequent table operators, as well as query clauses, have access to that alias. All this leads to the ability to handle our current querying task in the following elegant manner:

SELECT O.orderid, O.orderdate, O.custid, O.empid,
  B.beginning_of_month, E.end_of_month
FROM Sales.Orders AS O
  CROSS APPLY (VALUES(
    CAST(CONVERT(CHAR(6), orderdate, 112) + '01' AS DATETIME)))
  AS B(beginning_of_month)
  CROSS APPLY (VALUES(
    DATEADD(day, -1, DATEADD(month, 1, beginning_of_month))))
  AS E(end_of_month)
WHERE O.orderdate < E.end_of_month;

Aggregate Over Columns

Our next example for using the APPLY operator is to calculate aggregates over columns. To create the sample data for this example, use the code in Listing 2. This code creates and populates a table called Sales in the tempdb database for test purposes.

USE tempdb;
IF OBJECT_ID('dbo.Sales', 'U') IS NOT NULL DROP TABLE dbo.Sales;

CREATE TABLE dbo.Sales
(
  custid    VARCHAR(10) NOT NULL,
  salesyear INT NOT NULL,
  [01]      INT NULL,
  [02]      INT NULL,
  [03]      INT NULL,
  [04]      INT NULL,
  [05]      INT NULL,
  [06]      INT NULL,
  [07]      INT NULL,
  [08]      INT NULL,
  [09]      INT NULL,
  [10]      INT NULL,
  [11]      INT NULL,
  [12]      INT NULL,
  CONSTRAINT PK_Sales PRIMARY KEY(custid, salesyear)
);

INSERT INTO dbo.Sales
    (custid, salesyear, [01],[02],[03],[04],[05],[06],[07],[08],[09],[10],[11],[12])
  VALUES
    ('A', 2010, 90,41,75,9,85,6,65,5,30,90,11,71),
    ('A', 2011, 29,29,8,95,1,16,36,74,59,43,31,49),
    ('B', 2009, 29,51,92,15,2,45,26,90,34,14,25,9),
    ('B', 2010, 39,8,94,25,30,35,42,75,62,7,98,19),
    ('B', 2011, 39,22,41,56,5,27,2,22,32,52,74,26);

The Sales table holds the monthly sales quantities. The table has one row for each year and one column for each month. Why the table was designed like this isn’t really our concern—perhaps the data was imported from a spreadsheet in this manner. We have a very specific task: Write a query that calculates the minimum and maximum monthly quantities in each year.

The APPLY operator can be used to handle this task elegantly and very efficiently by combining the use of the enhanced VALUES clause available in SQL Server 2008 and later. The VALUES clause can be used to define a derived table like so:

(VALUES(),(),...) AS D();

Relying on this capability, you can use the APPLY operator with the VALUES clause to turn each source row to 12 target rows (one for each month) and have all monthly quantities in one result column (call it qty). Then you can perform any aggregates that you need on that result column. Here’s what the solution query looks like:

SELECT S.custid, S.salesyear,
  S.[01],S.[02],S.[03],S.[04],S.[05],S.[06],S.[07],S.[08],S.[09],S.[10],S.[11],S.[12],
  A.mn, A.mx
FROM dbo.Sales AS S
  CROSS APPLY
    (SELECT MIN(D.qty) AS mn, MAX(D.qty) AS mx
     FROM (VALUES(S.[01]),(S.[02]),(S.[03]),
         (S.[04]),(S.[05]),(S.[06]),
         (S.[07]),(S.[08]),(S.[09]),
         (S.[10]),(S.[11]),(S.[12])) AS D(qty)) AS A;

Figure 4 shows the output of this query. Although this approach isn’t a conventional way to use the VALUES clause, it’s certainly a valid, supported method.

Figure 4: Output of query calculating aggregates over columns

custid

salesyear

01

02

03

04

05

06

07

08

09

10

11

12

mn

mx

------

---------

--

--

--

--

--

--

--

--

--

--

--

--

--

--

A

2010

90

41

75

9

85

6

65

5

30

90

11

71

5

90

A

2011

29

29

8

95

1

16

36

74

59

43

31

49

1

95

B

2009

29

51

92

15

2

45

26

90

34

14

25

9

2

92

B

2010

39

8

94

25

30

35

42

75

62

7

98

19

7

98

B

2011

39

22

41

56

5

27

2

22

32

52

74

26

2

74

Unpivoting

The last example in which you needed to calculate aggregates over columns involves first unpivoting the monthly quantity columns into rows. You could achieve this with the UNPIVOT operator, but performance tests I’ve done show that the technique with the APPLY operator is faster. And if speed alone isn’t enough of a reason to switch to using the APPLY operator to handle unpivoting, perhaps the next example will be.

Suppose you needed to unpivot multiple sets of columns. To demonstrate this need, I’ll use a table called Sales (a different one than in the previous example) that you create and populate by running the code in Listing 3.

USE tempdb;
IF OBJECT_ID('dbo.Sales', 'U') IS NOT NULL DROP TABLE dbo.Sales;
GO

CREATE TABLE dbo.Sales
(
  custid    VARCHAR(10) NOT NULL,
  qty2009   INT   NULL,
  qty2010   INT   NULL,
  qty2011   INT   NULL,
  val2009   MONEY NULL,
  val2010   MONEY NULL,
  val2011   MONEY NULL,
  CONSTRAINT PK_Sales PRIMARY KEY(custid)
);

INSERT INTO dbo.Sales
    (custid, qty2009, qty2010, qty2011, val2009, val2010, val2011)
  VALUES
    ('A', 606,113,781,4632.00,6877.00,4815.00),
    ('B', 243,861,637,2125.00,8413.00,4476.00),
    ('C', 932,117,202,9068.00,342.00,9083.00),
    ('D', 915,833,138,1131.00,9923.00,4164.00),
    ('E', 822,246,870,1907.00,3860.00,7399.00);

The Sales table has a row for each customer, a set of three columns for yearly quantities (qty2009, qty2010, qty2011), and a set of three columns for yearly values (val2009, val2010, val2011). Your task is to unpivot each customer source row to three target rows (one for each year) and in each result row return the customer ID, year, quantity, and value. Try to do this with the UNPIVOT operator, and you’ll realize that you need to write two UNPIVOT queries—one for each set of columns to be unpivoted—and join the results. But with APPLY and the VALUES clause, you use a similar technique to the one used in the aggregate over columns example to construct three result rows out of each source row using row constructors, like so:

SELECT S.custid, A.salesyear, A.qty, A.val
FROM dbo.Sales AS S
  CROSS APPLY
    (VALUES(2009, S.qty2009, S.val2009),
       (2010, S.qty2010, S.val2010),
       (2011, S.qty2011, S.val2011)) AS A(salesyear, qty, val);

Figure 5 shows the output of this query. As you can see, this solution is elegant and concise, and it’s very efficient compared with the alternative using two UNPIVOT queries.

custid	salesyear	qty	val
-------	----------	----	--------
A	2009		606	4632.00
A	2010		113	6877.00
A	2011		781	4815.00
B	2009		243	2125.00
B	2010		861	8413.00
B	2011		637	4476.00

Inlining Scalar Functions

Developers love using functions to encapsulate logic—which is what programmability best practices encourage. Therefore, you see a wide use of user-defined functions (UDFs) in T-SQL code. Consider scalar UDFs. Suppose that the UDF is called in a query and a column from the table is passed as input, as in the following example:

SELECT col1, dbo.MyFunction(col1) AS expr1
FROM dbo.T1;

If the function has a body with flow that includes more than just a single scalar expression, you’d expect to pay whatever cost is involved with the UDF call for each of the rows involved in the query. But what if the UDF just encapsulates a return of a scalar expression—either a result of a query or just an expression? It turns out that SQL Server doesn’t currently attempt to inline such an expression in the calling query—and this fact has a major performance penalty when compared with embedding the original expression directly in the calling query.

I’ll demonstrate the performance problem with an example against the Performance database. Consider the following query returning orders placed on the last day of the year:

USE Performance;

SELECT orderid, orderdate, custid, empid, filler
FROM dbo.Orders
WHERE orderdate = DATEADD(year, DATEDIFF(year, '18991231', orderdate), '18991231');

This query finished in well under a second on the test computer where I ran it against hot cache, utilizing an efficient parallel scan.

Following programmability best practices, you want to hide the complexity of the calculation from the user; therefore, you encapsulate the logic in a UDF, like so:

USE Performance;
IF OBJECT_ID('dbo.EndOfYear') IS NOT NULL DROP FUNCTION dbo.EndOfYear;
GO
CREATE FUNCTION dbo.EndOfYear(@dt AS DATETIME) RETURNS DATETIME
AS
BEGIN
  RETURN DATEADD(year, DATEDIFF(year, '18991231', @dt), '18991231');
END;

You then revise the query to use the function, like so:

SELECT orderid, orderdate, custid, empid, filler
FROM dbo.Orders
WHERE orderdate = dbo.EndOfYear(orderdate);

The query is certainly clearer and more concise, but it takes 5 seconds to finish! The reason is two-fold. First, SQL Server unfortunately doesn’t inline the UDF, and therefore you pay some overhead for each of the function’s calls. Second, using a T-SQL scalar UDF prevents parallelism. But all hope is not lost; even though this might sound like a bizarre idea, you turn the function to an inline table-valued UDF, like so:

IF OBJECT_ID('dbo.EndOfYear') IS NOT NULL DROP FUNCTION dbo.EndOfYear;
GO
CREATE FUNCTION dbo.EndOfYear(@dt AS DATETIME) RETURNS TABLE
AS
RETURN
  SELECT DATEADD(year, DATEDIFF(year, '18991231', @dt), '18991231') AS result;

Then, use either explicit or implicit APPLY to invoke the UDF in the query. Here’s an example with implicit APPLY:

SELECT orderid, orderdate, custid, empid, filler
FROM dbo.Orders
WHERE orderdate = (SELECT result FROM dbo.EndOfYear(orderdate));

This time, the UDF got inlined prior to optimization because it’s an inline table-valued UDF. You basically get the same query plan as with the first query form that didn’t include a UDF call, and therefore the performance is the same. The query finishes in under a second using a parallel scan.

Give APPLY a Try

The APPLY operator was an underappreciated feature when it was first introduced in SQL Server 2005. But with time, people have found elegant, efficient, and creative ways to use APPLY. These days, I find myself using APPLY quite a lot in my code. If you’re not using APPLY extensively yet, I hope you’ll give it a try.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish