Logical Query Processing: The FROM Clause and APPLY

This article is the third in a series that covers logical query processing. In the first part, I provided an overview of the topic, and in the second part I started covering the FROM clause, focusing on joins. This third part continues the coverage of the FROM clause, focusing on the APPLY operator.

In this article I’ll use the same sample database called TSQLV4 that I used in the previous parts. You can download the source code to create and populate this database here. Run the following code to use the sample database in your session:

USE TSQLV4;

Make sure that you run all queries in this article while connected to this database.

Also make sure you remind yourself of the queries from Part I in the series that I referred to as simple sample query and complex sample query.

The APPLY operator is a T-SQL-specific feature that, like the JOIN operator, is designed as a table operator. You will find some similarities to a join, but also some differences. I’ll start the article by explaining how joins and subqueries are limited. I’ll then explain how APPLY addresses those limitations. I’ll conclude by explaining how you can use APPLY to define column aliases and allow their use from a very early stage in logical query processing.

Limitations of Joins and Subqueries

The motivation for Microsoft to introduce the APPLY operator in T-SQL was that subqueries and joins, which APPLY is considered an alternative to, have certain limitations. As an example, suppose that you need to write a query against the TSQLV4 database that returns the most-recent order for each customer. The tables involved are Sales.Customers and Sales.Orders. From Customers you’re supposed to return the customer ID and the company name, and from the qualifying rows from the Orders table you’re supposed to return the order ID, order date and employee ID.

This task is known generally as the top N per group task. The following code creates the recommended index on the Orders table for the typical query solutions for this task:

CREATE INDEX idx_poc

  ON Sales.Orders(custid, orderdate DESC, orderid DESC)

  INCLUDE(empid);

I like to think of this index as a POC index, which is an acronym for partitioning, ordering and covering. It defines the key-list based on the partitioning element (custid) followed by the ordering element (orderdate DESC, orderid DESC), and includes the rest of the elements from the query (empid) for coverage.

One of the limitations of subqueries in T-SQL is that they are limited to returning only one column. The exception is when you use the EXISTS predicate, but then the inner query’s SELECT list is meaningless anyway. If T-SQL supported returning multiple columns from a subquery, you could have used the following solution:

SELECT C.custid, C.companyname,

  ( SELECT TOP (1) O.orderid, O.orderdate, O.empid

FROM Sales.Orders AS O

WHERE O.custid = C.custid

ORDER BY O.orderdate DESC, O.orderid DESC )

FROM Sales.Customers AS C;

However, since T-SQL doesn’t support such a vector expression, but rather limits the subquery to returning only one expression, you get the following error:

Msg 116, Level 16, State 1, Line 24

Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.

You could use a separate subquery for each column, like so:

SELECT C.custid, C.companyname,

  ( SELECT TOP (1) O.orderid

FROM Sales.Orders AS O

WHERE O.custid = C.custid

ORDER BY O.orderdate DESC, O.orderid DESC ) AS orderid,

  ( SELECT TOP (1) O.orderdate

FROM Sales.Orders AS O

WHERE O.custid = C.custid

ORDER BY O.orderdate DESC, O.orderid DESC ) AS orderdate,

  ( SELECT TOP (1) O.empid

FROM Sales.Orders AS O

WHERE O.custid = C.custid

ORDER BY O.orderdate DESC, O.orderid DESC ) AS empid

FROM Sales.Customers AS C;

This solution is lengthy, with three repetitions of most of the subquery’s logic, making it difficult to maintain. Also, SQL Server’s optimizer currently doesn’t try to collapse the logic of the three subqueries internally into one physical access to the data. You can see this clearly in the query’s execution plan as shown in Figure 1 (using SQL Sentry’s Plan Explorer).

Figure 1: Execution plan for three subqueries

Figure 01 - Plan for three subqueries.jpg

Notice that there are three different branches that access the POC index per underlying customer row, each retrieving a different column from the qualifying order. This is clearly inefficient.

Some database platforms do support vector expressions, allowing a subquery to return multiple expressions. However, even if SQL Server had such support, it would only solve our problem when N in the task is equal to 1. What if you need to return the three most-recent orders for each customer?

Perhaps you’re thinking that this should be achievable with a cross join, like so:

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid

FROM Sales.Customers AS C

  CROSS JOIN ( SELECT TOP (3) O.orderid, O.orderdate, O.empid

       FROM Sales.Orders AS O

       WHERE O.custid = C.custid

       ORDER BY O.orderdate DESC, O.orderid DESC ) AS O;

However, a join treats its two inputs as a set, and a set has no order. This means that in one side of the join you’re not allowed to refer to the elements from the other side. Here, the derived table query against Orders has a correlation in the filter to an element from Customers (C.custid). Only subsequent steps in the logical processing of the query will have access to the elements from the two sides of the join. If you try to execute this query you will get the following error:

Msg 4104, Level 16, State 1, Line 52

The multi-part identifier "C.custid" could not be bound.

APPLY to the rescue

The APPLY operator was designed to solve the problems I described, with subqueries and with joins. There are two flavors of the operator: one called CROSS APPLY, which implements only one logical query processing step, and another called OUTER APPLY, which implements two.

The CROSS APPLY operator is pretty similar to the CROSS JOIN operator, only instead of treating the two inputs as a set, it evaluates the left input first, and then applies the right side to each left row. What this means is that if the right side is a table expression, like a derived table, within it you are allowed to refer to the left side’s elements. So if in the last query you replace CROSS JOIN with CROSS APPLY, the query becomes valid and solves our task. Here’s the code:

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid
FROM Sales.Customers AS C
  CROSS APPLY ( SELECT TOP (3) O.orderid, O.orderdate, O.empid
        FROM Sales.Orders AS O
        WHERE O.custid = C.custid
        ORDER BY O.orderdate DESC, O.orderid DESC ) AS O;

The applied query can return multiple columns. In fact, it can return multiple rows, as this example demonstrates. Plus, it can have references to elements from the left side unlike in a joined table expression.

The execution plan for this query is shown in Figure 2.

Figure 2: Execution plan for CROSS APPLY

Figure 02 - Plan for CROSS APPLY.jpg

Notice that, per customer, there’s only one access to the POC index.

As mentioned, the APPLY operator is a T-SQL-specific feature. Curiously, standard SQL has a similar feature called lateral derived table. The standard parallel to CROSS APPLY is to use CROSS JOIN, but to prefix the derived table with the keyword LATERAL to indicate that it’s a lateral derived table. The operator visits the left side first, and applies the lateral derived table to each left row. T-SQL doesn’t support lateral derived tables since it has its own solution using APPLY, but, if it did, the code would have looked like this. (Don’t run it since it’s not supported in SQL Server.):

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid

FROM Sales.Customers AS C

  CROSS JOIN LATERAL ( SELECT TOP (3) O.orderid, O.orderdate, O.empid

           FROM Sales.Orders AS O

           WHERE O.custid = C.custid

           ORDER BY O.orderdate DESC, O.orderid DESC ) AS O;

As mentioned, the CROSS APPLY operator implements only one logical query processing step: apply the right table expression to each row from the left side. If the right side returns an empty set for a given left row, CROSS APPLY won’t return that row. This is a bit similar to doing a cross join between a table with one row and a table with zero rows—you get an empty set back. If you want to preserve all left rows you use a second flavor of the operator called OUTER APPLY. This operator includes a second logical step that preserves outer rows. Similar to a LEFT OUTER JOIN, OUTER APPLY uses NULLs as placeholders in the right side since there’s no match.

The previous query which uses CROSS APPLY returns 263 rows in its result:

custid      companyname     orderid     orderdate  empid

----------- --------------- ----------- ---------- -----------

72      Customer AHPOP  10869       2016-02-04 5

72      Customer AHPOP  10804       2015-12-30 6

72      Customer AHPOP  10800       2015-12-26 1

58      Customer AHXHT  11073       2016-05-05 2

58      Customer AHXHT  10995       2016-04-02 1

58      Customer AHXHT  10502       2015-04-10 2

25      Customer AZJED  11012       2016-04-09 1

25      Customer AZJED  10929       2016-03-05 6

25      Customer AZJED  10859       2016-01-29 1

...


(263 row(s) affected)

This output doesn’t return two customers that appear in Customers but not in Orders.

The following code replaces the previous CROSS APPLY operator with OUTER APPLY:

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid

FROM Sales.Customers AS C

  OUTER APPLY ( SELECT TOP (3) O.orderid, O.orderdate, O.empid

        FROM Sales.Orders AS O

        WHERE O.custid = C.custid

        ORDER BY O.orderdate DESC, O.orderid DESC ) AS O;

This query returns 265 rows, which include the two customers that don’t have related orders:

custid      companyname     orderid     orderdate  empid

----------- --------------- ----------- ---------- -----------

72      Customer AHPOP  10869       2016-02-04 5

72      Customer AHPOP  10804       2015-12-30 6

72      Customer AHPOP  10800       2015-12-26 1

58      Customer AHXHT  11073       2016-05-05 2

58      Customer AHXHT  10995       2016-04-02 1

58      Customer AHXHT  10502       2015-04-10 2

25      Customer AZJED  11012       2016-04-09 1

25      Customer AZJED  10929       2016-03-05 6

25      Customer AZJED  10859       2016-01-29 1

...

22      Customer DTDMN  NULL    NULL       NULL

...

57      Customer WVAXS  NULL    NULL       NULL

...


(265 row(s) affected)

If you’re wondering what’s the standard parallel to OUTER APPLY, it’s to use the LEFT OUTER JOIN operator with a lateral derived table and a join predicate that is always true, like so. (Again, don’t run this code since it’s not supported in SQL Server.):

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid

FROM Sales.Customers AS C

  LEFT OUTER JOIN LATERAL ( SELECT TOP (3) O.orderid, O.orderdate, O.empid

            FROM Sales.Orders AS O

            WHERE O.custid = C.custid

            ORDER BY O.orderdate DESC, O.orderid DESC ) AS O

ON 1 = 1;

Figure 3 shows the logical query processing steps for the APPLY operator in addition to those shown last month for joins.

Figure 3: Logical query processing flow chart - APPLY

Figure 03 - Logical query processing flow chart - APPLY.jpg

If you want to simplify the code you can encapsulate the derived table query in an inline table valued function that accepts the customer ID and desired number of recent orders as inputs, like so:

CREATE FUNCTION Sales.GetTopOrders(@custid AS INT, @n AS BIGINT) RETURNS TABLE

AS

RETURN

  SELECT TOP (@n) O.orderid, O.orderdate, O.empid

  FROM Sales.Orders AS O

  WHERE O.custid = @custid

  ORDER BY O.orderdate DESC, O.orderid DESC;

GO

Then the outer query can use the function as the right input to APPLY instead of the derived table, and pass C.custid and the number of recent orders as inputs:

SELECT C.custid, C.companyname, O.orderid, O.orderdate, O.empid

FROM Sales.Customers AS C

  CROSS APPLY Sales.GetTopOrders( C.custid, 3 ) AS O;

The function gets inlined before optimization so the query plan that you get is identical to the one shown earlier in Figure 2.

Reuse of Column Aliases

One of the annoying aspects of SQL is when you define column aliases for expressions in the SELECT clause and then you try to refer to those aliases in other query clauses, like WHERE or GROUP BY. Consider the following query:

SELECT

  YEAR(orderdate) AS orderyear,

  MONTH(orderdate) AS ordermonth,

  COUNT(*) AS numorders

FROM Sales.Orders

GROUP BY orderyear, ordermonth;

Do you see any problem with it? If not, try running it. You get the following error:

Msg 207, Level 16, State 1, Line 127

Invalid column name 'orderyear'.

Msg 207, Level 16, State 1, Line 127

Invalid column name 'ordermonth'.

In terms of logical query processing, the SELECT clause is evaluated after the GROUP BY clause and not the other way around. The aliases that were defined by the SELECT clause are not available to expressions that appear in earlier logical query processing steps. Hence, you can’t refer to such aliases in the GROUP BY clause. For the same reason you can’t refer to such aliases in the WHERE clause.

I’ll elaborate on this in future articles when discussing the WHERE and GROUP BY clauses. For now, I want to focus on the APPLY operator. Since APPLY is a table operator, it’s evaluated as part of the FROM clause. The FROM clause is evaluated before all other clauses in the query. This means that if you define column aliases using APPLY, they will naturally be available to expressions in the rest of the query clauses. Here’s how you fix the problem in the previous query using APPLY and a derived table based on the VALUES clause (known as a table-value constructor):

SELECT orderyear, ordermonth, COUNT(*) AS numorders

FROM Sales.Orders

  CROSS APPLY ( VALUES( YEAR(orderdate), MONTH(orderdate) ) )

AS A(orderyear, ordermonth)

GROUP BY orderyear, ordermonth;

The VALUES clause defines a table made of one row with two columns. It assigns the column alias orderyear to the expression YEAR(orderdate) and ordermonth to the expression MONTH(orderdate). Note that APPLY enables access in the right side’s expressions to the column orderdate from the left side. As mentioned, since the aliases are created early in the logical processing of the query, they become available to the query clauses that are evaluated in subsequent steps, such as the GROUP BY clause in this query.

All expressions that appear in the same logical query processing step are treated as a set, so there’s no order between them. Consequently, you can’t refer to an alias defined in a given step in other expressions in the same step. Consider the following query as an example:

SELECT orderyear, ordermonth, COUNT(*) AS numorders

FROM Sales.Orders

  CROSS APPLY ( VALUES( YEAR(orderdate), MONTH(orderdate),

            DATEFROMPARTS(orderyear, 12, 31) ) )

AS A(orderyear, ordermonth, endofyear)

WHERE orderdate <> endofyear

GROUP BY orderyear, ordermonth;

Observe the attempt to refer to the orderyear alias in the expression that defines the endofyear column. Since orderyear is defined by the same logical query processing step that defines endofyear, the reference to orderyear in that expression is invalid. An alias created in a given step can only be referred to by expressions that appear in subsequent steps—not the same step, and certainly not preceding steps. If you try to run the above query you get the following error:

Msg 207, Level 16, State 1, Line 147

Invalid column name 'orderyear'.

To address this problem, when you have a dependency between expressions, define those using separate APPLY operators in the right order. In our case, you need to define the endofyear column in an APPLY operator that appears after the APPLY operator that defines orderyear, like so:

SELECT orderyear, ordermonth, COUNT(*) AS numorders

FROM Sales.Orders

  CROSS APPLY ( VALUES( YEAR(orderdate), MONTH(orderdate) ) )

AS A1(orderyear, ordermonth)

  CROSS APPLY ( VALUES( DATEFROMPARTS(orderyear, 12, 31) ) )

AS A2(endofyear)

WHERE orderdate <> endofyear

GROUP BY orderyear, ordermonth;

Now that you are familiar with the logical query processing aspects of the JOIN and APPLY operators, you should be able to figure out what the FROM clause in the complex sample query does. Here’s the part of the query with the complete FROM clause and a simplified SELECT clause:

SELECT

  C.custid,

  A.custlocation,

  O.orderid,

  OD.productid,

  A.val

FROM Sales.Customers AS C

  LEFT OUTER JOIN

  ( Sales.Orders AS O

      INNER JOIN Sales.OrderDetails AS OD

    ON O.orderid = OD.orderid

    AND O.orderdate >= '20160101' )

ON C.custid = O.custid

  CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city),

            OD.qty * OD.unitprice * (1 - OD.discount) )

      ) AS A(custlocation, val);

The query uses an inner join between Orders and OrderDetails, matching order headers with their respective order lines and keeping only orders that were placed since the beginning of 2016. Since it’s an inner join, orders placed prior to 2016 aren’t preserved.

The query performs a left outer join between Customers and the result of the inner join between Orders and OrderDetails. Since the join is a left outer join, all customers are preserved—even ones who didn’t place orders since the beginning of 2016. Those will have NULLs in the attributes from the nonpreserved side of the join.

The query then uses the CROSS APPLY operator to create the column aliases custlocation and val. The former is a concatenated string made of the country, region and city attributes with separators between the elements. The CONCAT function concatenates the inputs, replacing NULLs with empty strings. The latter is the order line value computed as the quantity times the unit price, also taking into account the discount. Since these column aliases are created in the FROM clause, they’re available to all remaining clauses in the query.

The SELECT clause here simply returns some of the attributes, including the ones computed by the CROSS APPLY operator. Here’s the output of this query, shown here in abbreviated form:

custid  custlocation        orderid  productid   val

------- ----------------------- -------- ----------- ------------

55      USA.AK.Anchorage    10808    56      646.0000000

55      USA.AK.Anchorage    10808    76      765.0000000

88      Brazil.SP.Resende       10809    52      140.0000000

42      Canada.BC.Vancouver     10810    13      42.0000000

42      Canada.BC.Vancouver     10810    25      70.0000000

...

13      Mexico.México D.F.      NULL     NULL    NULL

22      Spain.Madrid        NULL     NULL    NULL

33      Venezuela.DF.Caracas    NULL     NULL    NULL

57      France.Paris        NULL     NULL    NULL

51      Canada.Québec.Montréal  NULL     NULL    NULL


(701 row(s) affected)

When you’re done, run the following code for cleanup.

DROP INDEX idx_poc ON Sales.Orders;

DROP FUNCTION Sales.GetTopOrders;

What’s Next?

In this article I focused on the logical query processing aspects of the APPLY table operator. I described the two flavors of the operator: CROSS APPLY and OUTER APPLY. The former applies the right table expression to each row from the left side, and the latter also preserves all left rows. I explained the advantages that APPLY has compared to joins and subqueries. I also explained how you can use the APPLY operator to create column aliases for expressions very early in the logical processing of a query, making those aliases available to all remaining query clauses. Next month I’ll continue the coverage of the FROM clause, discussing the logical query processing aspects of the PIVOT and UNPIVOT operators.

Comments

Plain text