Logical Query Processing: The FROM Clause and Joins

Last month I provided an overview of logical query processing. I explained that this term represents the logical interpretation of a query. I provided a sample database called TSQLV4, two sample queries that I referred to as simple sample query and complex sample query, and a graphical depiction of logical query processing as a flow chart. Starting with this article, I delve into the particulars of individual clauses--this time focusing on the FROM clause and the JOIN operator.

The FROM Clause

The FROM clause is the first major query clause that is logically processed, despite the fact that the SELECT clause is written first. There are two main things that you specify in this clause. One, this is where you specify the input tables of the query. Two, this is where you specify table operators like JOIN, APPLY, PIVOT and UNPIVOT, which operate on the input tables.

If you specify only one table in the FROM clause without any table operators, the output of this step is also the input of the step—the input table.

If you do have table operators involved, they are logically evaluated in written order—from left to right—with the output of one operator becoming the left input of the next. This logical order defines which elements are or aren’t available to any given step. What one step creates is visible to subsequent steps, but not the other way around.

Note that from a physical processing perspective, SQL Server can alter the logical processing order when it knows that the output will remain the same, and the plan with the altered order is estimated to have a lower cost than the one with the original order. In the next section about the JOIN operator I’ll get into more details.

The JOIN Operator

In the relational model, a join is not limited to being a binary operator but rather is an n-ary operator. In SQL, though, a join is defined as a binary operator with two inputs—the left and right inputs. SQL supports three fundamental join types: cross, inner and outer. Each join type implements a different subset of the following steps:

1-J1. Apply Cartesian product

1-J2. Identify matches based on the ON predicate

1-J3. Add outer rows

A cross join involves only step 1-J1, an inner join involves 1-J1 and 1-J2, an outer join involves 1-J1, 1-J2 and 1-J3.

To understand these, consider the FROM clause of the simple sample query from the overview article:

FROM Sales.Customers AS C

LEFT OUTER JOIN Sales.Orders AS O

ON C.custid = O.custid

1-J1. Apply Cartesian product

The first step in the join applies a Cartesian product between the two inputs of the join. Since there are 91 rows in the Customers table and 830 rows in the orders table, the result has 75,530 rows, shown here in abbreviated form:

C.custid  C.companyname  ... O.orderid  O.custid  O.orderdate  ...

--------- --------------     ---------- --------- ------------

72    Customer AHPO      10248      85    2014-07-04

72    Customer AHPO      10249      79    2014-07-05

72    Customer AHPO      10250      34    2014-07-08

...

72    Customer AHPO      10471      11    2015-03-11

72    Customer AHPO      10472      72    2015-03-12

72    Customer AHPO      10473      38    2015-03-13

...

83    Customer ZRND      10993      24    2016-04-01

83    Customer ZRND      10994      83    2016-04-02

83    Customer ZRND      10995      58    2016-04-02

...

83    Customer ZRND      11075      68    2016-05-06

83    Customer ZRND      11076      9     2016-05-06

83    Customer ZRND      11077      65    2016-05-06

Note that the conceptual treatment of the join starts with a Cartesian product, but the SQL Server query optimizer will often manage to avoid it in practice in the plan that it builds. Remember that the physical processing doesn’t have to follow the exact steps defined by the logical processing, rather only produce the same result. The whole point of optimization in the physical layer is to apply shortcuts when possible.

Also observe in the result of this step that the column names are prefixed by the table name, or alias, if assigned. This is done to allow referring to a column name in an unambiguous manner if both inputs have the same column name--like custid, in our case.

The connection to the relational model has to do with the fact that the result of each step in logical query processing is a virtual table, representing a relation in the relational model. A relation has two parts: a header, which is a set of attributes (what SQL calls columns), and a body, which is a set of tuples (what SQL calls rows). A set in mathematics has no order. So an attribute is not supposed to be identified by position, but rather by name, and therefore attribute names in a relation must be unique. Since both source tables have a column called custid, you get unique names by adding the prefixes.

What’s interesting is that if you assign table aliases, you rename the tables for the duration of the query. In all subsequent logical query processing phases if you try to refer to the full table name, as in Customers.custid, you’ll get an error. You need to use the alias, as in C.custid.

1-J2. Identify matches based on the ON predicate

The purpose of the ON clause is to identify matches in the result of the previous step based on the specified predicate, which in our case is ON C.custid = O.custid. SQL uses three valued predicate logic, where the outcome of a predicate is one of the three logical values: true, false and unknown. With an equality operator, when both sides of the predicate are not NULL you get either true or false. You get a true when they are the same--for example, in this row:

C.custid  C.companyname  ... O.orderid  O.custid  O.orderdate  ...

--------- --------------     ---------- --------- ------------

72    Customer AHPO      10472      72    2015-03-12

You get a false when they are different, for example in this row:

C.custid  C.companyname  ... O.orderid  O.custid  O.orderdate  ...

--------- --------------     ---------- --------- ------------

72    Customer AHPO      10248      85    2014-07-04

If you have a NULL in either side or both, the result is the logical value unknown. That’s the case with both equality and inequality operators. In our sample data, we don’t have any NULLs in the C.custid or O.custid columns, so the predicates evaluate to true or false in all cases. Here’s the result of the ON predicate applied to the rows in the result of the previous step (again, abbreviated):

C.custid  C.companyname  O.orderid  O.custid  O.orderdate  C.custid = O.custid?

--------- -------------- ---------- --------- ------------ --------------------

72    Customer AHPO  10248      85    2014-07-04   false

72    Customer AHPO  10249      79    2014-07-05   false

72    Customer AHPO  10250      34    2014-07-08   false

...

72    Customer AHPO  10471      11    2015-03-11   false

72    Customer AHPO  10472      72    2015-03-12   true

72    Customer AHPO  10473      38    2015-03-13   false

...

83    Customer ZRND  10993      24    2016-04-01   false

83    Customer ZRND  10994      83    2016-04-02   true

83    Customer ZRND  10995      58    2016-04-02   false

...

83    Customer ZRND  11075      68    2016-05-06   false

83    Customer ZRND  11076      9     2016-05-06   false

83    Customer ZRND  11077      65    2016-05-06   false

C.custid  C.companyname  O.orderid  O.custid  O.orderdate  C.custid = O.custid?

--------- -------------- ---------- --------- ------------ --------------------

72    Customer AHPO  10248      85    2014-07-04   false

72    Customer AHPO  10249      79    2014-07-05   false

72    Customer AHPO  10250      34    2014-07-08   false

...

72    Customer AHPO  10471      11    2015-03-11   false

72    Customer AHPO  10472      72    2015-03-12   true

72    Customer AHPO  10473      38    2015-03-13   false

...

83    Customer ZRND  10993      24    2016-04-01   false

83    Customer ZRND  10994      83    2016-04-02   true

83    Customer ZRND  10995      58    2016-04-02   false

...

83    Customer ZRND  11075      68    2016-05-06   false

83    Customer ZRND  11076      9     2016-05-06   false

83    Customer ZRND  11077      65    2016-05-06   false

If the join is an inner join, the result of the second step is also the result of the join. But if the join is an outer one, as is the case in our simple sample query, there’s a third step involved.

1-J3. Add outer rows

The third step is only applicable to outer joins. In an outer join, you mark which table or tables you want to preserve: left, right or both (full outer join). In our example, the left table (Customers) is the preserved one. The join needs to preserve all rows from the preserved table irrespective of whether the ON clause identified any matches from the right table (Orders). This step identifies rows from the preserved side that don’t have any matches and adds those to the result with NULLs as placeholders in the nonpreserved side’s columns. In our example, the customers with IDs 22 and 57 don’t have matching orders and therefore this step adds those to the result of the previous step:

C.custid  C.companyname   ...  O.orderid   O.custid  O.orderdate  ...

--------- ---------------      ----------- --------- ------------

85    Customer ENQZT       10248       85    2014-07-04

79    Customer FAPSM       10249       79    2014-07-05

34    Customer IBVRG       10250       34    2014-07-08

84    Customer NRCSK       10251       84    2014-07-08

76    Customer SFOGW       10252       76    2014-07-09

34    Customer IBVRG       10253       34    2014-07-10

14    Customer WNMAF       10254       14    2014-07-11

68    Customer CCKOT       10255       68    2014-07-12

88    Customer SRQVM       10256       88    2014-07-15

35    Customer UMTLM       10257       35    2014-07-16

...

22    Customer DTDMN       NULL    NULL      NULL

57    Customer WVAXS       NULL    NULL      NULL

Note that in SQL Server, the physical processing typically doesn’t split the work to the three steps I described here; like I said, its only responsibility is to return the same output.

If another JOIN operator exists in the FROM clause, the result of the first join becomes the left input of the second join, and the relevant subset of steps are then applied to that join depending on the join type. Figure 1 puts in all together as a flowchart.

Figure 1: Logical query processing flow chart - Joins

Figure 01 - Logical query processing flow chart - Joins.png

Controlling the Logical Processing Order of Joins

As mentioned, normally joins are logically processed in written order, from left to right. As an example, consider the following query:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  INNER JOIN Sales.Orders AS O

ON C.custid = O.custid

  INNER JOIN Sales.OrderDetails AS OD

ON O.orderid = OD.orderid;

This query returns a total of 2,155 rows.

The first join that takes place here is the one between Customers and Orders based on the ON predicate C.custid = O.custid. The second join takes place between the result of the first join and the OrderDetails table based on the ON predicate O.orderid = OD.orderid. The ON predicates are actually the ones that define which units are joined together. They create what I like to think of as implied parentheses. You can actually add explicit parentheses. The above query implicitly defines the following parentheses:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM ( Sales.Customers AS C

     INNER JOIN Sales.Orders AS O

       ON C.custid = O.custid )

  INNER JOIN Sales.OrderDetails AS OD

ON O.orderid = OD.orderid;

Since all joins in the query are inner joins, any nonmatches from either side are discarded. For example, there are two customers (with IDs 22 and 57) that don’t have any matching orders, and therefore they don’t appear in the output. Suppose that you wanted to preserve all customers in the result. The natural inclination is to change the join between Customers and Orders to a left outer join, like so:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  LEFT OUTER JOIN Sales.Orders AS O

ON C.custid = O.custid

  INNER JOIN Sales.OrderDetails AS OD

ON O.orderid = OD.orderid;

Surprisingly, when you run this query you still get the same output as before with 2,155 rows, without the two extra customers that you expected. Can you explain why?

Logical query processing starts with the left outer join between Customers and Orders, preserving the two customers that don’t have orders. The result rows for those two customers are populated with NULLs as placeholders in the columns from the Orders table. Then, the second join is an inner join between the result of the first join and OrderDetails. When evaluating the predicate O.orderid = OD.orderid for the outer rows, the result is the logical value unknown since O.orderid is NULL in those rows. Consequently, the outer rows are discarded. This basically nullifies the outer join, effectively turning it into an inner join. You can generalize this and say that any left outer join that is subsequently followed by an inner join or a right outer join that compares something from the nonpreserved side of the join with anything, nullifies the outer join.

Interestingly, when SQL Server optimizes this query it applies contradiction detection, realizes that the outer join is meaningless, and converts the outer join to an inner join, as you can see in the query execution plan shown in Figure 2.

Figure 2: Query plan with inner join

Figure 02 - Contradiction detection.png

So, how do you fix this? The answer is, you don’t want a left join between Customers and Orders and then an inner join between the result and OrderDetails. You want a left join between Customers and the result of the inner join between Orders an OrderDetails. Remember that you control which units are joined with the ON clause, which must follow the two units that it connects. To join Orders and OrderDetails, you place the ON predicate O.orderid = OD.orderid right after this join. Let’s call this result OOD. To join Customers and OOD, you need the ON predicate C.custid = O.custid to appear after this join. Putting this all together, here’s how you’re supposed to write the query:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  LEFT OUTER JOIN

  Sales.Orders AS O

    INNER JOIN Sales.OrderDetails AS OD

      ON O.orderid = OD.orderid

ON C.custid = O.custid;

Now the output contains 2,157 rows, correctly including the two customers that don’t have matching orders.

This query uses implied parentheses, but for clarity, it’s strongly recommended to specify them explicitly, like so:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  LEFT OUTER JOIN

  ( Sales.Orders AS O

      INNER JOIN Sales.OrderDetails AS OD

    ON O.orderid = OD.orderid )

ON C.custid = O.custid;

The execution plan for this query now processes the join as an outer join as shown in Figure 3.

Figure 3: Query plan with outer join

Figure 03 - Plan with outer join.png

The above query represents part of the work involved in the complex sample query [BG: Link?]. In addition, the query is required to return only orders placed since the beginning of 2016. Be careful not to specify the predicate ordedate >= '20160101' in the WHERE clause, since doing so will filter out all customers who didn’t place orders during this period, and our query is supposed to preserve all customers. One way to achieve this is to add this predicate to the ON clause that connects Orders and OrderDetails, like so:


SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  LEFT OUTER JOIN

  ( Sales.Orders AS O

      INNER JOIN Sales.OrderDetails AS OD

    ON O.orderid = OD.orderid

    AND O.orderdate >= '20160101' )

ON C.custid = O.custid;

In an inner join, the ON clause serves a filtering purpose, so the join between Orders and OrderDetails returns only the rows where the order date is on or after 2016. Then the outer join between Customers and the result of the join between Orders and OrderDetails preserves all customers.

Another way to achieve the same task is to add the predicate to the outer join’s ON clause, like so:

SELECT C.custid, C.country, O.orderid, OD.productid, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  LEFT OUTER JOIN

  ( Sales.Orders AS O

      INNER JOIN Sales.OrderDetails AS OD

    ON O.orderid = OD.orderid )

ON C.custid = O.custid

AND O.orderdate >= '20160101';

The two queries are logically equivalent, and in fact, the optimizer produces identical plans for both. To me, the first is more natural. The important thing, like I said, is not to specify the predicate in the WHERE clause because doing so will nullify the outer part of the outer join. I’ll elaborate on the difference between ON and WHERE in a future article in the series that focuses on the WHERE clause.

Controlling the physical processing order of joins

The previous section described how to control the logical processing order of joins. As mentioned, the implementation (SQL Server in our case) doesn’t have to physically process the joins in the same order as the logical processing does, rather it’s only required to produce the same result. So the optimizer explores different join orders and picks the plan with the lowest estimated cost between the ones it explored.

The types of rearrangements that the optimizer can apply depend on the join type that you’re using. For outer joins, the only rearrangement allowed is to change A LEFT JOIN B to B RIGHT JOIN A, and the other way around. But it cannot completely change the order of the tables in a multi join query, since that could change the meaning of the query. With inner and cross joins it’s open season for the optimizer to do any kind of reordering that it likes. This includes the order of appearance of the tables, e.g., rearranging the following joins:

A INNER JOIN B INNER JOIN C INNER JOIN D

To the following joins:

B INNER JOIN C INNER JOIN A INNER JOIN D

This also includes rearranging the packing of units. For example, consider the following arrangement:

A INNER JOIN B

ON A.x = B.x

INNER JOIN C

ON B.y = C.y

INNER JOIN D

ON C.z = D.z

Remember the ON clause determine which units get connected. Written in this manner you get the following implied parentheses:

( ( A INNER JOIN B

ON A.x = B.x )

INNER JOIN C

ON B.y = C.y )

INNER JOIN D

ON C.z = D.z

The optimizer can change this arrangement, for example, to the following:

A INNER JOIN

B INNER JOIN C

ON B.y = C.y

ON A.x = B.x

INNER JOIN D

ON C.z = D.z

Where the implied parentheses are:

( A INNER JOIN

( B INNER JOIN C

ON B.y = C.y )

ON A.x = B.x )

INNER JOIN D

ON C.z = D.z

According to Benjamin Nevarez in the article Optimizing Join Orders, in combinatorics, the number of possible arrangements for N tables is (2N − 2)! / (N − 1)!, where the exclamation point (!) is factorial. For example, with 5 tables you get 1,680 arrangements. With 10 tables you get 17,643,225,600 arrangements! (here the exclamation point is shock). Since the optimizer has time and cost thresholds for optimization, you realize that with more tables in the query, the less likely it is that it will find the truly optimal order. If you suspect that the optimizer chose a grossly inefficient order, you can force it to use an order that you think that is more optimal. You do so by adding the query hint OPTION(FORCE ORDER) at the end of the query. If you want to force join order in all queries in the session, you can set the session option FORCEPLAN to ON. Note that forcing order is recommended mainly as a troubleshooting step, and if used in production, then only as a short term temporary solution. The reason for this is that you want the optimizer to be able to change its choices over time due to changing conditions like changes in indexing and data characteristics.

As an example for forcing physical join ordering, consider the following query:

SELECT C.custid, C.country, O.orderid, OD.productid,

  P.productname, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  INNER JOIN Sales.Orders AS O

ON C.custid = O.custid

  INNER JOIN Sales.OrderDetails AS OD

ON O.orderid = OD.orderid

  INNER JOIN Production.Products AS P

ON Od.productid = P.productid;

On my system the optimizer chose the plan shown in Figure 4 for this query.

Figure 4: Query plan for multi-join query

Figure 04 - Plan for multi-join query.png

The optimizer physically reordered the joins to the following:


SELECT C.custid, C.country, O.orderid, OD.productid,

  P.productname, OD.qty, OD.unitprice

FROM Production.Products AS P

  INNER JOIN

  Sales.Customers AS C

      INNER JOIN

      ( Sales.Orders AS O

          INNER JOIN Sales.OrderDetails AS OD

        ON O.orderid = OD.orderid )

    ON C.custid = O.custid

ON Od.productid = P.productid;

If you think, for example, that the originally written order is the optimal, you can force it, like so:

SELECT C.custid, C.country, O.orderid, OD.productid,

  P.productname, OD.qty, OD.unitprice

FROM Sales.Customers AS C

  INNER JOIN Sales.Orders AS O

ON C.custid = O.custid

  INNER JOIN Sales.OrderDetails AS OD

ON O.orderid = OD.orderid

  INNER JOIN Production.Products AS P

ON Od.productid = P.productid

OPTION (FORCE ORDER);

Remember, the implied parentheses here are the following:


SELECT C.custid, C.country, O.orderid, OD.productid,

  P.productname, OD.qty, OD.unitprice

FROM ( ( Sales.Customers AS C

       INNER JOIN Sales.Orders AS O

     ON C.custid = O.custid )

    
     INNER JOIN Sales.OrderDetails AS OD

       ON O.orderid = OD.orderid )

     INNER JOIN Production.Products AS P

     ON Od.productid = P.productid

OPTION (FORCE ORDER);

The plan for this query is shown in Figure 5.

Figure 5: Query plan with forced order

Figure 05 - Query plan with forced order.png

What’s next?

This article is the second part in the series about logical query processing. Here I focused on the FROM clause, and within it more specifically on the JOIN table operator. I described the logical steps involved in a join, how to control logical join ordering, and how to control physical join ordering. Next month I’ll continue the discussion on the FROM clause looking at additional table operators.

Comments

Plain text