Logical query processing describes the conceptual interpretation of SQL queries. This article is the sixth part in a series on the topic. Part 1 provided an overview of the concept and a sample database called TSQLV4. It also provided two sample queries which I referred to as simple sample query and complex sample query. Parts 2, 3, 4 and 5 covered the logical query processing aspects of the table operators JOIN, APPLY, PIVOT and UNPIVOT, respectively—all of which are processed as part of the first major clause FROM. In this article I describe the logical query processing aspects of the second major clause—the WHERE clause.
Logical query processing flow chart including WHERE clause
Recall that the first major step in logical query processing handles the FROM clause. The second major step handles the optional WHERE clause. Figure 1 provides a graphical depiction of the first two major steps in logical query processing.
Figure 1: Logical query processing flow chart - FROM and WHERE
As its input, the second step operates on the virtual table that is returned by the first step. It applies a filter based on the predicate that appears in the WHERE clause. For each input row, the outcome of the predicate can be true, false or unknown (when a NULL is involved), and the second step returns only the rows for which the predicate evaluates to true. It discards the rows for which the predicate evaluates to false or unknown.
The following code has the FROM and WHERE parts of our simple sample query:
SELECT C.custid, O.orderid FROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custid WHERE C.country = N'Spain';
The result of the first step before applying the WHERE clause is a virtual table with 832 rows. After applying the filter predicate in the WHERE clause, the result is a virtual table with the following 24 rows:
custid orderid ----------- ----------- 8 10326 8 10801 8 10970 22 NULL 29 10366 29 10426 29 10568 29 10887 29 10928 30 10303 30 10550 30 10629 30 10872 30 10874 30 10888 30 10911 30 10948 30 11009 30 11037 69 10281 69 10282 69 10306 69 10917 69 11013
The following code has the FROM and WHERE parts of our complex sample query
The result of the first step before applying the WHERE clause is a virtual table with 701 rows. After applying the filter predicate in the WHERE clause, the result is a virtual table with the following 27 rows:
custid custlocation orderid productid val ------- --------------- -------- ----------- ------------- 8 Spain.Madrid 10970 52 224.0000000 22 Spain.Madrid NULL NULL NULL 57 France.Paris NULL NULL NULL 69 Spain.Madrid 10917 30 25.8900000 69 Spain.Madrid 10917 60 340.0000000 69 Spain.Madrid 11013 23 90.0000000 69 Spain.Madrid 11013 42 56.0000000 69 Spain.Madrid 11013 45 190.0000000 69 Spain.Madrid 11013 68 25.0000000 74 France.Paris 10907 75 108.5000000 74 France.Paris 10964 18 375.0000000 74 France.Paris 10964 38 1317.5000000 74 France.Paris 10964 69 360.0000000 74 France.Paris 11043 11 210.0000000 89 USA.WA.Seattle 10861 17 1638.0000000 89 USA.WA.Seattle 10861 18 1250.0000000 89 USA.WA.Seattle 10861 21 400.0000000 89 USA.WA.Seattle 10861 33 87.5000000 89 USA.WA.Seattle 10861 62 147.9000000 89 USA.WA.Seattle 10904 58 198.7500000 89 USA.WA.Seattle 10904 62 1725.5000000 89 USA.WA.Seattle 11032 36 665.0000000 89 USA.WA.Seattle 11032 38 6587.5000000 89 USA.WA.Seattle 11032 59 1650.0000000 89 USA.WA.Seattle 11066 16 52.3500000 89 USA.WA.Seattle 11066 19 386.4000000 89 USA.WA.Seattle 11066 34 490.0000000 (27 row(s) affected)
WHERE for filtering, ON for matching
A common source for confusion in SQL is figuring out whether a predicate that you need to use in your query belongs in a join’s ON clause or in the WHERE clause. This can be confusing even for people with years of experience writing SQL queries. Let’s start with the fact that with inner joins, at least as far as standard SQL is concerned, both ON and WHERE serve the same filtering purpose; i.e., in both cases if for a given row the predicate evaluates to true, the row is returned, otherwise (if the predicate evaluates to false or unknown) the row is discarded. Consider the following query:
SELECT C.custid, O.orderid, O.orderdate FROM Sales.Customers AS C INNER JOIN Sales.Orders AS O ON C.custid = O.custid WHERE C.country = N'Spain' AND O.orderdate >= '20160101';
Here, since the join is an inner join, the logical meaning of the query is the same irrespective of whether you specify any of the three predicates in the ON clause or the WHERE clause. All predicates are considered filtering predicates. This query generates the following output:
custid orderid orderdate ----------- ----------- ---------- 30 10872 2016-02-05 30 10874 2016-02-06 29 10887 2016-02-13 30 10888 2016-02-16 30 10911 2016-02-26 69 10917 2016-03-02 29 10928 2016-03-05 30 10948 2016-03-13 8 10970 2016-03-24 30 11009 2016-04-08 69 11013 2016-04-09 30 11037 2016-04-21 (12 row(s) affected)
From a logical query processing perspective, the WHERE clause is evaluated after the FROM clause with all of its table operators like the join in our example. But from a physical query processing perspective, SQL Server can apply certain rearrangements as long as the query meaning (final result set) is preserved. For example, SQL Server will often apply what’s called predicate pushdown where instead of processing the predicates from the WHERE clause that are applied to a single table after the join, it processes them before the join. For one, if there are supporting indexes, they can be utilized to handle the filters, and even if not, reducing the input sets before the join leaves the join with less work to do. Figure 2 shows the execution plan for our query.
Figure 2: Predicate pushdown
Notice predicate pushdown took place. Currently there are no indexes to support the filters so the plan performs scans of the two input tables, but it still applies predicate pushdown where it evaluates the filters as part of the scans to reduce the sets that the join needs to operate on.
Unlike with inner joins, with outer joins the ON and WHERE clauses serve very different roles. The WHERE clause still serves the usual filtering role against the rows from the virtual table returned by the FROM clause. The ON clause serves a more sophisticated matching role. In an outer join you mark a table as preserved. From example, in Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custid, you mark the Customers table as preserved, and consider the predicate C.custid = O.custid as a matching predicate. This means that you want all rows from the Customers table preserved irrespective of the outcome of the matching predicate. What the ON clause determines is which rows from the nonpreserved side (Orders) to match to rows from the preserved side (Customers). A customer without a single matching order is still returned with NULLs used as place holders for the attributes from Orders.
A common bug that results from confusion around matching versus filtering is applying what’s supposed to be a matching predicate in the ON clause as a filtering predicate in the WHERE clause. For example, suppose that our query needs to filter only customers from Spain, but match orders to customers only if the customer ID in both sides is the same and the order was placed on or after 2016. Intuitively, people are used to specifying predicates that compare elements from both sides in the ON clause and predicates that are applied to elements from one side in the WHERE clause. Applying this pattern in our query results in a bug:
SELECT C.custid, O.orderid, O.orderdate FROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custid WHERE C.country = N'Spain' AND O.orderdate >= '20160101';
The predicate C.country = N'Spain' is supposed to be a filtering predicate and is used as such in our query. You want to keep only customers from Spain and discard the rest. However, the predicate O.orderdate >= '20160101' is supposed to be a matching predicate since you want to preserve customers who didn’t place orders during this period, but you currently apply it as a filtering predicate. So customers who did not place orders during this period are discarded. In fact, the join in this query effectively becomes an inner join. Any outer rows produced by the outer join for customers without orders have NULLs in the orderdate column, and the filter O.orderdate >= '20160101' discards those. The SQL Server optimizer applies contradiction detection and converts the outer join to an inner, producing the same plan shown earlier for the inner join query in Figure 2. To fix the bug you need to apply the predicate O.orderdate >= '20160101' as a matching predicate in the ON clause, like so:
SELECT C.custid, O.orderid, O.orderdate FROM Sales.Customers AS C LEFT OUTER JOIN Sales.Orders AS O ON C.custid = O.custid AND O.orderdate >= '20160101' WHERE C.country = N'Spain';
This query generates the following output:
custid orderid orderdate ----------- ----------- ---------- 8 10970 2016-03-24 22 NULL NULL 29 10887 2016-02-13 29 10928 2016-03-05 30 10872 2016-02-05 30 10874 2016-02-06 30 10888 2016-02-16 30 10911 2016-02-26 30 10948 2016-03-13 30 11009 2016-04-08 30 11037 2016-04-21 69 10917 2016-03-02 69 11013 2016-04-09 (13 row(s) affected)
Notice that customer 22 is a customer from Spain who did not place orders since the beginning of 2016. The plan for this query is shown in Figure 3.
Figure 3: Plan with outer join
This time the optimizer processes the join as an outer join.
Search arguments and equality versus distinctness
Even though this article’s focus is logical query processing, I do want to spend some time describing some aspects of physical query processing and the areas where it differs from logical processing. When it comes to query tuning, one of the most critical concepts to understand is what a search argument is, or SARG in short. A SARG is a filter predicate that enables the use of an index in a manner that relies on index ordering, such as applying a seek. Here’s the general form of a SARG:
The operator needs to be one that represents a consecutive range of qualifying rows in an index on the filtered column. It can be =, >, >=, <, <=, BETWEEN, > AND <=, >= AND <, and so on. It cannot be <>, for example. The filtered column must be unmanipulated. The expression on the other side can be manipulated. For example, the following is a SARG:
WHERE col1 > @p
If you had an index on col1, SQL Server can apply a seek in the index to handle the filter. The following is not a SARG since you apply manipulation to the filtered column:
WHERE col1 + 1 > @p
So here, SQL Server will have to scan the data instead of using a seek in the index. In this case, you can easily convert the filter to a SARG by subtracting 1 from @p instead of adding it to col1, like so:
This is a SARG:
WHERE col1 > @p - 1
If you wonder how come SQL Server’s optimizer doesn’t apply such internal rearrangements, it’s a good question, but the fact is that in most cases it doesn’t. It could be that Microsoft chose not to add such logic to the optimizer to not make the optimization process take too long and this way become counterproductive. In cases where they can give us simple best practices to follow, not including such logic in the optimizer itself allows a more efficient optimization process.
As another example, the following query (call it Query 1) is a sargable query (a query with a search argument) since the filter predicate doesn’t apply manipulation to the filtered column:
SELECT custid, country, region FROM Sales.Customers WHERE region = N'WA';
Before you run the query, create the following index to support the filter:
CREATE INDEX idx_rgn_i_cid_ctry ON Sales.Customers(region) INCLUDE(custid, country);
The execution plan for Query 1 is shown in Figure 4.
Figure 4: Plan for Query 1
Notice that the filter predicate is applied as a seek predicate since it is considered a SARG.
The following query (call it Query 2) filters only customers with a region that starts with the letter W:
SELECT custid, country, region FROM Sales.Customers WHERE LEFT(region, 1) = N'W';
This query is not sargable since it applies manipulation to the filtered column. Figure 5 shows the plan for this query.
Figure 5: Plan for Query 2
As you can see, the plan scans the covering index instead of applying a seek, despite the fact that the qualifying rows appear in a consecutive range in the index. To fix this problem, use the LIKE predicate instead of the left function, like so:
SELECT custid, country, region FROM Sales.Customers WHERE region LIKE N'W%';
This time the query is sargable as you can see in the query plan shown in Figure 3.
Figure 6: Plan for Query 3
Suppose that you need to write a query in a stored procedure or a user defined function that filters only customers from a region that is provided as input. The following code uses a local variable to emulate the routine’s parameter:
DECLARE @region AS NVARCHAR(40) = N'WA'; -- also try with NULL SELECT custid, country, region FROM Sales.Customers WHERE region = @region;
This query is sargable, and it returns correct results as long as the input is not NULL, like N'WA' in this example. For some customers the region is inapplicable and therefore is set to NULL. The NULL is SQL’s marker for a missing value—whether it’s missing and applicable or missing and inapplicable (our case). SQL uses three-valued predicate logic, which means that any comparison involving a NULL, either in one or both operands, results in neither the logical value true nor false, but rather in the logical value unknown. This is the case with both equality and inequality-based comparisons. So what if the above query needs to support a NULL input, in which case you want it to return all customers that have a NULL region? Try it. You get an empty set back since a comparison between two NULLs with an equality operator yields unknown, and a query filter returns only rows for which the filter predicate evaluates to true. It discards rows for which the predicate evaluates to false and unknown. As written, the query has a bug. A common way for people to fix the bug is to use the ISNULL or COALESCE function in both operands of the comparison to replace a NULL with a value that cannot normally appear in the data, like so (call this Query 4):
DECLARE @region AS NVARCHAR(40) = N'WA'; -- also try with NULL SELECT custid, country, region FROM Sales.Customers WHERE ISNULL(region, N'
') = ISNULL(@region, N' ');
Unfortunately, though, since you apply manipulation to the filtered column, the query is not sargable, as you can see in the plan shown in Figure 7:
Figure 7: Plan for Query 4
One way to make the query sargable is to check for the special case where both operands of the comparison are NULL using the IS NULL operator instead of the equality operator, like so (call this Query 5):
DECLARE @region AS NVARCHAR(40) = NULL; SELECT custid, country, region FROM Sales.Customers WHERE region = @region OR (region IS NULL AND @region IS NULL);
SQL Server knows to treat this form as a SARG as you can see in the plan for this query in Figure 8.
Figure 8: Plan for Query 5
Curiously, standard SQL supports a distinct predicate as an alternative to equality and inequality comparison. The form of the predicate is
WHERE region IS NOT DISTINCT FROM @region
Alas, T-SQL doesn’t support the distinct predicate, but as illustrated by Paul White in his article Undocumented Query Plans: Equality Comparisons, hope is not lost. Following is an elegant alternative that is supported in T-SQL (call this Query 6):
DECLARE @region AS NVARCHAR(40) = NULL; SELECT custid, country, region FROM Sales.Customers WHERE EXISTS (SELECT region INTERSECT SELECT @region);
This technique relies on the fact that set operators (UNION, EXCPET and INTERSECT) use distinctness—not equality—when comparing rows. When region is not distinct from @region, the INTERSECT operator returns one row, EXISTS returns true, and the row in the outer query is returned. When region is distinct from @region, the INTERSECT operator returns an empty set, EXISTS returns false, and the row in the outer query is returned. This is exactly the desired behavior. Remarkably, SQL Server’s optimizer considers this form as a SARG and the plan that you get for this query (Query 6) is the same as the one shown earlier for Query 5 in Figure 8. Since the predicate is considered a SARG it enables an index seek.
A similar construction can be used in a join when you want a distinctness based comparison, like so:
SELECT ... FROM dbo.T1 INNER JOIN dbo.T2 ON EXISTS (SELECT T1.col1 INTERSECT SELECT T2.col1);
As mentioned, the standard alternative to “not equal to,” but where you get true when one operand is NULL and the other isn’t, is IS DISTINCT FROM. With our region column and @region parameter you would use the following filter predicate:
WHERE region IS DISTINCT FROM @region
Again, this form is not supported in T-SQL, but the following elegant form is:
WHERE NOT EXISTS (SELECT region INTERSECT SELECT @region)
Or, instead of checking for an empty set intersection, you can check for a nonempty set difference, like so:
WHERE EXISTS (SELECT region EXCEPT SELECT @region)
One of the aspects of logical query processing that can be confusing is the fact that all expressions that appear in the same logical step are evaluated as a set, and since a set has no order, there’s no guarantee that SQL Server will evaluate the expressions in order of appearance. I’ll demonstrate this through an example.
Use the following code to create and populate a table called Properties:
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID(N'dbo.Properties', N'U') IS NOT NULL DROP TABLE dbo.Properties; GO CREATE TABLE dbo.Properties ( name VARCHAR(128) NOT NULL CONSTRAINT PK_Properties PRIMARY KEY, datatype VARCHAR(128) NOT NULL, val VARCHAR(500) NOT NULL ); INSERT INTO dbo.Properties(name, datatype, val) VALUES ('property1', 'SMALLINT', '1759' ), ('property2', 'VARCHAR', 'abc' ), ('property3', 'INT', '43112609'), ('property4', 'DATE', '20110212');
The table holds various properties, with columns for the property name, datatype name and value, with the last stored as a character string. Suppose that you want to filter only integer properties that are greater than 10. You use the following query attempting to achieve this:
SELECT name, datatype, val FROM dbo.Properties WHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT') AND CAST(val AS BIGINT) > 10;
In this query you assume that SQL Server will evaluate the predicates in order of appearance from left to right, and that it will short-circuit when the property is not an integer one. But since all expressions in the same logical step are treated as a set, you don’t have an assurance that SQL Server will process them in written order. When I ran this query in my system, it failed with the following error:
Msg 8114, Level 16, State 5, Line 138 Error converting data type varchar to bigint.
Looking at the Predicate property of the Clustered Index Scan operator in the query execution plan reveals that SQL Server rearranged the filter predicates as follows:
CONVERT(bigint,[tempdb].[dbo].[Properties].[val],0)>(10) AND ( [tempdb].[dbo].[Properties].[datatype]='BIGINT' OR [tempdb].[dbo].[Properties].[datatype]='INT' OR [tempdb].[dbo].[Properties].[datatype]='SMALLINT' OR [tempdb].[dbo].[Properties].[datatype]='TINYINT')
Even though SQL Server actually supports a concept of a short-circuit, it does not guarantee that it will evaluate your filter predicates in written order. This is in line with the logical query processing design so it shouldn’t be surprising. But here’s where things do get surprising. You might think that you can circumvent the problem by encapsulating a query that filters only integer properties in a table expression (CTE, derived table, view), and applying a query with a conversion against the table expression, like so:
WITH C AS ( SELECT name, datatype, val FROM dbo.Properties WHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT') ) SELECT * FROM C WHERE CAST(val AS BIGINT) > 10;
From a logical query processing perspective, such code should not fail. However, for performance reasons, the SQL Server parser unnests, or inlines, the inner query’s code in the outer query, resulting in code that is equivalent to the original query without the table expression. Consequently, the code fails with the same error.
One solution to this problem is to use the TRY_CAST function instead of the CAST function to handle the conversion. The former attempts the conversion, and if successful, returns the converted value; but if unsuccessful, instead of failing it returns a NULL. This way, even if SQL Server evaluates the conversion before the other predicate, the code will not fail. A similar TRY_CONVERT function is provided as an alternative to the CONVERT function. Here’s the code that implements this solution:
SELECT name, datatype, val FROM dbo.Properties WHERE datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT') AND TRY_CAST(val AS BIGINT) > 10;
Currently SQL Server supports TRY_% functions only for conversion purposes. It would be nice to have a similar function such as TRY_EXPRESSION for any type of scalar expression, since similar problems can arise due to other errors (divide by zero, overflow, input out of domain, and so on).
Another solution is to use a CASE expression that proceeds with the conversion only if the datatype is an integer, like so:
SELECT name, datatype, val FROM dbo.Properties WHERE CASE WHEN datatype IN ('TINYINT', 'SMALLINT', 'INT', 'BIGINT') THEN CAST(val AS BIGINT) END > 10;
SQL Server’s documentation of the CASE expression says: "You should only depend on order of evaluation of the WHEN conditions for scalar expressions (including non-correlated sub-queries that return scalars), not for aggregate expressions."
In our case we do not use an aggregate expression so we can rely on evaluation order and a short-circuit.
For more details on the topic, see the article Unwanted T-SQL Expression Failures.
WHERE and column aliases
Often people want to use column aliases that they created in the SELECT list for columns that result from computations in the WHERE clause, like so:
SELECT custid, CONCAT(country, N'.' + region, N'.' + city) AS custlocation FROM Sales.Customers WHERE custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle');
However, remember that in terms of logical query processing, the WHERE clause (step 2) is evaluated before the SELECT clause (step 5). Consequently, aliases created in the SELECT clause are not visible to expressions in the WHERE clause. This code generates the following errors:
Msg 207, Level 16, State 1, Line 204 Invalid column name 'custlocation'. Msg 207, Level 16, State 1, Line 204 Invalid column name 'custlocation'. Msg 207, Level 16, State 1, Line 204 Invalid column name 'custlocation'.
The reason that you get three errors is that the predicate custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle') is internally converted to a conjunction of three predicates: custlocation = N'Spain.Madrid' OR custlocation = N'France.Paris' OR custlocation = N'USA.WA.Seattle'.
An obvious workaround is to use a table expression such as a CTE or a derived table, where you create the alias in the inner query and use it anywhere you like in the outer query. A more elegant solution is to combine the use of the APPLY operator with the VALUES clause (table value constructor), and this way create the aliases that you need very early in logical processing, as part of the processing of the FROM clause. This will make the aliases available to clauses that are evaluated in subsequent phases, like the WHERE clause. Applied to our example, the solution code looks like this:
SELECT C.custid, A.custlocation FROM Sales.Customers AS C CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city) ) ) AS A(custlocation) WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle');
For more details on this technique and further examples, see Logical Query Processing Part 3: The FROM Clause and APPLY.
Would that it WHERE so simple
You would think that the WHERE clause is just a basic filter and that there shouldn’t be much to say about it. As it turns out, there’s a lot involved. Logical query processing explains why you cannot refer to aliases that were defined in the SELECT clause in the WHERE clause, and why there’s no assurance for the order in which the expressions in the WHERE clause will be evaluated. You also want to make sure you understand the complexities of NULL treatment, such as the difference between equality-based and distinctness-based comparisons. A good understanding of this topic helps you write correct and robust code. It’s also important to understand physical query processing considerations, such as which predicate forms constitute a search argument and which don’t, so that you can write optimal queries.