T-SQL query can use any of three clauses—ON, WHERE, and HAVING—for specifying logical expressions that SQL Server uses as filters. The three filters play different roles, but sometimes it can be difficult to know which clause you should specify a logical expression in. Most T-SQL programmers know when to specify a logical expression in the HAVING clause because SQL Server applies the HAVING clause after grouping the input data. But confusion around when to use the ON and WHERE clauses leads to some of the most frequently asked questions in the public SQL Server programming newsgroup (news://msnews.microsoft.com/microsoft.public.sqlserver.programming). Thanks to Lubor Kollar, a program manager with the SQL Server Engine team, for suggesting that I write an article to clarify this often perplexing subject.
Logical Order of Query Processing
To figure out where to specify filters, you need to understand the logical order that ANSI defines for processing a query's clauses. In this article, I consider only the logical aspects of query processing, not the physical ones. During physical processing, the optimizer takes shortcuts in generating an execution plan for a query if it knows that the result will be correct.
My examples use the Customers and Orders tables, which you create and populate by running the code that Listing 1 shows. The Customers table contains rows for four customers: customers A, B, and C from Seattle and customer D from Denver. The Orders table contains five orders. Customer A placed one order, and customer B placed three orders. An unknown customer placed one order, so this row has a NULL in the custid column.
In most programming environments, the server processes the code in the order you write it in. But SQL Server doesn't process the clauses in the order you write them in, mainly because you first specify (in the SELECT list) what you want to see in the result, but SQL Server has to access, join, filter, and group the data before it can generate the final result. Figure 1 shows the general syntax of a query that contains all ANSI clauses.
Let's first look at the logical processing steps of the three clauses in which you can specify logical expressions for filtering. I'll provide a short description for each step, then get into more details later by walking through a couple of examples. As defined by ANSI, the logical processing steps of any query that follows the form that Figure 1 shows are
1.Generate a Cartesian product (cross join) between T1 and T2, which matches all rows from T1 with all rows from T2. Call the resulting virtual table V1. The number of rows in V1 is the product of the number of rows in T1 and the number of rows in T2.
2.Apply the join condition (the filter specified in the ON clause) to the rows in V1. Only rows for which this filter returns TRUE continue to the next step. Call the virtual result table V2.
3.If the join type is outer, mark the tables as preserved or unpreserved. For a left outer join, SQL Server marks the left table (T1) as preserved; for a right outer join, the right table (T2) is marked as preserved; and for a full outer join, both tables are preserved. Add to the rows in V2 all rows from the preserved table that had no match in the unpreserved table, and call the result V3. Note that SQL Server returns NULLs in the columns from the unpreserved table. If the FROM clause specifies more than two tables, apply steps 1 through 3 repeatedly—join V3 to the third table, join that result to the fourth table, and so on.
4.Apply the WHERE filter. Only rows for which the filter returns TRUE can continue to the next step. Call the result V4.
5.Group the rows from V4 according to the GROUP BY list. Each unique combination of values in the GROUP BY list represents a group. Call the result V5.
6.Apply the HAVING filter to the rows from V5. Only rows for which the filter returns TRUE continue to the next step. Call the result V6.
7.Process the SELECT list; if DISTINCT is present, remove duplicates. Call the result V7.
8.Sort the rows in V7 according to the ORDER BY list. Call the result V8.
9.If a TOP clause exists, process it (TOP is non-ANSI). Call the result V9.
If a query doesn't contain all possible query clauses, SQL Server simply skips the steps related to those clauses. Now, let's look at two examples and process the queries by applying these steps.
Customers Who Placed No Orders
Suppose you need to write a join query that returns customers who have registered but not yet placed orders. According to the sample data, the expected result is customers C and D, as Figure 2 shows. The join query in Listing 2 generates the desired result. Why did I specify C.custid = O.custid in the ON clause and O.custid IS NULL in the WHERE clause and not both filters in the ON clause? If you try specifying both these logical expressions in the ON clause, you get all customers in the result, not just C and D. To understand why you must specify the first logical expression in the ON clause and the second in the WHERE clause, let's apply the logical processing steps to the query.
Generating a Cartesian product of Customers and Orders produces a result table (V1) that has 20 rows (4 customers x 5 orders). When you apply the join condition (C.custid = O.custid) to the rows in V1, the possible results from the join condition are TRUE, FALSE, and UNKNOWN. Figure 3 shows the rows in V1 and the result of the join condition for each row under the match? column. Only the rows where match? is TRUE appear in V2, which the next step uses. Notice that match? is UNKNOWN for rows where O.custid is NULL, and V2 doesn't include FALSE or UNKNOWN. Figure 4 shows V2's rows.
In case of an outer join, Step 3 says to add to the rows in V2 all rows from Customers (the preserved table) that had no match in Orders. In this case, customers C and D had no match, so SQL Server adds them to the rows from V2, with NULLs in the Orders table columns. The result is V3. Next, SQL Server applies the WHERE filter (custid IS NULL) to the rows in V3. Possible results from any logical expression are TRUE, FALSE, and UNKNOWN, but because you're using the IS NULL predicate, you can get only TRUE or FALSE. Figure 5 shows the rows in V3 and the result of the filter for each row under the match? column. Only the rows where match? is TRUE (the last two rows) move on to the next step.
The rest of the query processing is simply extracting the C.custid and C.city values from V4. You get the desired results: customers C and D, who made no orders.
Now we can discuss the difference between the ON clause and the WHERE clause and why I specified ON C.custid = O.custid and WHERE O.custid IS NULL. SQL Server applies the ON clause before adding the outer rows and applies the WHERE clause after. The outer join generates a NULL to identify a customer who placed no orders, so you must test for NULLs after applying the outer join and adding the outer rows—namely, in the WHERE clause.
I used the ANSI SQL-92 join syntax in Listing 2's query. SQL Server veterans might remember that before SQL Server adopted the ANSI SQL-92 join style, T-SQL supported (and still does for backward compatibility) an old-style syntax in which *= represents a left outer join. This old-style syntax doesn't have two separate clauses (ON and WHERE) for logical expressions; it has only the WHERE clause in which you specify both your join condition and your filter. An attempt to write Listing 2's query in the old-style syntax looks like this:
SELECT C.* FROM Customers AS C, Orders AS O WHERE C.custid *= O.custid AND O.custid IS NULL
You get all customers in the result instead of just those who placed no orders. You can now understand why: SQL Server treated the whole logical expression as the join condition, generating an empty set from Step 2. Then, it added back the customers who had no match (all customers).
Seattle Customers Who Placed Fewer Than Three Orders
To test your understanding of the difference between the filters and of logical query-processing steps, write a query that returns the Seattle customers who placed fewer than three orders. Figure 6 shows the desired result. When you're done, compare your query to Listing 3's solution.
Here are the logical processing steps. Steps 1 and 2 join Customers to Orders based on a C.custid = O.custid match in the join condition. Step 3 adds customers who made no orders to the result. In Step 4, SQL Server applies the WHERE filter C.city = 'Seattle'; only the rows in which the customer's city is Seattle remain. Step 5 groups rows from Step 4 by C.custid, so each group is a unique customer. Step 6 applies the HAVING filter COUNT (O.orderid) to the groups from Step 5, so only groups with fewer than three orders remain. Finally, Step 7 processes the SELECT list (C.custid, C.city) and returns the customer ID and city.
This solution has two tricky parts. One is the placement of the C.city = 'Seattle' filter in the WHERE clause. If you mistakenly placed this filter in the ON clause, you probably noticed that the query returned customer D even though customer D is from Denver. This happens because although Step 2 removes the row for customer D, Step 3 adds it back again when it adds the outer rows that had no match.
The other tricky part is to specify COUNT(O.orderid) and not COUNT(*). If you specified COUNT(*), you got a count of 1 (customer C) instead of 0 for customers who made no orders. The reason for the unexpected result is that in the result of Step 3, SQL Server added an outer row for customer C with NULLs in the columns from Orders. COUNT(*) counts the rows regardless of whether they resulted from a match or a non-match in the join condition and regardless of whether the columns have known values or NULLs. COUNT(O.orderid) correctly counts only rows where orderid isn't NULL, which can result only from a match in the join condition.
You're in Control
I hope this explanation clears up some of the confusion about where to specify query filters. Remember: When you know the basics, the advanced problems become simple. Once you've mastered the rules of logical query processing, you'll be in full control of when SQL Server logically applies your filters.