Take Control of Joins

Downloads
40621.zip

When you write a multitable join that uses inner joins exclusively (or for that matter, only cross joins or only full outer joins), the table order you specify in the FROM clause doesn't affect the query result. However, when you're using left or right outer joins, if you revise the table order in the FROM clause, you change the logical meaning of the query. When you're writing a multitable query in which you need to join different tables by using different join types (e.g., some using inner joins, others using left or right outer joins), finding a logically correct solution can be tricky. Most T-SQL programmers are familiar with the conventional manner of writing joins—for each pair of tables you're joining, you write the left table name, the join type, the right table name, and the JOIN condition, in that order.

However, most T-SQL programmers don't know about a couple of undocumented techniques you can use for writing multitable joins. The main advantage of these undocumented techniques is that both give you more power to control the logical order of join processing than the conventional techniques do. With more options to control the logical order of join processing, you can write queries that closely resemble the way you think. Although I discuss both techniques for the sake of completeness, I recommend that you use only the first technique because it's much more readable and easier to maintain than the second.

The Orders Case Study

To demonstrate the undocumented join syntaxes, I use a common order-application scenario that uses four tables: Customers, Orders, OrderDetails, and Products. First, run the code in Listing 1 to create and populate the tables with sample data.

Each customer in the Customers table might have zero or more orders in the Orders table. In the sample data, Active customer 1001 placed two orders, while Inactive customer 1002 placed no orders. Each order in the Orders table can have one or more lines in the OrderDetails table, each containing a different product. Each product in the Products table might appear in zero or more order lines in the OrderDetails table. In the sample data, a customer ordered Used product 301 and Used product 302 at least once, but no customers ordered Unused product 303.

If you want to return order-detail information that includes the customer's company name, the order ID, the product name, and the quantity without returning either customers who placed no orders or products that no one ordered, you need to use inner joins exclusively, as the code in Listing 2 shows. This code generates the result that Figure 1 shows.

Adding a slight twist to the previous requirement forces you to mix inner and outer joins. Say you now also need to return customers who placed no orders and products that no one ordered. Figure 2 shows the desired result. Handling this task using conventional join techniques is fairly simple, but I find this scenario useful to demonstrate the undocumented syntaxes. In practice, you might need to handle much more complex problems that might be simpler to solve by using the undocumented join methods. Once you learn the syntaxes of the techniques I present, you can implement them in any scenario.

Using Conventional Join Techniques

The query that Listing 3 shows is one way to solve the problem by using conventional join techniques. The first step is to use a left outer join between Customers and Orders based on matching values in the custid columns of both tables. The result of the first join is a hidden virtual table (call it VT1) that contains matching rows from both tables and shows NULLs in the columns from the Orders table for customers who made no orders. Next, the code joins VT1 to the OrderDetails table. Typically, you'd use an inner join between Orders and OrderDetails. However, using an inner join between VT1 and OrderDetails eliminates customers who placed no orders because you already returned those with NULLs in the orderid column (the join column).

To keep in the result customers who made no orders, you must use a left outer join to the OrderDetails table or join Orders to OrderDetails, then use a right outer join to Customers. This process returns another hidden virtual table (call it VT2). Next, the code performs a full outer join to return matching rows from VT2 and Products, customers who made no orders, and products that no one ordered.

Usually, the best solution represents the way you think so that you and others can more easily understand the code when you review it. But Listing 3's solution might not represent your way of thought. For example, because Orders and OrderDetails are closely related, I prefer a simple solution that performs a left outer join between Customers and the result of an inner join between Orders and OrderDetails, then performs a full outer join to the Products table. Using conventional join techniques, I'd revise the query as Listing 4 shows to match my way of thinking.

In Listing 4, I used a derived table called OOD to perform the inner join between Orders and OrderDetails as a separate step. However, the query became longer than the query in Listing 3. In more complex scenarios, you might end up nesting derived tables and producing code that's not easily readable, thus missing the whole point of creating a query that's easy to maintain. So with conventional join techniques, you might need to choose between two undesirable solutions: one that doesn't match the way you think and another that uses many derived tables and is overly long. The following undocumented join techniques can help in such situations.

Using Parentheses

The first technique I discuss is the one I recommend using because it lets you write highly readable queries and gives you full control of the logical order of join processing. Note that SQL Server's query optimizer can access the tables in a different order than the logical order you specify in the query if doing so will produce a more efficient plan while still returning the correct results. Similar to the way you use parentheses to determine evaluation order of regular expressions, you can use parentheses to determine join-evaluation order.

Consider the query in Listing 5, which provides a solution to our task. Note that I didn't use a derived table here; I just enclosed the inner join between Orders and OrderDetails in parentheses, forcing SQL Server to process the join in the parentheses as an independent step. You can even nest parentheses if necessary. This example doesn't require nested parentheses, but for illustration purposes, Listing 6's query encloses in parentheses the result of the left outer join between Customers and the result of the inner join between Orders and OrderDetails.

Using JOIN-Condition Order

Because the second undocumented technique isn't very intuitive, I recommend that you don't use it. The idea behind this technique is to control the logical order of join processing by the order in which you specify JOIN conditions. The difference between this technique and the conventional techniques is that with this one, you don't necessarily specify a JOIN condition right after the table names it refers to. You might specify several pairs of tables one after the other and several JOIN conditions one after the other. With this technique, the query in Listing 7 provides a solution to our task. By specifying the JOIN condition between Orders and OrderDetails first, as the code at callout A shows, and the one between Customers and Orders second, as the code at callout B shows, you determine the join-processing order to be Orders joined to OrderDetails first, then Customers joined to the result of joining Orders to OrderDetails.

The code in Listing 8 shows another variation of a query that uses this technique and provides a solution to our task. This code specifies all tables together and all JOIN conditions together. According to the JOIN-condition order, the code performs an inner join between Orders and OrderDetails first (call the result VT1). Next, it performs a left outer join between Products and VT1 (call the result VT2). Finally, the code performs a full outer join between Customers and VT2, returning the desired result.

Note that certain rules govern the relationship between the table order and the JOIN-condition order you specify. Because this technique is undocumented, I haven't found any source that explains the reason behind this logic, but it seems that the JOIN conditions must follow a chiastic relationship to the table order. That is, if you specify tables T1, T2, T3, and T4 in that order and the JOIN conditions match T1 with T2, T2 with T3, and T3 with T4, you must specify the JOIN conditions in the order opposite to the table order, like this:

T1 <join_type> T2
T2 <join_type> T3
T3 <join_type> T4
ON T4.key = T3.key
ON T3.key = T2.key
ON T2.key = T1.key

To look at this join technique in a different way, a given JOIN condition can refer only to the table names right above it or table names that earlier JOIN conditions already referred to and resolved. In the preceding inline pseudo code, the first JOIN condition can refer only to tables T3 and T4 because these are specified right above it. The second JOIN condition can refer to T3, T4, or T2 because the first JOIN condition already resolved T3 and T4 and because the join above T3 <join_type> T4 is between T2 and T3. I'll leave these rules with you to think about; maybe you can figure out the reasoning behind them.

Simpler Is Better

Discovering new T-SQL techniques is always exciting, especially when they make your life simpler. Again, I recommend using the first technique I discussed because it's much more intuitive than the second and usually lets you write short code that matches the way you think. If your thinking is complex, your code is bound to be complex, but if you think simple, now you have a tool to write simple code.

Comments

Plain text