Indexes and Joins

Indexes on a table can help you access one or many rows of data. Having a good index for your query is one of the best ways to improve performance. When your query is attempting to find just a few rows in a very large table, the performance difference between having no usable index on the table and having a good index can be enormous. When you're tuning queries that involve multiple tables (i.e., joins), indexes can also help SQL Server efficiently find the matching rows between tables.

The SQL Server query optimizer can choose from three strategies for performing joins: nested-loop join, merge join, and hash join. In this article, I describe the first two join strategies: nested loop and merge. (For more information about join strategies, see Itzik Ben-Gan and Kalen Delaney, "Advanced JOIN Techniques," December 1999.) In each method, the tables you want to join (or subsets of tables that you've restricted with conditions in the WHERE clause) are the join inputs. If your query involves WHERE conditions in addition to the join, SQL Server might apply the WHERE condition before trying to find the matching rows in the second table. Consider example Query 1 from the Northwind database:

-- Query 1:
SELECT LastName, FirstName, OrderID, OrderDate
FROM Employees e JOIN Orders o
 ON e.EmployeeID = o.EmployeeID

This join returns all the orders from the Orders table; for each order, it also returns the last name and first name of the employee who took the order and the order date. For Query 1, SQL Server examines all the rows in the Employees and the Orders tables, and the join inputs are the entire tables. However, if you qualify this SELECT statement with two WHERE conditions, as Query 2 shows, the join inputs are no longer complete tables:

-- Query 2:
SELECT LastName, FirstName, OrderID, OrderDate
FROM Employees e JOIN Orders o
 ON e.EmployeeID = o.EmployeeID
WHERE OrderDate < '1996-12-01'
AND LastName < 'D'

For Query 2, the join inputs are much smaller than in Query 1. Instead of joining the 9 rows in Employees with 830 rows in Orders, SQL Server must join only 2 rows from Employees with 121 rows from Orders. With a small number of input rows, the query optimizer frequently chooses a different join strategy than it would if the inputs were larger, and it might even choose to join the tables in a different order. As far as the optimizer's join strategy decisions are concerned, size does matter.

Nested Loops

Even if your query joins more than two tables, SQL Server performs the joins by joining only two inputs at a time, and each join in a query might use a different join strategy. The most straightforward type of join—and the type of join most people think about when planning join operations—is the nested-loop join. You can imagine SQL Server operating on the two inputs as if they were arrays in a development language like C or Basic. SQL Server compares each row in one of the inputs with all the rows in the other input to find the matches between the rows. Query 1 tries to find the rows that match on the EmployeeID column. So with the nested-loop join strategy, SQL Server must compare all the Employee ID values in one table with all the EmployeeID values in the other.

The worst-case scenario for a nested-loop join happens when no existing indexes can help SQL Server find the matching rows between the inputs and no indexes can help find the rows that satisfy any WHERE conditions. In this case, the inputs are the entire tables. The query optimizer chooses one table to be the outer table and accesses its rows first. Let's assume the outer table has P1 pages and R1 rows. The second table, the inner table, has P2 pages. SQL Server must read all the pages in the outer table; for each qualifying row on each page, it must then read all the pages from the inner table. To find the number of pages that SQL Server needs to read to produce the results, you can use the following formula:

P1 + R1 * P2

Even if the tables are relatively small, the resulting number of pages to be read gets large very quickly. For an outer table with only 200 pages and 4000 rows (i.e., 20 rows per page) and an inner table with 100 pages, the result is quite large. Tables with 100 or 200 pages aren't unusually big tables, but to process the join, SQL Server would need to access more than 400,000 pages if the tables have no usable indexes.

Indexes can help improve the performance of a nested-loop join in several ways. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. The presence of a clustered index on a join column frequently determines which table SQL Server chooses as the inner table. If the inner table has a clustered index, SQL Server doesn't have to look through every row of that table. The clustered index leads SQL Server directly to the rows in the inner table that have a join column value that matches the current row in the outer table. So in the formula, instead of the R1 * P2 term, which implies that SQL Server accesses all P2 pages, you can replace P2 with two or three page accesses, depending on how many levels the clustered index has. So the example with 200 pages and 4000 rows in the outer table and 100 pages in the inner table results in 200 + 4000 * 3, or 12,200, page reads—a big improvement over 400,000.

That 4000 in the calculation still makes the result bigger than I'd like. In this case, all 4000 rows in the outer table are part of the result, causing 4000 accesses of the inner table. Another way to reduce the number of page accesses is to reduce the size of the outer input. Besides checking for a clustered index on a join column, the optimizer tries to join the tables with the smaller input first. In Query 1, the Employees table has a clustered index on the join column, EmployeeID, but it's also dramatically smaller than the Orders table. Employees has only 9 rows; Orders has 830. For Query 1, if the optimizer chooses a nested-loop join, it would use the smaller Employees table as the outer input so that it has to access Orders only nine times.

If you have a WHERE condition involving the outer table, the number of qualifying rows decreases, and SQL Server needs to access the inner table fewer times. If you change Query 1 to include a WHERE clause on the Orders table, as Query 3 shows, the plan changes.

-- Query 3:
SELECT LastName, FirstName, OrderID, OrderDate
FROM Employees e JOIN Orders o
 ON e.EmployeeID = o.EmployeeID
 WHERE OrderDate < '1996-12-01'

Now, only 121 rows in Orders are part of the result. That smaller size, combined with the clustered index on the join column in Employees, means that the optimizer now chooses Employees as the inner table. SQL Server will use a nested-loop join because the clustered index lets SQL Server quickly find the matching rows in the inner table.

Figure 1 shows the query plan for Query 3. The first line of the plan tells the join type (Nested Loops) and specifies that the outer table will reference the EmployeeID column. The clustered-index scan on the outer table is the same as a table scan because no existing index can speed up access to that outer table. The WHERE condition on the OrderDate column reduces the number of rows to be returned and the number of times SQL Server must access the inner table. But SQL Server must access all the rows in the outer table to determine which ones have an acceptable OrderDate value. Finally, the query plan shows that SQL Server uses a clustered-index seek for the inner table because the index is on the column SQL Server uses to find the matching rows.

As I've noted in my previous columns, an index on the OrderDate column would be a good thing but wouldn't improve the query's performance nearly as much as a clustered index on the join column. A useful index on a search argument in the outer table means SQL Server doesn't have to access all the pages in the outer table, so the P1 value decreases. Usually, though, the P1 value is dwarfed by the value in the second term, R1 * P2, so a reduction in the P1 value yields only a minor improvement. An index on the outer table doesn't reduce the number of times SQL Server has to access the inner table because SQL Server still has to access the inner table for each qualifying row in the outer table, no matter how efficiently it found those qualifying rows. You can generalize the optimizer's choice of nested-loop join this way: The optimizer most often chooses a nested-loop join if one join input is much smaller than the other and the larger input has a clustered index on the join column.

Merge

In a nested-loop join, an index on the join column for the outer table is of no use. However, when you're tuning queries and tables, you might not always know which table will be the inner and which will be the outer, so you might end up with clustered indexes on the join columns for both input tables. SQL Server can use a merge join when the join inputs are both sorted on the join column, as in the case when both tables have clustered indexes on the join column.

You can think of merge join as combining two sorted lists of values. Suppose you have two piles of contractor information: one pile containing the master contract that each contractor has signed and the second containing descriptions of each project that a contractor worked on. Both piles are sorted by contractor name. You need to merge the two piles of paperwork so that each contractor's master contract is with all the project descriptions for that contractor. So, you basically need to make just one pass through each stack.

Consider Query 1 again: If both Employees and Orders had clustered indexes on the EmployeeID column, SQL Server could perform a merge join. The pseudocode for SQL Server's execution of the merge join would look something like this:

GET one Orders row and one Employees row
DO (until one input is empty);
 IF EmployeeID values are equal
  Return values from both rows
 GET next Orders row
 ELSE IF Orders.EmployeeID <> 
 Employees.EmployeeID
 GET next Employees row
 ELSE GET next Orders row

The query optimizer usually chooses the merge join strategy when both inputs are already sorted on the join column. If the inputs are both already sorted, little I/O is required to process a merge join if the join is one-to-many (1:M). A many-to-many (M:N) merge join uses a temporary table to store rows instead of discarding them as it usually does. If the data includes duplicate values from each input, the second input must return to the start of the duplicates in the temporary table as SQL Server processes each duplicate from the first input. However, in many cases, SQL Server won't use a merge join unless at least one of the join columns is unique.

Here's an example of joining two tables with and without a unique index. Listing 1 creates copies of the Orders and Order Details tables in the Northwind database and builds a clustered index on OrderID in both tables. When you initially join the tables and display the query plan, you'll see that SQL Server chooses a nested-loop join. Although the OrderID column is unique in the Orders table, if you don't specify unique in the index definition, the optimizer isn't aware that the key values are unique. So when you rebuild the clustered index in the Orders table, specifying that it must be unique, the revised query plan shows that SQL Server uses a merge join.

In some cases, SQL Server's optimizer might decide that it's cheap enough to sort one of the inputs before the join, then perform a merge join. If you change Listing 1 slightly so that the initial index built on the Order Details table is nonclustered, as Listing 2 shows, the query plan shows a sort operation before a merge join.

Know Your Tables

Usually, the query optimizer determines the type of join SQL Server will use. You can just write your join query and leave the choice of strategy to SQL Server. However, knowing how SQL Server performs the joins can help you decide which indexes will be most useful. The optimizer usually chooses nested-loop join if one of the join inputs is very small compared with the other input and the larger input has a clustered index on the join column. Merge join is the most likely strategy if both inputs are sorted on the join column and, in particular, if one of the inputs has a unique clustered index.

Both nested-loop and merge joins require appropriate indexes on the tables. However, if your application lets users build ad hoc queries, you might not know in advance what the best columns for indexes are. The third type of join, hash join, lets SQL Server achieve pretty good join performance even when your tables have no useful indexes. In my next column, I'll discuss hash joins.

Comments

Plain text