The Best Indexes for Joins

Downloads
24483.zip

Test and test again for optimization

You know by now that SQL Server can perform various types of joins (i.e., hash, merge, nested loop) and that different indexes can be useful to each join type in different ways. As I mentioned in "Hashing for Performance," April 2002, InstantDoc ID 24024, I've found no easy algorithm that's guaranteed to find the best indexes for any given query or table. However, I've devised a series of tests you can run to determine the best indexes for a join query. In this article's examples, you can run a test query against the Northwind sample database and experiment with indexes and performance. (Thanks to my colleague Kimberly Tripp-Simonnet for performing the initial testing of these indexes on much larger tables than those in Northwind.)

Most of the examples I've used in past columns have joined tables that have just one join column in common. If your join columns are composite, any index you explicitly build to support the join should include all the columns. But what if you have one join column and one or more search arguments? Is it better to index the join column or a column in a highly selective search argument? What happens if you have a composite index on both the join column and a search argument? In SQL Server 7.0 and earlier, I've never observed the optimizer choosing an index that supported both a join and a search argument for the same query. However, the SQL Server 2000 optimizer is much smarter, so I recommend that you add such indexes to your list of possibilities to consider and test.

Try It Out

Let's run a series of tests, using different indexes on two tables involved in a JOIN operation. The test query also has a search argument for each of the tables. The query runs against the data in the Northwind database's Orders and Order Details tables, but to avoid conflicting with the existing indexes on these tables, use the code that Listing 1 shows to make copies of the tables first.

How can you tell which indexes result in better performance? If the tables have millions of rows, you can use your clock or wristwatch. In Kimberly's original tests, the join with no indexes took more than 30 seconds, so she could easily recognize an improvement. But queries against the Northwind tables all seem to execute in just a second or two, so how can you really tell what works best?

In SQL Server 6.5 and earlier, one main way to compare the efficiency of one query against another is to look at the values that STATISTICS I/O returns—in particular, the value for logical page reads. However, as of SQL Server 7.0, that method is no longer sufficient. As I've mentioned, for hash joins, SQL Server might have to pass through each table only once, performing a minimal number of reads. However, the memory space required, plus any other internal processing, can make hash joins much more expensive than other types.

So for SQL Server 2000 and 7.0, to compare performance as I add different indexes, I usually look at the CPU time value in addition to or instead of looking at STATISTICS I/O. Make sure to check the CPU time reported after the query has executed. The CPU time that SQL Server reports before executing the query is the time required for parsing, optimization, and compilation. I also don't pay much attention to the elapsed-time value because factors other than the query's efficiency can affect that value. For example, other users on the server computer affect elapsed time, even if those users are running non—SQL Server applications.

Listing 2 shows the query that I want to tune and the statements that turn on the statistics measurements. In the STATISTICS I/O output, I'm interested in only the value for logical reads, which indicates how many pages need to be accessed. I'll leave it to you to use the SET STATISTICS PROFILE ON command to examine the query plans.

When I ran the test query with no indexes on the tables, I got the following output:

Table 'od2'. Scan count 1, logical
reads 10, physical reads 0, read-ahead reads 0.
Table 'o2'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 16 ms.

Let's use the CPU time of 16 milliseconds (ms) as a base time and try to improve on it. First, let's use the code that Listing 3 shows to build some indexes on the join columns. When you run the test query again, the statistics output is

Table 'od2'. Scan count 32, logical 
reads 144, physical reads 0, read-ahead reads 0.
Table 'o2'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 8 ms, elapsed time = 8 ms.

This output shows that the CPU time decreased by half but the logical reads increased significantly. This result happened because for the original query (with no indexes), SQL Server used a hash-join plan, which results in few I/O operations but more internal-processing time.

Now let's see whether building indexes on the search arguments makes the query run even faster. Use the code that Listing 4 shows to drop the existing indexes and build the new indexes. Then, run the test query again. According to the following I/O output

Table 'od2'. Scan count 1, logical
reads 10, physical reads 0, read-ahead reads 0.
Table 'o2'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 15 ms.

the performance didn't improve. Neither search argument was selective enough to cause SQL Server to use an index, so the plan is the same as when no indexes existed.

Next, let's try adding composite indexes that include both the join column and the search argument column for each table. You could try building these indexes by specifying either the join column first or the search argument column first. Listing 5 shows the code to build composite indexes on the example tables. Listing 5 actually contains two sets of index drops and rebuilds; the first set builds composite indexes giving priority to the JOIN condition, and the second set gives priority to the search argument. I'll let you try them out by running the test query after building each new set of indexes.

If you ran these tests, you probably found no significant performance advantage to building the composite indexes. Neither of these indexes was more useful than the ones we had already created. Finally, let's try building composite indexes that cover the query. Remember that a covered query is supported by an index that contains all the columns the query references. Again, you could give priority to either the join column or the search argument column, as the code in Listing 6 shows.

Just like the indexes we built for Listing 5, the indexes built for Listing 6 don't seem to help at all. But changing the column order in the CREATE INDEX statement, as the code in Listing 7 shows, produces a winning solution: a covering index on each table, in which the search argument column is the leftmost column in the index. This combination of indexes gives outstanding performance, with a CPU time of substantially less than the original 16ms. In fact, my machine reported the CPU time as 0ms, which means less than one millisecond. I had no way to predict this result, but knowing what possible combinations of indexes were worth trying led me to an ideal solution.

Even though no simple algorithm can always come up with the best indexes for any given query or table, you can make some smart guesses and test whether those choices give good performance. For example, you can start with an index on your primary keys (which should already exist if you define your primary keys as a constraint on the table). Your decision is whether to make the index on the primary key column (or columns) clustered or nonclustered. That's a topic I won't discuss in detail now, but try experimenting to find the difference in performance. Sometimes a clustered index performs better, and sometimes a nonclustered index does.

Before you make the final decision about whether the primary key should have a clustered or nonclustered index, test all the possible types of queries that will access the table. You'll probably also want to consider building indexes on your foreign key columns and the columns used for search arguments and for grouping and sorting. When testing possible indexes, make sure to include all your data-modification operations. In my next column, I'll look at how indexes can be both a help and a hindrance for achieving good data-modification performance.

Comments

Plain text