The cost-based optimizer in Microsoft SQL Server 2000 is charged with a difficult task: Find the most efficient way to execute a query. Success in this ambitious undertaking hinges on up-to-date information.
The optimizer gets its information from a variety of sources. You explicitly give it some information when you create tables, indexes, and constraints. This information won’t change unless you deliberately change it with Data Definition Language (DDL) statements to alter the database schema. The database also collects information on its own in the form of statistics. Statistics reflect the actual data stored in the database, including the cardinality of the tables, columns, and indexes. This information is subject to change, so it’s only as good as the last time the statistics were collected. Fortunately, SQL Server collects these statistics automatically on an as-needed basis, provided you don’t disable this feature.
As different plans are developed, costs are associated with each. These costs, which can ultimately be expressed as scalars, can also be compared to a database setting for parallelism. If the anticipated cost of a plan exceeds this threshold, SQL Server might evaluate plans that utilize parallel execution, which essentially divides the work up among multiple processors. The default value for this setting is 5, which is considerably higher than the typical cost of most simple queries and single-row updates. To examine this setting, open Enterprise Manager, right-click the name of any database server, and select Properties. The Processor tab has a Parallelism section in which you can change the threshold or disable this feature by setting the number of processors available to 1.
The ability to analyze plans is why the cost-based optimizer is such a valuable part of SQL Server. The optimizer can react to changes in data characteristics more quickly, efficiently, and accurately than any human. Some rule-based schemes in other database platforms require you to anticipate how tables will be populated and program that knowledge into every SQL statement submitted. Through the cost-based optimizer, SQL Server relieves you of that burden in almost every case.
Of course, there are limits. Your data might be constrained by application logic that restricts table entries based on complex business rules, and the optimizer won’t have any visibility to these details. The optimizer still devises a desirable plan based on the information it can glean from the schema, statistics, and the query, but perhaps an even better plan exists, given the hidden inherent traits of the data.
In these situations, perhaps a subtle hint would help. I’m not referring to the explicit Table, View, Join, or Query Hints that you can included with any statement. Although these explicit hints are perfectly valid and are supported syntax, they should be avoided if possible. SQL Server Books Online (BOL) cautions against them (see the entry for the keyword From), with good reason. Explicit hints rob the optimizer of its ability to react to changing data distribution, inhibiting it from doing what it does best. What I’m talking about can best be described as a subtle hint, like the kind you get from family members in the months leading up to holidays and birthdays.
The idea here is to embed enough information in your query to help the optimizer make a better decision. You’re not forcing it down any path—you’re just supplying more information than the optimizer currently has, then letting it do its job.
To illustrate, let’s use an example from the familiar Pubs database that’s included with SQL Server. The database includes a table called Roysched, which contains, as the name implies, a royalty schedule for publication sales. Each row contains a title, a low range and a high range for sales volume, and a corresponding royalty percentage if those volumes are hit.
If you examine the data in this table, a business rule quickly becomes apparent. No overlap exists between any two rows for a given title. In other words, the sales volume for any given title will map to only one royalty value by virtue of the ranges within a title being mutually exclusive. This rule isn’t enforced within the database schema at all. In all likelihood, the data entry process for this table would have some validation in place to guard against ambiguities.
Here’s a simple query to find the royalty percentage for a given title and sales volume:
SELECT royalty FROM roysched WHERE 22000 BETWEEN lorange AND hirange AND title_id = 'BU7832'
Due to the business rules just stated, you would expect only one row to be returned. However, the optimizer doesn’t know this in advance like you do. Statistics won’t help to identify the complex pattern in the data (i.e., the non-overlapping ranges within each title ID). There are two main effects of this blind spot:
A simple addition to the statement can help—a subtle hint, if you will. Although you typically use the TOP operator to limit a large data set, you can use it here as well:
SELECT TOP 1 royalty FROM roysched WHERE 22000 BETWEEN lorange AND hirange AND title_id = 'BU7832'
From a human perspective, adding the TOP 1 operator looks a little redundant to anyone familiar with the data. After all, you’ve already stated that only one row will be returned by virtue of the validations on data entry. Furthermore, adding operators usually adds processing steps in the execution plan. Why, then, would you want to add this step to the query? To help the optimizer. The optimizer can capitalize on the knowledge that only one row will be returned. Granted, you’ve added a step to the query plan, but the reduced cost for reading the table is enough to offset the cost of that step.
Figure 1 and Figure 2 show the Results and Execution Plan tabs, respectively, in the Query Analyzer for comparison purposes. To examine these results yourself, simply enable the Show Execution Plan option under the Query menu prior to running the query or use the keyboard shortcut Ctrl+K. To generate the I/O costs, open the Current Connection Properties dialog box (which you can access from the under the Query menu) and select the Set statistics IO check box. Alternatively, within the query window, you can execute the following statement prior to running the queries:
set statistics io on
If you’re using the Query menu’s Results in Grid option, the I/O information will actually be output to a separate output tab labeled Messages.
After you run the two queries, note the difference in I/O costs and compare the execution plans. The second query costs only half as much as the first.
In most cases, TOP is used with an ORDER BY clause because the outcome is unpredictable. However, in this example, you know that the outcome of the statement won’t change—only one row will ever meet the selection criteria.
The query in this example is relatively simple. The differences become even more significant in a complex statement that joins multiple tables together.
Joins are relatively expensive operations in query processing. One way to minimize that expense is to limit the amount of data from each table involved in a join. Think of the optimizer as breaking down the query into all the necessary processing components, including the application of filters or predicates, joins, bookmark lookups, and so on. Then through the process of cost estimation for each step, the optimizer creates the plan with the lowest overall cost. The order in which steps are executed is an important aspect of the plan. On the one hand, choosing to apply a filter before performing a join might dramatically affect the amount of data processed in the join. On the other hand, applying that filter might also require expensive bookmark lookups. A bookmark lookup is a read against a specific row in a table that’s referenced by an index. When a nonclustered index is used in a query, bookmark lookups will be required if the index doesn’t include all the table columns referenced in the column list and WHERE clause. Although SQL Server has some internal optimization tricks, such as reading a full page at a time to save potential additional reads of the same page, a significant number of bookmark lookups quickly become more costly than a straight table scan. Therefore, it might be more appealing to do the join first if the number of bookmark lookups would be reduced.
When more than two tables are involved, the permutations for ordering the steps increase exponentially. The decision to execute one join before the other might hinge on which of the two steps is going to limit the intermediate data set the most.
For these reasons, if you know something about your data that the optimizer doesn’t, it’s beneficial to find a way to make that knowledge apparent. The previous example did so by limiting the result set to the one row we knew would be returned—a subtle hint that let the optimizer make its own decisions, armed with knowledge that it didn’t have before. Limiting the result set to the one row would also work well with tables with date ranges (i.e., a table with from and to date columns), where the date ranges aren’t allowed to overlap. Such rules are also usually enforced in application logic and are invisible to the optimizer.
If the phrase subtle hints doesn’t scream out documentation or at least comments to you, it should. Document and comment these assumptions in your code, design, requirements, and test cases. The business rules you’re exploiting won’t be obvious in an entity relationship diagram (ERD). If code appears to have extra or redundant steps without proper explanation, some future maintainer of the code might remove your changes in a well-intentioned attempt to improve efficiency, ironically reverting it back to suboptimal performance. Therefore, it’s important to leave yourself and your successors a guide to your decisions regarding any changed statements.
Do you have any opportunities like this in your applications? Chances are, at least some business rules result in inherent predictability for your data. Can the optimizer realize that predictability on its own? If not, it’s up to you to find a way to help it along.