A table's columns usually represent the properties of real-life entities such as employee and product names. When your query contains a predicate—a WHERE clause containing search criteria—that includes a column, SQL Server's query optimizer can find the data that satisfies the query in two different ways: by using a table scan or index access. In a table scan, the query optimizer sequentially reads all rows of a table and compares each row with the search criteria in the WHERE clause. SQL Server typically decides to scan the entire table when the query selects a significant number of table rows. The query optimizer considers index access when a column index (clustered or nonclustered) exists. The decision about whether to use the existing index depends on many different factors. (For a description of factors that influence the use of index access, see "Which Is Faster: Index Access or Table Scan?" January 2000.)
If the query optimizer has to decide how to access a column that contains nonunique values, the distribution of distinct values for the column across a table's rows becomes an important consideration. Suppose your organization stores all customer names in the customers table, which includes two columns named gender and country. Also suppose your business is based mainly in North America and that the country column has 10 distinct values. If 70 percent of your customers live in the United States, 10 percent in Canada, 5 percent in Mexico, and 0.05 percent in Germany, the distinct values in the country column aren't uniformly distributed across the rows of the customers table. In this example, the country column's data is skewed because the frequency with which this column's existing values (United States, Canada, Mexico, Germany, and so on) appear in the rows of the customers table differs significantly. In contrast, the gender column lists roughly the same number of males and females, so the gender column's data isn't skewed. Let's look at different SELECT statements to see how the query optimizer decides to select the resultset in different situations in which the skewed-data column is in the WHERE predicate. In preparation, you need to first write a T-SQL statement that creates the customers table, which Listing 1, page 62, shows. You can then use the batch that Listing 2, page 62, shows to load 100,000 rows into this table.
Skewed Data Causes Plenty of Mischief
Skewed data can affect the query execution plan of queries containing a predicate and queries containing a join operation. Let's first explore the impact on queries containing a predicate.
Evaluating the predicate. Suppose you want to find out how many male customers you have in Germany. Because Germany accounts for only 0.05 percent of your customers in the earlier example, the predicate that Listing 3 shows will select at most 50 customers. The relationship between a predicate and the number of rows it chooses exemplifies the concept of selectivity. A predicate's selectivity is the ratio of the number of rows satisfying the WHERE clause to the total number of rows in the table. The selectivity of an index is said to be high, as in this scenario, when the index key value identifies only a few rows. If selectivity is high and if a corresponding index exists, the query optimizer typically uses index access because retrieving a relatively small number of rows is much faster with the sorted row order that the index access uses than sequentially scanning the whole table. Applying the index access strategy, the query optimizer uses the existing index i_country to satisfy the query, as Figure 1 shows. In contrast, Listing 4's predicate, which identifies US customers, selects significantly more rows. In this case, the best access strategy is a table scan, which Figure 2 shows, because reading all data pages sequentially is significantly faster than reading specific pages by using the existing index. So we can see that the query optimizer chooses a table scan if the I/O estimate for index access is greater than the number of pages that the table fills. Here, selectivity is low.
In Listing 3 and Listing 4, the query optimizer chooses the query plan you expect. However, if you try to write a generic SELECT statement that you can use to query any country your customers live in, the query optimizer's choice for data access might surprise you. Following the earlier line of reasoning, you might conclude that the query optimizer would choose the index scan for executing the batch that Listing 5 shows because the @country variable's value is set to Germany. Instead, the query optimizer uses a table scan—even if the corresponding index for the country column exists. The reason for this behavior is simple: The query optimizer can't use the existing statistics as the queries in Listing 3 and Listing 4 did because one of Listing 5's predicates contains a variable. And Listing 5's batch declares the @country variable in the same scope as the query. The query optimizer doesn't know the value of the variable when it chooses the access method because it doesn't perform multiple passes when it optimizes the query. Thus, Figure 3 shows that the query optimizer used the table scan because the Row count parameter has a value of 0. The query optimizer assigns this value when it doesn't know the number of selected rows.
Join operations. Columns containing skewed data can also influence the performance of join operations. Consider a table called orders, which contains 500,000 rows. Listing 6 shows the statement that creates the orders table, and Listing 7 contains the batch that loads the table's rows. Suppose 20 percent of your business comes from PC sales, and controllers account for only 0.05 percent of total sales. Also suppose that all relevant columns in the query (customers.customerid, orders.customerid, item_name) are indexed. If you ask for the average income of customers who bought PCs, as Listing 8 shows, SQL Server has to read all rows from the customers table and 80,000 rows from the orders table to perform the join. Because both tables contain so many rows, the query optimizer chooses not to use existing indexes and instead executes the join operation by using the hash join method that Figure 4, page 64, shows. A hash join performs best on tables that differ considerably in size, on unsorted rows, and on nonindexed columns. This join method is two-phased. First, SQL Server builds a hash structure consisting of the smaller (outer) table's values, which have been hashed, then probes each row on the larger (inner) table for matches in the hash table. The hash join scans the inner table only once. In contrast, to execute a join of the customers and orders tables, based on controller sales, as Listing 9, page 64, shows, SQL Server has to read only 200 rows from the orders table. For this reason, the query optimizer chooses to execute the join by using the nested loop join that Figure 5, page 64, shows. In a nested loop join, SQL Server accesses the outer table row by row. Then, looking only for matching rows, SQL Server scans the inner table, which must contain an indexed join column. This join method produces I/O for each index scan of the inner table (i.e., I/O for 200 scans).
Curbing Skewed Data's Misdoings
You can use several methods to enhance response time for queries that have to contend with skewed data:
- Use multiple threads to scan one table. If your SQL Server system runs on an SMP computer, the query optimizer applies parallel table scans.
- Divide the table by using local partitioned views.
- Avoid using variables in batches.
You can apply all these methods in situations where the query optimizer uses a table scan—that is, where the selectivity of a column's value is low.
If your computer has several processors, SQL Server might split a table's rows into different logical parts and process these parts on different processors, thereby permitting parallel table scans. If the query optimizer applies a parallel table scan, you can significantly enhance the response time of Listing 3's query, which accesses skewed data in the customers table. By using sp_configure, SQL Server autoconfigures the max degree of parallelism configuration option, which controls the use of parallel queries.
When you apply parallel scans, SQL Server groups the table's rows. Because this grouping is an internal process, a user can't directly influence it. However, if you use distributed partitioned views (available only in SQL Server 2000), you can specify how to partition the table's rows. The main advantage of distributed partitioned views is that the query optimizer knows which values of the partitioning column are stored in which partition and can omit the search in views that don't contain the value from a given predicate. For example, consider the query in Listing 4. If you create two distributed partitioned views, one with females and one with males, the query optimizer will search the rows of the all-male view to satisfy the query. (For more information about distributed partitioned views, see the three-part series by Kalen Delaney and Itzik Ben-Gan: "Distributed Partitioned Views in SQL Server 2000," August 2000, "Querying Distributed Partitioned Views," September 2000, and "Modifying Views with INSTEAD OF Triggers," October 2000.)
As for avoiding using variables in batches, let's look again at Listing 5. To execute the query in Listing 5, the query optimizer can't use the existing statistics because a variable in one of the predicates is joined by the AND operator. For this reason, avoid batches that contain a variable. If you have to use such a query in a batch and if the join column's value has high selectivity, consider using the INDEX query hint to maintain index access.
Accounting for Data Skew During Database Design
Because many columns in relational tables contain skewed data, carefully consider your data distribution when you design your database. First, take skewed data into account when you create indexes for your database tables. The way you use indexes for columns containing skewed data depends on the values of the column that you name in a predicate. Second, be aware that the size of tables containing skewed data directly impacts performance by index access and table scans. Index access becomes faster and table scans slow down as tables become larger, resulting in an increasing performance gap between the two access strategies. Hence, the effect of skewed data is highest in the data-warehousing field.
As you've seen in this article, the query optimizer accesses skewed data effectively. The only case in which the query optimizer doesn't recognize the selectivity of a column's value is when the predicate contains a variable. However, if your query's predicate contains a column that has skewed data, you can enhance your query's response time by using parallel table scans or local partitioned views and by avoiding variables in batches.