Tame Bulky Inserts

Downloads
25521.zip

Test to find the best index-creation order

Modifying only one row through one INSERT, UPDATE, or DELETE statement is fairly straightforward. However, if the table has multiple indexes and you're modifying multiple rows through one statement or you're using bulk copy program (bcp) or the BULK INSERT command, you have to make some tough decisions. The most important decision is whether to drop one or more indexes before performing the modification or to leave all the indexes on the table. To help you make this decision, let's look at what goes on when SQL Server manipulates multiple rows by insertion or update. Then, I'll show you some example tests you can run to find the best method of using your indexes to support those transactions.

Operations that modify all or most of the rows in a table or that add a large number of rows to a table are called batch operations. Although you can delete multiple rows at the same time, the details aren't as complex as with inserts or updates because removing a value from a table or index is much easier than inserting a new value. Remember that insertions can lead to page splits or forwarding pointers. SQL Server often implements an update as a DELETE followed by an INSERT, so for the INSERT phase of the update, split pages or forwarding pointers might result. Remember also that for each new value in a table, whether the new value arrived from an INSERT or an UPDATE statement, every nonclustered index must have a new row inserted that references the new value. So if you're inserting or updating thousands or millions of rows in a table and the table has multiple nonclustered indexes, each needing to be modified for each new row, SQL Server must do a huge amount of maintenance.

Strategies for Batch Operations

SQL Server 2000 and 7.0 offer two strategies—table-level modification and index-level modification—for maintaining all the indexes that belong to a table during a batch operation. The query optimizer chooses between them based on its estimate of the likely execution costs for each strategy.

In table-level modification—sometimes called row-at-a-time—SQL Server maintains all indexes for each row as that row is modified. In most cases, SQL Server has to perform many random index accesses; at least one page per index per updated row needs modification. Also, you might have additional modifications at the upper (non-leaf) index level. For INSERT operations, if the rows being inserted are already in a sorted sequence, SQL Server can more efficiently modify any index that's sorted in the same sequence. However, because the input rows can be sorted in only one order, other indexes still require a lot of random access. SQL Server can use nonrandom index accesses for at most one index.

In index-level modifications—sometimes called index-at-a-time—SQL Server uses a work table to store all the rows to be modified and within this table sorts each index in order of the index keys. In other words, you have as many sort operations as indexes. Then, for each index, SQL Server merges the inserts or updates into the index and never accesses each index page more than once, even if multiple modified rows pertain to one index leaf page.

If the update involves a tiny percentage of the table's rows and the table and its indexes are large, the query optimizer usually considers table-level modification the best choice. Most online transaction processing (OLTP) operations, which affect only one or a few rows, use table-level modification. Conversely, if the update is relatively large, table-level modifications require a lot of random I/O operations and might read and write each leaf page in each index multiple times. In that case, index-level modification offers much better performance. The amount of logging required is the same for both strategies.

Pre-7.0 releases of SQL Server recommend dropping all indexes before using a bulk copy operation to import a lot of rows into a table because these releases offer no index-level strategy for maintaining all the indexes. Inserting large numbers of rows when the table has no indexes for SQL Server to maintain and then building (or rebuilding) the indexes after loading the data is always more efficient. However, if you're using SQL Server 2000 and 7.0, don't assume that you should drop all indexes before loading large amounts of data. When testing your applications' performance and index strategies, make sure to test your batch operations before and after building indexes. In cases where SQL Server can effectively perform index-at-a-time batch operations, you can often substantially improve performance by building the indexes before loading or updating the data.

Testing Batch Operations

To demonstrate how you might organize index performance tests for your bulk operations, I've included two scripts. Listing 1 creates a copy of the Northwind database's Order Details table. By changing the number of times the script copies the table into itself, you can control the number of rows in the table. Listing 1 creates a table with more than half a million rows. Listing 2 uses the T-SQL command BULK INSERT to perform three bulk load tests, each of which can be performed in two different ways. The script first copies the od table (which the script in Listing 1 created) to a text file. Each test creates a new table called load_od to hold the inserted rows. The first test loads the data, then builds one clustered index and four nonclustered indexes on load_od. The second test builds the clustered index first, loads the data, then builds four nonclustered indexes. The third test builds all five indexes before loading the data.

You can perform each test with or without the TABLOCK hint that I commented out in the code. I won't say much more about locking, but for understanding BULK INSERT operations and Listing 2, it's useful to know what the TABLOCK hint does. Be aware that another form of the TABLOCK hint is used in SELECT statements, but that form isn't the same as the TABLOCK hint you use with BULK INSERT operations. When you use it with BULK INSERT or the bcp utility, the TABLOCK hint instructs SQL Server to lock the whole table for the entire batch operation. Without this hint, each inserted row would acquire a row lock, so if you're inserting tens of thousands of rows, SQL Server has to manage tens of thousands of locks. The TABLOCK hint acquires just one lock for the operation. So TABLOCK can be a good thing—unless you don't want the entire table to be unavailable while the data is being loaded. Without the TABLOCK hint, other processes can access and modify any rows in the table that aren't affected by the INSERT operation. With the TABLOCK hint, no other processes can read or modify any rows in the table, with one exception: Other processes that are also performing BULK INSERT or bcp commands can simultaneously access the table when they also use the TABLOCK hint. This feature lets multiple processes insert into the table concurrently; in some cases, this concurrent insertion ability can give you a great performance boost.

I ran the script in Listing 2 four times. First, I created the od table with just over half a million rows and ran the three tests with TABLOCK, then I ran the same tests without TABLOCK. I then created a copy of the od table with more than a million rows and ran the tests again, with and without TABLOCK. Table 1 shows the time required in seconds for each of my tests.

This set of tests didn't show that building all the indexes before loading the data was faster, but my goal here is not to convince you that it is. I'm merely suggesting types of tests to run. You could run many other tests, and many factors could be different on another system, yielding vastly different results. For example, you could try running the tests with 5 million or 10 million rows, or try more or fewer indexes. You could build the indexes on different columns or create composite indexes.

You should base your tests on the data you'll actually be loading, including the types of indexes you expect to build. Other systems could give different results if they have better I/O systems, especially for the drive that contains tempdb, where most sorting happens. I performed the tests for this article on my Toshiba Tecra 8000 laptop, which has a 10GB hard drive. I wouldn't make any recommendations to my clients based on these results, other than my usual advice to thoroughly test whether dropping some of your indexes before executing batch operations will improve performance. Without thorough testing, it's almost impossible to figure out whether you should build one, none, or all of the indexes before loading the data.

Query Plans for Batch Operations

One limitation in trying to analyze my results for these BULK INSERT tests is that there's no way to see a query plan that shows whether SQL Server used row-at-a-time or index-at-a-time insertion. For a BULK INSERT operation, the plan tells you only that SQL Server performed a bulk insertion.

For batch UPDATE operations, however, you can determine whether they occurred at the table level (row-at-a-time) or at the index level by inspecting the query plan output. If SQL Server performs the update at the index level, the optimizer produces a plan that contains an UPDATE operator for each of the affected indexes. If SQL Server performs the update at the table level, the plan shows only one UPDATE operator.

If you ran the tests to create the load_od table and you still have the load_od table with all five indexes, you can demonstrate the difference for yourself. Copy the following query into Query Analyzer, then select the toolbar button to display the query plan (or press Ctrl-L):

UPDATE load_od
SET Quantity = Quantity + 1, Discount = Discount * 1.01 
— WHERE counter < 250

If the WHERE clause is commented out, the optimizer needs to find the best plan for updating a large number of rows—all the rows in the table. The graphical plan output will show two branches. The first branch updates the clustered index (i.e., the data), then sorts the data and updates the nonclustered index on the Quantity column; the second branch sorts the data, then updates the nonclustered index on the Discount column. This plan shows index-level updating. If you remove the comment marker from the WHERE clause so that only a fraction of the rows in the table are updated, you'll get a much simpler query plan. This plan has only one INDEX UPDATE operator for the clustered index (the table data). As it updates each row in the table, SQL Server makes the corresponding changes to the appropriate nonclustered indexes. So this plan shows table-level updating.

When you need to modify thousands or millions of rows in one operation, finding an efficient strategy is essential. However, the most efficient strategy isn't always obvious. Knowing what options SQL Server has available and what changes you can make to affect performance can help you create tests that reveal the best strategy for your applications when they're processing your data.

Comments

Plain text