Server updates rows in multiple ways, automatically and invisibly choosing the fastest update strategy for a specific operation. In determining the strategy, SQL Server evaluates the number of affected rows, how it will access the rows (by a scan or an index retrieval, and through which index), and whether changes to the index keys will occur. In SQL Server 7.0, updates can happen in place or as a delete followed by an insert. (An in-place update is one where SQL Server changes the bytes in the row with no movement of data necessary.) In addition, the SQL Server 7.0 query processor or the storage engine can manage updates. The decision of whether the query processor or the storage engine controls the update is relevant to all data modification operations (not just updates).
Releases of SQL Server before 7.0 can perform either a direct or a deferred update. Many people mistakenly interpret direct to mean that the update is done in place. People then conclude that a deferred update means that the update isn't done in place, but requires the row to be moved to a new location. But deferred means that the update occurs in two passes. The first pass uses the transaction log as a holding area where SQL Server records all the changes it will make and marks them as NO-OP, meaning that no operation occurs at that time, but an entry is made in the log. The second pass then rereads the log and applies the changes. Also, earlier releases of SQL Server update every nonclustered index any time a row moves, to hold the row's new location. Because all changes—including changes to the nonclustered indexes—must be recorded twice, deferred operations are log-intensive. Deferred updates are by far the slowest kind of SQL Server 6.x update.
SQL Server 7.0 has no deferred updates. SQL Server performs all updates in a direct mode, without using the transaction log as an intermediate holding area. Direct doesn't necessarily mean that the update happens in place, however. And because of the way SQL Server 7.0 maintains nonclustered indexes, the overhead of moving rows is inexpensive, even if the table has multiple nonclustered indexes.
What happens if a row in a table needs to move to a new location? With SQL Server 7.0, such a situation might arise when you update a row with variable-length columns to a larger size that no longer fits on its original page. Or, because SQL Server stores rows in order of the clustering key, a row might need to move when the clustered index column changes. For example, if you have a clustered index on lastname, a row with a lastname value of Abbot is stored near the beginning of the table. If you then update the lastname value to Zappa, this row will move to near the end of the table.
The leaf level of nonclustered indexes contains a row locator for every row in the table. If the table has a clustered index, the row locator in every nonclustered index is the clustering key for that row. So if—and only if—the clustered index key is updated, modifications are required in every nonclustered index. Keep this fact in mind when deciding which columns to build your clustered index on. Clustering on a nonvolatile column is a great idea. If a row moves because it no longer fits on the original page, it keeps the same row locator (in other words, the clustering key for the row stays the same), and no nonclustered indexes need modification.
If a table has no clustered index (it's stored as a heap), the row locator stored in the nonclustered index is the row's physical location. In SQL Server 7.0, if a row in a heap moves to a new page, the row leaves a forwarding pointer in the original location. The nonclustered indexes don't need changing; they still refer to the original location, and the pointer specifies the new location.
Let's look at an example. The code in Listing 1 creates a simple table with three columns. The table has no clustered index, so it's stored as a heap. After populating the table with five rows, which fill the page, the code updates one of the rows to make its third column much longer. The row no longer fits on the original page and needs to move. To find the address of the page it moved to, the code selects the first column from sysindexes. The value you supply to DBCC PAGE depends on the value that the first column returns. See the sidebar "Using DBCC PAGE," page 57, for a partial explanation of the DBCC PAGE command.
Here are the contents of a row that contains a forwarding pointer; you can see what appears in the slot where the row with a = 3 previously appeared:
Slot = 2 Offset = 0x1feb Record Type = FORWARDING_STUB Record Attributes = 11ef3feb: 0000f904 00000100 00 .........
In these results, you have to byte-swap each set of four bytes. Each two digits make up one byte. So, after byte-swapping the first four bytes, you get 04 f9 00 00. The value of 4 in the first byte identifies it as a forwarding stub. The f9 00 00 in the next three bytes represent the page number the row moved to. Converting this hex value to decimal, you get 249. The next group of four bytes tells you that the row is at slot 0, and the page is on file 1. If you then use DBCC PAGE to look at page 249, you can see what the forwarded record looks like. (For a more complete description of using DBCC PAGE and the output it returns, refer to my book, Inside SQL Server 7.0, Microsoft Press, 1999).
Managing Forwarding Pointers
Forwarding pointers let you modify data in a heap without having to make drastic changes to the nonclustered indexes. If a forwarded row must move again, SQL Server updates the forwarding pointer to point to the new location, so you'll never have one forwarding pointer pointing to another. Also, if the forwarded row shrinks enough to fit in its original place, SQL Server eliminates the forwarding pointer, and the record moves back to its original position.
With SQL Server 7.0, when a forwarding pointer is created, it remains forever, with two exceptions. The first exception happens when a row shrinks enough to move back to its original location. The second exception happens when the entire database shrinks. When a file shrinks, SQL Server reassigns the row identifiers (RIDs), used as the row locators, so the shrink process never generates forwarded rows. If the shrinking process removes pages, any forwarded rows or stubs on those pages are effectively unforwarded because all the rows have new locations.
To see the total count of forwarded records in a table, you can enable trace flag 2509. Then execute the DBCC CHECKTABLE command, and you'll see the number of forwarded records in that table.
DBCC TRACEON (2509) GO DBCC CHECKTABLE (bigrows)
Your results will look something like this:
DBCC results for 'bigrows'. There are 5 rows in 2 pages for object 'bigrows'. Forwarded Record count = 1 DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Updating: In Place or Not
Updating a row in place is the rule rather than the exception in SQL Server 7.0, unlike with earlier releases. Updating in place means that the row stays in the same location on the same page, and only the affected bytes change. In most cases of updating in place, the log will contain one record for each such updated row. The exceptions occur when the table has an update trigger on it or when the table is marked for replication. In these cases, the update still happens in place, but the log contains a delete record followed by an insert record. Updates happen in place if you update a heap or if you update a table with a clustered index without changing any of the clustering keys.
In cases where a row can't be updated in place, the cost of a not-in-place update is minimal because of the way SQL Server stores nonclustered indexes and because of the use of forwarding pointers. No nonclustered indexes need changing, which makes the update quite inexpensive.
If your update can't happen in place because you're updating clustering keys, it will occur as a delete followed by an insert. Sometimes when you're updating a variable-length column in many rows, you get a hybrid update: some of the rows are updated in place and some aren't. If you're updating index keys, SQL Server builds a list of rows that need to change as a delete plus an insert operation. If the list is small enough, SQL Server stores it in memory; if the list would take too much memory, SQL Server writes it to tempdb. SQL Server then sorts this list by key value and operator (delete or insert). If the index whose keys are changing isn't unique, SQL Server then applies the delete and insert steps to the table. If the index is unique, it collapses the delete and insert operations on the same key into a single update operation. Let's look at a simple example.
The code in Listing 2 builds a table with a unique clustered index on column X, then updates that column in both rows. Table 1 shows the operations that SQL Server generates internally for that update statement. This list of operations, the input stream, consists of the old and new values of every column and an identifier for each row to be changed. In this case, the updates must be split into delete and insert operations. If they weren't, the update to the first row, changing X from 1 to 2, would fail with a duplicate-key violation. So SQL Server generates a converted input stream that looks like Table 2. Note that for the insert half of the update, the RID isn't relevant in the input stream of operations to be performed. Before a row is inserted, it has no RID. SQL Server then sorts this input stream by index key and operation, as in Table 3.
Finally, if one key value has both a delete and an insert, the two rows in the input stream collapse into an update operation. So the final input stream looks like Table 4. You can see that, although the original query was an update to column X, after the split, sort, and collapse of the input stream, the final set of actions looks as if you're updating column Y! This method of carrying out the update prevents intermediate violations of the index's unique key. Screen 1 contains part of the graphical query plan for this update, showing the split, sort, and collapse phases.
Updates to nonclustered index keys also affect the index's leaf level by splitting the operation into deletes and inserts. Think of the nonclustered index's leaf level as a miniature clustered index; any modification of the key could potentially affect the sequence of values in the leaf level. As with the data in a clustered index, the index's uniqueness determines the update type. If the nonclustered index is nonunique, the update splits into delete and insert operators. If the nonclustered index is unique, the split is followed by a sort and an attempt to collapse any deletes and inserts on the same key back into an update operation.
Table-Level vs. Index-Level Data Modification
We've just considered the placement and index manipulation necessary for modifying one or a few rows with no more than one index. If you're modifying multiple rows in a single operation (insert, update, or delete) or by using bulk copy program (bcp) or the BULK INSERT command and the table has multiple indexes, you need to be aware of some other facts. SQL Server 7.0 offers two strategies, table-level and index-level modification, for maintaining all of a table's indexes. The query optimizer chooses between them based on its estimate of the anticipated execution costs for each strategy. In a table-level modification, sometimes called row-at-a-time, SQL Server maintains all indexes for each row as it modifies that row. If the update stream isn't sorted, SQL Server needs to perform many random index accesses, one access per index per update row. Even if the rows that will be modified are supplied as an ordered stream, the update stream can't be sorted in more than one order; therefore, there might be nonrandom index accesses for at most one index.
In index-level modifications, sometimes called index-at-a-time, SQL Server gathers all the rows to be modified and sorts them for each index. In other words, as many sort operations as indexes occur. Then, for each index, SQL Server merges the updates into the index. SQL Server will access no index page more than once, even if multiple updates pertain to one index leaf page.
Usually, if the update is small and the table and its indexes are large, the query optimizer considers table-level modification the best choice. Most online transaction processing (OLTP) operations use table-level modifications. But if the update is relatively large, table-level modifications require many random I/O operations and might read and write each leaf page in each index multiple times. In such cases, index-level modification performs much better. The amount of logging required is the same for both strategies.
Let's look at a specific example. Although this example deals with inserting new rows, you could perform a similar analysis if you were updating a large percentage of the rows in a table. Suppose you use BULK INSERT to increase the size of a table by 1 percent (which could correspond to 1 week in a 2-year sales history table). The table is stored in a heap (no clustered index), and the rows are simply appended to the end because the table's other pages don't have much available space. There's nothing to sort on for insertion into the heap. Assume also that you have two nonclustered indexes on columns a and b. SQL Server sorts the insert stream on column a and merges it into the index on a.
If an index entry is 20 bytes long, an 8KB page filled at 70 percent (which is the natural, self-stabilizing value in a B-tree after many random insertions and deletions) would contain 8KB * 70 percent ÷ 20 bytes = 280 entries. Eight pages (one extent) would then contain 2240 entries, presuming leaf pages are laid out in extents. A 1 percent table growth implies, on average, 2.8 new entries per page, or 22.4 new entries per extent.
Table-level insertion would touch each page, on average, 2.8 times. Unless the buffer pool is large, table-level insertion reads, updates, and writes each page 2.8 times, which means 5.6 I/O operations per page, or 44.8 I/O operations per extent. Index-level reads, updates, and writes each page (and extent, again assuming a contiguous layout) exactly once. In the best case, about 45 page I/O operations are replaced by two extent I/O operations for each extent. Of course, the cost of using multiple sorts to do the insert also includes the cost of sorting the inputs for each index. But the much-reduced cost of the index-level strategy can easily outweigh the cost of sorting.
Releases of SQL Server before 7.0 recommend dropping all indexes if you're going to use bcp to import many rows into a table or if you're going to do mass updates. This recommendation is good for earlier releases because they have no index-level strategy for maintaining all the indexes. With the new index-at-a-time strategy in SQL Server 7.0, this recommendation is no longer valid.
You can find out whether SQL Server made your updates at the table level or the index level by inspecting the SHOWPLAN output. If SQL Server performs the update at the index level, you'll see a plan that contains an update operator for each of the affected indexes. If SQL Server performs an update at the table level, you'll see only one update operator in the plan.
Updates in earlier releases of SQL Server can be problematic. They can happen quickly if certain conditions are met, but in other cases, when you have deferred operations, updates take many times longer. In addition, more than a few nonclustered indexes on a table can greatly slow the update speed. In SQL Server 7.0, all updates can happen quickly. The new indexing structures significantly reduce the need for index maintenance during updates. And the new index-at-a-time update strategy greatly optimizes the performance of index maintenance, in cases where it's still necessary. Indexes in SQL Server are a great thing; don't let concerns about maintenance during updates keep you from adding all the indexes you need to your important tables.
(This article was adapted from Inside SQL Server 7.0, Chapter 8, "Modifying Data.")