SQL Server Read-Consistency Problems, Part 1

Life is full of incorrect assumptions, many of which concern things that we take for granted and never bother to check, just because they seem so trivial, or because we make wrong conclusions based on our observations and understanding of reality, or because we misinterpret other explanations. Try answering the following questions before you read on, and when you’re done reading the article, answer them again: When you use the Read Uncommitted isolation (e.g., when specifying the NOLOCK hint) to query data, what would change in SQL Server’s treatment of SELECT queries compared with the default Read Committed isolation? When you query data using the Read Uncommitted or Read Committed isolation, can you get the same row multiple times? When you query data using the Read Uncommitted or Read Committed isolation, is it possible that your query will skip a row? When you examine a query’s execution plan and see an Index Scan or Clustered Index Scan operator with Ordered: False, how will SQL Server scan the data?

In this article, I’ll try to shed some light on the way reads behave in different scenarios and the consistency level you should expect in each scenario. Most of this article’s findings are fairly recent—in the past year or so— and my answers to the questions above before those findings were wrong or incomplete. I found myself surprised by the true answers to these questions (at least true to the best of my current understanding), and my current understanding led me to revisit my past choices and recommendations regarding reading data. I described some of these findings in my blog, but here I’ll provide a more complete picture, as well as some new findings. But first, I’ll set some groundwork with a few important fundamentals.

Some Important Fundamentals

Bear with me if you’re already familiar with the concepts that follow. I believe they’re integral to understanding this article’s contents.

Heap table. A heap table is an organization of a table with no clustered index. SQL Server maps heap data with one or more Index Allocation Map (IAM) pages that map the table’s data in file order. No logical ordering is enforced in a heap. When a new row is inserted into a heap, SQL Server uses a special type of page called Page Free Space (PFS) to locate free space in which to insert the new row. When you update a variable-length column in a row in a heap with a longer value such that the row doesn’t have room to expand in the original page, SQL Server will move the row to another page, leaving a forwarding pointer in the row’s original location, pointing to the row’s new address. It’s important to note that neither inserts nor updates can cause page splits (which I’ll describe in a moment) in a heap.

Clustered table. A clustered table is an organization of a table with a clustered index in a B-Tree data structure. For the purpose of this discussion, I’ll focus on the leaf level of the index, which contains the full data rows. The pages in the leaf level are organized in a doubly linked list, maintaining logical index key ordering. Besides the linked list, SQL Server also maps the leaf level’s data with IAM pages in file order. The logical order of the pages maintained by the linked list (index key order) can be different from the order of the pages in the files. The discrepancy between index order of pages and file order of pages is called logical fragmentation. This fragmentation is measured as the percentage of out-of-order pages as you scan the pages in index order. An out-of-order page is one that appears in the file before the page that points to it in the linked list.

Nonclustered index. Like a clustered index, a nonclustered index is organized as a B-Tree. The leaf level contains as many rows as the number of rows in the table, each with a copy of a subset of columns from the corresponding data row, plus a row locator that’s used to locate the full data row. The index leaf level organizes the data in a doubly linked list, maintaining logical index key ordering. SQL Server also maps the nonclustered index leaf level’s data with IAM pages in file order. The aforementioned fragmentation is also relevant to nonclustered indexes.

Page splits. When a new row is inserted into an index (clustered or nonclustered), SQL Server identifies the target page based on the new row’s index key order. If the target page has no room to accommodate the new row, SQL Server will split the page—that is, it will allocate a new page (before or after the current page in the file or in another file), move half the rows from the original page to the new one, insert the row to one of the two pages depending on the index key order of the new row, and update the pointers in the linked list to preserve logical index key ordering among the pages. The only exception to this rule is when the new row has a higher index key value than the highest existing key, in which case SQL Server won’t move half the rows from the original page to the new one. A similar split process will take place if a row is expanded as a result of an update to a variable-length column when there’s no room for the row to expand in the page.

Allocation-order scan. This type of scan can be used to scan heap or index (clustered and nonclustered) data in file order based on IAM pages. With heap data, that’s the only type of scan that can be used. With index data, that’s one of the two possible alternatives. The performance of allocation- order scans isn’t affected by logical fragmentation because this scan is done in file order, not in index order. However, the performance of allocation-order scans is affected by file system fragmentation of the database data files. SQL Server Enterprise Edition supports Advanced Scanning (aka merry-go-round scans), which is applicable only to allocation-order scans (not to index-order scans, which I’ll describe next). Advanced scanning allows a request for an allocation-order scan to piggyback an already-running allocation-order scan, and when both processes being served by the same scan are done, the one that joined later will scan the remaining data (before the point it joined the scan).

Index-order scan. This type of scan can be used to scan index data in logical index key order—that is, scanning the leaf-level pages by following the linked list either from head to tail (forward scan) or from tail to head (backward scan). When this type of scan is used, the data is returned to the next operator in the plan (or to the caller, if that’s the last operator being processed) in logical index key order. The performance of index-order scans is affected by logical fragmentation since out-of-order pages require the disk arm to go back and forth. With zero fragmentation, the performance of an index-order scan should be similar to that of an allocation-order scan: The higher the fragmentation, the worse the performance of an index-order scan compared with an allocation-order scan—up to several times slower. The only exception is when the table is very small (up to 64 pages); an allocation-order scan can be more expensive than an index-order scan because of some overhead involved with this type of scan.

Consistency Problems with Read Uncommitted

The use of the Read Uncommitted isolation level (e.g., with the NOLOCK hint) is a common practice in attempt to overcome blocking problems under the default Read Committed isolation. However, with Read Uncommitted, there are several changes in the way SQL Server handles reads that you might not be aware of. This month and next, I’ll describe read-consistency problems under the Read Uncommitted isolation, resulting from inserts running while processes are reading data. In a later article, I’ll cover read-consistency problems under both Read Uncommitted and Read Committed isolations resulting from updates running while processes are reading data.

A common perception programmers have regarding behavioral changes of reads when working with Read Uncommitted is that the only change is that SQL Server won’t request shared locks when reading data. Therefore, readers can get “dirty data”—that is, uncommitted changes. But the reality is that there are other behavioral changes of readers besides not acquiring shared locks that can result in reading the same row multiple times or skipping rows that existed when the read started.

These read-consistency problems have to do with changes in the storage engine choices when instructed by the relational engine (specifically by the optimizer) to perform index scans (both clustered and nonclustered) with Ordered: False.

The optimizer will use an index scan whenever all the index leaf data is needed. When the data isn’t required to be scanned in order, the Index Scan operator would show Ordered: False in the tooltip box. When the data is required to be scanned in order (e.g., to satisfy an ORDER BY request, GROUP BY, DISTINCT, merge join), the Index Scan operator would show Ordered: True. As an example, examine the execution plans that Web Figure 1 shows for the following three queries:

USE AdventureWorks;
-- Query 1
SELECT * FROM Sales.SalesOrderHeader;
-- Query 2
SELECT SalesOrderNumber FROM Sales. SalesOrderHeader;
-- Query 3
SELECT * FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;

Observe that the first two index scans (i.e., first clustered, second nonclustered) show Ordered: False, whereas the third shows Ordered: True. With the first two scans, as far as the optimizer is concerned, the data can be returned in any order, but with the third scan, it needs to be returned in index key order.

Ordered: False doesn’t necessarily mean that the storage engine in charge of the actual row operations (fetching the data) will utilize the more efficient allocation- order scans based on IAM pages. The storage engine needs to also take into account your consistency expectations based on the isolation level the query runs under. An allocation-order scan running while inserts are issued against the table/index might result in getting the same row multiple times or skipping rows.

As an example of getting the same row multiple times, suppose you have the following index keys in pages of an index shown in file order: Page#101(keys: 50, 60, 70, 80), Page#107(keys: 10, 20, 30, 40), Page#121(keys: 90, 100, 110, 120). The logical index order of the pages is Page#107(keys: 10, 20, 30, 40)-Page#101(keys: 50, 60, 70, 80)-Page#121(keys: 90, 100, 110, 120). Suppose an allocation-order scan starts and reads pages 101 and 107. So far, it has read the rows with the keys 50, 60, 70, 80, 10, 20, 30, 40. At this point (before the scan is finished), another process inserts a row with the key value 65. Suppose the target page 101 is full, so it undergoes a split.

It’s important to note that in a Read Committed isolation, shared locks aren’t kept until the statement is done but rather until SQL Server finishes reading the locked resource (row or page). So, even if SQL Server decides to lock pages when reading, once the page is read, the lock is released. The split process allocates a new page (can be either before or after the page that got split in the file)—say, page 135—and moves half the rows from the source page to the new one, and inserts the new row to one of the two pages based on its key value. The layout of the data in the file is now Page#101(keys: 50, 60, 65), Page#107(keys: 10, 20, 30, 40), Page#121(keys: 90, 100, 110, 120), Page#135(keys: 70, 80). The linked list will reflect the right logical index key order, so the layout in the index is Page#107(keys: 10, 20, 30, 40)-Page#101(keys: 50, 60, 65)- Page#135(keys: 70, 80)-Page#121(keys: 90, 100, 110, 120). The scan continues reading the rest of the pages—121 and 135, reading the rows with the keys 90, 100, 110, 120, 70, and 80. You realize that the rows with the keys 70 and 80 were read twice.

Similarly, if a page that wasn’t yet reached by the scan splits, and the new page is allocated before the scan position, the scan will end up skipping or not reading the rows that moved to the new page. For these potential consistency problems, in Read Committed isolation or higher, the storage engine would opt for an index-order scan when the optimizer requests an index scan with Ordered: False. There are exceptions, which I’ll discuss later.

The reason an index-order scan won’t allow these problems is that the pages are read in logical index key order following the linked list. If a page splits after the scan visited it, both pages (the one that got split and the new one) won’t be visited again because they’re behind in logical order. If a page splits before the scan reached it, the scan will read both pages in order.

An index-order scan won’t be as efficient as an allocation- order scan because fragmentation will negatively impact an index-order scan. However, these particular consistency problems won’t happen. Other consistency problems might happen, but I’ll discuss those in a later article.

What might surprise you is that when an index scan with Ordered: False is running under Read Uncommitted, the storage engine assumes that you have no consistency expectations whatsoever but rather that you’re aiming at the best possible performance. Therefore, the engine will use an allocation-order scan and these consistency problems might happen if rows are inserted while the scan is running. There’s an exception in SQL Server 2005: If the index is small (64 pages or fewer), the storage engine will use an index-order scan because it’s more efficient than an allocation-order scan with so little data. Unfortunately, when you see an index scan in the plan with Ordered: False, there’s no way to tell in the plan whether the storage engine will, in practice, use an allocation-order scan or an index-order scan.

It should be noted that table scans against a heap don’t incur the consistency problems that this article describes for the simple fact that heaps don’t incur page splits, as I explained earlier. However, keep in mind that clustered tables have some advantages over heaps—for example, they let you control the way inserts are distributed over disk drives. Also, even when the table is organized as a heap, if there are nonclustered indexes, those can incur splits, and scans of such indexes could incur the aforementioned read-consistency problems.

Ready for Examples?

Next month, I’ll provide tangible examples with code you can run to see the problems with your own eyes, and I’ll suggest alternatives to avoid these problems. Several people are due thanks for their part in shedding light on the read-consistency problems I’ve covered in this article: Lubor Kollar, Srikumar Rangarajan, Tony Rogerson, Davide Mauri, Andrea Benedetti, Gianluca Hotz, Maciej Pilecki, and Paul Randal.

Comments

Plain text