Disappearing Database Corruptions

Q: Every so often our system produces an 824 error and our weekly consistency-checking job fails. However, when I run DBCC CHECKDB during the day there are no errors. What’s going on?

A: This phenomenon is fairly common—I call it “disappearing corruptions,” although that’s a little bit of a misnomer. It can be very frustrating, especially when it happens repeatedly, to have your automated consistency checking job fail or to be alerted to an 823/824 corruption problem, but have nothing show up when you run DBCC CHECKDB manually. How can these corruptions just go away? There are two answers that explain this phenomenon: changes to the database and transient I/O subsystem problems.

When DBCC CHECKDB completes, it’s giving you information about all the data file pages that comprised the database at the time that DBCC CHECKDB started running. This is because it creates a database snapshot when it begins, in order to provide an unchanging, transactionally consistent view of the database, which makes the consistency checks much simpler to engineer. This means that if anything in the database changes between successive executions of DBCC CHECKDB, the set of database pages being checked might be different. If a page was corrupt on one execution of DBCC CHECKDB, it might have been deallocated, and therefore is no longer part of the database on the next execution of DBCC CHECKDB. This means it won’t be processed by DBCC CHECKDB, so the corruption won’t be reported any more—and will seem to have disappeared.

What causes pages to be deallocated? Common operations that can deallocate a large amount of database pages are index rebuilds, which are usually performed by a maintenance job. If DBCC CHECKDB runs before the index maintenance job, it’s entirely possible that some pages that were found to be corrupt are subsequently deallocated by index maintenance.

Is this a problem? Yes because the corruptions can’t be investigated any further, although there’s nothing that can be done about it. It’s not a bug or anything wrong with how SQL Server performs operations.

The second cause of the disappearing corruptions phenomenon is transient I/O subsystem problems. These do occur and are the reason read-retry functionality was added to SQL Server in SQL Server 2005. In SQL Server 2005 and later, if a page read fails, SQL Server will retry the read four times before raising an 823 or 824 error. You need to know that read-retry was needed because that’s an early warning that your I/O subsystem is failing. For more information, see my blog post “ A little-known sign of impending doom: error 825 .”

If a corruption problem is transient in the I/O subsystem, obviously it might not show up again the next time DBCC CHECKDB is executed. The best thing to do in this situation is to enable Agent alerts so that you know about these problems as soon as they occur. (For more information about Agent alerts, see my blog post “ Easy monitoring of high-severity errors: create Agent alerts.” )

Comments

Plain text