The Curious Case of: the read-retry error

Question: Occasionally I see messages in the error log stating that a read I/O had to be retried several times before it succeeded. What does this mean and what should I do about it?

Answer: These messages are letting you know that a mechanism called read retry had to be used to allow a read operation to complete.

SQL Server 2005 introduced this behavior as a way of working around transient problems in I/O subsystems. Basically, whenever a read operation completes with an 823 or 824 error, SQL Server will retry the read operation 4 more times to see if one of them will succeed.

As an aside, an 823 error is where the operating system returns a read failure to SQL Server, for a variety of possible reasons. An 824 error is where the operating system returns the data to SQL Server and SQL Server then determines there’s a problem with it (e.g. a page checksum failure).

If a read operation fails and all four of the retry operations also fail, then an 823 or 824 error will be raised by SQL Server, the query will be terminated, and the connection to SQL Server will be disconnected.

If a read retry operation is required and it succeeds, the query obviously continues unaffected, and a message like the following appears in the error log:

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘H:\DBFiles\MCCHDB_FG2_File2.ndf’ at offset 0×000000021C4000 succeeded after failing 3 time(s) with error: incorrect checksum (expected: 0×E1544a10; actual: 0×23451ee2). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

This feature was introduced as a way of avoiding immediate SQL Server downtime from transient I/O subsystem problems. It was introduced in Exchange Server a few years prior to its use in SQL Server and was found to be very effective.

Although the feature is very useful, I have a problem with it—unless you specifically have a SQL Agent alert on message number 825 or are periodically looking through the error log for unusual entries, you’ll never know that read retry was required. This is because the message is only severity 10 – informational – not classified as an actual error. It’s very important to know a read retry occurred because the message is really letting you know that your /O subsystem is creating transient corruption problems—and next time you might not be so lucky and read retry will not work.

To summarize: you should put an Agent alert on message 825 so you know when it occurs, and if it does, investigate the integrity of the I/O subsystem with the usual tools (e.g. SQLIOSim).

Comments

Plain text