Question: I’ve started seeing a problem where periodically the scheduled DBCC job fails and I get 823 errors in the error log. I also see errors from the file system too. Is this an I/O subsystem problem?
Answer: These errors are not caused by I/O subsystem issues, although outwardly they seem to be.
Under the covers all of the DBCC CHECK commands create a hidden database snapshot. When the database snapshot is created, any active transactions in the database are effectively rolled back into the database snapshot (not affecting the real transactions, of course) so that the database snapshot is transactionally consistent. Also, while the database snapshot exists, any data file pages in the real database that are going to change need to be copied into the database snapshot before they change so that their state at the time the database snapshot was created is preserved.
These two factors mean that there is the potential for the database snapshot to grow very large – potentially up to the size of the real database. If that happens, there is the possibility that the database snapshot may exceed the size that NTFS can cope with and error 665 will result, as below:
The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x000005bd3dc000 in file 'Test.mdf:MSSQL_DBCC8'
This error comes from Windows Server 2008 – Windows Server 2003 will give error 1450. You can read more about this in this blog post from Product Support. You may see a variety of SQL Server errors too – including 823, 7928, 1823, and 3314. Check to make sure that the 823 errors are referencing a file with a name like ‘MSSQL_DBCC’ (the database snapshot files) rather than one of the real database data files. If the latter then you have a real I/O subsystem problem.
If you’re seeing these errors when you’re running your consistency checking job at a time when there shouldn’t be any user activity, the likely cause is that the consistency checking job has overlapped with another job that is causing a lot of activity in the database – maybe a data loading job or routine index maintenance.
We had one case recently that was caused by the index maintenance job running much longer than it should have been. When the database snapshot was created for the DBCC command, an entire index rebuild of a very large clustered index had to be rolled back, leading to the database snapshot size exceeding the NTFS limit and failing.
In this case, rather than relying on expected maintenance job run times, create a system where a potentially long-running job kicks off a job that cannot overlap with it – i.e. an index maintenance job that starts the consistency checking job.