Here’s a story that I’d like to share, about the infamous SQL Server 833. A while ago I was on a performance-tuning call at a client site; they’d been experiencing event 833 from some of their production SQL Servers, and they wanted to know if this was a SQL Server problem or a problem with the disk farm.
See also: The 833, What Is It?
To back up just a little, an event 833 (or error 833, as it’s sometimes called) is a signal from the SQL Server that it’s waiting for records that it’s requested. An 833 message reads like "SQL Server has encountered 49 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [...file full path...] in database [...database name...]…."
This message is often interpreted as being a disk I/O issue, and in many cases the 833 can be resolved by optimizing disk operations or just replacing older, slower equipment with the new fast disks. But this one case didn’t resolve quite so easily. You see, the disk farm attached to the SQL Servers in question were composed of flash drives — a flash SAN.
Let’s complicate this situation:
The production SQL Servers which were getting the 833s are on the far right of this system diagram. The 833s were coming from different database data files and log files, sometimes during heavy transaction activity and other times during very light transaction activity…so, in other words, the 833s were sporadic and unpredictable.
Given all the "upstream" data stores, processes, and source data implications (no one on staff seemed to really understand how the production system worked), it’s not surprising that there would be some sort of issue in the transaction stream. The stream starts at point-of-sale terminals and data aggregators on the far left and proceeds through load balancers and "soft routers" (this is a distributed environment) and on to error-handling routines before finally arriving at a place where the SQL Servers can begin to do their job. The good news is that this system is able to handle many millions of transactions a day with only the occasional 833 hiccup. The 833 "wait" always resolves itself within a minute or two, but by then, the upstream congestion is significant.
The client set up a test environment with the intent of replicating the 833 situation to determine root cause. There was only one problem with the test setup…it bore little if any resemblance to production. And, while I was onsite, we were never allowed to use advanced analysis tools on the production systems.
Spoiler Alert! The exact root cause of the 833s was never determined…
So now the challenge to you, dear reader: Given what you know so far, what would you surmise might be the cause of the 833s which are being thrown by the SQL Servers? Remember, that SAN is full of flash drives…