Are Your DMVs Lying to You?

I was recently troubleshooting a problem with a critical production server running SQL Server. The server's CPU pegged at 100 percent. At that time of the day, the server's CPU was normally around 30 percent, so I knew something was wrong. A few weeks earlier, a similar problem arose and I had a hunch that a stored procedure was probably causing the trouble. So, this time around, I was prepared to gather more information. I had, ready to go, queries to obtain the cached plan and to obtain statistics from the dynamic management view (DMV) sys.dm_exec_query_stats.

Just as I suspected, the creation_time column in sys.dm_exec_query_stats showed that the stored procedure had recompiled just prior to when the high CPU problem started. I then compared the plan cached during the high CPU cycle to a "good" plan—that is, a plan that I had saved from when the server's CPU was at its normal rate. The comparison revealed that during the high CPU cycle SQL Server was doing a nested loop join when it should have been doing a hash-match join.

Based on this information, I decided to recompile the stored procedure with sp_recompile. Within a minute, CPU usage returned to normal. I again ran the query to get the cached plan. This newest plan looked the same as the good plan I had saved. Everything supported my suspicion that the stored procedure had recompiled and cached a suboptimal plan, which caused the high CPU—that is, everything except for some of the other statistics captured by sys.dm_exec_query_stats.

The high CPU problem spanned roughly 35 minutes, from around 7:35 a.m. to 8:10 a.m. I captured information from sys.dm_exec_query_stats at 8:08 a.m., just prior to when I recompiled the stored procedure. Here's what the DMV reported:

last_execution_time = 8:08 a.m.
execution_count = 25
total_worker_time= 5465822 microseconds

I became suspicious of these statistics for a couple of reasons. First, I knew 25 executions in 35 minutes seemed really low. Second, how could 5.46 seconds of worker time cause a 4´2.3GHz processor to be pegged at 100 percent CPU usage for nearly 35 minutes? It just wasn't adding up, which left me wondering whether the information returned by sys.dm_exec_query_stats was wrong.

Well, the information in sys.dm_exec_query_stats wasn't wrong, just misleading. Here's what happened. The stored procedure was recompiled at about 7:35 a.m. In the next few minutes, it executed 25 times. But then the performance of the stored procedure became so bad that each call started timing out. (I was able to confirm this by working with the developers and digging through the logs on a few of the application servers.) When the stored procedure call is made, the last_execution_time column is updated. But columns like execution_count and total_worker_time aren't updated until the execution completes, which makes sense. For example, SQL Server can't possibly update the total_worker_time column until the query has completed. Until that time, it wouldn't know how long the execution took to run. The same holds true for columns like total_logical_reads and total_logical_writes.

Comments

Plain text