Dissecting SQL Server’s Top Waits

Okay. You’ve read "Getting to Know Wait Stats" (October 2007). You know about wait stats—what they are, how to collect them, and how to report on them. Now that you have that foundational knowledge, I want to concentrate on a few wait stats that typically account for most overall system waits. While doing recent performance and scalability audits for clients, I noticed a common pattern regarding the most frequent waits I’m seeing. So, I felt the time was ripe to talk about each.

Let’s start the discussion by taking a look at Table 1, which displays a list of 14 distinct wait types, in descending order of total wait time. Of course, this table is merely a partial list of the total wait types; your top waits might or might not be the same as these. However, it’s likely that many of these wait types are indeed part of your top waits. For the purposes of this article, let’s group these 14 wait types into five unique categories: latches, locking, network, I/O, and parallelism.

Latches

The LATCH_xx and PAGELATCH_xx waits aren’t related to physical I/O activities. LATCH_xx waits are typical of contention for internal resources or structures other than the buffer pool, particularly when you’re using heaps or text datatypes. Addressing the problems associated with the other wait types will generally also help pure LATCH waits. PAGELATCH_xx waits are directly related to contention in the memory caches, including the buffer pool. High waits can indicate one of several problems. One potential problem is inadequate amounts or poorly optimized memory, in which there’s a lot of contention for the existing pages in the caches. Adding more memory or tuning your existing system usually helps. Another cause might be extremely high volumes of inserts in the same range of pages, particularly if it results in page splits. This problem can be further complicated if there are many selects on the same range. (You would need to have many thousands of transactions per second to see this problem occur in SQL Server 2005.) If this is the problem, some index tuning might be necessary. PAGELATCH_UP is an allocation type of wait and typically shows up in the tempdb database under heavy use. Adding more data files to tempdb can significantly reduce or eliminate these waits.

Locking

The LCK_xx waits are associated with locking and blocking. High volumes of these waits indicate that something is preventing the efficient execution of the storage engine’s locking mechanisms. This problem is mostly caused by transactions that are too long or by a lack of proper index usage, resulting in locking or reading more rows than necessary, thus prolonging the transaction and increasing the chances of further blocking. Poorly configured hardware can also contribute to high LCK_xx waits because each action simply takes longer to finish, often leading to blocking on multiuser systems.

Network

Most people assume that high network waits are the result of problems directly related to the physical network between the server and the clients. Although this problem can affect the number of NETWORKIO waits, the reality is that high network waits are most likely caused by poor client response. If the client can’t process the results as fast as SQL Server can send them, you’ll start to see these waits. Poorly written client applications and overburdened hardware on the client side can lead to high network waits.

I/O

The PAGEIOLATCH_xx, IO_COMPLETION, and WRITELOG waits fall into the I/O category. PAGEIOLATCH_ xx waits are specific to disk-to-memory transfers and almost certainly indicate problems with your disk subsystem. At the very least, they suggest that your subsystem is having trouble keeping up with the volume of I/O requests. IO_COMPLETION simply means that the system is waiting for I/O tasks to finish. Finally, WRITELOG indicates that the system is waiting for transaction log–related write requests to finish. You can reduce most of these waits by using one of several tactics:

Tune your queries so that they read less data to satisfy the requests
Change the placement of the data and log files so that they don’t compete with each other on any particular drive array (e.g., separate the files onto separate physical drives)
Change the RAID array types and configurations to improve performance (e.g., add more disks, change the array from RAID 5 to RAID 10)

Parallelism

CXPacket waits are a direct result of parallel processing— more specifically, inefficient parallel processing. This wait appears when one thread in a parallel plan is waiting on one or more other threads to finish before it can proceed. High totals of this wait are sure-fire indications that you might not be benefiting much from the use of multiple processors or many multiple processors in parallel. Consider lowering the MAXDOP setting either at the server level or by using query hints (if you can narrow down the culprits to particular queries). See "Boost Performance with Parallel Processing" for more details.

Further Study

The Microsoft article “Description of the waittype and lastwaittype columns in the master.dbo.sysprocesses table in SQL Server 2000 and SQL Server 2005” goes into much more detail about the different waits.

See associated Table

Comments

Plain text