Troubleshooting Slow Servers

When troubleshooting a performance problem within a SQL Server database instance, there are a variety of places that you can look to determine its root cause. Some of the biggest performance bottlenecks usually boil down to several causes:

Missing indexes
Poor index design
Poorly designed database schema
Inadequate storage I/O subsystem
Buffer pool too small
Slow network
Wrong technology used

Resolving these problems might seem difficult, but if you have the right tools available, they can be easy enough to fix so that performance will improve.

Missing Indexes

Disk performance problems and an increased number of deadlocks are some of the potential indicators that you might have indexes missing. One way to easily determine if there are missing indexes is to use two dynamic management views (DMVs): sys.dm_db_missing_index_details and sys.dm_db_missing_index_columns. Between these two DMVs, you can easily identify many of the indexes that need to be created to improve performance.

However, there's a catch: The system doesn't understand that indexes already in existence might only need another column added to their included columns list. So, SQL Server might recommend that another entire index be created when all that's needed is a new column in the existing index. Because of this, you shouldn't just take the information from the DMVs and create all the specified indexes. The reported indexes should be compared against the indexes that already exist in the database to ensure that duplicate indexes aren't being created. Duplicate indexes can lead to additional writes, updates, and deletes, causing performance problems.

A better approach is to use a query like that in Listing 1 to identify the indexes that are missing from the database on which the query is being run. This query uses the db_id() system function to restrict the output to that from the current database. So, to run it in your database, simply specify that database in the Available Databases drop-down box in SQL Server Management Studio (SSMS) or add a USE statement before the query.

SELECT a.avg_user_impact
  * a.avg_total_user_cost
  * a.user_seeks,
  db_name(c.database_id),
  OBJECT_NAME(c.object_id, c.database_id),
  c.equality_columns,
  c.inequality_columns,
  c.included_columns,
  c.statement,
  'USE [' + DB_NAME(c.database_id) + '];
CREATE INDEX mrdenny_' + replace(replace(replace(replace
  (ISNULL(equality_columns, '')
  + ISNULL(c.inequality_columns, ''), ', ', '_'),
  '[', ''), ']', ''), ' ', '') + '
  ON [' + schema_name(d.schema_id) + ']
  .[' + OBJECT_NAME(c.object_id, c.database_id) + ']
  (' + ISNULL(equality_columns, '') +
  CASE WHEN c.equality_columns IS NOT NULL
    AND c.inequality_columns IS NOT NULL THEN ', '
    ELSE '' END + ISNULL(c.inequality_columns, '') + ')
    ' + CASE WHEN included_columns IS NOT NULL THEN
    'INCLUDE (' + included_columns + ')' ELSE '' END + '
    WITH (FILLFACTOR=70, ONLINE=ON)'
FROM sys.dm_db_missing_index_group_stats a
JOIN sys.dm_db_missing_index_groups b
  ON a.group_handle = b.index_group_handle
JOIN sys.dm_db_missing_index_details c
  ON b.index_handle = c.index_handle
JOIN sys.objects d ON c.object_id = d.object_id
WHERE c.database_id = db_id()
ORDER BY DB_NAME(c.database_id),
  ISNULL(equality_columns, '')
  + ISNULL(c.inequality_columns, ''), a.avg_user_impact
  * a.avg_total_user_cost * a.user_seeks DESC

When reviewing the output from the query in Listing 1, look for queries that have a high value in the first column. The higher the number, the more savings that will be seen by adding an index.

Poor Index Design

Poor index designs for database tables will usually manifest as slow running queries and queries that have a high execution plan cost. The easiest way to identity these problems is to query the procedure cache because queries' execution plans remain in the cache as long as there is enough memory to store them. Listing 2 shows a sample query that looks for queries with a high total cost. When reviewing the output from Listing 2, look at the StatementSubTreeCost column. The higher the number, the more expensive the execution plan.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITH XMLNAMESPACES
(DEFAULT
  'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
   query_plan AS CompleteQueryPlan,
   n.value('(@StatementSubTreeCost)[1]', 'VARCHAR(128)')
    AS StatementSubTreeCost,
   dm_ecp.usecounts
FROM sys.dm_exec_cached_plans AS dm_ecp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS dm_eqp
CROSS APPLY query_plan.nodes
  ('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple')
  AS qp(n)
ORDER BY n.value('(@StatementSubTreeCost)[1]',
  'VARCHAR(128)') DESC
GO

In SQL Server 2012, you can set up an Extended Events session to capture the execution plans for queries. However, it isn't recommended that Extended Events be used to capture execution plans in high-load systems as doing so could overload the system by 21% or more by capturing the showplan_xml column.

For more information on how to properly design indexes, see SQL Server 2008 Query Performance Tuning Distilled (APress, 2009) by Grant Fritchey and Sajal Dam. Another book you might want to check out is Expert Performance Indexing for SQL Server 2012 ( APress) by Jason Strate and Ted Krueger, which is scheduled to be released in May 2012.

Poorly Designed Database Schema

One of the biggest killers of database performance is a poorly designed database schema. Evidence of a poorly designed database schema can be seen when lots of I/O requests occur while queries are being run. The easiest way to look for this evidence is to use SQL Server Profiler to capture the queries that have the highest I/O load, then run those queries in SSMS with SET STATISTICS IO enabled. Between the table I/O outputs and the execution plans, you'll get a good indication of which table is having a performance problem.

Fixing Schema Problem

After the problematic table has been identified, fixing its schema problem is the next step. Sadly, fixing the problem usually isn't as simple as finding the problem. The table might need to be normalized or denormalized, depending on the table's design. Or perhaps the table needs to be vertically partitioned because there's too much data in it. Conversely, maybe the vertical partitioning was taken a little too far. In case you're unfamiliar with vertical partitioning, it's a technique in which a single table is broken into multiple tables, often using a date as the breakpoint. For example, instead of putting the sales history for every year into a single table, you could put the sales history for each year into a separate table.

Inadequate Storage I/O Subsystem

The most obvious way to determine whether a SQL Server database instance is having storage performance problems is to look in the SQL Server ERRORLOG file. A message about slow I/O in the ERRORLOG file is a good indicator that there's something wrong with the storage I/O subsystem.

However, not seeing any errors about slow I/O in the ERRORLOG file doesn't rule out that your storage I/O subsystem isn't the cause of a performance bottleneck. Not seeing these errors just means that your storage isn't ridiculously overtasked and not keeping up. A quick way to look for I/O problems is to download Adam Machanic's free sp_WhoIsActive stored procedure. When you run sp_WhoIsActive, it will give you the current wait type of all the currently running processes in the system. Look for wait types that report I/O as being the problem.

Another way to look for I/O problems is to query the sys.dm_os_wait_stats DMV with code such as

SELECT *

FROM sys.dm_os_wait_stats

ORDER BY wait_time_ms desc

If the first few rows returned from this query include some I/O wait types, then a slow I/O subsystem might be the problem.

You might have noticed that I said that these wait types might indicate a slow I/O problem and not that they do indicate a slow I/O problem. That's because a poorly indexed database server can also report I/O problems when the I/O subsystem is perfectly sized for the application, but you're asking the I/O subsystem to do a lot more work than it should be doing. In other words, the I/O problem is actually the result of an index problem. To resolve it, you need to add indexes to the system. Because of this, whenever I look at a new system in which the client wants to upgrade the hard drives, I always start with the indexes because that might very well fix the problem.

After the problem has been identified as actually being a storage I/O subsystem problem, there are a few directions that you can go to fix it. For large reporting databases, table partitioning might help. There are two different techniques that you can consider: partitioning for partition elimination and partitioning for CAPEX savings. No matter which approach is taken, a solid understanding of the data and how it's used is required.

Partitioning for partition elimination. With this technique, the goal is to reduce the amount of data that SQL Server needs to scan through when doing table scans. Typically, this is done by using a static window approach with many partitions created in the system and data. For example, you could put the data for each month into its own partition. This way, when users run a query, SQL Server needs to scan through only one partition instead of going through all the data.

Partitioning for CAPEX savings. With this technique, the goal is to put the most frequently used data on really fast disks, such as flash drives, and put the less frequently used data on slower, less expensive disks, such as Serial ATA (SATA) disks. This often makes flash drives a much more affordable option, because you only need a small number of them.

In either case, the same techniques are used to create the partition scheme and function, as discussed in "Data Warehousing: Horizontally Partitioning the Fact Table ." No matter which technique you use, you must have a complete understanding of how the data is used to successfully set up table partitioning.

Another way to fix an inadequate storage I/O subsystem is to simply buy new hardware that can handle the increased workload. Although this is probably the easiest solution to implement, getting management to sign off on the purchase of new hardware might not be that easy.

Several vendors offer storage solutions that can be configured to store data on multiple tiers on a single volume over different speed disks. This gives the same basic benefit as partitioning for CAPEX but without needing to dive into how the table needs to be partitioned. The storage array automatically moves the data that's most often used to the fastest disks, which leaves the stale or idle data on the slower, less-expensive disks.

Buffer Pool Too Small

The buffer pool is where SQL Server stores data pages that it has read or written to recently. A small buffer pool will generally cause performance problems, because it will put too much pressure on the disk subsystem. There are a couple of different performance monitor counters that you can look at to see if the buffer pool size is healthy. Both counters are located within the SQLServer:Buffer Manager performance counter object.

The first counter to look at is Database Pages. This counter tells you how much memory is currently allocated to the buffer pool. This counter is in pages, so to get it into gigabytes, multiply the value by 8, then divide by 1,048,576.

The second counter to look at is Page Life Expectancy. This counter tells you how long, in seconds, SQL Server expects that you'll be keeping the data in memory before releasing the page so that additional data can be loaded into memory. There is no magic number that says that the buffer pool is healthy. There's a myth floating around the Internet that says a Page Life Expectancy of 300 means that the system is healthy. However, this isn't the case. On a system that has 100MB of RAM, a Page Life Expectancy of 300 seconds might be a good number to go by. However, a Page Life Expectancy of 300 on a system that has 256GB of RAM would be a disaster.

Fixing a small buffer pool requires adding more memory to the server and configuring SQL Server to use the additional memory. The only other way to increase the amount of data that SQL Server can load from one database into the buffer pool is to move databases from one server to another so that the remaining databases have access to a higher percentage of the database engine's buffer pool.

Slow Network

Another potential server bottleneck is the network. The easiest way to identify this problem is to look at the wait statistics on the system. A wait type of ASYNC_NETWORK_IO indicates that the network between SQL Server and the client computers is having a performance problem. If the client computers are located at a remote site, this could be the cause of the ASYNC_NETWORK_IO wait types.

If all the client computers are local, ASYNC_NETWORK_IO wait types might be the result of an older version of SQL Server or an older network switch being used. A slow or misconfigured network port could also be the cause of ASYNC_NETWORK_IO wait types. For example, having a network port configured for 10/half duplex could cause performance problems. In addition, there could be performance problems if a network port is configured to auto-detect the network speed, but the network switch and server don't successfully negotiate the fastest possible network speed.

Fixing a slow network isn't something that a DBA will be able to handle alone. It will require a coordinated effort between the network administrator, the network provider if there's a WAN involved, and the DBA.

Wrong Technology Used

Some database performance problems can be traced back to simply using the wrong technology to get the job in question done. Usually this revolves around running large reports against production OLTP databases or databases that have an OLTP schema. Many of these reports end up doing large, complex aggregations. Although SQL Server can do the aggregations, it isn't the best technology to use for them. For large, complex aggregations, SQL Server Analysis Services (SSAS) is often the more appropriate tool because SSAS pre-aggregates the data when the databases are loaded. As a result, when a query asks for aggregated data, it has already been aggregated and the results can simply be returned instead of being processed.

The Cost of Poor Performance

There are many reasons why SQL Server database engines can slow down. It could be a database design problem, system design problem, infrastructure design problem, or code problem. No matter the reason, it's important to realize that slow SQL Server database engines can have more serious consequences than just people complaining. Customers might not get their orders placed. Application users might be spending minutes instead of seconds waiting for screens to load. Users might be waiting hours instead of minutes for reports to run. All of these things will cost the company money -- and nothing will cause a company to want to get its database performance problems resolved more quickly than being able to quantify the money being lost because of slow performance.

Comments

Plain text