Memory Matters

Analyzing SQL Server memory utilization is one of the techniques the SQL Server Customer Advisory team uses to help customers discover and rectify various performance problems. In "Minding Memory," September 2004, we looked at how SQL Server utilizes memory and how you can monitor that utilization. Now, let's learn how to use information about memory utilization to detect and correct performance problems.

Looking for Memory Pressure

Sometimes, application performance problems might first manifest themselves in ways that end users can see, such as poor response time or poor throughput. DBAs and developers then start investigating the problems to find the underlying causes. In other cases, an observant DBA might notice an off-balance system characteristic that end users don't see, such as high CPU usage or a high disk-response time, and start looking for causes. Regardless of when you start your investigation, one of your first steps should be to find out whether your system is under memory pressure. Because SQL Server is designed to dynamically manage memory, finding this answer isn't as straightforward as it might seem. For example, on a system devoted to SQL Server, the SQL Server process (sqlservr.exe) commonly acquires most of the memory on the system. This behavior is normal because SQL Server is designed to use as much system memory as possible, as long as no other applications are competing for memory and SQL Server memory growth isn't restricted by the max server memory configuration parameter. Thus, the amount of memory the SQL Server process has allocated doesn't tell you whether the system is under memory pressure. (For information about memory-utilization differences in 64-bit and 32-bit SQL Server, see the Web-exclusive sidebar "64-Bit vs. 32-Bit Memory Management.") And although you might think you can detect memory pressure by looking at whether the SQL Server process is paging, this method doesn't work either because SQL Server is designed to avoid paging.

So how do you detect memory pressure? You can start by looking at the Performance Monitor (PerfMon) counters SQL Server:Memory Manager:Target Server Memory and SQL Server:Memory Manager:Total Server Memory. Target Server Memory tells you how much memory SQL Server would allow itself to use based on external factors such as other applications' memory requirements, total system memory, and the max server memory configuration parameter. Total Server Memory tells you how much memory SQL Server is currently using. If over time the value of Total Server Memory is consistently lower than the value of Target Server Memory, there's no memory pressure on the system: SQL Server isn't using as much memory as it could if it needed to. However, if Total Server Memory is equal to or greater than Target Server Memory, memory pressure could be the cause of the problem. To determine where the memory pressure is coming from and resolve it, you need to examine each of the ways SQL Server uses memory, which we described in "Minding Memory." Let's look at what you can learn from memory utilization in the database page cache, the plan cache, the query workspace, and locks.

Database Page Cache

Counters under the SQL Server:Buffer Manager PerfMon object provide clues about memory pressure in the database page cache. First, look at the Database pages counter and compare it to the Total pages counter. If Database pages is a significant fraction (more than 50 percent) of Total pages, you know you have a data-intensive application. Two other counters, Buffer Cache Hit Ratio and Page life expectancy tell you how well the page cache is coping with the workload. The Buffer Cache Hit Ratio tells you what fraction of page requests SQL Server is servicing out of the cache and correspondingly, what fraction of page requests are going to the I/O system. Keep in mind that with a Buffer Cache Hit Ratio of 99 percent, you incur 10 times as much I/O as you would at 99.9 percent, and at 90 percent, you incur 10 times as much I/O as at 99 percent. The Page life expectancy counter tells you how long you can expect a page to reside in memory under current conditions. Low values (typically, less than 300 seconds) mean the system is "churning" the buffer pool at a high frequency, a potential sign that memory pressure could be causing performance problems.

If your system has a low Buffer Cache Hit Ratio or low Page life expectancy, check your I/O system to see how well it's coping with the I/O workload. In particular, a high value (typically more than 10ms) for the Avg. Disk sec/Read or Avg. Disk sec/Write PerfMon counters in the PhysicalDisk object might indicate that that the I/O system is struggling. Be careful to inspect these counters separately for each drive letter; if you look only at the averages, the numbers you see might mask a problem on a particular drive.

If you see low values for the Buffer Cache Hit Ratio and Page life expectancy counters, you might consider adding more memory to boost application performance. However, before you go cut that check for more RAM, a little investigation of the underlying causes is in order. Perhaps your buffer cache suffers because SQL Server is using too many table or index scans instead of using index seeks to restrict access to specific rows. Scans, especially of large tables or indexes, are likely to cause a drop in the Buffer Cache Hit Ratio because it's unlikely that all the pages of a large object can remain cached in memory. Scans might also cause a drop in Page life expectancy by forcing out other pages residing in the buffer cache. Index seeks restrict access to a particular range of rows and thus cause fewer page reads. The SQL Server:Access Methods:Full Scans/sec PerfMon counter lets you monitor the number of full scans SQL Server is performing in a second. If you determine that overuse of scans is a problem, you can use SQL Server Profiler to narrow down which SQL statements are causing the scans by monitoring the Showplan Statistics event under the Performance category in Profiler. Analysis of the Profiler data might show that you need to add an index to a table or perhaps reorder the columns of an existing index to turn an index scan into an index seek. For information about how to use SHOWPLAN output information, see the Microsoft article "HOW TO: Troubleshoot Slow-Running Queries on SQL Server 7.0 or Later." (You can find the URL for this article in Related Reading.)

Another factor that might reduce the performance of the buffer cache is excessive denormalization, which leads to redundant data and thus increases the overall number of pages in the database. When the buffer cache has to serve more data pages, it becomes less efficient. Another common cause of poor buffer cache performance involves form-based update programs, which display a row to users, who can choose the columns they want to update. Then, the application writes back all the columns to the database instead of updating just the ones the users changed. This kind of application not only wastes CPU cycles on the server by processing updates of unchanged columns, it also causes SQL Server to update all indexes to reflect the potentially changed column values. This behavior causes dozens of unnecessary page reads and writes.

Plan Cache

SQL Server requires a query plan to execute a SQL statement, but producing a query plan from a SQL statement (i.e., compilation) is a CPU-intensive process. SQL Server can cache query plans in the plan cache and reuse them, thus saving CPU resources, but because memory is limited, SQL Server might discard less frequently used plans. And in some special circumstances, as the Microsoft article "Query Recompilation in SQL Server 2000" (in Related Reading) describes, SQL Server might have to discard a plan it has cached for a particular SQL statement and compile a new plan, a process called recompilation. The SQL Server:Buffer Manager:Procedure Cache Pages counter tallies the total number of pages in the plan cache. If this number is a significant fraction (typically, greater than 25 percent) of the total number of pages in the buffer pool, your application is plan intensive. Counters under the SQL Server: Cache Manager object tell you how many plans of each kind (e.g., AdHoc, Prepared, Procedure) are cached and how many pages they occupy.

OLTP workloads typically comprise many similar statements, executed with different values for parameters such as customerID or orderID. In our experience, problems related to compilation, recompilation, and efficient use of the plan cache are the most frequent source of performance problems in OLTP applications running on SQL Server.

How do you tell if the plan cache is effectively avoiding unnecessary compilations? Recall from "Minding Memory" that the Cache Hit Ratio counter is cumulative from instance startup, so it isn't a good measure of the current effectiveness of the plan cache. However, you can look at the counters under the SQL Server:SQL Statistics object to get a pretty good picture of the plan cache's efficiency. In particular, if SQL Compilations/sec is a significant fraction (more than 10 percent) of SQL Batch Requests/sec and your application is an OLTP application, your application likely isn't making effective use of the plan cache. The Web-exclusive sidebar "Optimizing Parameterization for Query Plan Reuse," explains some common reasons why this might be the case and shows how to improve query-plan reuse.

Sometimes, plan reuse isn't appropriate. For example, sometimes the size of query result sets (or interim results) vary greatly depending on the parameters you pass to the WHERE clause. In such cases, the access strategies for different statements might need to be different. But in general, plan reuse reduces CPU utilization and can greatly improve the performance of a high-throughput OLTP system.

Another design choice that can lead to poor plan-cache performance is the creation of separate objects such as views or stored procedures for every user. Application designers sometimes use this technique to restrict the data that each user can access. However, since the views or stored procedures for each user would be different objects, there can be no plan reuse even though the objects are all similar. If you have hundreds or thousands of users, this design can put unnecessary memory pressure on the plan cache and cause plans to be discarded frequently. An alternative design that requires fewer objects uses the SUSER_SID() and SUSER_SNAME() functions to control the data each user sees. This setup yields better plan-cache behavior. You can read about this approach in Sean Maloney's article "For Your Eyes Only."

Memory pressure originating in any part of SQL Server can cause eviction of cached query plans and unnecessary compilations. One clue this might be happening is a sudden drop in the SQL Server:Buffer Manager:Page life expectancy counter. You can monitor stored procedure plan compilations, reuse, and evictions by using SQL Server Profiler to look for the cache events that Table 1 lists. For a particular stored procedure, an SP:CacheRemove event followed by the SP:CacheMiss event (which tells you that SQL Server didn't find the query plan it was looking for), followed in turn by SP:CacheInsert (which tells you that SQL Server created a new query plan in the plan cache) usually indicates memory pressure. In contrast, when you have plan reuse and little memory pressure, you'll see more SP:CacheHit and SP:ExecContextHit events in Profiler.

Workspace Memory

Complex decision-support-processing or data-warehouse applications typically generate a heavy demand for workspace memory, resulting in a high value for the SQL Server:Memory Manager:Granted Workspace Memory (KB) counter. In such cases, it's important to monitor the length of the memory grant queue (which we described in "Minding Memory") by checking the SQL Server:Memory Manager:Memory Grants Pending counter. A long queue—especially compared to the SQL Server:Memory Manager:Memory Grants Outstanding counter—means the system is overloaded with too many memory-consuming queries. You can use Profiler to understand which queries are using memory grants and why. The Integer Data column of Profiler trace event 28 shows you the size of a statement's memory grant in kilobytes. An analysis of the Showplan event for the statement might lead you to discover a missing index for a JOIN or SORT operation. Or your analysis might show that the query is too broad and needs the user or application to narrow the query's focus to process a smaller set of rows.

SQL Server relies heavily on statistics to estimate the amount of workspace memory a query will require. For example, for a hash join, the estimated number of rows and row size of the first input to the join (the build input) determine how much memory the join needs. Problems arise if the estimate isn't accurate. If the estimate is low, the hash table won't fit in the allocated memory and rows will overflow to disk, thus making the query execution slow. The Hash Warning and Sort Warning events in Profiler's Errors and Warnings category show you such overflows. If the estimate is too high, SQL Server thinks it needs to assign the query a larger grant than it actually requires, and the query might wait longer than necessary in the memory grant queue. You can use the Show Plan Statistics event to determine exactly where the query optimizer's estimate of row counts begins to diverge from the actual number of processed rows. In general, keeping statistics up-to-date gives the query optimizer the best chance of making accurate size estimates. However, sometimes the nature of the data distribution or the complexity of the query make accurate prediction of result sizes difficult. You might need to use join method hints (such as HASH JOIN or LOOP JOIN) or join order hints (such as FORCE ORDER) in such situations.

Lock Memory

All the uses of memory we've looked at so far—database page cache, plan cache, and workspace memory—are important to optimizing performance. You generally want SQL Server to devote as much system memory to these requirements as possible. Locks, on the other hand, represent overhead that's necessary for SQL Server to guarantee the transaction-isolation level you choose for your application. Therefore, you want to keep lock memory at the minimum necessary for application consistency. Excessive use of locks might tell you that your application is using a higher isolation level than it needs. For example, in REPEATABLE READ and SERIALIZABLE isolation levels, SQL Server acquires and holds until the end of the transaction a lock for each row (or page) the statement reads, which can add up to a lot of locks if your statement reads a lot of rows. In such cases, you should review your application design to determine whether a lower isolation level would suffice. However, in some cases SQL Server takes and holds read locks even in the READ COMMITTED isolation level, which might happen when SQL Server uses more than one access path (e.g., an index or base table heap) for a particular table in the same query. For example, if a query has two predicates in the WHERE clause on the same table (e.g., Customer.age > 35 and Customer.state = 'CA') and each can be answered by using a different index (one on age and the other on state), SQL Server might decide to use both indexes and combine the results with a JOIN operation to produce the requested result. In such a case, SQL Server would acquire locks on the two indexes to ensure that the query sees a consistent image of each row. If such locking is a problem, you can try switching to the READ UNCOMMITTED isolation level or use NOLOCK hints in the statement to avoid locking. If your application's design precludes using the NOLOCK option, try to alter the query plan by modifying indexes (e.g., adding a covering index) or by using hints to eliminate such locking.

Making Memory Work for You

SQL Server's memory usage has a profound effect on your system's performance. Understanding how SQL Server utilizes memory and how to monitor such usage will help you tell whether your system is under memory pressure, recognize common problems that lead to such pressure, and solve memory problems. While talking to customers with memory-related SQL Server performance problems, the SQL Server Customer Advisory Team always recommends trying to find and address the underlying causes of memory pressure before you decide to add more memory. We like to use the analogy of a leaky gas tank in an automobile. Sure, having a bigger tank will help delay a shutdown, but eventually the tank will still run out of gas unless you find the leak and fix it.

Comments

Plain text