Are You Making the Right Cloud Storage Decisions for SQL Server tempdb?

Tempdb is a critical system database in Microsoft SQL Server. It’s unique in that it’s a shared resource tapped by all users of a SQL Server instance for hosting temporary user objects such as tables, table variables, temporary indexes, table-valued function results and cursors. It’s also used for hosting internal system objects for worktables, hashes, sorting, temporary LOB storage, aggregation operations and hash joins, and other system needs. A SQL Server tempdb database can be extremely quiet or the subject of near-constant thrashing. This depends on many factors, including the size of the database, the quality--and efficiency--of application and database coding, table design, use of temp tables or table variables, reliance on cursors and table-valued functions, and load, to name just a few factors. Regardless, those tasked with building and administering SQL Server environments should plan to provision tempdb to be as high-performing as possible, considering factors of proposed use and design, scaling capabilities and budget. Ensuring adequate storage is key in this regard, but it may be one of the last things you think about.

In many cases, database administrators who are tasked with keeping their implementations of SQL Server performing at optimal levels may not be those who provision the servers to begin with. If they're lucky, they may have the clout to make recommendations for resource levels around CPU, RAM and storage. They may even have a say in selecting the hardware. However, with more and more workloads transitioning toward cloud-based platforms like Microsoft Azure or Amazon AWS, most of the key decisions around infrastructure are removed from the purview of the modern DBA. This article should serve as a reminder that the decision to go to the cloud should not be a decision to ignore infrastructure decisions completely. There are still some important choices to make regarding infrastructure that are critical even when working in cloud-based environments where the options are limited.

Configuration Considerations for tempdb

You may draw the conclusion that it’s of utmost importance to set up SQL Server tempdb for success from the minute you finish installing SQL Server. Tempdb has prominent focus in sort, hash and merge operations. It also factors into any process involving temporary objects, which is why it’s a bottleneck for a significant percentage of the load on your SQL Servers. You want the correct number of files provisioned, you want them sized appropriately, and, finally, you want the files provisioned on the fastest disk available. All three of these factors are easy to configure--or adjust--particularly in cloud implementations. Oh, and don’t forget to always separate out tempdb onto its own dedicated drive.

tempdb File Count Best Practices

The decisions around the number of data files to provision for SQL Server tempdb is a factor of CPU logical core count. For those instances of SQL Server running with less that eight logical cores (do those still exist?) there should be a 1:1 relationship between logical core count and data files allocated in tempdb. When logical core count is eight cores or higher, initially provision with eight data files. If there is any allocation contention noted in tempdb (manifesting as PAGELATCH_UP with the wait resource hosted in tempdb), then add four data files at a time until the contention abates.

File Size Considerations

Sizing SQL Server tempdb depends on many factors, some of which I already touched upon; user database(s) size, coding efficiency and relational modeling are all key factors. Other factors--such as sorting index rebuilds in tempdb, using version stores for reducing contention and reliance upon certain types of functions--are also considerations. Initially, the key is to ensure all your data files are set to grow at the same interval. As you go through the development and testing process, you’ll be able to collect metrics around projected tempdb usage that you can implement when releasing your database to production. Tempdb file sizes are a bit trickier to nail initially, but this is where the power of cloud provisioning makes life easier since provisioning storage is an online process (typically) in the cloud. If you need more space, it can usually be allocated without a necessary downtime. My base goal with file size for tempdb (and most other databases): I try to pre-grow my data and log files to the max size I expect to need; I don't rely on auto-growth, but allow for it if needed by making sure I have additional space or measures in place to alert me with significant time to act should I need to provision space on demand.

Focusing on Storage Decisions for tempdb

This leaves the class of storage, which is the remaining storage consideration for SQL Server tempdb. It’s also the decision that is most easily overlooked, particularly for cloud provisioned instances of SQL Server. Most cloud providers offer different levels of storage as it pertains to performance. For example, Microsoft Azure and Amazon AWS have multiple options for SSD disk, as well as HDD disk offerings. For most cloud-provisioned instances, you have two options for achieving a target goal of IOPS (the unit of measure of input and output operations per second). Depending on your cloud provider, you can provision a class of drive where the amount of IOPS is based on the size of the drive in GB, or you can purchase a class of drive that allows you to provision a specific level of minimum IOPS. There are also allowances for “bursting” the IOPS as necessary if the provisioned level of IOPS is not suitable for the load--one more plus for cloud-provisioned SQL Server instances. Most storage changes that involve adjusting the amount of IOPS provisioned for a SQL Server are online operations. Adjusting your provisioned IOPS up or down usually does not require a service restart or a server reboot.

Real-World Example

One of my recent clients was experiencing multiple performance problems with its production environment hosted in AWS. There were plenty of things to be concerned about, and all the technical debt I was discovering was adding up to months of work. In the discovery process, I noted a quick win, though: tempdb latency. Latency is a way to measure “drag” on storage calls to and from the disk subsystem. In a nutshell, latency measures the amount of time, in milliseconds (ms), that elapses between a request for a data element from disk and receipt of that data element. Latency is a factor of the throughput achievable by a class of storage based on its manufacture or through throttling, as well as by the amount of requests that are queuing for items from disk.

The following were the findings on disk latency when I performed my initial assessment of the SQL instance:

Figure 1: Initial Latency of SQL Server Drives

I was able to draw quite a few conclusions for items to resolve just from looking at latency findings from the results of this single query against the sys.dm_io_virtual_file_stats Dynamic Management Object:

1. Data and log files for user databases are on the same volume, which is a mix of access patterns between read and write operations that conflict with one another.

2. System databases are hosted on the same drive as the operating system. I typically like to provision a dedicated drive for system databases (other than tempdb), but this is also a low-priority concern.

3. The latency of the drive dedicated to tempdb compared to the other drives was staggering.

Latency concerns for reads on the main data file (32.0 ms on the D drive compared to 20 ms for any of the tempdb data files on E drive) were concerning until I also factored in the volume of reads and writes (not shown). Reads for their main user database were 11x less than reads for tempdb for the same time period. Writes for the same database were half of those for tempdb. RAM factors into this: The user database could fit into the buffer pool 10x over. The bulk of the user database’s pages would be read into buffer once and would endure multiple IO cycles only for any dirty pages. The access patterns for tempdb are completely different and more volatile due to the nature of how tempdb is utilized. While the latency findings for the user database were not within the range of IOPS I’d be targeting for reads on a healthy server (optimally 5 ms or less), resolving them was a much lower priority than fixing tempdb.

Upon closer examination, I discovered that the client had provisioned all its disks on general-purpose AWS gp2 storage when the io1 storage that is designed for critical business applications--specifically databases--was disregarded. A 5-minute operation to convert the disk to io1, without performing any other tuning changes to the client environment, resulted in the following when comparing before and after latency findings:

Figure 2: Latency Improvements Due to Changing Drive Type Only

By changing only the class of drive and setting a higher-provisioned IOPS level and bursting allowance, we were able to improve read and write latency by over 60 percent on the tempdb volume. This led to an overall reduction in blocking due to queries with dependencies on operations requiring tempdb.

Have you taken the time to look at how you’ve provisioned tempdb? Perhaps there are some quick and easy fixes in your future, too.

Comments

Plain text