The idea behind data partitioning is really pretty simple: Rather than storing data in one really large table, you spread the data out among many smaller tables. When used properly, this technique can aid in the maintenance and scalability of storing data, and a good partitioning strategy lets you increase the efficiency of operations against that data. Partitioning lets you spread your data among filegroups on different drives as well and makes it easier to re-index smaller segments of the overall data set. However, although the idea is relatively simple, historically, the implementation has not been.
Before SQL Server 2005, partitioning was accomplished by spreading data among separate physical tables and joining the tables as needed through the use of views. This method lent itself to a slew of challenges. First, anytime a new table was added or an old one removed, definitions would need to be changed to accommodate the underlying table changes. Special logic would have to be added to the system to indicate into which tables new records should be inserted. Also, queries against partitioned views were typically slow, especially when data was distributed among different servers.
The Developer and Enterprise editions of SQL Server 2005 and SQL Server 2008 brought changes that made partitioning much easier. Instead of defining different tables and hashing them together, as in previous versions, SQL Server would let you define a single table as well as define how your data would be logically and physically distributed through the use of schemes and functions. In addition, SQL Server would manage the distribution in the background so you didn’t have to. Your data would still be in smaller tables, as it had been in previous versions, but SQL Server would handle those details in the background. In this article, my goal is to give you an overall high-level view of some of the challenges I’ve faced and the differences I’ve noticed between SQL Server 2005 and SQL Server 2008 performance when I run the same queries against nonpartitioned and partitioned tables.
Design and Implementation
In this article, I won’t cover partitioning design and implementation in extensive detail. Just know that you need to define a partition function to specify how you want your data logically distributed. With that definition, you specify your value barriers. You must then create a partition scheme to physically distribute your partitions among filegroups based on the logical barriers you defined in your partition function. When you create a clustered index on your table, you specify the partition scheme that will be used to partition the table (alternatively, you could use a unique constraint). You achieve many administrative and maintenance-related benefits by partitioning your tables, such as the ability to move data in and out of tables quickly by “switching” partitions. Partitioning also allows for rebuilding indexes on one or more partitions individually. Read a highly useful Microsoft white paper about partitioning.
When it comes to querying partitioned data, you face performance challenges, although current challenges pale in comparison to those encountered with pre–SQL Server 2005 versions. At first glance from a high level, it might seem that query performance for nonpartitioned and partitioned tables should be the same. But because partitioned data is spread out among partitions rather than being in the same spot, it becomes trickier for the database engine to easily locate data. However, if you design your partitioning strategy so that your queries always point to a particular partition, you can see query performance benefits similar to those for a nonpartitioned table. These performance benefits can grow dramatically if you’re dealing with huge amounts of partitioned data. If a majority of your queries will span partitions (e.g., data from different time ranges, different customers), then achieving good performance is trickier still. Note: If you find that your queries continually span partitions, you may need to reevaluate your partitioning strategy and consider whether this feature is the right choice for your environment.
As I compared SQL Server 2005 and SQL Server 2008 performance running the same queries against nonpartitioned and partitioned tables, I intentionally kept the tests as straightforward as possible. This approach lets you easily see (and lets me easily describe) the differences between the different methods. Results may—and probably will—vary when higher loads of data are examined and when partitioning strategies and query criteria change.
The code snippet presented in the first script, shown in Listing 1, sets up the tables and indicates how they’re partitioned. I’ve designed my partitioning function to distribute data across 10 different partitions, all of which are in my primary filegroup. In a perfect world, I would have these partitions on different disks or sets of disks, but for this scenario, placing all of the partitions in the primary filegroup should be fine.
In this setup, I load 1 million records into a table aptly named NonPartitionedTable. Each column in this table will play a different role in the examples presented. The ValueColumn contains a globally unique identifier (GUID), which I’ll later use to return a single record lookup. The PartitionID is a hashed value that will be part of my clustered index for my partitioned table and will serve as the partitioning key for data distribution. If you partition a table, the partition key must be included in the clustered index definition because partitioning physically distributes a table, and a clustered index is physically a part of the table. Note that I’m not designing this partition strategy on a date-based element. Instead, I mimic an environment where data is segmented based on customer-related data.
The GroupedValue is another hashed column that I’ll use to query values across partitions and return an aggregated set of data. This type of query is of particular importance because many queries run against partitioned tables can’t specify the partitioning key as part of the criteria because records from multiple partitions must be returned. The performance of such queries is critical and should play a significant role in your design strategy if you determine that you need to partition your database tables.
After the data is loaded to my nonartitioned table, I simply use a SELECT…INTO statement to make a copy of the table named PartitionedTable. From there, I build indexes on each table. When I build the clustered index on the PartitionedTable, I specify my partitioning scheme.
With data loaded, I can start digging in to see how queries against nonpartitioned and partitioned tables differ. I’ll run a series of queries against a nonpartitioned table and a partitioned table in a SQL Server 2005 instance followed by the same set of queries against a nonpartitioned table and a partitioned table in a SQL Server 2008 instance. I’ll compare and contrast the execution plans as well as the IO activity between the two environments.
In the example in Figure 1, I retrieve a single value from the NonPartitionedTable table and assign it to a variable. I then specify this variable as the predicate value in queries against my two tables.
You can see that the query against the NonPartitionedTable yields execution plans drastically different from the same statement against the PartitionedTable. The NonPartitionedTable query produces an expected single Index Seek operation while the query against the PartitionedTable produces an Index Seek, a Constant Scan, and an expensive Nested Loop join to compile the result set. The Constant Scan, although showing a cost of 0%, is really a scan of all of the available partitions to determine if a particular partition can help satisfy the query. This simple query against a partitioned table can be quite expensive.
In SQL Server 2005, if a partitioned table has only one partition involved in a query, all available schedulers could be used to satisfy the query. However, if the query spanned multiple partitions, only one thread would be used per partition regardless of how many schedulers the server had available to contribute to a parallel execution plan. SQL Server 2008 remedies this limitation by allowing all available threads to access partitions as necessary. This change is more of an advantage in OLAP environments, where read requests are intensive and can really benefit from parallel queries, than in OLTP environments, where queries designed to take advantage of parallelism may introduce a performance bottleneck and should be evaluated.
Before I continue, I’ll mention the command I’ll use to show the IO activity that results from a query. The command is SET STATISTICS IO ON, and it must be run in the same session as the queries that you run. This command tells SQL Server to display disk activity generated by the T-SQL statements executed in the current session. When I look at the output from this statement, I’m usually most concerned with the logical reads figure. This figure represents the number of pages read from the data cache. It should remain relatively the same for the same query given the same data set. As a general rule, when I am tuning queries, I try to get the number of logical reads as low as possible.
Figure 2 shows the output from SET STATISTICS IO ON for the previous query. You can see that the query against the NonPartitionedTable table incurred 3 logical reads, while the query against PartitionedTable table incurred 27. From an IO point of view, the query against the PartitionedTable table is nine times as expensive as the query against the NonPartitionedTable table.
When I add the partitioning key (PartitionID) to the query criteria, the performance against the PartitionedTable table improves. In fact, the result of the addition yields a performance improvement that mimics the performance of the query against the NonPartitionedTable table. This improvement occurs because SQL Server doesn’t incur the cost of determining which partition holds the record in query. Figure 3 shows the performance difference.
As Figure 4 shows, in terms of IO, the two queries are now identical because SQL Server no longer has to scan partitions. Both tables now incur the same number of index seeks, which results in the same number of pages accessed.
Now, let’s examine the results of the same set of queries that I ran against my SQL Server 2005 instance run against my SQL Server 2008 instance.
As Figure 5 shows, the execution plan for SQL Server 2008 is drastically different from the execution plan for SQL Server 2005. The Constant Scan is gone, as is the Nested Loop join operator. The visual representation of the execution plan for the two tables is now the same.
However, the queries still incur the same number of reads from the database as they did with the SQL Server 2005 version. Even though the visual Constant Scan is gone, from an IO perspective, the query against the PartitionedTable table in SQL Server 2008 remains the same. The relative costs of the queries are the same as with SQL Server 2005.
Figure 6 shows that when I run the same query against the SQL Server 2008 instance that I ran against the SQL Server 2005 instance (refer to Figure 2), I get the same result: a 9 to 1 ratio in terms of IO.
After the PartitionID is added to the query criteria, as Figure 7 shows, the performance of the two queries is the same. This change in behavior mimics that of the SQL Server 2005 instance when the PartitionID column was added.
In Figure 8, you can see that after I add the PartitionID criteria, the logical reads for the two queries are the same. Again, this equality occurs because SQL Server doesn’t have to determine which partition to query on its own. I’ve specified the partition by using the PartitionID in the query predicate.
You can see the significance of telling the query which partition to access to retrieve the data. If I don’t specify the partition, SQL Server has to determine it, which can be costly.
Aggregations are common types of queries executed against databases. Questions such as “How many cars did we sell each month for the past 12 months?” or “What is the average number of calls taken by active customer service representatives (CSRs) by day this month?” call for aggregations. When you deal with partitioned tables, you’re almost certainly going to be running queries that span multiple partitions. Spanning partitions to pull data together requires extra overhead for the database engine.
In Figure 9, you can see the execution plan for an aggregated query run against my SQL Server 2005 instance. Notice that Constant Scan comes into play for the SQL Server 2005 query.
Figure 10 shows the IO output for the above query. As expected, the query against the partitioned table still incurs more logical reads than the same query against the nonpartitioned table.
Figure 11 shows the same set of aggregation queries run against the SQL Server 2008 instance.
The execution plan indicates that for SQL Server 2008 the relative cost of the aggregation against the PartitionedTable table is still more expensive than the same query against the NonPartitionedTable table. However, although the difference is still present, in terms of relative cost it has improved over the SQL Server 2005 version.
Figure 12 shows the difference in IO between the two queries for SQL Server 2008. Again, in terms of logical reads, the query against the PartitionedTable is still more expensive.
Partitioning is a powerful feature, one that brings advantages to data manageability and administration. However, partitioning can bring certain challenges when it comes to query performance, and you must consider its implementation carefully for your environment.
Partitioning Query Performance
In this article, I’ve explored how the same queries run against partitioned and nonpartitioned tables on SQL Server 2005 and SQL Server 2008 perform. Queries against a partitioned table in SQL Server 2005 can perform differently than a partitioned table in SQL Server 2008 in terms of parallel plan generation. Most significantly, queries against a partitioned table perform significantly better when the partitioning key is supplied. As mentioned earlier, my goal was a high-level view of how partitioning tables may affect your query performances on SQL Server 2005 and SQL Server 2008. However, it’s clearly not an exhaustive view. If you’ve had different experiences from those I’ve presented, I would love to hear about them. Feel free to email me at [email protected].