DATA TALES #6: The Case of the Database Diet (Part 1)
Dr Greg Low – SQL Down Under (Australia)
I wanted to start by wishing everyone Happy New Year. One of the busiest places to be in the new year is in any gym. So many people make new year resolutions and in so many cases, it involves losing weight. So I thought I’d start the new year with a tale of a database that needed to go on a diet. The client’s primary database started at 3.8 TB. This case will be described over a short series of articles. This is part 1 and describes phase 1 of the diet.
When it comes to databases, size really does matter, and for many reasons.
Clients often just consider the storage cost of the main database but they should think about much more than that. I read recently that a typical organization held more than 40 copies of its primary database in various guises: copies in other environments (UAT, Staging, Testing), many backup copies, etc. There is also a larger management cost associated with a larger storage volume.
We have all seen database applications that perform worse and worse as more data is held in the database. The amount of data in a database never really concerns me as long as it is being queried efficiently and appropriate indexing is in place. So this slow down as data volumes increase should not occur, but it often does when applications haven’t been designed or implemented well.
Larger databases often mean longer database backup and recovery times. I am not usually too concerned about longer backup times. Recovery times concern me as that’s when the client is typically waiting around for the process to complete. In many organizations, the data to be recovered will also need to be moved across corporate networks before the recovery can even start, and that takes even more time.
Operating System Compression
Whenever I start talking to clients about compression technologies, they immediately think that I’m talking about operating system level compression. Most clients have memories of poor experiences with this in the past.
Compressing database data files at the operating system level isn’t something you should try to do. There would be no in-memory benefit from doing so. Because you need larger chunks of data to be compressed at once for the compression to be effective, and because the compressed data chunks would vary in size, this type of scheme would not perform for random access.
This means that based on their previous experiences, clients assume that compression will mean smaller and slower databases and they certainly don’t want slower databases. But that’s not the case with table compression in SQL Server. When applied appropriately, we find that table compression leads to smaller and faster databases. That is the best of both worlds.
The data in both transactional and data warehouse systems is able to be compressed significantly.
The first attempt to reduce size by the SQL Server team was the addition of the vardecimal data type in SQL Server 2005 SP2. A typical transactional system such as a large SAP system, often has tables that contain a large number of decimal values. A decimal value in SQL Server with the default precision of 18 digits occupies 9 bytes of storage, yet quite often these columns are NULL or hold simple values like zero or 1. With vardecimal in Enterprise Edition, SQL Server stored the row contents differently and made decimal values smaller by storing them in a variable length format. The vardecimal data type was an internal type. You couldn’t define a column or a variable as vardecimal data type.
Making this change to decimal values often produced a significant reduction in database size, without negatively impacting performance. In fact, the performance was often slightly better, so why not continue down this path?
In SQL Server 2008, two types of table compression were introduced: ROW and PAGE. Unfortunately, they were only added to Enterprise Edition, but the client that I’m discussing today is using Enterprise Edition. I do wish that table compression was available in other editions.
With ROW compression, SQL Server makes changes to how the contents of a row is stored to ensure that more rows fit on a single page. Clients are often surprised that this can increase the performance of their applications but there are several reasons:
If the database is currently I/O bound, anything that you do to decrease the amount of I/O will usually have a very positive impact on performance. For example, if you can reduce the size of the rows by 30%, that’s 30% less data that needs to be moved during I/O operations. (Clearly this situation can change if the system is CPU bound instead of I/O bound but most of my current clients have I/O bound systems).
For disk-based tables, because SQL Server buffers store copies of database pages, more rows of data will now fit in those same buffers. That means that more rows fit in server memory, without having to increase the amount of memory available.
While the processing for each page involves more CPU, most servers that are I/O bound have plenty of available CPU. Interestingly, I’ve even seen situations where CPU load reductions have occurred with table compression. I can only attribute that to the fact that while there is a higher CPU load per page, you end up processing far less pages.
So how is ROW compression implemented?
The first change is that the amount of metadata associated with each row is reduced in size. This involves information like details of the columns and how long they are.
The next change recognizes that so many columns are NULL or holding a value of zero. The storage of these was optimized to the point that NULL and zero values occupy no space at all.
Another change takes the vardecimal story further. Vardecimal was deprecated and variable length storage formats were used for numeric types (smallint, int, bigint, decimal, numeric, smallmoney, money, float, real, timestamp/rowversion). The storage of bit is improved by reducing metadata overhead and some date-related values are reduced in size (datetime, datetime2, datetimeoffset).
String values are not affected in general but trailing padding characters are removed from char and nchar values.
Trailing zeroes are removed from binary values.
We have found ROW compression to be surprisingly effective. On most large transactional system tables, we have seen a reduction of around 30%.
Applying ROW Compression
The type of compression that is to be used is a property of each table. (In fact, it is a property of each partition of each table and index). It can be applied at the time you create each table, but more likely in our work, it will be applied once the table is already populated.
The important concept to understand is when the compression occurs. For an example, let’s consider the Sales.OrderDetails table from the AdventureWorks sample database:
Based on the data types, we can see that this table might be a good candidate for size reduction based upon ROW compression. Rather than modifying the AdventureWorks table, let’s create another table in tempdb to work with:
USE tempdb; GO CREATE TABLE dbo.SalesOrderDetail ( SalesOrderID int NOT NULL, SalesOrderDetailID int NOT NULL, CarrierTrackingNumber nvarchar(25) NULL, OrderQty smallint NOT NULL, ProductID int NOT NULL, SpecialOfferID int NOT NULL, UnitPrice money NOT NULL, UnitPriceDiscount money NOT NULL, CONSTRAINT PK_dbo_SalesOrderDetail PRIMARY KEY (SalesOrderID, SalesOrderDetailID) ); GO INSERT dbo.SalesOrderDetail (SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount) SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice, UnitPriceDiscount FROM AdventureWorks.Sales.SalesOrderDetail; GO
SQL Server provides a procedure to let us estimate the amount of reduction that would occur. Let’s try that:
EXEC sp_estimate_data_compression_savings 'dbo', 'SalesOrderDetail', NULL, NULL, 'ROW'; GO
This procedure doesn’t check the entire table, it takes a sample of the table’s rows into tempdb, then compresses those with the requested compression scheme and reports the outcomes. As it extrapolates the results for the entire table, based upon the sample, the “real” outcomes can vary. However, we’ve found it to be a useful first step:
In this case, the sample size was basically the whole table. It reports a reduction from 6.9MB to 3.7MB, or a reduction of 46%. The suggestion is that if we apply ROW compression, the table will end up 54% of its original size. Let’s check the current table size, then make the change:
EXEC sp_spaceused N'dbo.SalesOrderDetail'; GO
ALTER TABLE dbo.SalesOrderDetail REBUILD WITH (DATA_COMPRESSION = ROW); GO EXEC sp_spaceused N'dbo.SalesOrderDetail'; GO
At 3.9MB, the table is now 56% of the original size of 6.9MB.
Database Diet Outcomes: Phase 1
For our client whose database needed to go on a diet, by applying ROW compression to all tables, we saw a reduction in overall database size from 3.8TB to 2.6TB. As the client system had been totally I/O bound, we were not surprised to see a noticeable performance improvement as well.
Does it make sense to apply ROW compression to all tables? We have not yet encountered a table where ROW compression was anything but beneficial, at any site. For customers working with Enterprise Edition, I believe it should be the default setting for disk-based tables.
We can, however, do much better than this in reducing the size of the database, and improving the overall performance. Like all diets, the real issue is about sticking with the plan. In the next article, we’ll look at how we continued to improve the situation by implementing phase 2 of the database diet.
And before I forget, not everyone has New Year at the same time of the year, so I need to add a quick Happy New Year to any readers with Chinese heritage, who have New Year coming up on February 8th: 新年快乐！