DATA TALES 7 - The Case of the Database Diet (Part 2).

In the last article I started discussing a large customer database that needed to go on a diet. We needed to drastically reduce the size of the database. I described why ROW compression was important, and showed how to estimate the savings from using it.

This month, we’ll start by looking at how ROW compression actually works, then look at the greater savings from PAGE compression. We’ll also look at how it works internally.

Next time, we’ll consider how to decide what to use, but first, let’s see how table compression actually works.

An Uncompressed Table

Let’s start with the same table as last time but we’ll recreate it in a permanent database, rather than in tempdb.

USE master;
GO

IF EXISTS (SELECT 1 FROM sys.databases WHERE name = N'CompressionTest')
BEGIN
    ALTER DATABASE CompressionTest SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE CompressionTest;
END;
GO

CREATE DATABASE CompressionTest;
GO

USE CompressionTest;
GO

CREATE TABLE dbo.SalesOrderDetail
(
    SalesOrderID int NOT NULL,
    SalesOrderDetailID int NOT NULL,
    CarrierTrackingNumber nvarchar(25) NULL,
    OrderQty smallint NOT NULL,
    ProductID int NOT NULL,
    SpecialOfferID int NOT NULL,
    UnitPrice money NOT NULL,
    UnitPriceDiscount money NOT NULL,
    CONSTRAINT PK_dbo_SalesOrderDetail
    PRIMARY KEY (SalesOrderID, SalesOrderDetailID)
);
GO

INSERT dbo.SalesOrderDetail
    (SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber,
     OrderQty, ProductID, SpecialOfferID, UnitPrice,
     UnitPriceDiscount)
SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber,
   OrderQty, ProductID, SpecialOfferID, UnitPrice,
   UnitPriceDiscount
FROM AdventureWorks2012.Sales.SalesOrderDetail;
GO

We can then check the space that is being occupied by the table:

EXEC sp_spaceused N'dbo.SalesOrderDetail';
GO

This command displays the following output:

Let’s take a look at the data that will be on the first page of that table. We know which rows that will be because we’ll order our SELECT by the clustered index:

SELECT TOP(3) *
FROM dbo.SalesOrderDetail
ORDER BY SalesOrderID, SalesOrderDetailID;
GO

These are the first three rows on the first page:

We’ll need them to refer back to later.

Next, let’s take a look at the contents of the data page:

DBCC TRACEON (3604);

DECLARE @DatabaseID int = DB_ID();
DECLARE @PageToDisplay int =
    (SELECT TOP(1) allocated_page_page_id
     FROM sys.dm_db_database_page_allocations
     (@DatabaseID, OBJECT_ID(N'dbo.SalesOrderDetail'), 1, NULL, 'DETAILED')
     WHERE page_type_desc = N'DATA_PAGE'
     AND previous_page_page_id IS NULL);

DBCC PAGE (@DatabaseID, 1, @PageToDisplay, 1);
GO

Turning on trace flag 3604 redirects DBCC output to the client instead of to the SQL Server logs. We’re then using sys.dm_db_database_page_allocations to locate all the pages for the table, and then filtering to retrieve only data pages. The data pages are double-linked, so to find the first page, I’ve then looked for a page that has no previous page.

Once, I’ve found a suitable page number to look at, I’ve used DBCC PAGE to display its contents. (It’s parameters are the database, the file (in this case 1), the page number, and the type of output required. The value 1 will give us the formatted output that we need here.

It contains 3 basic sections. The first section is the header:

While this contains lots of interesting information, the parts that are of interest to us now are:

m_type = 1 indicates that this is in fact a data page
m_level = 0 is the index level and in this case it’s the leaf level of the clustered index
pminlen = 38 indicates the number of fixed length bytes in each row
m_slotCnt = 114 shows the number of rows (slots) on the page
m_freeCnt = 2 indicates that there are only 2 free bytes on the page. The page is quite full.

The end of the page contains the third section which is the slot array. It is stored in reverse order:

(I removed many rows to keep this short).

But the main area of interest for us today is the 2^nd section, which is the actual data:

Each byte is represented as two hexadecimal digits. Values are stored in least significant byte first. In the first row of the first slot (ie: slot 0), you can see a record size of 69 bytes. You can see that the data for each row starts with the same marker value in each row.

The first 38 bytes are the fixed data region. As it’s in hexadecimal, it will help if we convert some of the values from the first row:

SELECT sys.fn_varbintohexstr(43659), sys.fn_varbintohexstr(776);
GO

This returns the following output:

If you look carefully at the value in the first row above, you can see it is 8baa0000 and that’s the same value with the bytes in the reverse order. If you wander through the other values, you’ll find that the int, and smallint values are stored first. Later, the nvarchar value for the carrier tracking number 4911-403C-98 is stored. You can see it later in the data. Keep in mind that it is stored as double-byte Unicode characters. That’s why there are dots shown in between the characters.

Adding Row Compression

Now that we know what the “normal” data rows looks like, let’s add row compression and check the size again.

ALTER TABLE dbo.SalesOrderDetail
    REBUILD WITH (DATA_COMPRESSION = ROW);
GO

EXEC sp_spaceused N'dbo.SalesOrderDetail';
GO

There is a significant size reduction:

The table is now 54% of its original size ie: a 46% overall reduction in size.

Let’s look into the page contents again:

DECLARE @DatabaseID int = DB_ID();
DECLARE @PageToDisplay int =
    (SELECT TOP(1) allocated_page_page_id
     FROM sys.dm_db_database_page_allocations
     (@DatabaseID, OBJECT_ID(N'dbo.SalesOrderDetail'), 1, NULL, 'DETAILED')
     WHERE page_type_desc = N'DATA_PAGE'
     AND previous_page_page_id IS NULL);

DBCC PAGE (@DatabaseID, 1, @PageToDisplay, 1);
GO

The page header is quite similar:

Note however, that pminlen (fixed data length) is now 6, and there are now 214 rows per page (m_slotCnt), even though the m_freeCnt is now 26 (slightly more free space).

The slot array at the end is the basically the same except it has more entries:

But look at the data in the rows. It has become a series of compressed data array entries:

The values have been stored in shorter locations, and note that the string value that was nvarchar now is using single bytes not double bytes. The overall sizes for the rows shown are now 36 bytes each.

This shows how effective ROW compression is, without really rearranging the page structure all that much. Rows can still be read/written much the same way as with uncompressed data. We generally find the performance of ROW compression indistinguishable from uncompressed data. On systems with I/O bottlenecks however, we usually see a notable performance improvement.

Applying PAGE Compression

Greater compression can be obtained by compressing the entire page, rather than each row one at a time. So if ROW compression is excellent, is PAGE compression better? Let’s look at how it works. Let’s start by applying PAGE compression, and checking the size again.

ALTER TABLE dbo.SalesOrderDetail
    REBUILD WITH (DATA_COMPRESSION = PAGE);
GO

EXEC sp_spaceused N'dbo.SalesOrderDetail';
GO

This returns the following:

That’s 31% of the original size or a 69% reduction in table size.

Let’s look at the page contents again:

DECLARE @DatabaseID int = DB_ID();
DECLARE @PageToDisplay int = 
    (SELECT TOP(1) allocated_page_page_id 
 FROM sys.dm_db_database_page_allocations
      (@DatabaseID, OBJECT_ID(N'dbo.SalesOrderDetail'), 1, NULL, 'DETAILED')
     WHERE page_type_desc = N'DATA_PAGE'
     AND previous_page_page_id IS NULL);

DBCC PAGE (@DatabaseID, 1, @PageToDisplay, 1);
GO

Once again, the page header and the offset table are similar:

Note that the m_slotCnt (number of slots) value has gone up to 479 slots in the page. We are now holding far more rows in the same number of pages.

It’s the data section that has changed the most. Note that the data row detail has changed:

Note how few bytes are used per column, and that a big part of that reduction is from the introduction of symbols. These symbols point to a dictionary that’s held on the page. A prefix reduction technique has been used, and duplicate values have been replaced by symbols.

For example, note that the column 3 value (which was the carrier tracking number string) has been replaced by a single byte token. Where did its value go? We can see it in the dictionary section of the page dump:

Summary

I’ve shown how effective ROW compression is and now you’ve seen how powerful PAGE compression is. But is it useful for all tables? In the next article, we’ll explore that question and show how to work out whether ROW or PAGE compression is appropriate. I hope you’ve enjoyed taking a short look under the covers of compression.

Comments

Plain text