DATA TALES #9: The Case of the Database Diet (Part 4)

DATA TALES #9: The Case of the Database Diet (Part 4)

The Final Cut

This is the 4th and final article in a series on a database that needed to go on a diet for the New Year.

In the first article, I described why size really does matter, both for the DB itself, and all the other copies of it that exist in most organizations. I then showed how to estimate the improvement using ROW compression. Our customer’s database that started at 3.8TB was reduced to 2.6TB by applying ROW compression without code changes. Better still, the performance of the I/O bound application improved significantly.

In the second article, I showed how both ROW and PAGE compression change the internal structure of the database pages, and showed why PAGE compression could achieve such great size reductions.

In the third article, I provided guidance on how to decide whether to use ROW or PAGE compression, on a table by table, index by index, and partition by partition basis. By applying the recommendations in that article, the database had now dropped to 1.4TB and had even better performance than before.

In this final article, I’ll look at two other areas that were consuming a great deal of space within the database, and what I did about it.

 

Large String Values

 

When discussing table compression, I mentioned that the storage of string values was improved by only in two basic ways:

  • When ROW compression was applied, char and nchar data was stored in a variable length format i.e. if a value “hello” was stored as char(40) , then SQL Server did not occupy 40 bytes to store it. However, when the data was retrieved, 40 bytes were still returned as expected.
  • SQL Server 2008 R2 added Unicode compression into the mix. Many double-byte character values ended up occupying only half the space.

The problem with this is that systems often store large amounts of highly-compressible character data, and this is still stored in an essentially uncompressed format.

For the customer system that we are discussing, I found that a large volume of XML data (around 700GB) was being stored in columns of varchar data type. This data could have been stored as xml data type but there was no need to deal with the overhead involved, because the data was never queried using XQuery functions within the database, so the customer had decided to simply store it all as varchar data. If on a rare occasion, they did need to use XQuery on the data, they were happy to CAST it to xml data before applying the XQuery functions.

If you have ever used a ZIP style compression on this type of data, you’ll realize how well it can be compressed, so I decided to pursue the use of ZIP style compression on the relevant columns. Applying ZIP compression is a CPU intensive activity though. Generally, I would not want SQL Server being the application that performed the compression.

Recompressing the existing data would be a fairly intensive process so I wanted to make sure that it could be carried out over a period of time, rather than expecting it all to be immediately converted. To support this, in each table where the data needed to be held either compressed or uncompressed, I added additional columns. For example, look at the following table design:

CREATE TABLE dbo.SystemEvents

(

    SystemEventID bigint IDENTITY(1,1)

        CONSTRAINT PK_dbo_SystemEvents PRIMARY KEY,

    LoggedWhen datetime NOT NULL,

    LogData varchar(max) NOT NULL

);

GO

 

I would change this table as follows:

CREATE TABLE dbo.SystemEvents

(

    SystemEventID bigint IDENTITY(1,1)

        CONSTRAINT PK_dbo_SystemEvents PRIMARY KEY,

    LoggedWhen datetime NOT NULL,

    LogData varchar(max) NULL,

    IsCompressed bit NOT NULL

        CONSTRAINT DF_dbo_SystemEvents_IsCompressed DEFAULT(0),

    LogDataCompressed varbinary(max) NULL,

    CONSTRAINT CK_dbo_SystemEvents_Either_LogData_Or_LogDataCompressed_Is_Required

          CHECK (LogData IS NOT NULL OR LogDataCompressed IS NOT NULL)

);

GO

 

I then had the application changed so that it was happy regardless of whether the LogData or LogDataCompressed column was returned to it. It was also changed so that for all new rows, the data would be GZip compressed before being sent to SQL Server.

Finally, we created a small utility application that would progressively compress all existing data in the table during periods of low system activity.

The outcome of this is that the 700GB of data became 180GB of data, and the database had now dropped to a total of 880GB.

 

Compression in the Database?

 

If I had not been able to change the application (as is often the case), I would have replaced the table with a view of the same name, and performed the compression and decompression within the database. Don’t forget though that generally this work is best done outside the database. However, if it does need to happen in the database, this section applies.

One of the great new features of SQL Server 2016 is the inclusion of native COMPRESS() and DECOMPRESS() functions. You’ll find details of the COMPRESS statement here:

https://msdn.microsoft.com/en-us/library/mt622775.aspx and details of the DECOMPRESS statement here: https://msdn.microsoft.com/en-us/library/mt622776.aspx.  If the customer had been already running SQL Server 2016, this would then have been easy, but sadly they were on an older version.

For versions that support SQLCLR, it’s relatively straightforward to create functions to perform compression. Here is a rough example of the code required:

using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

using System.IO;

using System.IO.Compression;

 

public partial class UserDefinedFunctions

{

    [Microsoft.SqlServer.Server.SqlFunction(Name = "CompressData",
                                            DataAccess = DataAccessKind.None,
                                            SystemDataAccess = SystemDataAccessKind.None,
                                            IsDeterministic = true,
                                            IsPrecise = true)]

    public static SqlBytes CompressData(SqlBytes DataToCompress)

    {

        SqlBytes returnValue = SqlBytes.Null;

 

        if (!DataToCompress.IsNull)

        {

            using (var compressedData = new MemoryStream())

            using (var gZipOutputStream = new GZipStream(compressedData,
                                                         CompressionMode.Compress,
                                                         true))

            {

                gZipOutputStream.Write(DataToCompress.Buffer, 0, DataToCompress.Buffer.Length);

                gZipOutputStream.Flush();

                gZipOutputStream.Close();

                returnValue = new SqlBytes(compressedData.ToArray());

            }

        }

 

        return returnValue;

    }

 

    [Microsoft.SqlServer.Server.SqlFunction(Name = "DecompressData",
                                            DataAccess = DataAccessKind.None,
                                            SystemDataAccess = SystemDataAccessKind.None,
                                            IsDeterministic = true,
                                            IsPrecise = true)]

    public static SqlBytes DecompressData(SqlBytes DataToDecompress)

    {

        SqlBytes returnValue = SqlBytes.Null;

 

        if (!DataToDecompress.IsNull)

        {

            var copyBuffer = new byte[4096];

 

            using (var decompressedData = new MemoryStream())

            using (var gZipInputStream = new GZipStream(DataToDecompress.Stream,
                                                        CompressionMode.Decompress,
                                                        true))

            {

                int bytesReadFromInputStream;

 

                while ((bytesReadFromInputStream
                        = gZipInputStream.Read(copyBuffer, 0, copyBuffer.Length)) > 0)

                {

                    decompressedData.Write(copyBuffer, 0, bytesReadFromInputStream);

                }

                returnValue = new SqlBytes(decompressedData.ToArray());

            }

        }

 

        return returnValue;

    }

}

 

PDF Image Files

From the remaining 880GB of data in the database, 350GB was being used to store PDF files in varchar(max) columns. These files were also worthy of attention.

Working with PDF files is clearly outside the area of SQL Server, but it’s worth noting the changes we made to those files:

  • Reduced text resolution from 1200x1200 DPI to 150x150 DPI. Even at 75x75 DPI, I could not find a user who could see the difference.
  • Replaced existing TIFF images in new files with smaller (and higher-quality) PNG images. This not only reduced the size but improved the quality of scaled output. This change could not be made for older existing files. For those, I reduced the quality of the stored images using functions in the PDF library, until I achieved a good balance of size and quality.
  • Removed unnecessary fonts that were being stored in each and every PDF file.

The outcome of this was that the 350GB of PDF data was reduced to 120GB of data.

 

Summary

With most databases, there is often a great deal of scope available for reducing the overall size and I’ve discussed earlier why doing this is so important.

For this particular customer database, applying the following changes:

  • Balanced approach of ROW and PAGE compression
  • Compressed large string data
  • Reduced PDF image sizes

took the database from 3.8TB total to 550GB of data. That’s a big outcome for the client. 

The final (almost unexpected) benefit was that the database now almost fitted entirely in memory. (The customer system had 512GB of main memory). You can imagine how that improved the overall performance and manageability.

Next time, we’ll be back to another interesting case that came up during my consulting work.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish