Don't Be Afraid of BLOBs and CLOBs

Manipulating binary large objects (BLOBs) and character large objects (CLOBs) has always been difficult in SQL Server. Fortunately, SQL Server 2005 provides three new data types for large object (LOB) storage. With these new data types, you can easily manipulate BLOBs and CLOBs, especially when you use Common Language Runtime (CLR) procedures and functions. With the compression algorithms in Microsoft .NET Framework 2.0, you can create CLR functions that seamlessly compress and decompress LOBs with minimal performance impact. Before I show you how to do so, let me introduce you to the three new data types that make this compression and decompression possible.

Related: VARBINARY(MSX) Tames the Blob

The New Kids on the Block

SQL Server 2005 provides three new data types to store and manipulate LOBs:

varbinary(max), which you use for BLOBs
varchar(max), which you use for CLOBs
nvarchar(max), which you use for CLOBS with Unicode values

The new varbinary(max), nvarchar(max), and varchar(max) data types behave much like the traditional varbinary(n), nvarchar(n), and varchar(n) data types, respectively. However, the maximum storage size of the three new data types is 2^31-1 bytes, or about 2GB.

The new data types replace the text and image data types from previous versions of SQL Server. SQL Server 2005 Books Online (BOL) states that you shouldn't use the text and image data types in new SQL Server 2005 applications and you should change any existing text and image data types to the new data types in legacy applications.

Unlike text and image values, varbinary(max), nvarchar(max), and varchar(max) values can be used for variables and function parameters. They can also be returned by CLR (or T-SQL) scalar-value functions. These traits make them great candidates for data manipulation.

CLR procedures and functions use the .NET data types of SqlBytes and SqlChars to receive and process BLOBs and CLOBs, respectively. SqlBytes is used to pass varbinary(max) parameters, where as SqlChars is used to pass nvarchar(max) and varchar(max) parameters. CLR is Unicode based, so passing varchar(max) as a parameter implies converting it to Unicode before parameters are passed. SQL Server 2005 does all this without your intervention, but it's important that you're aware of the differences concerning Unicode and nonUnicode strings.

Compressing LOBs

Before SQL Server 2005, you had to use extended stored procedures to compress data, which is a difficult and risky business. SQL Server 2005's CLR integration capabilities make compression less complicated, more secure, and more stable.

Because you can convert data back and forth between BLOBs and CLOBs, let's explore how to use varbinary(max) for simplicity sake. Listing 1 contains a function, written in C#, to compress BLOBs. Compressing a BLOB is as easy as passing in a SQLBytes parameter, reading the BLOB's data, and writing that data to a compression stream.

Callout A in Listing 1 shows where the compression function receives a varbinary(max) value as a SQLBytes parameter. SQLBytes represents a mutable type that wraps either an array or a stream. Assuming it wraps an array, the code uses the Buffer property to access the underlying array and retrieve the BLOB's data, as callout B shows. Note that for large BLOBs, this method could raise an out-of-memory exception if you're running SQL Server 2005 Service Pack 1 (SP1) or earlier. Fortunately, unlike extended-procedure exceptions, CLR exceptions don't crash SQL Server 2005 machines. This size limitation has been resolved in SQL Server 2005 SP2.

Finally, the compression function uses the DeflateStream class in .NET Framework 2.0 to compress the BLOB, as callout C shows. The compression stream writes to a MemoryStream object that's later used to create a new SQLBytes object that's returned to the calling process. There is only one caveat: Microsoft's implementation of DeflateStream requires the stream to be closed before it writes the last compressed bytes. Flushing alone isn't enough. So, the function flushes, then closes the compression stream.

Decompressing LOBs

Compressing data makes LOBs unreadable to other applications, so you need a function to decompress the data. Listing 2 contains a BLOB decompression function that's written in C#. This function follows the same principles used in the compression function, but instead of writing data to a stream and returning a compressed block, it reads from a stream and returns a decompressed block.

As callout A in Listing 2 shows, the code uses a loop to read data from the DeflateStream object. This loop is necessary because the deflator doesn't know how much data it needs to read or how much compression was achieved. The code loops until all bytes are read, then writes the bytes read in each pass to a MemoryStream object.

Using the Functions

The T-SQL code in Listing 3 loads the assembly in SQL Server 2005 (a process called cataloging during which the assembly is verified for security and reliability) and creates the compression and decompression functions. With the assembly registered and the functions created, you can compress and decompress data without having to modify your application layers or database objects.

The sample script in Listing 4 demonstrates how you might use the compression and decompression functions in T-SQL code. This script first creates a table and adds some values to it. The script then runs a compression update followed by a SELECT statement that returns the uncompressed data. Note that this script allows column-level compression but lacks consistency check functions.

You Don't Need to Be Afraid Any Longer

With the new varbinary(max), nvarchar(max), and varchar(max) data types, you can use SQL Server 2005's CLR to easily compress and decompress LOBs. All you need to do is write the compression and decompression functions, load the assembly, and create the functions. You're then ready to use those functions in your scripts. However, for large implementations or mission-critical applications, you might consider using third-party products, such as SQLCompress.NET. In these situations, third-party applications offer several advantages, including the use of algorithms rather than the Buffer property to get the data in and out of BLOBs (thereby avoiding the size limitation in pre-SP2 servers), verification of compressed data to help detect data corruption, configurable memory block sizes (which keeps memory utilization down without impacting compression performance), and extensibility.

Comments

Plain text