Question: I’m involved in a project to design a database schema and some of the tables are expected to have millions of rows and a large amount of LOB data (a mixture of binary and character). I know there are a bunch of options for storing this data but I’m struggling with how to choose between them. Can you help?
Answer: The answer to this question depends on the size of the data and how it will be used.
There are two kinds of data type for storing LOB data – true LOB data types that can store more than 8000 bytes and the limited data types that can store up to 8000 bytes – and of course there are pros and cons for each type.
If your data is going to be less than 8000 bytes then it makes sense to use one of the limited data types – (n)varchar (1-8000) or varbinary (1-8000). When the data type becomes really small (say less than 5 bytes), you need to decide whether to use a fixed-length type – (n)char – instead of a variable length type, to avoid the two-byte overhead that comes with variable-length columns. If you need to store 2-byte character set data, you’ll have to use nchar or nvarchar.
These data types are always stored ‘in-row’ – i.e. inline with the row they are part of on the same data file page (except for the special case when the row grows beyond 8060 bytes, but that’s beyond the scope of this answer). This means that the rows may become quite large and so very few rows can be stored per data file page. If the LOB data is not going to be used very often, this reduces the data density of the columns that are going to be used frequently – meaning more data file pages will have to be read and stored in memory to process the columns being used and lowering the efficiency of data processing operations.
You may decide, if this is the case, to store the seldom-used LOB columns in a separate table, and JOIN to them when they are needed, or to store them in true LOB columns that are stored off-row (i.e. in separate data file pages, but still in the same table). This increases the data density of the commonly-used data because the data rows are much smaller and many more can fit on each data file page – at the expense of extra overhead to access the off-row (or separate-table) LOB data. It’s a trade-off decision you need to make.
Of course, if the data is usually larger than 8000 bytes, you have no choice but to use one of the true LOB columns – (n)varchar (max) or varbinary (max) – (n)text and image have been deprecated and should not be used in new designs. You have the same data density dilemma here too as both of these types can be stored ‘in-row’ or ‘off-row’. There is the additional consideration of index maintenance operations with true LOB columns – in all releases until SQL Server 2012, the presence of a true LOB column in a table prevents online index operations on the table’s clustered index. This may force you to store the true LOB columns in a separate table and JOIN to them when needed.
If you need to store data larger than 2GB per value, or require very fast access to the data then you may consider the FILESTREAM data type as well. You can read about it in the whitepaper I wrote: FILESTREAM Storage in SQL Server 2008
I’ve really just touched the tip of the iceberg here. I go into more detail in a blog post I wrote a couple of years ago but this should have given you an idea of the considerations and trade-offs you need to make.