Are You Really Saving Anything By Normalizing?

How much are you saving on disk space by normalizing a dimension hierarchy instead of flattening it? Consider a table containing 1 million 300-byte rows. Rounding up to accommodate record headers and null bitmaps, this table would take up about 300MB of space.

From the original dimension, remove a 15-byte category and replace it with a 2-byte category key: You've just saved yourself about 6.35MB of space (1 million times 13 bytes).

But then you have to add space needed for the category table - let's say 1,000 15-byte rows for a total of about 15KB. Subtract this from your savings for the dimension, and your space savings is now down to 6.3MB, out of the 300MB you started with. That's not bad, but that's not the full story.

Every dimension is associated with at least one fact table, and the fact table in this scenario is probably sized in the tens of gigabytes. We have to factor in the fact table because the dimension and the category will be joined to it an uncountable number of times - it's all part of the same picture, really. Let's say, for the sake of argument, that the fact table is 25GB. Now you're looking at 6.3MB saved on a total of 25.3GB - at this point, it's not even worth doing the math. You just don't save significant space by normalizing hierarchies in a data warehouse. Nor should you normalize in order to protect data integrity - that's what the extraction, transformation, and loading (ETL) process is for. You should be denormalizing and laying out records so that users can navigate and query with ease and confidence.

Comments

Plain text