NULL in the Data Warehouse

A database is more than a repository for raw data—it’s meant to hold the truth. Ten years ago I made that statement when I wrote about NULL in OLTP databases. I commented in an earlier article that we scrub, decode, and transform data warehouse data so it can present a single version of truth to the enterprise. See Data Warehousing: Dimension Basics, October 2007. However, in the data warehouse community, there are vigorous discussions about whether or not NULL, which is tightly aligned with truth, should be represented in a data warehouse.

NULL is a condition of the unknown. For instance, you don’t know which marital status or eye color John Smith has, but you do know that he has a marital status and an eye color. In your organization’s human resources database, if you were to enter any other value except NULL as an answer to these questions, you’d be storing untruth. In the database world, NULL can have many meanings. “Value is not relevant” and “value exists but is unknown” are the two most common and most easily understood by the user community.

NULL in OLTP and Nonrelational Databases

In an OLTP database it’s important to use NULL where and when it’s relevant, otherwise you’re masking the truth in data. All major relational database management systems (RDBMSs) today represent NULL explicitly, but that doesn’t mean that database schemas have been properly architected to use NULL. Too often, DBAs who learned to build databases on nonrelational platforms used alternate methods to represent the unknown condition. In addition, programmers who don’t grasp the concept or importance of NULL, or who just don’t want to be bothered coding for it, create tables in the development stage of an application’s life cycle using all manner of implicit representation for the unknown condition. These tables from development end up in production.

A surprising number of nonrelational databases are still in production today, and few (if any) of these support NULL explicitly. True legacy systems, still in production today, contain data that implies NULL, such as using a value of -1 for integers that should have only positive values. Other tricks: using a value of zero where the numeric values should be non-zero; using a fictitious date such as 99999; using variations on the character string unknown, such as n/a, na, not relevant, missing, and so forth.

The existence of nonrelational production databases makes a difference to the data warehouse DBA because legacy and OLTP databases are the source of data for the data warehouse. The data warehouse DBA has to deal with these explicit and implicit NULL conditions. But first the data warehouse DBA has to answer the most relevant question, “Should you allow NULL in the data warehouse?”

NULL in the Data Warehouse

The arguments for and against NULL in the data warehouse are complicated. The goal of the data warehouse is to represent a single version of the truth for the enterprise. An exception to this goal is where you use business intelligence (BI) on direct-loaded data in order to ferret out the gap between perception and reality, as I discussed in my opinion piece BI Without the Data Warehouse, September 2008.

Incorporating NULL into the relational data warehouse can vastly complicate querying and cause much end-user confusion. If IT specialists (programmers, developers, and DBAs) can’t grasp the idea or importance of NULL, then how can we expect non-technical end users to understand NULL? We must assume that business end users are querying the data warehouse for statistical information; how will NULL affect the answers they get from the data? If the end user queries the warehouse and filters on a nullable value, is it valid to leave the entire record out of the analysis? Would this matter? If a value is unknown at the time of measurement, which is a plausible condition with slowly-changing dimensions, would it be better to omit the record from the analysis, or should you include it with some sort of zero or “temporarily unknown” condition. How would you do that? If a value exists in the source database but for some reason the extraction, transformation, and loading (ETL) process failed to bring it into the data warehouse, is it better to treat that value as NULL (the true condition) or should you calculate and assign a median or average value to it?

A NULL attribute can cause an entire record to be left out of an analysis, which, in a small data set, could skew results and give misleading information to business decision-makers. Or, even worse, the presence of NULL when trying to process cubes could cause SQL Server Analysis Services (SSAS) to throw an error and stop processing. On the other hand, substituting values for NULL (e.g., zero, median or average values, fictitious dates) can also produce misleading results. Data warehouse DBAs need to support the BI users by giving them the cleanest data set with the clearest inherent meaning. You don’t want to confuse or encumber end users by embedding implications of NULL in the data, such as a -1 instead of a positive number, which means that the value is actually unknown and that end users should treat records that contain -1 differently from the rest of the data set. So is there a “best way” to handle NULL in the data warehouse, a way that satisfies both the truth condition and the ease-of-use condition? Well, maybe.

Consider Data Warehouse Business Requirements

Before I make my recommendation, remember, nothing is absolute. Your decisions must be based on the business requirements that the data warehouse was built to support. In general, I suggest that you should leave currency and percentage attributes NULL. A zero dollar amount and an unknown dollar amount are very different, and a zero percent, when incorporated into a statistical evaluation, could vastly skew results, far more so than an unknown percentage.

Date attributes are best left NULL, since it’s very hard to enforce a fictitious date value. In the OLTP source, when the data access of end-user applications was restricted, it was relatively easy to filter out these fictitious dates. But your BI users pretty much have the run of the data with their statistical and data mining tools. These fictitious dates could have an impact on mean and median values.

You should leave numbers NULL, for reasons similar to those for leaving currency, percentage, and date data types NULL. Zero is too often a valid value to be used as an implicit NULL. Including negative values in a positive number data set will only confuse the BI user who has no idea that the negative number is meant to be NULL.

And lastly, I would substitute strings such as “unknown” for NULL character data types and “missing” for empty strings. SQL Server 2005 Analysis Services and Microsoft Office Excel pivot tables easily deal with empty strings, and the presence of an empty string can be even more meaningful than a string that reads “missing.”

When to Do NULL Conversions

It’s easiest to do these conversions during the data transformation phase of the ETL process, before you load the data into the data warehouse. In the source data, if you have data definitions or metadata (code books) that spell out the implicit NULL conditions (e.g.,-1 in an otherwise positive number data set, fictitious dates like 99999), or, if you can extract these implied conditions from the data itself, then you can use these rules for the transformation logic. Change 99999 dates to NULL, for instance.

If you do these conversions before loading data into the warehouse, you’ll save time and CPU cycles. As it says in SQL Server 2005 Books Online, when SSAS processes a dimension, if it encounters a NULL it will by default convert character NULL to empty strings and numeric NULL to zero—it won’t throw an error and stop the processing. For more information see Defining the Unknown Member and Null Processing Properties.

If the default conversion behavior is going to skew results, as in those rows where NULL would be converted to zero for the calculation I’d either modify the defaults in SSAS or I’d pre-filter the input data set to exclude rows with NULL. The point is that SSAS won’t accept NULL in either the dimension or the fact table when it’s cube-building. All this conversion-on-the-fly only adds time and CPU cycle overhead to the task of processing the cube. So not having to do this step is a plus for the data warehouse DBA and the BI analyst.

Problems with Nullable Foreign Keys or Measures

Dimensions that have nullable non-key columns aren’t a problem, except for the overhead they could cause when you’re trying to process a cube. SQL Server won’t let you create a primary key on any column that’s nullable. Run the code in Listing 1, callouts A-D. Notice that when you try to define a primary key on the nullable column, colA, SQL Server throws the error “Cannot define PRIMARY KEY constraint on nullable column in table 'NULLtest'.”

The fact table, which is full of foreign keys and measures, could be problematic when you’re trying to process a cube if either a foreign key or a measure were nullable. In an ANSI-compliant database, such as SQL Server, NULL propagates. If you perform a string concatenation of multiple columns, such as the sample shown at callout E of Listing 1, you’ll see that the row that contains a NULL in one of the participating columns returns a NULL in the result set. The act of concatenating known values to an unknown value can only result in an unknown value.

If you don’t have SET ANSI_WARNINGS turned ON, you’ll never see the warning message “Null value is eliminated by an aggregate or other SET operation,” and you’ll never realize that the result returned only approximates the truth. See the code in callout F of Listing 1.

Do all functions affected by the presence of NULL in the SQL Server database truly propagate in a SQL Server environment? Not really. T-SQL has extensions that circumvent NULL, letting you sum a column and get a result, even when the column contains NULL. If you run the code in callout G of Listing 1 and click the Messages tab, you’ll see that SQL Server throws the message “Warning: Null value is eliminated by an aggregate or other SET operation.”

You can have a NULL foreign key in a fact table, but it’ll never be included in the output of an EQUI-JOIN operation, simply because NULL can’t equal NULL (an unknown cannot equal another unknown). If you try to join the NULL foreign key from the fact table to a NULL column in a dimension, then the comparison evaluates to false, and the row that contains NULL isn’t returned, as you’ll see when you run the code in callout H of Listing 1.

What if the NULL foreign key wasn’t intentional? What if it was one of those rows that went through the ETL process and somehow the foreign key was either corrupted or destroyed, so that where once there was a valid parent-child relationship, today there’s nothing. There’s no pkey-fkey association between the dimension and the fact table. Is there any way around this? The short answer is yes.

In the dimension you can create a special “unknown member” row with a dimension key value of -1 and assign as many of the attribute columns a value of an empty string as you can, as shown in callout I of Listing 1. I’ve commented out the SET IDENTITY INSERT commands, because my little test table doesn’t have an identity column. Let these commented-out commands serve as a reminder when you’re using this technique on your production systems. Do a replacement in the fact table, substituting -1 for any NULL foreign key that’s associated with the dimension you just altered, as I did in callout J of Listing 1. Then join the two tables, using the code in callout K of Listing 1. It’s not an ideal solution, but at least now you’ll be able to join the affected rows without having to compromise your ANSI-compliant database.

NULL in the Data Warehouse—Not Such a Good Idea

NULL in the data warehouse isn’t such a good idea because NULL interferes with the operations of cube-building and adds overhead to cube processing. In addition, NULL can propagate under specific circumstances, leaving you with unpredictable results. I suggest the setting NOT NULL for most columns in your data warehouse. But remember, there are no absolutes when it comes to NULL: Consider the size of the data warehouse, how business requirements are interpreted, and what works best for you.

Comments

Plain text