Designing for Performance: NULL or NOT NULL?

When you're gathering business rules and requirements for a database project and implementing them in the data model, you might need to let a data value, which eventually will become a column in the database, assume a value of unknown. Unknown is real; you work with unknowns every day.You should be able to store unknown in your databases. Relational databases, SQL Server included, use a placeholder called NULL to represent unknowns. NULL brings complications and overhead, so that some data modelers and database programmers avoid using it. But NULL has both supporters and detractors. The question when you're designing for performance is, is NULL a matter of choice? Or should you stay away from NULL altogether?

What Is NULL?

Before we answer those questions, though, let's define what NULL actually is and what it's good for. NULL is a condition. For example, suppose you need a date of birth to calculate retirement age.The DateOfBirth field must be a date data type, so you can't enter unknown or not applicable or n/a, and you shouldn't enter zero or a bogus value such as 1800-01-01. In such cases, you need a placeholder until you can enter the correct value. NULL fills this need.

NULL means that some value is unknown, missing, or irrelevant.This three-valued logic is an important feature of the SQL language. It's defined in the ANSI SQL standard and requires the presence of a NULL condition. In most programming languages, a Boolean expression (or predicate, in SQL), such as haircolor = brown and gender = F, can be only true or false. In ANSI SQL, you have a third option, unknown (NULL), so that the expression haircolor = brown and gender = F can equate to true, false, or unknown.

If you have an SQL query such as

SELECT * FROM employee 
  WHERE haircolor = 'brown' 
  AND gender = 'F'

and you test against a row in the employee table in which the value of haircolor is brown and the value of gender is F, the test is true and that row is returned in the result set. If you test against a second row, which contains values of blonde and F for haircolor and gender, respectively, the test is false and that row isn't returned in the result set. If you test against a row that contains NULL for haircolor and F for gender, this test also fails because you don't know what color the hair is; hence the NULL haircolor.This third row also isn't part of the result set.

Because NULL is a placeholder and not a value (such as zero or a space), the ANSI specification states that the NULL condition must be the same for all data types (number, character, and date). So you can expect equivalent behavior when you use any data type in a comparison expression that includes NULL. SQL Server implements this ANSI requirement by including a NULL bitmap in each record, one bit for each column; a value of 1 indicates that the corresponding column is nullable. The NULL bitmap is present in every record of every user table.

The ANSI specification also states that, at the time of creation, a column is nullable by default. Not all database management system (DBMS) vendors implement the ANSI specification, nor do they implement it consistently. Internally, the SQL Server 2005 and SQL Server 2000 engine default behavior doesn't conform to the ANSI standard—that is, columns created without a declaration of NULL or NOT NULL are created as NOT NULL. I suspect that SQL Server was optimized for pure speed in the beginning, and now it has to maintain these settings for backward compatibility.

However, you'll never see this behavior if you execute CREATE TABLE code in Query Analyzer. Tools like Query Analyzer are ODBC-based, and they automatically turn the ANSI_NULL_DFLT_ON option to true, which means that columns created without a declaration of NULL or NOT NULL are created as nullable. It's easy to switch back and forth between ANSI compliance and noncompliance, either by setting the ANSI options in database properties or by including a session SET command—such as SET ANSI NULLS (ON|OFF)—with your code. Because this situation can be so confusing, your best bet is simply to declare NULL or NOT NULL for each column when you create a table.

To NULL or NOT to NULL?

I've read in many articles and in SQL Server Books Online (BOL) that you should minimize when and how you use NULL. Some authors have even gone as far as telling readers to never declare a column nullable. There are valid reasons for concern: Not only does making columns nullable impose some minor additional overhead on SQL Server, you also must ensure that you've accounted for the NULL condition in any program that will access the database. (For more information about the database-engine overhead associated with making columns nullable, see the sidebar "NULL Overhead.") Many programming languages don't natively recognize a NULL result set—they'd flag this as an error condition—so it's often necessary to add special code to your programs to account for NULL in the result. The Common Language Runtime (CLR) in Visual Studio (VS) 2005 makes using nullable columns much easier; for more information about this topic, see the sidebar "NULL and the CLR."

However, I don't believe these problems offer enough reason to avoid NULL. NULL was created for a purpose: to express the unknown condition, which can be entirely valid, depending on the circumstances. ANSI has clearly advocated the use of NULL instead of other mechanisms such as codes or empty strings when appropriate.

Is there room for compromise when you're designing for performance and dealing with the NULL condition? I think so. Look at the Customer table in Figure 1, which is a conceptual data model. If you have columns that are meant to contain-street address, city, state, and postal code for your customers, and you know that for each customer you'll always have these values, it doesn't make sense to make these columns nullable when you create them. Instead, create them with the property NOT NULL. The notation <M> to the right of each column in the Customer table means mandatory. When this conceptual model is converted to a physical data model, all columns marked <M> will be generated with the NOT NULL property. (For guidelines to follow when designing a customer table to handle data in a Web-based retail environment, see the Web-exclusive sidebar "Design Tip: Defining a Customer," http://www.sqlmag.com, Instant Doc ID 49297.)

On the other hand, if, for the sake of argument, we say that the CUSTOMER table will hold both Web customers and walk-in customers at the brick-and-mortar store, would it be reasonable to make CustEmailAddr (the customer's email address) mandatory also? A walk-in customer might not even have an email account (hard to believe, but true!), so in fact you should make that column nullable. (See the Web-exclusive sidebar, "Design Tip: Choosing a Primary Key for a Web-Customer Table," http://www.sqlmag.com, InstantDoc ID 49298, to learn about the drawbacks of using an email address as a primary key.) If, instead, you make the email address column NOT NULL and use codes or phrases to indicate that the customer doesn't have an email address, you're requiring that all programs and programmers know how to filter out valid email address values from the table.

A Matter of Choice
When dealing with the concept of NULL, you have some choices. Circumventing NULL—that is, making all columns in all tables NOT NULL and using codes and phrases to represent the unknown condition—means placing the responsibility of filtering for an accurate result set on your programming staff. It also means storing anomalous or misleading data in the database, as in the case of DateOfBirth, which I mentioned earlier.

Embracing NULL means that your programming staff will have to write code that checks for the NULL condition. Using NULL offers one benefit, though. The bit data type is no longer just true or false. Making a bit data type nullable now means that it can have in essence three values— true, false, or NULL—a handy method for describing reality.

With either approach, you'll incur a little overhead. Each time SQL Server reads a record, the processor decodes the NULL bitmap. In SQL Server 2005, 2000, and 7.0, the NULL bitmap is always present in each record. If the bitmap is present, the processor will decode it, so even if all columns in the table are declared NOT NULL, you'll still incur some processor overhead.

I believe that there's room for compromise, that you can make this a NULL-win situation. Don't make all columns nullable, and don't make them all NOT NULL. Use NULL only when necessary, in situations where unknown is a valid condition. Everywhere else, use NOT NULL.

The choice is yours, to NULL or NOT to NULL. Either methodology incurs overhead. Either methodology requires that the programming staff make accommodations to get accurate results. However, if you find yourself in a mixed environment, with data exchanges happening between SQL Server and other DBMSs, I encourage you to adopt the ANSI standard and go with NULL.

Comments

Plain text