SQL by Design: The Reason for NULL

When you're gathering business rules and requirements for a database project and implementing them in the data model, you might need to let a data value, or attribute, assume a value of unknown. Unknown is real. So you can expect to be able to store unknowns in your databases. Relational databases use a placeholder called NULL to represent unknowns. But NULL brings complications and overhead, so some data modelers and database administrators avoid using NULL. As I explain the use and limitations of NULL, keep in mind that one goal of a relational database is to express accurate data.

What Is NULL?

NULL is a condition. For example, suppose you need a date of birth to calculate retirement age. The DateOfBirth field must be a date data type, so you can't enter unknown, not applicable, zero, or a bogus value such as 1800-01-01. In such cases, you need a placeholder until you have the correct value. NULL fills this need. NULL means that some value is unknown, missing, or irrelevant. This three-valued logic is an important feature of the SQL language as defined by the American National Standards Institute (ANSI) and requires a NULL condition. In most programming languages, a Boolean expression (or predicate, in SQL), such as haircolor = 'brown' and gender = 'F', can be only true or false. In ANSI SQL, you have a third option, unknown (NULL), so the expression haircolor = 'brown' and gender = 'F' can equate to true, false, or unknown. If you have a query such as

   SELECT * FROM employee WHERE haircolor = 'brown'
   AND gender = 'F'

and you test against a table row in which the value of haircolor is brown and the value of gender is F, the test is true and that row is returned in the result set. If you test against a row that contains values of blonde and F for haircolor and gender, the test is false and that row isn't returned in the result set. Finally, if you test against a row that contains NULL for haircolor and F for gender, this test also fails, because you don't know what color the hair is. This third row is not part of the result set.

Because NULL is a placeholder and not a value (such as zero or a space), the ANSI specification states that the NULL condition must be the same for all data types (number, character, date). So you can expect equivalent behavior when you use any data type in a comparison expression that includes NULL.

The ANSI specification states that a column is nullable by default. Not all database management system (DBMS) vendors implement the ANSI specification, nor do they implement it consistently. For instance, SQL Server 4.2 behavior was NULL by default, the ANSI standard; but starting with SQL Server 6.0, the NULL property switched to NOT NULL by default. So to be sure, always specify whether a column can be NULL when you're writing CREATE TABLE statements. (You can download an example DDL at from the "Download the Code" icon at the top of the page.)

Why Use NULL?

A NULL condition is neither a space (for character or date data types) nor a zero value (for number data types). When you're developing and implementing a data model, you might find that you need to specify a NULL condition for a column. Figure 1 is a detailed entity-relationship diagram (ERD) of a simplified workplace environment with three tables: Project, Employee, and Paycheck. Project contains attributes describing the kinds of workplace projects an employee might be assigned to, Employee contains attributes describing individual employees, and Paycheck contains attributes describing employee compensation. To demonstrate a point, I've simplified the relationships and restricted the business rules. Each Project can have many Employees assigned to it, but an Employee is associated with only one Project at a time. Each Employee receives many Paychecks, but each Paycheck goes to only one employee.

I made some assumptions about the attributes in the Project table. The primary-key attribute (ProjNo), the project title (ProjName), and the starting date of the project (StartDate) all must have values when a row is inserted into the table; the end date of the project (EndDate) may have a value. ProjNo, ProjName, and StartDate all are created with a NOT NULL property; EndDate is created with the NULL property.

In the Employee table, the primary key (EmpNo), the employee's family name (EmpLname), and the employee's Social Security number (EmpSSN) must be filled in for each row in the table. The employee's given names (first plus middle, if known) can be NULL, and so can the associated project number.

Each time a paycheck is issued, an entry is made in the Paycheck table. The two-part (concatenated, or composite) primary key (WeekNbr plus EmpNo), the employee's Social Security number (EmpSSN), and the paycheck amount (PayAmt) must have values. The employee's job code (JobCode) may be NULL.

Figure 2 illustrates this design. I populated the tables with test data to demonstrate the concept of NULL. The Project table contains three projects, one of which has a firm end date; the others are open-ended. The decision to make EndDate nullable was based on the business environment. Not every project has a firm end date, and for some projects the end date is unknown or unimportant because a project might be continual and ongoing. Suppose the project manager wants to track project duration in days and months. If you make EndDate NOT NULL and use some sort of code to indicate no end date or an ongoing project, you can't make EndDate a datetime data type. And you can't use date functions such as DateDiff, which gives the project duration, as you can see in Listing 1.

NULL Logic Limitations

An unfortunate limitation of the ANSI three-valued logic specification is that NULLs can propagate. For instance, if you have employee age in the Employee table and want to determine the average age of your employees, you might write a query such as

   SELECT AVG(age) FROM employee

and expect it to return some result. But according to some implementations of three-valued logic, when you're computing aggregate values, if a single age is NULL, the result set must be NULL. Any NULL age is unknown, and you can't add real numbers to an unknown value and get a real number for the result.

The SQL Server implementation of NULL is consistent with the ANSI specification, announcing when a NULL value is involved in the computation and using only the real values to get the result. You can see examples of these warnings in the results for Listings 2 and 3. The warning messages that return with these queries are visible in SQL Server 6.5 only if you SET ANSI_WARNINGS ON from the ISQL/w window before running the code. With SQL Server 7.0, ansi_warnings are on by default, so you'll probably see the warnings if you run these examples.

NULL Overhead

The cost of using NULL offsets the benefit of being able to express accurate information. The ANSI specification doesn't suggest how to manage NULL columns; ANSI leaves the administration to the developers.

SQL Server 6.5 handles NULL columns as variable-length data types (varchar and varbinary). Even in a table of all fixed-length data types, such as Paycheck, the row layout and overhead are variable-length format because JobCode is nullable. If a row contains all fixed-length data types and no NULL column properties, 4 bytes are added to each row for administrative overhead. If a row contains variable-length data types or if any column is nullable, each row has at least 9 bytes added to it. Five more bytes per row taken out of each page doesn't sound like much. However, 63 fixed-length Paycheck records can fit on one SQL Server 6.5 page, whereas only 54 variable-length records fit on that same page. To get the most out of these I/O operations, you want the greatest possible number of records per page read and write.

To determine the number of records that will fit on a page for SQL Server 6.x, calculate the blocking factor. The blocking factor is the number of logical records that can fit in a physical record, or page, and thus, the number of logical records that a single I/O operation can return. To determine the blocking factor for a SQL Server 6.5 table with all fixed-length columns and no NULL, start with a 2016-byte page. Total the number of bytes in each column, and add 4 bytes for overhead. Divide 2016 by the summed total; round down for the final answer. If the Paycheck table consists of fixed-length logical records of 28 bytes each, a maximum of 63 fixed-length records can fit in one SQL Server 6.5 page: 2016/(28+4).

However, if a table contains variable-length columns or if any column (including fixed-length columns) is nullable, fewer logical records are stored on a page. To calculate this number for SQL Server 6.5, find the sum of the fixed-column width, fixed-column overhead, the average variable-column length, and the variable-length overhead, and divide that value into the page size. In the Paycheck table, which has one column with a NULL property, you can fit only 54 records on one SQL Server 6.5 page: 2016/(24+4+4+5). This difference doesn't sound like much, and for small data sets you probably won't notice any performance degradation. However, with large data sets involved in sequential processing, you might see a substantial difference in performance, based on physical file I/O, cache hits, or both.

File I/O isn't the only function NULL affects. SQL Server 6.5 has to do more lookup and computation for variable-length rows than for fixed-length rows, thus putting more load on the processor. Whereas the columns of a fixed-length row are stored in the order in which they are created, the variable-length record is not stored in creation order. Fixed-length columns are first in storage, followed by 2 bytes to indicate the total row width. Then comes each variable-length (or NULL) column, followed by a couple of bytes of column offset information, followed by a one-byte entry per variable-length column, pointing to the column's starting location in the row.

When you issue a SELECT * query against a table with only fixed-length columns, the operations required are 1) read the rows and 2) present the result set. When you issue a SELECT * query against a table with variable-length or nullable columns, the operations required are 1) compute the length of each variable-length column, 2) rearrange the order of the columns from storage to creation order, and 3) present the result set.

As if this additional processor load weren't enough, NULL columns affect indexes. SQL Server 6.5 treats each index built on a column that might contain NULL in much the same way it treats tables with variable-length columns. You create an index on a column to enhance retrieval and shorten response time. But an index you create on a column that contains instances of NULL doesn't perform as well as an index that you create on a NOT NULL column.

NULL has a different effect on SQL Server 7.0 file I/O. SQL Server 7.0 treats a fixed-length column with NULL like a fixed-length field, not a variable-length field. Also, SQL Server 7.0 stores a special NULL bitmap in every record to identify which columns contain NULLs, even in rows that have no nullable columns. The SQL Server 7.0 engine must reference this bitmap for every row it accesses. With NULL, you lose some of the gain in file I/O that SQL Server 7.0 gives you, because of the additional processor time necessary to decode the bitmap with every operation.

One good reason to disallow NULL relates to update handling (i.e., how the SQL Server engine modifies the record during an update operation). Because a NULL property causes SQL Server 6.5 to treat a column as a variable-length field, rarely can you perform an update in-place when a table contains variable-length columns or NULLs. For most updates, SQL Server 6.5 performs an insert operation followed by a delete operation. The exception is the direct update in-place, which is the quickest, most efficient method of performing an update. The direct update in-place doesn't relocate the data, unlike deferred update and direct update not-in-place, and thus incurs the least overhead of all the update methods.

SQL Server 7.0 doesn't have anything comparable to the SQL Server 6.5 deferred update, so the impact of NULL properties in an updated record is not as profound in SQL Server 7.0 as in 6.5. SQL Server 7.0 has enhanced NULL handling, which makes NULLs less cumbersome and incurs less performance degradation than in previous releases.

NULL helps express a true condition of data. Therefore, you can use NULL and view performance overhead as a cost of doing business.

Comments

Plain text