Don’t Avoid the UNKNOWN

Downloads
47010.zip

NULL is a tough and seemingly contradictory concept to understand. NULL is a nonvalue but sometimes is a value. That is, NULL represents an unknown or irrelevant value, or a way to represent a value that can't be expressed as an explicitly known value. Even after working with SQL Server for many years, programmers still have a hard time dealing with NULL because human minds have a hard time dealing with concepts that aren't measurable. In this respect, dealing with NULL in T-SQL is similar to dealing with infinity or even zero in mathematics.

Another problem with NULL is that logical expressions involving it can yield an UNKNOWN logical result instead of the familiar TRUE or FALSE. T-SQL contributes to the confusion by treating NULL values and UNKNOWN logical results differently in different contexts of the language.

If you think you're knowledgeable about NULLs, try to answer this question: Given two sets U \{2, 7, 9\} and V \{3, 7, NULL\}, represented by the tables U and V, both with an integer column col1, what would the following query return?

SELECT * FROM U WHERE
  col1 NOT IN(SELECT col1 FROM V);

Most programmers would say that the query will return 2 and 9, but the truth is different because of the way NULLs are treated in logical expressions. To shed some light on this truth and reduce the confusion, let's look at the inconsistencies in T-SQL when working with UNKNOWN logical results and NULL values.

Beware of the UNKNOWN

In most programming languages, logical expressions can yield one of two possible results: TRUE or FALSE. ANSI SQL is unique in this respect by supporting a concept called three-valued logic. In SQL, logical expressions can yield one of three possible results: TRUE, FALSE, or UNKNOWN. You basically get an UNKNOWN logical result whenever the answer is unknown. This happens when a NULL is involved in the expression because NULL represents an unknown value. For example, 5 > 3 is known to be TRUE, and 3 > 5 is known to be FALSE. However, 3 = NULL yields the UNKNOWN logical result. Even NULL = NULL yields the UNKNOWN logical result (unless the ANSI_NULLS set option is set to OFF, which isn't standard and should be avoided).

If you want to check whether a column value is NULL, instead of specifying

col1 = NULL

make sure you specify

col1 IS NULL

If you want to check whether a column isn't NULL, instead of specifying

col1 <> NULL

NOT(col1 = NULL)

make sure you specify

col1 IS NOT NULL

A confusing aspect of UNKNOWN is that when you apply the NOT logical operator, it still yields UNKNOWN. In other words, unlike NOT(TRUE), which yields FALSE, and NOT(FALSE), which yields TRUE, NOT(UNKNOWN) yields UNKNOWN. That's why the following Northwind query doesn't return all rows from the Customers table:

SELECT CustomerID, 'TRUE' AS logic, Region
FROM dbo.Customers WHERE Region = 'WA'
UNION ALL
SELECT CustomerID, 'NOT(FALSE)', Region
FROM dbo.Customers WHERE NOT(Region = 'WA');

The query returns all customers that are known to be from the WA region and all customers that are known not to be from the WA region. But the query doesn't return customers from unknown regions. To include those customers, you need to run the query

SELECT CustomerID, 'TRUE' AS logic, Region
FROM dbo.Customers WHERE Region = 'WA'
UNION ALL
SELECT CustomerID, 'NOT(FALSE)', Region
FROM dbo.Customers WHERE NOT(Region = 'WA')
UNION ALL
SELECT CustomerID, 'Is UNKNOWN', Region
FROM dbo.Customers WHERE Region IS NULL;

How to Treat the UNKNOWN

Although in real life it's fair game to say that the answer to 3 = NULL is unknown, in T-SQL it would be more accurate to say that 3 = NULL is known to be UNKNOWN (i.e., the UNKNOWN logical result). It's a subtle but important distinction between saying that you don't know what the result is and saying that you know that the result is UNKNOWN. T-SQL treats UNKNOWN logical results deterministically. That is, in the same context, UNKNOWN will always be treated the same way. The confusing part is that in different contexts UNKNOWN is treated differently. However, as long as you know what to expect in T-SQL code, there should be no problems.

For example, in all query filters (e.g., ON, WHERE, HAVING), UNKNOWN is always treated as FALSE. In other words, when the filter yields UNKNOWN, the row isn't returned. For example, the following query returns all customers that are known to be from the WA region:

SELECT * FROM dbo.Customers WHERE Region = 'WA';

Technically speaking, queries only return rows for which the logical expression in the filter returns TRUE. Rows aren't returned when the logical expression returns FALSE or UNKNOWN.

A logical expression used in a CHECK constraint is treated differently. Rows for which the CHECK constraint expression returns either TRUE or UNKNOWN are accepted; rows for which the expression returns FALSE are rejected. For example, if you had the following CHECK constraint in the Customers table

CHECK(Region = 'WA')

both WA and NULL in the Region column would qualify and all known regions that aren't WA (e.g., NY, CA) would be rejected.

Now that you know how an UNKNOWN logical result is treated in a filter, try to answer the question I posed at the beginning of this article. Run the code in Listing 1 to create and populate the tables U and V, then try to guess which rows would be returned from U by the query

SELECT * FROM U WHERE
  col1 NOT IN(SELECT col1 FROM V);

Finally, run the query to see what you really get. This query returns all members of table U that are known not to be members of table V. The member 7 is known to be in V, so you don't want to see it. However, you can't tell for sure whether member 2 or member 9 isn't in V. The NULL in V is an unknown value, which means that it might well represent a 2 or 9. In other words, when you have a NULL in V, you can never know for sure whether a value in U isn't in V. Hence, you get an empty set back.

The description I just gave is a logical explanation of why you get an empty set back from the query. For a more technical explanation, let's evaluate the expression for a value that's known to appear in both sets (e.g., 7), and one that isn't (e.g., 2). For the value of 7, the expression would be

7 NOT IN(3, 7, NULL) =
NOT(7=3 OR 7=7 OR 7=NULL) =
NOT(FALSE OR TRUE OR UNKNOWN) =
NOT(TRUE) =
FALSE

Because the result is FALSE, you don't get 7 back. You know for sure it appears in V, so you don't want to see it. Now let's evaluate the expression for 2, which appears in U only:

2 NOT IN(3, 7, NULL) =
NOT(2=3 OR 2=7 OR 2=NULL) =
NOT(FALSE OR FALSE OR UNKNOWN) =
NOT(UNKNOWN) =
UNKNOWN

Because the logical expression evaluates to UNKNOWN—and in a filter UNKNOWN is treated as FALSE—you don't get 2 back. In other words, if there's a NULL in V, the values that appear in U and V will yield a FALSE, whereas the other values in U will yield UNKNOWN, which is treated as FALSE. If you keep this in mind, you'll realize that you'll always get an empty set back.

If you want to return values from U that are known not to appear in V, you need to run a different query because that's a different type of request. One option is to use NOT IN to query only the known values in V:

SELECT * FROM U WHERE col1 NOT IN
  (SELECT col1 FROM V WHERE col1 IS NOT NULL);

This query returns 2 and 9 as expected. Another option is to use NOT EXISTS:

SELECT * FROM U WHERE NOT EXISTS
  (SELECT * FROM V WHERE V.col1 = U.col1);

This query also returns 2 and 9. The reason why you don't need to test for known values explicitly here is that EXISTS can only return TRUE or FALSE. It has no UNKNOWN state. Either the subquery returns rows or it doesn't. The trick is that an unknown state in the subquery's filter is treated as FALSE; therefore, the outer query yields the desired TRUE (NOT(FALSE)). You get values that appear in U but are known not to appear in V.

NULLs and Scalar Expressions

Any arithmetic operation involving NULL, such as

SELECT 5+NULL, 5-NULL, 5*NULL, 5/NULL

yields a NULL because the result is unknown. Similarly, an attempt to concatenate a string with a NULL, such as

SELECT 'abc' + NULL

yields a NULL (unless the CONCAT_NULL_YIELDS_NULL option is set to OFF, which isn't standard and should be avoided).

ISNULL and COALESCE are two handy functions that let you substitute a NULL with another value. ISNULL is T-SQL specific. It accepts two inputs and returns the first one that isn't NULL. Coalesce is ANSI compliant. It accepts any number of inputs and returns the first one that isn't NULL.

All NULLs Are Not Equal

You might be tempted to think that all NULLs are equal because they all represent an unknown value. However, as I mentioned previously, NULL = NULL yields UNKNOWN in logical expressions. Depending on whether the logical expression is used in a filter or in a CHECK constraint, SQL Server will determine whether to treat UNKNOWN as FALSE (in filters) or TRUE (in CHECK constraints).

Grouping, distinct, sorting, uniqueness, and set operations can also affect whether NULLs are treated as equal. In grouping and distinct operations, NULLs are treated as equal. That is, NULLs are grouped together or considered duplicates by the DISTINCT clause. For example, the following Northwind query produces only one group from all rows in which the Region value is NULL:

SELECT Region, COUNT(*) AS cnt
FROM dbo.Customers
GROUP BY Region;

This query returns 19 rows, one of which represents the group of unknown regions. That group contains 60 members in the count.

Sorting operations also treat NULLs as equal, but ANSI doesn't specify whether NULLs should be sorted lower or higher than known values. ANSI leaves this determination to the database platform implementation. Either approach is fine, as long as it's consistently applied throughout the database product. In SQL Server, NULLs are sorted lower than known values. Hence, the following query returns all rows with a NULL in the Region column first:

SELECT Region, CustomerID FROM dbo.Customers
ORDER BY Region;

If you want NULLs to sort last, you'd use the query

SELECT Region, CustomerID FROM dbo.Customers
ORDER BY CASE WHEN Region IS NULL THEN 1 ELSE 0 END, Region;

The ORDER BY clause calculates a primary sort value, which is 0 for a known region and 1 for an unknown one. This guarantees that NULLs sort last. Within known regions, the Region value itself is used as the secondary sort value, ensuring a correct sort among regions.

In uniqueness operations, ANSI provides two nuances of a UNIQUE constraint. One nuance treats NULLs as equal, which means there's only one occurrence of a NULL. The other nuance treats NULLs as different values, which means there can be multiple occurrences of NULLs. T-SQL implements only the former, so only one occurrence of a NULL is allowed in a column with a UNIQUE constraint. If you want to enforce uniqueness in a column for only known values, you can use a trigger. Alternatively, you can use an indexed view created for the non-NULL values in the column, as the following example shows:

CREATE TABLE dbo.T1(c1 INT);
GO
CREATE VIEW dbo.VT1 WITH SCHEMABINDING
  AS SELECT c1 FROM dbo.T1 WHERE c1 IS NOT NULL
GO
CREATE UNIQUE CLUSTERED INDEX idx1 ON dbo.VT1(c1);

Because the view filters out the NULL values, you can enter as many as you like to into T1. However, if you attempt to enter duplicates of a known value, they're rejected by the indexed view because it enforces uniqueness.

The set operations UNION, EXCEPT, and INTERSECT treat NULLs as equal. (Note that EXCEPT and INTERSECT are in SQL Server 2005 only.) For example, when the UNION set operation compares two rows to eliminate duplicate rows, it considers NULLs as equal. Similarly, when EXCEPT looks for rows in the first set that don't appear in the second set or when INTERSECT tries to identify which rows appear in both sets, they treat NULLs as equal.

To NULL or Not to NULL?

I find the ability to express NULL values and UNKNOWN logical results very useful in T-SQL, so I don't recommend that you try to avoid using them altogether. The way I see it, the important thing is to be familiar with the subtleties of NULL and UNKNOWN treatment in T-SQL. Knowing how they are treated enables you to anticipate and plan for the behavior so you can make sure you handle all cases in your code.

Comments

Plain text