IS NOT DISTINCT FROM

Treatment of NULLs adds a fair amount of complexity to SQL. Here I’d like to point out a specific issue that has to do with NULL treatment and point out a standard SQL language element that addresses this issue and was not yet implemented in SQL Server. At the end of this article I’ll provide a URL where you can vote for adding it to SQL Server.

When you compare columns/expressions, say in the ON or WHERE clauses of a query, predicates follow what’s called the 3-valued-logic. For example:

SELECT ...
FROM dbo.T1 JOIN dbo.T2
  ON T1.col1 = T2.col1;

The predicate can evaluate to TRUE, FALSE or UNKNOWN. The predicate evaluates to UNKNOWN whenever NULLs are involved. For example (T1.col1 = 5, T2.col1 = NULL), (T1.col1 = NULL, T1.col2 = 10) and even (T1.col1 = NULL, T2.col1 = NULL). Some T-SQL language elements will handle UNKNOWN differently than others. For example, ON, WHERE, HAVING ,IF and CASE “accept TRUE”, and will treat UNKNOWN somewhat similar to FALSE, but not exactly like FALSE.

For example, in the above example, when T1.col1 = T2.col1 evaluates to UNKNOWN, like with FALSE, this will be considered a nonmatch. However, the reason that I said that the treatment here is not exactly like with FALSE is that if T1.col1 = T2.col1 evaluates to UNKNOWN, NOT T1.col1 = T2.col1 also evaluates to UNKNOWN. Other language elements like a CHECK constraint “reject FALSE”, and treat UNKNOWN more similar to TRUE, but again, not exactly like TRUE. For example, an expression such as CHECK(col1 = col2) will reject a row where col1 = 5 and col2 = 10, but will accept col1 = 5 and col2 = NULL. But as I mentioned earlier, if col1 = col2 evaluates to UNKNOWN, NOT col1 = col2 still evaluates to UNKNOWN. So in this case, when col1 = 5 and col2 = 10 the row will be accepted this time (because NOT FALSE is TRUE), but col1 = 5 and col2 = NULL will also be accepted.

T-SQL already provides an answer for cases where you want to explicitly check when the expression is or is not a NULL via the IS NULL and IS NOT NULL operators. However, there are cases where you want to compare columns/expressions from both sides, and use 2-valued-logic; that is, you want the predicate to evaluate to either TRUE or FALSE.

For example, suppose that in certain circumstances you want the predicate T1.col1 = T2.col1 to evaluate to TRUE when both sides have known values and are equal, or when both sides are NULL—in other words, you want to treat NULLs just like you do known values. For now, you have to add some logic such as:

(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

This makes the expression awkward, and also you would probably cross your fingers in hope that this will not hurt optimization. Note that this expression will “accept” only the cases you want to treat as a match and “reject” all others, but it still has 3-valued-logic, e.g., T1.col1 = 10, T2.col1 = NULL evaluates to UNKNOWN. This means
that:

NOT(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

Will evaluate to UNKNOWN in those cases as well and not to TRUE. So if you want all non matches of the positive predicate to be accepted, you won’t use the positive predicate and apply NOT on top, rather rewrite your
logic, e.g.,

T1.col1  T2.col1
  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)

In short, there are cases where you want predicates to use 2-valued-logic, and for now, the solutions are not pretty, and we can only hope that they are optimized well.

SQL Server 2008 introduces new language elements that add scenarioswhere NULLs are involved and you would want predicates to use 2-valued-logic, e.g., grouping sets related features and the MERGE statement.

I will cover those in detail in my October and November columns. For now, even though you may not be familiar with grouping sets and the MERGE statement, focus on the ON clause of the following MERGE statement
handling incremental updates of aggregates:

WITH LastDay AS
(
  SELECT 
    GROUPING_ID(
      custid, empid,
      YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
    custid, empid,
    YEAR(orderdate) AS orderyear,
    MONTH(orderdate) AS ordermonth,
    DAY(orderdate) AS orderday,
    SUM(qty) AS qty
  FROM dbo.Orders
  WHERE orderdate = '20080419'
  GROUP BY
    CUBE(custid, empid),
    ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
)
MERGE INTO dbo.MyGroupingSets AS TGT
USING LastDay AS SRC
  ON     (TGT.grp_id = SRC.grp_id)
     AND (TGT.orderyear = SRC.orderyear
          OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL))
     AND (TGT.ordermonth = SRC.ordermonth
          OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL))
     AND (TGT.orderday = SRC.orderday
          OR (TGT.orderday IS NULL AND SRC.orderday IS NULL))
     AND (TGT.custid = SRC.custid
          OR (TGT.custid IS NULL AND SRC.custid IS NULL))
     AND (TGT.empid = SRC.empid
          OR (TGT.empid IS NULL AND SRC.empid IS NULL))
WHEN MATCHED THEN
  UPDATE SET
    TGT.qty += SRC.qty
WHEN NOT MATCHED THEN
  INSERT (grp_id, orderyear, ordermonth, orderday, custid, empid)
  VALUES (SRC.grp_id, SRC.orderyear, SRC.ordermonth, SRC.orderday, SRC.custid, SRC.empid);

Notice how awkward it is.
I had a discussion on the subject with fellow MVPs in a private forum. Steve
Kass did some research and discovered that ANSI SQL has an answer to
this problem in the form of operators:

IS \[NOT\] DISTINCT FROM

It does require getting used to, but once you do, it makes perfect sense.
DISTINCT treats NULLs like known values. That is, one NULL is not
distinct from another NULL, but a NULL is distinct from known values. 5 is
distinct from 10, but 5 is not distinct from 5.
Now let’s go back to the example I mentioned earlier:

T1.col1 = T2.col1

To have this predicate use 2-valued-logic, and have NULLs treated like
known values, instead of using:

(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

You would use:

T1.col1 IS NOT DISTINCT FROM T2.col1

This is ultimately simpler than the awkward forms used nowadays and has much better potential for good optimization in terms of utilizing indexes efficiently. As for the inverse; take the following predicate:

T1.col1  T2.col1

In order to treat NULLs like known values, instead of using:

T1.col1  T2.col1
  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)

You would use:

T1.col1 IS DISTINCT FROM T2.col1

Besides being simpler and lending itself to better optimization, the IS \[NOT\] DISTINCT FROM operator give you pure 2-valued-logic, namely, apply NOT on top and you will get the exact inverse.

I find this standard language element very important, and this will become more evident in SQL Server 2008. Unfortunately, it was not implemented in T-SQL. Steve Kass who researched and discovered it submitted a
suggestion for this enhancement in connect.microsoft.com. I encourage you to vote for this suggestion here; the more votes it gets, the better are the chances you will see it in a future version of SQL Server.

Cheers,
--
BG

Comments

Plain text