Skip navigation

Finding Nonmatching Keys

Downloads
24907.zip

I have two tables called T1 and T2. To find out which keys in T1 didn't exist in T2, I used the following query:

SELECT * FROM T1
WHERE key_col NOT IN(SELECT key_col 
FROM T2)

However, the query returned an empty result set. How can I generate the data I need?

The query returns an empty result set when the T2 table that the query's IN predicate refers to has NULLs in the key_col column. To see why the query returned an empty result, let's look at an example that uses the Employees and Orders tables that Listing 7, page 14, shows. From those tables, the following query attempts to fulfill the request "return employees who made no orders" (in this case, employees 1 and 4):

SELECT * FROM Employees
WHERE empid NOT IN(SELECT empid FROM Orders)

However, the result is an empty set. To understand why the result is an empty set instead of employees 1 and 4, let's analyze the way SQL Server "thinks" when it processes the query logically one step at a time. Note that the following three steps aren't necessarily the way SQL Server processes the query physically, but the logical simplification of the query processing will help you understand the unexpected results.

Step 1: The subquery in the above IN predicate returns the list that Figure 2, page 14, shows. You can think of SQL Server replacing the list of values that the subquery's IN predicate returned, as the following query shows:

SELECT * FROM Employees
WHERE empid NOT IN(3, 2, 2, 3, 3, NULL, 2)

Step 2: You can think of SQL Server simplifying the list by removing duplicates because duplicates don't affect the result of the IN predicate:

SELECT * FROM Employees
WHERE empid NOT IN(2, 3, NULL)

Step 3: You can think of SQL Server logically replacing the IN predicate with a list of logical expressions that compare employee ID to the values in the list, concatenated by the OR logical operator:

SELECT * FROM Employees
WHERE NOT (empid = 2 OR empid = 3 OR empid = NULL)

Keep in mind that a comparison of any value to NULL renders an unknown result. A query returns a row only when the logical expression in the WHERE clause evaluates to TRUE. For employee ID 2, who placed an order, SQL Server evaluates the logical expression in the previous WHERE clause as follows:

NOT (2 = 2 OR 2 = 3 OR 2 = NULL) =>
NOT (TRUE OR FALSE OR UNKNOWN) =>
NOT (TRUE) =>
FALSE MONTH

When the logical expression in the WHERE clause evaluates to FALSE, the query doesn't return the row. For employee ID 1, who placed no orders, SQL Server evaluates the logical expression in the WHERE clause as follows:

NOT (1 = 2 OR 1 = 3 OR 1 = NULL) =>
NOT (FALSE OR FALSE OR UNKNOWN) =>
NOT (UNKNOWN) =>
UNKNOWN

When the logical expression in the WHERE clause evaluates to UNKNOWN, the query doesn't return the row.

In this example, you can't accurately say that employees 1 and 4 placed no orders. A NULL exists in the Orders table in the empid column, and because NULL represents an unknown or irrelevant value, you can say only that you don't know whether employees 1 and 4 placed orders. This ambiguity explains the empty result set. The problem is in the way you asked the question rather than in an inaccurate result set. When you change your request to "return all employees who aren't known to have made orders," the following query returns employees 1 and 4:

SELECT * FROM Employees
WHERE NOT empid IN(SELECT empid FROM Orders
      WHERE empid IS NOT NULL)

You can also return employees 1 and 4 by using other solutions, such as the EXISTS() predicate that the following code example includes:

SELECT *
FROM Employees AS E
WHERE NOT EXISTS(SELECT *
               FROM Orders AS O
               WHERE O.empid = E.empid)

An EXISTS()predicate returns TRUE only if the subquery supplied to it as an argument returns at least one row. For employees 2 and 3, the query would find at least one order ID, so the EXISTS() predicate would return TRUE. But for employees 1 and 4, the subquery would return no rows, so the EXISTS() predicate would return FALSE. A NOT EXISTS() predicate's returns are the opposite of an EXISTS() predicate. Because the logical expression in the example's WHERE clause contains NOT EXISTS(...), for employees 2 and 3, the expression evaluates to FALSE and the query doesn't return rows. A NOT EXISTS() predicate returns TRUE, so the query returns the rows for employees 1 and 4. A query that uses the EXISTS() or NOT EXISTS() predicate typically performs either the same or better than a query that uses an IN predicate.

To return employees 1 and 4, you can also use a third option—an outer join. By left-joining the Employees table to the Orders table, you can return all matching rows—as well as employees who have no matching rows in Orders—with NULLs for the nonmatching column values from the Orders table. Note that SQL Server doesn't return the row in the Orders table that has a NULL in the empid column because the row doesn't match any row in the Employees table. The following query shows that you can filter all rows in the Employees table that have no matches in the Orders table by checking for NULLs in the orderid column that result from the outer join:

SELECT E.*
FROM Employees AS E
  LEFT OUTER JOIN Orders AS O
    ON E.empid = O.empid
WHERE O.orderid IS NULL

The outer join solution returns the required results but performs less efficiently than the solution that uses the EXISTS() predicate. SQL Server can provide good performance when using the EXISTS() predicate because this predicate returns TRUE if the subquery returns at least one row. When SQL Server finds at least one row, it doesn't need to continue processing the subquery. In contrast, a query containing a join must process all rows, then filter the nonmatches.

TAGS: SQL
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish