Missing Numbers

Downloads
25103.zip

Test your T-SQL to solve the unsolvable puzzle

Imagine you're at a job interview for an SQL programming position when the interviewer challenges you to find the solution to a puzzle. You give it your best shot, attacking the problem from several angles, and when the time is up, you find yourself right where you started—with no solution. Years later, after giving more thought to the problem, you decide that the puzzle probably doesn't have a solution after all! That's what happened to Isaac Blank, a talented SQL programmer who kindly gave me permission to share his experience with you.

Here's the puzzle Isaac faced. Given the table that the code in Listing 1 creates, find one SELECT statement that retrieves all the missing numbers in the table. The output should be in the format that Figure 1 shows. Note that you're not allowed to create or use any additional tables, and the SELECT statement should be static (i.e., not built as an SQL string and run dynamically).

Isaac's interviewer claimed that the puzzle solution was a single SELECT query that involved a complex (five-way or so) self-join. Isaac thinks now that no such solution exists. I gave the puzzle my best shot, and after trying several solutions, I realized that they're all limited in some way. I also think that the puzzle probably doesn't have a solution, but I might be missing something. Let's look at how I went about trying to solve the puzzle. You can try to solve it first on your own or read on to see my conclusions. If you think that Isaac and I missed something and that you have a solution, I'd love to hear it.

Laying the Groundwork

Before you start a puzzle like this, you need to make sure that you have all the information required to solve it. It's important not to leave any room for ambiguity because a missing piece of information, however subtle, might determine the solution's complexity or even whether the puzzle is solvable. This puzzle is missing several pieces of information. First, must the solution be ANSI-compliant, or can it be any valid T-SQL query? In this case, the solution must be ANSI-compliant.

Second, the puzzle doesn't specify the lower and upper boundaries. Are they 1 and 1000? Or 1 and the maximum value in column a? Or the minimum and maximum values in column a? Let's presume just the minimum and maximum values in a, which happen to be 1 and 1000 in this case. The minimum and maximum values might be different next time you run your query if the table undergoes modifications in the meantime. For example, if the row with a value of 1 in column a is deleted and a new row with a value of 2000 is inserted, the minimum and maximum values become 2 and 2000. Your query needs to work correctly without your knowing what the minimum and maximum values will be. Furthermore, let's deal only with natural numbers (positive integers).

Third, what's the highest possible upper boundary? A certain specific value? The maximum integer SQL Server 2000 allows (the bigint data type allows a maximum of 2⁶³-1)? Or maybe any theoretical natural number that's not constrained by the limitations of the existing data types in the product you're implementing the solution on? Any theoretical natural number means that the solution would work without any changes even in future releases of SQL Server that might support larger integer data types. Isaac and I decided to assume any theoretical nonfinite natural number.

Now that you have all the missing pieces of information, you can start working on a solution. Note that you're not allowed to revise any of the requirements. For example, my initial thought was to provide the output not in the exact format specified but as ranges of missing numbers. You can try this approach as an exercise (I'll even get to it later in the article), but be aware that you're solving a different puzzle. Before you continue reading, I urge you to first give the puzzle some thought and try to find a solution.

Digging In

One approach to solving this puzzle is to first isolate the primary obstacles that prevent you from providing a solution. By ignoring certain requirements, you can create a simple solution, then focus on finding an alternative that meets the puzzle's requirements. The two most obvious obstacles are that you're allowed to refer to only Test1 and that the upper boundary can theoretically be any natural number. Let's remove these two obstacles for now by assuming that the upper boundary is finite—say 1000—and that you can create and use tables other than Test1 in your query. At this point, the solution is simple: Use an auxiliary table that contains all integer numbers in the range 1 to 1000. You can use the script in Listing 2 to create and populate a table called Nums that has one integer column called num.

The following query is a solution to the simplified puzzle (make sure you first run the code in Listings 1 and 2 to create the tables):

SELECT num
FROM Nums
WHERE num BETWEEN (SELECT MIN(a) FROM Test1)
              AND (SELECT MAX(a) FROM Test1)
  AND NOT EXISTS(SELECT * FROM Test1
                 WHERE a = num)
ORDER BY num

The query returns from the Nums table all numbers that are between the minimum and maximum values in Test1 but that don't exist in Test1—in other words, all the missing numbers you're looking for.

Now you can revisit the requirements you omitted earlier. First, you can try getting rid of the Nums table. If you could write a query that returns a table similar to Nums without referring to any table other than Test1, you could use it as a derived table as follows:

SELECT num
FROM (<derived table query>) AS Nums
WHERE num BETWEEN (SELECT MIN(a) FROM Test1)
              AND (SELECT MAX(a) FROM Test1)
  AND NOT EXISTS(SELECT * FROM Test1
                 WHERE a = num)
ORDER BY num

This way, you meet the requirement of referring only to Test1.

In T-SQL, generating such a derived table is easy because T-SQL allows a SELECT query that doesn't have a FROM clause (for example, SELECT 1). You can use queries with no FROM clause to generate the derived table by using binary calculations. If you perform a UNION ALL between SELECT 0 and SELECT 1, you get a table with two rows as a result. You can perform a cross join on a series of 10 such tables, each representing a different bit in a binary value that contains 10 bits. In all, you get 1024 rows in the result of the 10-way CROSS JOIN query, representing the 1024 possible values in a binary string of 10 bits. In each of the participating tables, you multiply the column value by 2 to the power of the represented bit - 1 and sum all the values each bit represents (one from each table) plus 1 to generate the result column value. Look at the query in Listing 3, which implements the calculation I just described. Execute it, then reread the explanation. You can add ORDER BY num to the end of the query to verify that all the values are in the query result. If you use the query in Listing 3 as a derived table, make sure to remove the ORDER BY clause because you can't have an ORDER BY clause in derived tables.

The problem with Listing 3's query is that it's not ANSI-compliant, and ANSI compliance is one of the puzzle requirements. According to ANSI, a SELECT query must have a FROM clause and refer to at least one table or view. At first glance, this rule might look like a serious obstacle, but it's not. To make the query ANSI-compliant, instead of SELECT 0, you can write SELECT MIN(0) FROM Test1 or SELECT MAX(0) FROM Test1. You can handle SELECT 1 similarly.

If you aren't comfortable with the binary approach, you can use a decimal approach in which each digit in a table has a value ranging from 0 to 9; you just cross-join three such digits tables to generate 1000 different numbers. Listing 4 contains the ANSI form of a query that uses a decimal approach to create the Nums derived table. Again, you can add an ORDER BY clause here to verify that all the values are in the query result. Listing 5, page 18, contains the full query that returns all missing numbers from Test1, with a limited boundary of 1000.

Proof Negative

Now you're left with one obstacle—the theoretical upper boundary. The solution in Listing 5 assumes the upper boundary is 1000. Any other known finite upper boundary still lets you create a solution. All you need to do is adjust the number of cross-joined digits tables. Even "the maximum integer that SQL Server 2000 supports" is a finite upper boundary. Sure, the query would be very long, but it's possible. Because the solution must be a static query, you must know the upper boundary in order to know how many cross joins to write in the query. However, the requirement is to support any theoretical natural number. In fact, any other approach to solving the puzzle and producing enough result rows in the specified format would require that you know the upper boundary. Why? Bear with me while I try to prove my point.

Problem: Let n be the number of rows in the table Test1. If n = 0 (empty set) or n = 1 (minimum value in Test1 = maximum value in Test1), the result should be an empty set. Therefore, you need to address only situations where n >=2. Let m be the maximum value in Test1. The maximum number of rows that you might need to generate in the result is m - 2 (accounting for two rows in Test1 with the values 1 and m). Prove that a static SQL query (an SQL statement that wasn't built as a string and executed dynamically) that produces m - 2 rows can't be written if m can be any nonfinite natural number.

Proof: The maximum number of rows that can be generated by a single operation (e.g., join, union, intersection) that refers to Test1 only, where n>=2, is the Cartesian product n² when you use a cross join of Test1 and itself. By definition, a static query with an infinite number of cross joins can't be written. For any given query that you write, the maximum number of rows that the query can generate is n to the power of the number of cross-joined tables. Because m can be any natural number, there's always a chance that m - 2 is greater than n to the power of the number of cross-joined tables. Therefore, unless I missed something obvious in my logic, the puzzle doesn't have a solution.

Could any form of this puzzle still have a solution if the upper boundary isn't considered to be finite? Yes. For example, if you could represent the missing numbers in any form you wanted, you might represent them as ranges of numbers. The query in Listing 6 does just that, calculating the begin_range values as follows:

Given that n is a value in Test1, begin_range =
n + 1 if no row exists in Test1 where a = n + 1.

Similarly, the query calculates the end_range values as follows:

Given that n is a value in Test1, end_range = 
n - 1 if no row exists in Test1 where a = n - 1.

To match begin_range values with their corresponding end_range values, the query selects for each begin_range value the minimum end_range value that's greater than or equal to the begin_range value. Although Listing 6's query gives a solution to the missing numbers puzzle with an unknown boundary, it doesn't meet the format requirement for the query output.

Worth the Effort?

Knowing how talented Isaac Blank is with SQL, I knew I was facing a tough puzzle. I also knew that I might run into a dead end. You might wonder whether trying to solve a problem that might not have a solution is worth the effort. I think it is.

Pierre de Fermat, a seventeenth-century mathematical genius, wrote notes in the margins of a book. In his notes, he claimed that he could prove that the equation xⁿ + yⁿ = zⁿ has no solution where x, y, and z are natural numbers and n > 2. However, the book's margins weren't wide enough for him to write down the solution. Fermat never published his proof, so some mathematicians are still skeptical that he had one. Mathematicians tried to find a proof for more than 350 years, until British mathematician Andrew Wiles discovered one in 1995.

My father is one of the mathematicians who spent years trying to solve Fermat's Last Theorem. He thought about it wherever he went. If you ask my father how he feels about having spent so much time on a problem without finding the proof, he says it was well worth the study. Although he didn't find the solution, he polished his skills in Number Theory and invented many new techniques of his own. The puzzle in this article is nowhere close in scale to Fermat's Last Theorem, but whenever you try to solve an SQL puzzle, even if it has no solution, you polish your skills. And the attempt can be well worth the effort.

Comments

Plain text