Get Rid of Gaps in Numerical Sequences

It’s common for a table to contain an integer column that’s used as the table’s primary key or part of the primary key. An integer column can be a simple sequence of numbers (e.g., 1, 2, 3, …) or a more complex sequence that includes numbers representing specific information such as a year (e.g., 2008001, 2008002, 2008003, …). One of my clients uses an integer column that falls into the latter category. This company maintains a database of customers who are identified by a numeric ID. Each customer ID reflects the year in which the customer record was added to the database, the numeric ID of the office that added the customer record, and an integer. Each year, all the offices are assigned a non-overlapping range of integers to use in the customer IDs.

Believing that a gapless sequence of customer IDs helps instill a high level of confidence in the customer list, the company’s CEO requested that the numbers be maintained so that there are no gaps. In this case, a gap is said to exist when the difference between two consecutive integers in a sorted sequence is more than one. For example, the sequence 1, 2, 4, 5 has a gap between 2 and 4.

Although companies might want integer sequences in database applications to be gapless, gaps occur frequently. Gaps can occur because of the assignment process. An automated process might produce gaps because of the logic used or an untrained user generates more numbers than needed. Manual assignment systems are notorious for generating gaps due to user errors. Gaps can also occur because of inadequate table maintenance. For example, a number might be assigned to a record in a table, but later the record is deleted.

To meet the CEO’s request, I began working on a system to check for and eliminate gaps. I first created a script, GenerateGaps.sql, that generates a temporary table (##TempTbl) containing around 1,000 integers with gaps. GenerateGaps.sql begins by generating a simple table that has as an integer identity column and a tiny integer column. Next, the script adds one record to the table, then adds more records through recursion. In the first recursion, it copies the existing record and adds that record to the table. In the second recursion, the script copies the two existing records and adds them to the table. The recursion occurs 10 times, so the table is populated with 1,025 records (1 through 1025). The script then deletes three records—the records whose identity column values are 250, 500, and 750. Finally, the script creates and populates a table (##TempRange) with the two range values of 300 and 600. (The purpose of the range values will be discussed shortly.)

In GenerateGaps.sql, you can adjust how many records are generated if more or fewer records are desired. The number of records generated is the sum of a progression of powers of 2 (21 + 22 + 23 + 24 and so on), so 20 recursions produce more than a million records. You can also adjust how many records are deleted and change the range values in the ##Temp- Range table.

With the integers generated, my next task was to write a script that would search a specified range of integers in a sequence to find the first gap (i.e., the lowest missing integer) or the next available integer. FindGap.sql in Listing 1 is the result. Here’s how this script works. In the code at callout A, FindGap.sql retrieves the range values from the ##TempRange table, then generates a sequence of integers within that range. So, in this example, FindGap.sql creates 301 integers that start at 300 and end at 600.

Next, FindGap.sql compares the newly created gapless sequence of integers with the same range of integers from the ##TempTbl table. As callout B shows, one of several actions will occur based on the result of this comparison. When a gap is found, the script returns the missing integer. When no gaps are found, the script uses the MAX function to obtain the maximum value in the range, then adds 1 to that value. The resulting value becomes the next available integer unless one of the following conditions is met:

If the resulting value is null, no integers from the ##TempTbl sequence are in use, so the script returns the first integer in the range.
If the resulting value is the last number in the range, the script returns the negative version of that number.
If the resulting value is greater than the highest value in the range, all the integers from the ##TempTbl sequence are in use, so the script returns a 0.

By varying the range values in GenerateGaps.sql, you can see how this part of FindGap.sql works. For example, GenerateGaps.sql provides the range values of 300 and 600, which means FindGap.sql will return a value of 500 because that’s the first gap. If you change the range values in GenerateGaps.sql to 1500 and 2000, FindGap.sql will return a value of 1500 because no integers in the range have been used so far. If you change the range values in GenerateGaps .sql to 800 and 1025, FindGap.sql returns a value of -1025, which means the last integer in the range is the next available integer in the sequence. If you change the range values in GenerateGaps.sql to 800 and 1000, FindGap.sql returns a value of 0 because all the numbers in the range have been used.

As callout C shows, once determined, the next available integer is returned using the sp_executesql stored procedure with an output parameter. You can send a message to a DBA based on the value returned. For example, if a 0 or a negative integer is returned, the message can inform the DBA that the ranges in the table need to be updated so that new numbers can be created. If a positive integer is returned, there is no need to send a message.

You can download GenerateGaps.sql and FindGap.sql by clicking the 99797.zip hotlink at the top of this page. These scripts work on SQL Server 2005 SP2 and SQL Server 2000 SP4. I tested FindGap.sql’s performance on SQL Server 2005. With a range containing around 100,000 integers, FindGap.sql found the first gap and the next available integer in about 1 second. When the range of integers was increased to around a million, FindGap .sql found the first gap and the next available integer in about 12 seconds. These results indicate that a range in the low hundreds of thousands is the most efficient.

There are many ways you can adapt GenerateGaps.sql and FindGap.sql to meet your needs. For example, you could adapt them to use sequences in which the integers have differences larger than one (e.g., even or odd integers) by increasing the difference between consecutive integers. You could use a numbering scheme that includes duplicate integers by incorporating a summary view that changes the duplicate integers to unique integers.

Instead of using FindGap.sql as separate script, you could use the code as part of another script or convert it to work as a stored procedure with an output parameter. If you don’t want to use a table to provide the range values, you could assign @n1 and @n2 values that reflect the desired range and remove the statement that selects the range values from the range table. As you can see, the possibilities are virtually endless.

—Roy Byrd, principal consultant, Byrd Associates

Comments

Plain text