Build the Numbers Table You Need

In the database world, a numbers table, or tally table, is simply a table of unique integers. These one-column tables usually start with a value of 1, increment by 1, and end at some fixed integer. Developers can often use them to eliminate cursors and loops, parse strings, optimize queries, identify number sequence gaps, and more. A Google search for "tally table" or "numbers table" shows many different ways to build them, but the techniques generally have drawbacks. The techniques might themselves involve cursors, loops, or identity columns; cursors and loops usually run slowly, and identity columns can become a hassle. The techniques might need temporary tables or actual database tables, which could place expensive demands on database server resources, and the techniques lack the flexibility for a developer who needs a different kind of numbers table: one that has starting—and maybe ending—negative values or increment values greater than 1.

To get around almost all these drawbacks, I wrote CreateIntegersTable, the multistatement table-valued function shown in Listing 1.

USE [master]
GO

IF OBJECTPROPERTY(OBJECT_ID('dbo.CreateIntegersTable'),'IsTableFunction') = 1
  BEGIN
    DROP FUNCTION dbo.CreateIntegersTable
  END
GO

CREATE FUNCTION [dbo].[CreateIntegersTable]
(
  @start_int  BIGINT = 1,
  @step_int  BIGINT = 1,
  @end_int  BIGINT
)

RETURNS @FinishedIntegersTable TABLE
(
  ints  BIGINT
)

AS
BEGIN

/*  Table variable @IntegersTable holds finished Integer list. */

  DECLARE @IntegersTable TABLE 
  (
    ints BIGINT
  )
/*  If the user picked an @start_int value less than 1, shift the @end_int value. */
  IF (@start_int < 1)
    SET @end_int = (@end_int + ABS(@start_int) + 1);


/*  Common table expression to build list of Integers less than 32,767. */

  WITH IntegersTableFill (ints) AS
  (
    SELECT
  CAST(1 AS BIGINT) AS 'ints'
    UNION ALL

SELECT  (T.ints + @step_int) AS 'ints'
    FROM  IntegersTableFill T
    WHERE  ints <= (
      CASE
        WHEN  (@end_int <= 32767) THEN @end_int
        ELSE  32767
      END
        )
  )

  INSERT INTO @IntegersTable
  SELECT ints
  FROM IntegersTableFill
  OPTION ( MAXRECURSION 32767 )
	A
/*  This block handles @end_int > 32,767. */

  IF (@end_int > 32767)
  BEGIN
	B
    DECLARE @last_int_inserted  BIGINT
    DECLARE @int_row_groups    INT
    DECLARE @current_row_group  SMALLINT
/*    Figure out how many row groups the function needs for @end_int rows. */

SELECT @int_row_groups = CEILING((LOG((@end_int * 1.0)/65534))/LOG(2)) + 1
	C
/*    The WHILE loop uses variable @int_row_groups. */
    SET @current_row_group = 1
    WHILE (@current_row_group <= @int_row_groups)
    BEGIN
  SELECT @last_int_inserted = MAX(ints) FROM @IntegersTable
	D
/*
  In each group, make row inserts into @IntegersTable,
  starting at @last_int_inserted and ending at
  @end_int in the last group.
 */

  INSERT INTO @IntegersTable
  SELECT  ints + @last_int_inserted + (@step_int - 1)
  FROM  @IntegersTable
  WHERE  (ints + @last_int_inserted) <= @end_int
  SET @current_row_group = @current_row_group + 1
	E
    END
  END
/*  Reverse the Integer shift in @IntegersTable, then in @end_int. */
  IF (@start_int <> 1)

UPDATE  @IntegersTable
    SET    ints = (ints + (SIGN(@start_int) * ABS(@start_int)) - 1)
  IF (@start_int < 1)
    SET @end_int = (@end_int - ABS(@start_int) - 1)
	F
/*  Trim @IntegersTable, then set up the function return value. */
  DELETE FROM @IntegersTable
  WHERE  ints < @start_int OR
  ints > @end_int
  INSERT @FinishedIntegersTable SELECT ints FROM @IntegersTable
  RETURN
END

Written for SQL Server 2005, CreateIntegersTable takes input parameters @start_int, @step_int, and @end_int and returns a single-column table variable of type BIGINT. CreateIntegersTable relies on a common table expression (CTE) and simple table inserts. Although it uses one WHILE loop, the loop is executed infrequently—a 60-million–row result set, for example, loops at most 11 times. CreateIntegersTable avoids cursors, temporary tables, database tables, and identity columns. It handles both positive and negative start and end values as well as step values greater than 1. It can even make full use of the BIGINT space, both in the data type of the numbers themselves and in the total number of rows returned. Almost all the variable names in function CreateIntegersTable and ActualEndIntegerCalculator, a related function explained later, include "int," to clarify that these functions deal only with integers and big integers.

Understand It

Although CreateIntegersTable starts with USE \[master\], the function can go in any database. The script drops the function if it already exists in the target database, then creates a fresh instance of the function. The @start_int and @step_int parameters default to 1.

To simplify the code, the function assumes that the first integer in the integer range is at least 1. The function handles @start_int values greater than 1 later on but a @start_int value less than 1 needs special handling. When the start value is less than 1, the IF test adds ABS(@start_int) to the @end_int value. Because the function builds ranges starting at 1, not 0, it also adds 1 to @end_int to compensate by shifting the end value, as Figure 1 shows.

Figure 1: Handling a start value of less than 1

The IntegersTableFill CTE (callout A) inserts rows into @IntegersTable, the table variable that holds the generated integers, up to a maximum @end_int value of 32,767. Variable @end_int specifies the maximum value in the table; because the function has a default @step_int value of 1, the @end_int value of 32,767 applies to both the maximum value in the table and the number of rows in the table. If the function receives a @step_int value greater than 1, however, the number of rows in the table at this point will be less than 32,767 while the maximum value in the table remains 32,767. Using a CTE is fast, and, for this initial set of rows it has high performance—and the maximum number of rows involved, 32,767, makes the function's other features possible. Parameter @step_int increments the integers correctly. If @end_int is less than 32,767, the CTE WHERE clause uses @end_int instead of 32,767. This behavior improves efficiency, as I explain later.

The CTE OPTION (MAXRECURSION 32767) limit forces an @end_int maximum of 32,767. The MAXRECURSION option allows only 32,767 levels of recursion, which would mean a maximum of 32,767 integers. We could specify (MAXRECURSION 0) or even leave this line out of the function, but then nothing would protect the function from infinite recursion, which would crash it.

The next IF test (callout B) deals with @end_int values that exceed 32,767. Although the IntegersTable CTE that I just described works for values less than or equal to 32,767, numbers larger than 32,767 require something else because of the MAXRECURSION limit. For such end values, the function uses an INSERT INTO with a SELECT (callout E) that doubles the number of integers already in @IntegersTable. In other words, this particular INSERT INTO can handle row insertions only in successive groups of 2ⁿ power (2 ^ n) multiples of 65,534 (64KB) rows. Figure 2 shows these calculations.

Figure 2: Calulating row-insertion groups

The function inserts as many integers as we want, up to the BIGINT limit, but starting at integer 32,768, it must do so in steps, using a SELECT statement to double the number of rows already in @IntegersTable before it inserts new rows. Therefore, the function must figure out how many row groups it should handle. The SELECT statement in callout E does just that. Say that @end_int is 5,000,000. From Figure 2, the function should then go up to row group 8, the group that includes row 5,000,000 and ends at row 8,388,352. Based on the way the math works, which I described earlier, using the last number of group 8 as an example, we start with the equation

65534 x 2ⁿ = 8388352

and try to get a formula for n, the number of row groups. With natural logs and some algebra, we get

n = (ln(128))/(ln(2))

which yields n = 7. In Figure 2, the row group numbers start at 1, not 0, so add 1 to n, and you get n = 8. A general formula for n would look like this:

n = (ln(x/65534))/(ln(2))

To make the numbers work the way we need, round n up to the next integer if it's not an exact integer, and then add 1. The T-SQL statement at callout C does all this; we multiply @end_int by 1.0 because the math works reliably with real numbers. With this expression, for @end_int = 5000000, @int_row_groups = 8. The earlier number shift is important here. If @start_int is less than 1, the natural log function at best would have returned the wrong value and at worst would have crashed. Although this statement returns 0 for @end_int less than 32,768, @end_int will never drop below 32,768 at this point in the function, so we're safe.

The function is almost ready to make the row insertions from integers 32,767 to @end_int. At callout D, it initializes variable @current_row_group to 1 and uses this variable as the loop counter for the WHILE loop. Variable @num_row_groups has the number of row groups the function handles. The WHILE loop loops once for every row group. Before the loop starts its work, variable @last_int has the last, or highest, value already inserted into table variable @IntegersTable. With @last_int_inserted, the loop knows the integer at which it should start inserting rows. Row group 1 is a special case, and @last_int_inserted helps there, too. Row group 1 covers rows 1 to 65,534, but the IntegersTableFill CTE already inserted at most 32,767 rows. Variable @last_int_inserted equals 32,767, so in the loop, the first row group should start at integer 32,768. Variable @last_int_inserted makes this possible. If the loop is at the last row group, it uses parameter @end_int to stop. Otherwise, it just keeps going. Setting @end_int this way makes the function more efficient, as I mentioned earlier. For example, if the user picked 16,776,706 for @end_int, the function would end at row group 10. But if the SELECT statement didn't stop at @end_int, the function would insert integers16,776,70 to 33,553,408 and throw them away later. Variable @end_int prevents this waste.

Now the function uses the INSERT INTO statement (callout E) to insert the integers. Earlier, I mentioned that the INSERT INTO statement doubles the number of integers already in @IntegersTable. Look closely at the SELECT statement to see why. As a separate SELECT, it adds @last_int_inserted (the latest, largest integer in @IntegersTable) and possibly @step_int, to every integer already in @IntegersTable. Then the INSERT INTO statement inserts this new row set into @IntegersTable. Thus, for each pass of the WHILE loop, the largest integer inserted is (@last_int_inserted + @last_int_inserted) plus maybe @step_int, doubling the number of integers in @IntegersTable. Only in the last row group does the last result set integer exceed @end_int. The WHERE clause tests for this condition, optimizing the INSERT statement. The SELECT statement uses @step_int almost exactly the way the IntegersTableFill CTE used it, except that here, it subtracts 1 to account for the fact that it operates on a "base" set of integers that starts at 1, not 0. Finally, the loop increments @current_row_group.

When the loop ends, table variable @IntegersTable has a "raw" list of integers, but input parameter @start_int might differ from 1, which is the value the function assumed, and the function needs to compensate. First, if @start_int is less than 1, the UPDATE statement (callout F) shifts the integers back to that @start_int value. The function recalculates @end_int, and the DELETE statement trims @IntegersTable accordingly. The function inserts everything into @FinishedIntegersTable, which it returns to the T-SQL statement that originally called it, and then it ends.

If @step_int is greater than 1 and @end_int is greater than 32,767, the last integer the CTE inserts could have a value greater than 32,767 and the largest integers in the row groups would exceed the integers in the row groups that Figure 2 shows. If CreateIntegersTable handled this scenario, it might become even more efficient, but it would also become more complex.

Put It to Work

Listing 2 shows how to use CreateIntegersTable. Declare a table variable and run an INSERT INTO statement with a SELECT statement that calls the function and specifies parameters. That's it.

USE [master]
GO
DECLARE @IntegersTable TABLE
(
  ints BIGINT
)
INSERT INTO @IntegersTable
SELECT  ints
  FROM  CreateIntegersTable (-40, 45, 22401)
SELECT ints
FROM @IntegersTable
ORDER BY ints

Unfortunately, CreateIntegersTable adds entries to the tempdb log file, and for a very large integer list, this could become a big problem because the log file could use a lot of system resources and affect performance. SQL Server 2005 has no way to prevent this situation; if it did, the function would have even better performance and use even fewer resources. I researched for a solution but found nothing. If you know how to prevent this problem in either SQL Server 2008 or 2005, please send me an email to tell me about it.

Function CreateIntegersTable returns a set of rows that starts at the value of parameter @start_int—each time, every time. However, given specific @start_int, @step_int, and @end_int parameters, the final integer in the row set that CreateIntegersTable returns might differ from the @end_int parameter. This situation would happen because of mod function (i.e., division-remainder) math and could occur when the @step_int parameter is greater than 1. You can use scalar value function ActualEndIntegerCalculator, which Listing 3 shows, to find the actual, largest integer that CreateIntegersTable will return given a specific set of start, step, and end parameters.

USE [master]
GO
IF OBJECTPROPERTY(OBJECT_ID('ActualEndIntegerCalculator'),'IsScalarFunction') = 1
  BEGIN
    DROP FUNCTION ActualEndIntegerCalculator
  END
GO
CREATE FUNCTION [ActualEndIntegerCalculator]
(
  @start_int  BIGINT,
  @step_int  BIGINT,
  @end_int  BIGINT
)
RETURNS BIGINT
AS
BEGIN
  DECLARE @ActualEndInteger BIGINT
  IF (@start_int <= 0) AND (@start_int <= @end_int)
  BEGIN
    SET @ActualEndInteger = @end_int - ((@end_int - @start_int) % @step_int)
  END
  ELSE IF (@start_int > 0) AND (@start_int <= @end_int)
  BEGIN
    SET @ActualEndInteger = (@end_int - @start_int) - ((@end_int - @start_int) % @step_int) + @start_int
  END
  RETURN @ActualEndInteger
END

This function can go in any database, and this example shows how to use it:

SELECT  dbo.ActualEndIntegerCalculator (1455, 45, 22401)

If parameter @start_int is less than or equal to 0, ActualEndIntegerCalculator starts at @start_int and finds the multiple of @step_int closest to @end_int. If @start_int is greater than 0, the calculator function shifts the @start_int and @end_int parameters so that @start_int is 0, calculates @end_int based on this shift, as before, and then adds back @start_int. If @start_int is greater than @end_int, the function returns NULL.

With the CreateIntegersTable and ActualEndIntegerCalculator functions, you have efficient, flexible, high-performance tools to build the numbers tables you want, when you want.

Comments

Plain text