Sequences, Part 2

Last month I introduced sequences ("Sequences, Part 1"), which are a new feature in the upcoming version of SQL Server 2012 (code-named Denali). I explained how to create and alter sequences, I discussed how to obtain new values using the NEXT VALUE FOR expression, and I compared sequences with the IDENTITY column property. This month I discuss caching of sequence values, obtaining a range of sequence values, and producing multiple unique sequence values in the same target row.

Caching

SQL Server supports defining a caching option for sequences, with the purpose of improved performance by minimizing disk access. The caching option is an extension to standard SQL. When you create or alter a sequence, you can indicate the CACHE option along with a cache value or NO CACHE if you want to prevent caching. The current default option (as of press time) is to use a cache value of 50; this default might change in the future. (For more information about cache values, see the sidebar “How to Determine the Default Cache Value in SQL Server Denali.”)

If you choose the NO CACHE option (remember, the default is CACHE 50), every time a new sequence value is generated, SQL Server has to write the change to the system tables on disk. This, of course, can have a negative performance effect. When using the CACHE option, SQL Server writes to the system tables only once every request. What SQL Server records on disk every time a new block of cached values is allocated is the first value to be used in the next block. In the meantime, SQL Server keeps in memory only two members with the same type as the sequence type, holding the current value and how many values are left in the current block of cached values before the next write to disk needs to occur. Every time a request for a new sequence value is made, SQL Server updates those two members in memory.

Compared with not caching values, you can observe significant performance improvements when caching with even fairly small blocks of values (e.g., 50); the bigger the cache value, the better the performance. So why not simply use as big a value as possible in the cache option? Because if SQL Server shuts down uncleanly (e.g., power failure), you basically lose the remaining range up to the value currently written to disk. Given the fact that new sequence values generated in a transaction that doesn’t commit are eventually lost anyway, sequences can’t guarantee no gaps to begin with.

Note that if SQL Server shuts down cleanly, it writes the current value plus one to disk; then upon restart, when the next request for a sequence value is made, the new sequence value starts from that point. As an example, suppose the cache size is 50. The stored value in the system tables is 151 (first noncached value), the current sequence value is 142, and the number of values remaining to use is 8 before the next request will cause a write to disk. So if you shut down the system cleanly at this point, SQL Server will write the number 143—the next value to be used—to the system tables. When you restart the system and a request for a new sequence value is made, SQL Server will allocate 50 values by writing to disk 193.

If you’re curious about whether SQL Server caches values for IDENTITY as well, it does. Currently (at press time), the cache size of IDENTITY is a hard-coded value of 10, but this size might change in the future. The internals of the IDENTITY property and sequences are quite similar—but because the default cache size is 50 for sequences and 10 for IDENTITY, you should see a small performance advantage of sequences compared with IDENTITY.

There’s another interesting difference in the handling of caching between the two features. As I mentioned, with sequences, if the system shuts down unexpectedly, you lose the remaining cached values in the current block and simply end up with a gap in your values. With IDENTITY, when SQL Server restarts, during recovery it scans the log records to determine the last used IDENTITY value, so you end up not losing the remaining cached values (except in certain scenarios with low likelihood because their time window is very small). If you’re wondering why SQL Server can’t recover the last used sequence value from the log in a similar manner, it’s because sequence values can be generated by SELECT queries without inserting them into a table—and therefore no record exists in the log for those values. So SQL Server simply doesn’t attempt to recover the lost cache values.

Let’s examine some performance numbers from a test I ran on my laptop. It’s not really a thorough or exhaustive test, but rather a fairly simple test just to get a general sense of performance. Use the code in Listing 1 to create the objects used in this performance test.
The code in Listing 1 creates a helper function called GetNums that returns a sequence of integers of a requested size. The code also creates a table called T1 with an IDENTITY property, a sequence called Seq1, and a table called T2 with a default expression that obtains its value from Seq1.

SET NOCOUNT ON;
USE tempdb;
-- cleanup
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
IF OBJECT_ID('dbo.T1'     , 'U' ) IS NOT NULL DROP TABLE    dbo.T1;
IF OBJECT_ID('dbo.T2'     , 'U' ) IS NOT NULL DROP TABLE    dbo.T2;
IF OBJECT_ID('dbo.Seq1'   , 'SO') IS NOT NULL DROP SEQUENCE dbo.Seq1;
GO
-- helper function
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
  SELECT n FROM Nums ORDER BY n OFFSET 0 ROWS FETCH FIRST @n ROWS ONLY;
GO
-- T1 with IDENTITY, T2 with sequence
CREATE TABLE dbo.T1
(
  col1 INT       NOT NULL IDENTITY,
  col2 BINARY(1) NOT NULL
);
CREATE SEQUENCE dbo.Seq1 AS INT START WITH 1 INCREMENT BY 1;
CREATE TABLE dbo.T2
(
  col1 INT       NOT NULL DEFAULT(NEXT VALUE FOR dbo.Seq1),
  col2 BINARY(1) NOT NULL
);

First I checked how long it took to fill T2 with 1,000,000 rows without generating any sequence or IDENTITY values, using the following code:

TRUNCATE TABLE dbo.T2;
INSERT INTO dbo.T2(col1, col2)
  SELECT n, 0x42
  FROM dbo.GetNums(1000000);

This code ran for 7 seconds on my laptop. I ran the code more than once with each test to ensure that if tempdb needed to grow, it didn’t do so while I was measuring performance.

To test the performance of IDENTITY, I used the following code:

TRUNCATE TABLE dbo.T1;
INSERT INTO dbo.T1(col2)
  SELECT 0x42
  FROM dbo.GetNums(1000000);

This code ran for 9 seconds. So, roughly 2 seconds can be attributed to creating IDENTITY values. As a reminder, SQL Server currently uses a hard-coded cache value of 10 for IDENTITY; you can’t change this value.

Next, I used the following code to test the performance of the sequence object with the default caching option (currently CACHE 50):

TRUNCATE TABLE dbo.T2;
INSERT INTO dbo.T2(col2)
  SELECT 0x42
  FROM dbo.GetNums(1000000);

This code ran for 8 seconds. As you can see, the performance of a sequence with the default caching is better than that of IDENTITY, which can be attributed to the greater cache value.

Finally, I used the following code to test the sequence performance with no caching:

ALTER SEQUENCE dbo.Seq1 NO CACHE;
TRUNCATE TABLE dbo.T2;
INSERT INTO dbo.T2(col2)
  SELECT 0x42
  FROM dbo.GetNums(1000000);

This code ran for 19 seconds.

The conclusion from this test is that the performance of a sequence object with default caching is slightly better than the performance of IDENTITY; however, without caching, the sequence object performs much slower.

I also wanted to test different cache values. The problem with the first test is that it’s difficult to easily isolate the part of the runtime that’s purely caused by the generation of the sequence values, because such a large portion of the runtime is caused by the insert operation itself. With IDENTITY, there’s not much choice because you have to insert the data into a table. Fortunately, though, with sequences you can generate new values in a query without writing them to a table.

I therefore queried the helper function GetNums to generate 10,000,000 rows—10 times the number used in the previous test—and used the NEXT VALUE FOR expression against Seq1 to generate the sequence values. In the Query Options dialog box, I enabled the Discard results after execution option so that the query runtime wouldn’t include producing the output in SQL Server Management Studio (SSMS).

I then measured the runtime of just producing 10,000,000 rows with the function, without generating sequence values, like so:

SELECT n
FROM dbo.GetNums(10000000);

This code ran for 11 seconds. To get a sense of the performance effect of generating the sequence values alone, you need to subtract this number from the runtime of the code with the sequence generation.

I used the following code to recreate the sequence with the default cache value:

IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
IF OBJECT_ID('dbo.T2', 'U') IS NOT NULL DROP TABLE dbo.T2;
IF OBJECT_ID('dbo.Seq1', 'SO') IS NOT NULL DROP SEQUENCE dbo.Seq1;
CREATE SEQUENCE dbo.Seq1;

Then I used the following code to query the function and generate a new sequence value per row:

SELECT NEXT VALUE FOR dbo.Seq1
FROM dbo.GetNums(10000000);

I ran this code several times after altering the sequence definition with the different cache values I wanted to test. I tested NO CACHE, 10, 20, 30, 40, 50 (default), and 10,000,000. Figure 1 shows the benchmark results.

Figure 1: Sequence cache values benchmark

Note in the figure that the Y axis representing the runtime starts at 11—the runtime of the query without generating the sequence values—to give you a better sense of the net difference in performance between the different cache values. What’s interesting is that a very large difference exists between no caching and caching, even with small values—but as you increase the cache values, the improvement is nonlinear. So if you’re concerned with losing a large range of sequence values if an unexpected shutdown occurs, you’re not really sacrificing a lot of performance when using a fairly small cache value. Also, to tell roughly where IDENTITY would fit in this graph (at least for now), look at the bar for cache value 10.

When you’re done, run the following code for cleanup:

IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
IF OBJECT_ID('dbo.Seq1', 'SO') IS NOT NULL DROP SEQUENCE dbo.Seq1;

Obtaining a Range of Sequence Values

SQL Server supports the ability to ask for a whole range of sequence values in one shot, updating the sequence only once for the entire range, and having the application distribute the individual values. To request a new range of sequence values, use the stored procedure sp_sequence_get_range. If you provide the sequence name and the range size as inputs, you’ll get back as output parameters the first value in the range, the last value, the number of times the sequence cycled (if allowed), the sequence increment, and the minimum and maximum values. If cycling isn’t allowed, and when asking for a new range you end up exceeding the maximum (or minimum if the increment is negative), you’ll get error 11732; you can use error handling code to deal with this case.

Let’s take a look at a couple of examples. First, run the following code to create a sequence:

IF OBJECT_ID('dbo.Seq1', 'SO') IS NOT NULL DROP SEQUENCE dbo.Seq1;
CREATE SEQUENCE dbo.Seq1 AS INT
  START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 100;

Next, run the following code twice to demonstrate two requests for a range of size 11:

DECLARE @first AS SQL_VARIANT;
EXEC sys.sp_sequence_get_range
  @sequence_name     = N'dbo.Seq1',
  @range_size        = 11,
  @range_first_value = @first OUTPUT;
SELECT @first;

On the first run, you should get the first value 1; on the second run, you should get the first value 12. Of course, in the application you’ll typically distribute each range of 11 values—in the first case the values 1 through 11, and in the second case the values 12 through 22.

Observe in the sequence definition that cycles aren’t allowed (by default). To demonstrate an error when you exceed a boundary point, run the following code:

DECLARE @first AS SQL_VARIANT;
EXEC sys.sp_sequence_get_range
  @sequence_name     = N'dbo.Seq1',
  @range_size        = 1759,
  @range_first_value = @first OUTPUT ;
SELECT @first;

You’ll get the error message that Figure 2 shows.

Figure 2: Error message from exceeding a boundary point

Next, alter the sequence definition to allow cycles:

ALTER SEQUENCE dbo.Seq1 CYCLE;

Then ask for a range such that the maximum value is exceeded:

DECLARE
@first       AS SQL_VARIANT, 
  @cycle_count AS INT;
EXEC sys.sp_sequence_get_range
  @sequence_name     = N'dbo.Seq1',
  @range_size        = 1759,
  @range_first_value = @first       OUTPUT,
  @range_cycle_count = @cycle_count OUTPUT;
SELECT @first AS first_value, @cycle_count AS cycle_count;

This time the operation doesn’t fail, and the code returns the first value and cycle count that Figure 3 shows.

Figure 3: Output from exceeding the maximum value

Multiple Sequence Values in the Same Target Row

There’s an intriguing aspect of the NEXT VALUE FOR function that some people might find surprising. Consider the following sequence definition:

IF OBJECT_ID('dbo.Seq1', 'SO') IS NOT NULL DROP SEQUENCE dbo.Seq1;
CREATE SEQUENCE dbo.Seq1 AS INT START WITH 1 INCREMENT BY 1;

Without running the following code, what do you expect its output to be?

SELECT
NEXT VALUE FOR dbo.Seq1 AS val1, NEXT VALUE FOR dbo.Seq1 AS val2;

Most people would guess 1 and 2—but when you run the code, you get 1 and 1. Now that you know what the output is, can you explain why you get this output?

The answer is that this behavior has to do with a very unique concept to SQL called all-at-once, which means that all expressions that appear in the same logical query processing phase are evaluated conceptually at the same point in time. That’s why, for example, you can use an UPDATE statement such as the following to swap column values:

UPDATE dbo.T1
SET col1 = col2, col2 = col1;

Because NEXT VALUE FOR is an expression, it must follow the same rules. Therefore, no matter how many times you call the expression in the same query, for the same row it will produce the same value in all of its occurrences.

The question is then, What if you have a table T1 with two columns called key1 and key2 that need to obtain their values from the same sequence? If you assign both with default expressions that invoke the NEXT VALUE FOR expression against the same sequence, both columns will obtain the same value in each row. If you need to issue a single row INSERT statement against the table, you can simply invoke the NEXT VALUE FOR expression twice in two separate statements, store those values in variables, then use those variables in your INSERT statement.

Things become far trickier if you need to handle a multi-row insert. How many ways can you think of to solve this puzzle? (I covered this puzzle in my blog a few months ago. Check out the puzzle at the and its solutions at TSQL Challenge with Denali's Sequence Object" and its solutions at "Solutions to TSQL Challenge with Denali's Sequence Object".)

More Flexibility

Because sequences have so many advantages over IDENTITY, their introduction in SQL Server Denali is a blessing. Sequences are much more flexible than IDENTITY, which gives you much greater control. Last month I covered working with sequences, and I compared sequences with IDENTITY. This month I covered caching, obtaining a range of values, and generating multiple values for the same target row.

I’d like to thank Tobias Ternstrom, Umachandar Jayachandran, and Ian Kosseff from the SQL Server dev team for answering my questions and helping clarify what happens beneath the covers in SQL Server Denali.

Comments

Plain text