Change in Behavior of RAND and NEWID in SQL Server 2005

I’m writing this blog entry as a result of a customer query regarding a change in behavior related to invocations of the RAND and NEWID functions in SQL Server 2005.

Suppose you need to write an expression in a query invoking the RAND or NEWID function (say for randomization purposes) and you need the function to be invoked only once. For example, suppose you need to make a random choice out of three options (call them ‘option one’, ‘option two’ and ‘option three’), and you write the following code:

select
  case rnd
    when 1 then 'option one'
    when 2 then 'option two'
    when 3 then 'option three'
    else 'oops'
  end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;

Remember that the RAND function returns a float value in the range 0 through 1, inclusive (0 and 1 are possible result values). Casting the expression rand()*3 as an integer will truncate the fraction part of the value.

Even though the probability to get exactly 3 back after casting is very low, it is still a possibility. This is why I used %3 (modulo 3)—to ensure the expression will return an integer in the range 0 through 2. By adding 1, the expression is guaranteed to return an integer in the range 1 through 3.

Internally, SQL Server rearranges a simple form of a CASE expression such as the above to the searched form, namely, it expands the WHEN clauses to incorporate the full predicates. As an example, the above CASE expression is internally evaluated as follows:

select
  case 
    when rnd = 1 then 'option one'
    when rnd = 2 then 'option two'
    when rnd = 3 then 'option three'
    else 'oops'
  end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;

SQL Server 2000 evaluates each reference to the alias rnd assigned in the table expression d separately; therefore, it actually invokes the RAND function three times. You realize that this means that in SQL Server 2000 it is possible that none of the WHEN clauses of the CASE expression will evaluate to TRUE, and you might end up getting ‘oops’ back. Try running this code several times in SQL Server 2000 and you will be able to verify this.

SQL Server 2005 changes the behavior of outer references to aliases assigned in table expressions, where the aliased expression invokes the RAND or NEWID function. SQL Server 2005 will invoke the function only once, therefore it is guaranteed that one of the WHEN clauses in the above query will evaluate to TRUE, and you will never get ‘oops’ back.

You can test this with a similar example that invokes the NEWID function. To return a random value in the range 1 through n, instead of using the expression:

cast(rand()*n as int)%n + 1

You can use the expression:

abs(checksum(newid()))%n + 1

CHECKSUM(NEWID()) returns a random integer. Applying ABS on top ensures you get a nonnegative integer. Applying %n (modulo n) ensures that the value is >= 0 and < n. By adding 1 you ensure that the value is >= 1 and <= n. In short, this is just another way to get a random integer value in the range 1 through n. So the above query can be rewritten as follows:

select 
  case rnd
    when 1 then 'option one'
    when 2 then 'option two'
    when 3 then 'option three'
    else 'oops'
  end
from (select abs(checksum(newid()))%3 + 1 as rnd) as d;

Try running this query several times in both SQL Server 2000 and in SQL Server 2005. In SQL Server 2000 you will occasionally get ‘oops’ back, while in SQL Server 2005 you will never get ‘oops’ back. This is due to the same change in behavior I described earlier.

A simple way to test the difference in behavior between the versions is by running the following code:

select rnd, rnd
from (select rand() as rnd) as d;

Run it in SQL Server 2000 and you will get two different invocations of RAND, hence most probably two different values back. Run it in SQL Server 2005, and you’re guaranteed to get the same value back twice since RAND will be invoked only once.

This change is described in SQL Server 2005’s Books Online under the section “Behavior Changes to Database Engine Features in SQL Server 2005,” but it’s very easy to overlook it.

Note that the change in behavior has nothing to do with multiple invocations of RAND or NEWID in the same query, as opposed to being invoked once in a table expression and then referenced multiple times in the outer query. For example, the following code can return ‘oops’ in both SQL Server 2000 and in SQL Server 2005:

select 
  case abs(checksum(newid()))%3 + 1
    when 1 then 'option one'
    when 2 then 'option two'
    when 3 then 'option three'
    else 'oops'
  end;

In SQL Server 2005 you can now use a table expression as demonstrated earlier as a workaround. In SQL Server 2000 (and also in 2005), you can use a variable as a workaround:

declare @rnd as int;
set @rnd = abs(checksum(newid()))%3 + 1;

select 
  case @rnd
    when 1 then 'option one'
    when 2 then 'option two'
    when 3 then 'option three'
    else 'oops'
  end;

You can experience similar problems in less obvious scenarios; for example, consider the predicate:

where abs(checksum(newid()))%3 + 1 between col1 and col2

Also here, the predicate is expanded internally and NEWID is invoked multiple times:

where col1 >= abs(checksum(newid()))%3 + 1 
  and col2

So you end up getting two independent invocations of the NEWID function. If you want to rely on a single invocation, as demonstrated earlier in SQL Server 2005 you can use a table expression, and in both versions you can store the result of the expression in a variable and then refer to the variable.

As the last example of problematic use of RAND and NEWID, I recently saw code written by a programmer that was supposed to populate a temporary table with a set of unique random integers (say, 50 random integers in the range 1 through 100). The code looked similar to the following:

set nocount on;
create table #random_values(rnd int not null);
create index idx1 on #random_values(rnd);
 
declare @i as int;
set @i = 1;
while @i <= 50
begin
insert into #random_values
select abs(checksum(newid()))%100 + 1
where abs(checksum(newid()))%100 + 1
not in (select rnd from #random_values);
if @@rowcount = 1 set @i = @i + 1;
end
 
select rnd from #random_values order by rnd;
 
drop table #random_values;

You realize that the two invocations of the NEWID function (in the SELECT and WHERE clauses) are independent of each other. Therefore, this code may very well populate the temporary table with duplicate values in all versions of SQL Server. For example, here’s a subset of the output I got after running this code:

rnd
-----------
2
3
4
4
...

I’m not saying that this particular solution is the optimal way to get a set of unique random values rather just explaining the logical problems with this solution and the workarounds.

In SQL Server 2005 you can now use a table expression as a workaround, revising the SELECT query to the following:

...
  insert into #random_values
    select rnd
    from (select abs(checksum(newid()))%100 + 1 as rnd) as d
    where rnd not in (select rnd from #random_values);
...

In both SQL Server 2000 and 2005 you can use a variable as a workaround:

...
  set @rnd = abs(checksum(newid()))%100 + 1;
  insert into #random_values
    select @rnd
    where @rnd not in (select rnd from #random_values);
...

To summarize, care should be taken when using functions such as RAND and NEWID in queries. If you need to rely on a single invocation of the function, SQL Server 2005 will give you the desired behavior as long as you

encapsulate the invocation of the function in a table expression and then in the outer query refer to the alias of the expression as many times as you like.

In SQL Server 2000 (or any other version), you can use a variable as a workaround.

Learn more from my article "Exploring Changes Made to SQL Server 2005's UDF."

Cheers

--

BG

set nocount on;

create table #random_values(rnd int not null);

create index idx1 on #random_values(rnd);

declare @i as int;

set @i = 1;

while @i <= 50

begin

insert into #random_values

select abs(checksum(newid()))%100 + 1

where abs(checksum(newid()))%100 + 1

not in (select rnd from #random_values);

if @@rowcount = 1 set @i = @i + 1;

end

select rnd from #random_values order by rnd;

drop table #random_values;

Comments

Plain text