The Perils of T-SQL Randomization

Randomization is a common need in programming languages such as T-SQL. Prime examples of T-SQL randomization needs include generating a random integer value in a range 1 through n, returning randomly sorted data, returning random n rows, and returning random n rows per group. Let's walk through each of those randomization needs and determine how to provide solutions in T-SQL. In particular, I'll focus on how to improve the solutions by using some of the new T-SQL features in SQL Server 2005, including the use of TOP with input expressions, the APPLY operator, and the TABLESAMPLE option.

Generating a Random Integer Value

Given an input integer value @n, you're tasked with generating a random value in the range 1 through @n. This classic request is probably one of the most common randomization needs. However, the solution in T-SQL isn't particularly straightforward.

One common solution among programmers is to use the RAND function, as follows:

DECLARE @n AS INT; 
SET @n = 10; 
SELECT CAST(RAND()*@n AS INT) + 1;

RAND generates a float value in the range 0 through 1. By multiplying the result of RAND by @n, you get a float value in the range 0 through @n. Converting a float value to an integer doesn't round the result but rather floors it (i.e., gets rid of the fraction). Adding 1 ensures that the minimum result value will be 1 and not 0.

This solution has two problems. One problem is that the RAND function isn't truly random.The function accepts an integer seed as input, then calculates a float value based on the input in a deterministic manner. For example, when you run

SELECT RAND(42);

several times, you'll always get the 0.71435594503451 float value back.

Even when you invoke the RAND function without specifying a seed as input, SQL Server calculates a new seed value based on the previous RAND invocation. For example, if you run

SELECT RAND(42); 
SELECT RAND(); 
SELECT RAND();

several times, you'll always get the same results back: 0.71435594503451, 0.04100 99860282736, and 0.649384188470534.

Sometimes, you might want to rely on such behavior. For example, if you need to generate repeatable values for a demonstration or for performance tests, all you need to do is invoke the RAND function with a known seed at the beginning of your script and later invoke it without a seed. However, if you expect truly unpredictable random values, RAND without an input won't give you what you need.

If you need to generate an unpredictable random float value in the range 0 through 1, you need to provide a random integer seed to the RAND function. At first, this situation appears to be a catch-22, but it isn't.You can generate a random integer seed with the expression CHECKSUM(NEWID()).NEWID() generates a new globally unique identifier (GUID) with each invocation, and the CHECKSUM function generates an integer checksum value based on the input.Thus, the following expression generates a random float value in the range 0 through 1:

SELECT RAND(CHECKSUM(NEWID()));

Another problem with the original solution I provided for generating a random integer value in the range 1 through @n is that RAND generates a float value in the range 0 through 1— inclusive.That is, the float values 0E0 and 1E0 are included in the range of values that RAND can return. Therefore, the original solution will return @n + 1 if RAND returns exactly 1E0. Of course, the probability to get exactly 1E0 back from RAND is very low—so low that you might be willing to take the risk. But if you want to ensure that the result will be in the range 1 through @n, you can subtract a very small fraction from @n.

Here's the revised solution for returning a random integer value in the range 1 through @n, addressing both problems:

DECLARE @n AS INT; 
SET @n = 10; 
SELECT CAST(RAND(CHECKSUM(NEWID())) *
  (@n - 0.00000000001) AS INT) + 1;

This expression is lengthy, and you might consider such a solution awkward. Fortunately, I've learned a simpler solution from SQL Server MVP Steve Kass:

DECLARE @n AS INT; 
SET @n = 10; 
SELECT 1 + ABS(CHECKSUM(NEWID())) % @n;

The CHECKSUM(NEWID()) expression generates a random integer value in the range of values supported by a four-byte integer datatype. The ABS(CHECKSUM (NEWID())) % @n expression returns a random integer in the range 0 through (@n - 1). Add 1, and you get a random integer in the range 1 through @n.

Returning Randomly Sorted Data

Another common randomization request is to return randomly sorted data. For example, suppose you want to return the rows from the Employees table in the Northwind database, sorted randomly. Programmers making their first attempt at solving the problem might come up with the following solution:

USE Northwind; 
SELECT EmployeeID, FirstName, 
  LastName 
FROM dbo.Employees 
ORDER BY RAND();

However, if you run this query multiple times, you'll find that you keep getting the results back in the same order.To understand why you don't really get random sorting, run the following query, which invokes the RAND function as part of the SELECT list:

SELECT RAND() AS Rnd, EmployeeID, 
  FirstName, LastName 
FROM dbo.Employees;

Now, examine the output that Table 1 shows. Like most non-deterministic functions, RAND is invoked once per query, not once per row.You get the same value in all rows. Hence, specifying RAND in the ORDER BY clause has no effect on sorting.

There's only one nondeterministic function that is evaluated once per row and not once per query—the NEWID function. Many programmers who realized the problem with RAND use the following solution to return the data in random order:

SELECT EmployeeID, FirstName, LastName 
FROM dbo.Employees 
ORDER BY NEWID();

Note, however, that even though NEWID will generate a different value in each row, it isn't completely random. NEWID generates unique values—not exactly a random distribution of values. On some OS platforms (e.g., Windows NT 4.0), it's extremely apparent that it isn't random. On later platforms, it's less apparent, but NEWID still generates clustered or even incremental values. The expression CHECKSUM(NEWID()) gives better random distribution than NEWID, so I recommend the following solution for returning data in random order:

SELECT EmployeeID, FirstName, LastName 
FROM dbo.Employees 
ORDER BY CHECKSUM(NEWID());

Run this query multiple times and observe that you get a different random order with each invocation.

Returning n Random Rows

The next randomization request is to return @n random rows from a table, where @n is an integer input value—for example, return @n random rows from the Employees table. Now that you're familiar with the technique to sort rows in a random manner, the following solution should be straightforward:

DECLARE @n AS INT; 
SET @n = 3; 
SELECT TOP (@n) EmployeeID,
  FirstName, LastName 
FROM dbo.Employees 
ORDER BY CHECKSUM(NEWID());

The ability to provide the TOP option with an expression as input is new to SQL Server 2005. By contrast, SQL Server 2000 allows only a constant as input, as follows:

SELECT TOP 3 EmployeeID, 
  FirstName, LastName 
FROM dbo.Employees 
ORDER BY CHECKSUM(NEWID());

Note that with large tables (hundreds of thousands or millions of rows), this solution is inefficient. This query causes a full table or index scan—which is itself expensive— but the scan amounts to a small portion of the query cost. Most of the query cost is associated with sorting a large number of rows, which is very expensive. For example, run the following query against the Sales-OrderHeader table in the AdventureWorks database to return one random row, and request the graphical execution plan and STATISTICS IO information:

SET STATISTICS IO ON; 
USE AdventureWorks; 
SELECT TOP (1) SalesOrderID, 
   OrderDate, CustomerID, ContactID 
FROM Sales.SalesOrderHeader 
ORDER BY CHECKSUM(NEWID());

Figure 1's execution plan shows a full table scan (i.e., clustered index scan) and a sort operator, with 20 percent of the cost associated with the scan, and 80 percent associated with the sort. As the table grows larger, the sort-cost portion of the query cost increases because sorting has O(n log n) complexity. The cost of the query (i.e., the Estimated Subtree Cost in the leftmost operator) is about 2.7. STATISTICS IO reports 703 logical reads because of the full table scan.This table is fairly small, so the query cost isn't extremely high and the results come back fairly quickly. But if you run it against a table that contains millions of rows, it would take quite a while.

One way to reduce the query cost with large tables is to scan—and more important, sort—less data.To do so, you can use the new TABLESAMPLE option in SQL Server 2005. (For details about TABLESAMPLE, see "Query Data Samples in a Table," InstantDoc ID 49572.)TABLESAMPLE lets you request a sampling of a certain number or percentage of rows, which SQL Server translates to scanning a random sample of pages.As I describe in "Query Data Samples in a Table," SQL Server randomly chooses which pages to scan, and you're not guaranteed to get the exact number of rows you request. So, the higher the number of rows you request, the higher the probability that you'll get at least the number of rows you request.

With that in mind, here's how you can request one random row from Sales-OrderHeader, with a substantial reduction of the cost of the query:

SELECT TOP (1) SalesOrderID, 
  OrderDate, CustomerID, 
  ContactID 
FROM Sales.SalesOrderHeader 
  TABLESAMPLE(1000 ROWS) 
ORDER BY CHECKSUM(NEWID());

Examine Figure 2's execution plan, produced for this query, and notice that the query cost is substantially reduced—it's now about 0.048. STATISTICS IO reports only 23 logical reads. Much fewer rows were scanned, but more important, much fewer rows were sorted. Note that there's a small chance you won't get any rows back from the query, but in such a case, you can introduce-logic into your application to invoke the query again.

Figure 2: Execution plan for Top (1) query against SalesOrderHeader with the TABLESAMPLE option

Returning n Random Rows per Group

A variation of returning n random rows is to return n random rows per group—for example, given @n as input, returning @n random orders for each employee. In SQL Server 2005, you can use the new APPLY operator to apply a table expression with the aforementioned logic per each employee, as Listing 1 shows. Table 2 shows one possible output of this query, but of course, every time you run it, you'll get a different output.

For details about the APPLY operator, see "SQL Server 2005's APPLY Operator." An attempt to use somewhat similar logic in SQL Server 2000 proves futile. For example, initially I wrote the query that Listing 2 shows, believing it to be logically equivalent to the APPLY query. However, the query returns a different number of rows every time I run it. Can you see the bug in the query? This query invokes the subquery once per each order (as opposed to once per each employee in the APPLY solution). Besides being much more expensive than the APPLY solution, this solution reevaluates the three random order IDs every time the subquery is run (i.e., once per order).

Therefore, you might get fewer than three matches for each employee, and you might get more than three.In short,this solution isn't correct.

Get to Know TOP, APPLY, and TABLESAMPLE

Randomization is a common need in T-SQL, but T-SQL introduces its own unique challenges. As we've seen, RAND returns a random float in the range 0 through 1, inclusive, and RAND is evaluated only once per query and not once per row. In the process of demonstrating techniques to tackle common randomization needs in T-SQL, I hope I've shown you how you can improve these solutions by using some of the new T-SQL features in SQL Server 2005.

Comments

Plain text