Skip navigation

Solution to Catch That Bug! Random Row per Group

Last week I presented the following puzzle:
Catch That Bug! Random Row per Group
(http://www.sqlmag.com/Article/ArticleID/93895/93895.html).
As some of you figured, the reason that the query returns varying numbers of
rows in different invocations is because the subquery is reevaluated once per
order (can be observed in the execution plan), while you want the subquery to
reevaluate once per employee. Every time the subquery is invoked a new
random OrderID is produced, and it may or may not be equal to the OrderID
in the outer row. In some cases no order is found for a given employee, and
in some cases more than one order is found. In short, you need to come up
with a solution where either the execution plan evaluates a random order only
once per employee, or one where random sort values are assigned only once
(in one shot) for the set of Orders.
As an aside, the expression CHECKSUM(NEWID()) generates values with
better random distribution than NEWID() does, but that’s a different story.

SQL Server 2005 has language elements that allow elegant and efficient
solutions that overcome the problem including the APPLY table operator and
the ROW_NUMBER function. But I’ll start with solutions that work in SQL
Server 2000 first.

The following query produces a random OrderID per employee:

SELECT (SELECT TOP 1 OrderID
        FROM dbo.Orders AS O
        WHERE O.EmployeeID = E.EmployeeID
        ORDER BY CHECKSUM(NEWID())) AS RandomOrder
FROM dbo.Employees AS E;

Now filter the orders from the Orders table where the OrderID is IN the set
of OrderIDs returned by the previous query:

SELECT EmployeeID, OrderID, OrderDate, CustomerID
FROM dbo.Orders
WHERE OrderID IN
    (SELECT (SELECT TOP 1 OrderID
             FROM dbo.Orders AS O
             WHERE O.EmployeeID = E.EmployeeID
             ORDER BY CHECKSUM(NEWID())) AS RandomOrder
     FROM dbo.Employees AS E)
ORDER BY EmployeeID;

You get a plan where the innermost subquery (returning a random OrderID
for the outer employee) is evaluated only once per employee. You might not
feel comfortable with such a solution, since its correctness relies on the
execution plan that the optimizer chooses. The execution plan might be
different with different data distribution, indexes, etc. To be on the safe side,
you can first calculate a random sort value for each order (in one shot) and
materialize the result set in a temporary table:

SELECT EmployeeID, OrderID, CHECKSUM(NEWID()) AS Rnd
INTO #OrdersRnd
FROM dbo.Orders;

Then you can safely use the logic that I presented with the original query:

SELECT EmployeeID, OrderID, OrderDate, CustomerID
FROM dbo.Orders AS O1
WHERE OrderID IN
    (SELECT TOP 1 OrderID
     FROM #OrdersRnd AS O2
     WHERE O2.EmployeeID = O1.EmployeeID
     ORDER BY Rnd)
ORDER BY EmployeeID;

Not an efficient solution, but a safe one.

As for SQL Server 2005, things are simpler. You can use the APPLY
operator to return a random order per employee like so:

SELECT A.*
FROM dbo.Employees AS E
  CROSS APPLY
    (SELECT TOP (1) EmployeeID, OrderID, OrderDate, CustomerID
     FROM dbo.Orders AS O
     WHERE O.EmployeeID = E.EmployeeID
     ORDER BY CHECKSUM(NEWID())) AS A
ORDER BY EmployeeID;

The APPLY operator pretty much lends itself to being evaluated only once
per outer row.

Another solution is to calculate row numbers partitioned by employee, sorted
by randomly generated values (CHECKSUM(NEWID()))) . Encapsulate the
query that generates random values in a CTE, and have the outer query filter
only rows where the row number is equal to 1:

WITH Orders_RN AS
(
  SELECT EmployeeID, OrderID, OrderDate, CustomerID,
    ROW_NUMBER() OVER(PARTITION BY EmployeeID
                      ORDER BY CHECKSUM(NEWID())) AS RowNum
  FROM dbo.Orders
)
SELECT *
FROM Orders_RN
WHERE RowNum = 1
ORDER BY EmployeeID;

Here the random values are generated only once. The last two solutions that
use APPLY and ROW_NUMBER are also more flexible than the original
solution in the sense that you can request more than one random order per
employee:

SELECT A.*
FROM dbo.Employees AS E
  CROSS APPLY
    (SELECT TOP (3) EmployeeID, OrderID, OrderDate, CustomerID
     FROM dbo.Orders AS O
     WHERE O.EmployeeID = E.EmployeeID
     ORDER BY CHECKSUM(NEWID())) AS A
ORDER BY EmployeeID;

WITH Orders_RN AS
(
  SELECT EmployeeID, OrderID, OrderDate, CustomerID,
    ROW_NUMBER() OVER(PARTITION BY EmployeeID
                      ORDER BY CHECKSUM(NEWID())) AS RowNum
  FROM dbo.Orders
)
SELECT *
FROM Orders_RN
WHERE RowNum 

Cheers,
--
BG
 
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish