Bug with NEWID Function

This bug was first described by Thomas Glörfeld here.

Related Microsoft Connect item can be found here.

The bug has to do with invoking the NEWID function in a query defining a table expression (derived table, CTE, view, inline table-valued function), and then joining the table expression with another table. If the relationship between the table expression and the other table is 1:M, each row from the table expression can appear multiple times in the result of the join. The bug is that depending on the optimizer’s choices in terms of physical processing, SQL Server may end up evaluating the NEWID function once per each target row instead of once per each source row.

This bug was tested on SQL Server versions 2000/SP4, 2005/SP2 and 2008/RC0.

As an example, the following code creates the tables t1 and t2 that are related in a 1:M relationship, and a view that invokes the NEWID function per each row from t1:

set nocount on;

use tempdb;

if object_id('dbo.v1', 'V') is not null drop view dbo.v1;

if object_id('dbo.t2', 'U') is not null drop table dbo.t2;

if object_id('dbo.t1', 'U') is not null drop table dbo.t1;

create table dbo.t1(id int not null primary key);

insert into dbo.t1(id) values(1);

insert into dbo.t1(id) values(2);

create table dbo.t2(id int not null references dbo.t1);

insert into dbo.t2(id) values(1);

insert into dbo.t2(id) values(2);

create view dbo.v1

select id, newid() as newid_val

from dbo.t1

Consider the following query:

select a.id as a_id, b.id as b_id, a.newid_val

from dbo.v1 as a

join dbo.t2 as b

on a.id = b.id;

You would expect the NEWID function to be evaluated per each source row; namely, per each row from t1. However, the function ends up being evaluated per each target row; namely, per each row in the result of the join. Here’s the output I got from one of the invocations of this query:

a_id b_id newid_val

----------- ----------- ------------------------------------

1 1 5232647A-0279-4F3C-9B7F-EA88B062A1AF

1 1 B87D2681-26C4-4B9A-8895-35EF138BC964

2 2 F69FE4D3-26AE-4642-9622-C8416C8FFC0C

2 2 0ED5950C-12BF-4778-90A3-2AC55CD59AAD

Instead of getting two distinct GUIDs (one per each source row from t1), you get four (one per each target row in the result of the join.

I posted the bug on Microsoft Connect (FeedbackID=350485), and after consideration, Microsoft decided to close the item and mark it as “Won’t Fix”. The reasoning behind the decision not to fix the bug is that in the vast majority of the cases, the optimization aspects that lead to the bug yield better performance without sacrificing the correctness of the query, and if you fall into one of the unusual cases where the correctness of the query is compromised, you can consider alternatives (e.g., physically materialize the data along with the NEWID values in a table).

Here’s the response from Microsoft:

“Closing the loop . . . I've discussed this question with the Dev team. And eventually we have decided not to change current behavior, for the following reasons:

1) The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-established tenet. It's the fundamental 'leeway' that allows the optimizer enough freedom to gain significant improvements in query-plan execution.

2) This "once-per-row behavior" is not a new issue, although it's not widely discussed. We started to tweak its behavior back in the Yukon release. But it's quite hard to pin down precisely, in all cases, exactly what it means! For example, does it a apply to interim rows calculated 'on the way' to the final result? - in which case it clearly depends on the plan chosen. Or does it apply only to the rows that will eventually appear in the completed result? - there's a nasty recursion going on here, as I'm sure you'll agree!

3) As I mentioned earlier, we default to "optimize performance" - which is good for 99% of cases. The 1% of cases where it might change results are fairly easy to spot - side-effecting 'functions' such as NEWID - and easy to 'fix' (trading perf, as a consequence). This default to "optimize performance" again, is long-established, and accepted. (Yes, it's not the stance chosen by compilers for conventional programming languages, but so be it).

So, our recommendations are:

a) Avoid reliance on non-guaranteed timing and number-of-executions semantics.

b) Avoid using NEWID() deep in table expressions.

c) Use OPTION to force a particular behavior (trading perf)

Hope this explanation helps clarify our reasons for closing this bug as "won't fix".

Thanks,

Jim”

I wanted to point out the bug so that you would be aware of it and of the fact that a fix is not planned. If you need the NEWID function to be evaluated once per each source row, make sure you materialize the data in a table first.

Cheers,

Comments

Plain text