Home Dashboard Directory Help
Search

Bug with NEWID and Table Expressions by Itzik Ben-Gan


Status: 

Closed
 as Won't Fix Help for as Won't Fix


30
1
Sign in
to vote
Type: Bug
ID: 350485
Opened: 6/11/2008 8:36:42 AM
Access Restriction: Public
0
Workaround(s)
view
2
User(s) can reproduce this bug

Description

Originally described by Thomas Glörfeld here: http://www.glorf.it/blog/2008/05/16/sql-talk/sql-server-is-not-aware-of-nondeterministic-functions.
Bug description:
Suppose you define a table expression (derived table, CTE, view, inline TVF) based on a query that invokes the NEWID function (call it T2). You join the table expression with another table (call it T1). The relationship between T2 and T1 is 1:M. You join T1 and T2. Depending on the join algorithm chosen by the optimizer, you might get the NEWID function to be invoked once per each result row instead of once per each T2 row.
This bug appears in SQL Server 2005, and seems like a regression from SQL Server 2000.
Details
Sign in to post a comment.
Posted by Jon Seigel on 12/23/2011 at 11:40 AM
Interestingly, in 2008+, the default plan generated by the original query is different than for 2005 (hash join instead of nested loops), but the nested loops plan still exhibits the buggy/different behaviour if you force it using a hint.
Posted by Paul White NZ on 7/19/2011 at 4:50 PM
Diksta,

Did you read *any* of the preceding discussion? From the last comment from Microsoft:

"The optimizer does not guarantee timing or number of executions of scalar functions"

So now you are asking why the timing of number of executions of the RAND function varies?

Yes you only changed an INNER JOIN to FULL JOIN, but that's enough to change the optimizer rules and transformations applied to reach a final plan. It so happens that the final plan shape differs, and so you get different results.
Posted by Diksta on 7/19/2011 at 5:41 AM
Why does this work:

WITH RawData AS (
SELECT 1 AS ProductId, 2 AS AnotherId
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 1, 4),
GroupedData AS (
SELECT DISTINCT ProductId FROM RawData),
GUIDs AS (
SELECT ProductId, RAND() AS ProductGUID FROM GroupedData)
SELECT rd.ProductId, rd.AnotherId, g.ProductGUID FROM RawData rd FULL OUTER JOIN GUIDs g ON g.ProductId = rd.ProductId;

...but this doesn't work:

WITH RawData AS (
SELECT 1 AS ProductId, 2 AS AnotherId
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 1, 4),
GroupedData AS (
SELECT DISTINCT ProductId FROM RawData),
GUIDs AS (
SELECT ProductId, RAND() AS ProductGUID FROM GroupedData)
SELECT rd.ProductId, rd.AnotherId, g.ProductGUID FROM RawData rd INNER JOIN GUIDs g ON g.ProductId = rd.ProductId;

(and all I changed there was the INNER JOIN to a FULL OUTER JOIN)
Posted by Microsoft on 7/7/2008 at 9:27 AM
Hi,

Closing the loop . . . I've discussed this question with the Dev team. And eventually we have decided not to change current behavior, for the following reasons:

1) The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.

2) This "once-per-row behavior" is not a new issue, although it's not widely discussed. We started to tweak its behavior back in the Yukon release. But it's quite hard to pin down precisely, in all cases, exactly what it means! For example, does it a apply to interim rows calculated 'on the way' to the final result? - in which case it clearly depends on the plan chosen. Or does it apply only to the rows that will eventually appear in the completed result? - there's a nasty recursion going on here, as I'm sure you'll agree!

3) As I mentioned earlier, we default to "optimize performance" - which is good for 99% of cases. The 1% of cases where it might change results are fairly easy to spot - side-effecting 'functions' such as NEWID - and easy to 'fix' (trading perf, as a consequence). This default to "optimize performance" again, is long-established, and accepted. (Yes, it's not the stance chosen by compilers for conventional programming languages, but so be it).

So, our recommendations are:

a) Avoid reliance on non-guaranteed timing and number-of-executions semantics.
b) Avoid using NEWID() deep in table expressions.
c) Use OPTION to force a particular behavior (trading perf)

Hope this explanation helps clarify our reasons for closing this bug as "won't fix".

Thanks,

Jim
Posted by Frommon on 6/20/2008 at 4:47 AM
--Itzik Ben-Gan query
select *
from (select 1 as id
union all select 1
union all select 2
union all select 2) as t1
join
(select 1 as id, newid() as new_id
union all select 2, newid()) as t2
on t1.id = t2.id;

--could be fixed by following one

select *
from (select 1 as id
union all select 1
union all select 2
union all select 2) as t1
join
(select distinct * from
(select 1 as id, newid() as new_id
union all select 2, newid()) as tt
) t2
on t1.id = t2.id

-- or this way:

select *
from (select 1 as id
union all select 1
union all select 2
union all select 2) as t1
join
(select * from
(select id,new_id
, Row_number() over(order by new_id) as rn
from
(select 1 as id, newid() as new_id
union all select 2, newid()
) tt
) ttt
) t2
on t1.id = t2.id

Posted by Microsoft on 6/17/2008 at 11:34 AM
Hi Itzik,

Thankyou for this bug report. One of the most interesting discussions around.

This hits to the very heart of the issue - is optimization allowed to change a program's semantics? Ie: if a program yields certain answers, but runs slowly, is it legitimate for a Query Optimizer make that program run faster, yet also change the results given?

Before shouting "NO!" (my own personal inclination too :-), consider: the good news is that, in 99% of cases, the answers ARE the same. So Query Optimization is a clear win. The bad news is that, if the query contains side-effecting code, then different plans CAN indeed yield different results. And NEWID() is one such side-effecting (non-deterministic) 'function' that exposes the difference. [Actually, if you experiment, you can devise others - for example, short-circuit evaluation of AND clauses: make the second clause throw an arithmetic divide-by-zero - different optimizations may execute that second clause BEFORE the first clause] This reflects Craig's explanation, elsewhere in this thread, that SqlServer does not guarantee when scalar operators are executed.

So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code - so that results of JOINs, for example, follow the semantics of a nested-loop execution - then we can use appropriate OPTIONs to force that behavior - as UC points out. But the resulting code will run slow - that's the cost of, in effect, hobbling the Query Optimizer.

All that said, we are moving the Query Optimizer in the direction of "as expected" behavior for NEWID() - trading off performance for "results as expected".

[This bug has an interesting parallel in optimizing compilers for languages like C++. In such cases, the optimizer NEVER changes the observable behavior of a program. If it don't know what an extern function does, it assumes, conservatively, that the function can be side-effecting - changing global state. If it CAN see the IR for an extern function, the best compilers will alias-analyze, and perform safe optimizations. They also skip optimization around variables tagged as volatile - another expression of side-effecting. So, in a way, conventional language optimizers take the opposite stance from database optimizers. But there's less at stake! - lost perf might be a factor of 20% for C++, but a factor of ten-fold for database]

Anyhow, this bug is now assigned to the QO Dev team for a deeper look.

Thanks,

Jim
Posted by Itzik Ben-Gan on 6/12/2008 at 8:01 AM
Here's another repro of the bug (2005/SP2):

with t1 as(select 1 as col1, newid() as col2)
select *
from t1 as a join t1 as b
on a.col1 = b.col1;

col1        col2                                 col1        col2
----------- ------------------------------------ ----------- ------------------------------------
1         B85E63EF-C24F-4C77-954E-6752B85A1A4D 1
6E2AEB76-B230-481C-BBF8-46DE228AC181

Interestingly, without the join the bug doesn't appear:

with t1 as(select 1 as col1, newid() as col2)
select col1, col2, col2 from t1;

col1        col2                                 col2
----------- ------------------------------------ ------------------------------------
1         651D3B26-73DF-448B-AE4A-E0BD44D80E11
651D3B26-73DF-448B-AE4A-E0BD44D80E11
Posted by Tony Rogerson SQL on 6/12/2008 at 12:10 AM
There is a general problem here - ROW_NUMBER() is also effected when used in CTE's when ordering by a non-deterministic function; ROW_NUMBER() when used in a CTE or View suffers from being run multiple times probably due to the way the inline expansion of the View and CTE work...

In the script below rn should be the same; as you can see on the CTE with the rn being calculated in the anchor it isn't...

select 1 as id, 1 as d
into #t
union all select 1, 2 as d
union all select 1, 3 as d
union all select 1, 4 as d
union all select 1, 5 as d
union all select 1, 6 as d
union all select 1, 7 as d
union all select 2, 1 as d
union all select 2, 2 as d
union all select 2, 3 as d
union all select 2, 4 as d
union all select 2, 5 as d
union all select 2, 6 as d
union all select 2, 7 as d

select *, row_number() over( order by newid() ) as rn
into #x
from #t

;with t2 ( id, d, rn )
as (
    select *, row_number() over( order by newid() ) as rn
    from #t
    )
select *
from t2 a
    inner join t2 b
        on b.id = a.id
     and b.d = a.d
order by a.rn

;with t2 ( id, d, rn )
as (
    select *
    from #x
    )
select *
from t2 a
    inner join t2 b
        on b.id = a.id
     and b.d = a.d
order by a.rn

The performance side of this I've documented here: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/05/17/non-recursive-common-table-expressions-performance-sucks-2-row-number-is-executed-number-of-cte-references-x-number-of-rows-from-the-anchor.aspx
Sign in to post a workaround.