Parallelism bug with SQL Server 2012 giving incorrect results for cetain degrees of parallelism - by Paul McLoughlin

Status : 

  Fixed<br /><br />
		This item has been fixed in the current or upcoming version of this product.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.


7
0
Sign in
to vote
ID 804376 Comments
Status Closed Workarounds
Type Bug Repros 5
Opened 10/2/2013 2:49:54 PM
Access Restriction Public

Description

I've discovered a bug in SQL Server 2012 where certain queries can give different - and incorrect - results at certain degrees of parallelism.  

Essentially with the reproduction script provided the query will give 8 rows (correct) at MAXDOP 1;  16 rows (duplicates of each row) at MAXDOP 4; and 16 rows (duplicates with two incorrect rows) at MAXDOP 8.  

I've attached a setup script and reproduction script.  The setup script creates a database called ParallelismBugDB, creates a set of tables, populates those tables with data, and loads statistics for those tables from our production environment (so that the optimizer will choose the offending plan).  The reproduction script runs the query that displays these inconsistencies with MAXDOP 1, MAXDOP 4 and MAXDOP 8 (freeing the plan cache between each run).

It would appear that there are two bugs here:  the first being the differences between the number of rows returned for MAXDOP 1 and MAXDOP 4 cases; the second (and more concerning) being the logically incorrect results returned in the MAXDOP 8 case.
Sign in to post a comment.
Posted by Tarakpatel on 5/14/2014 at 11:26 AM
Hi,
Recently we have migrated our database from SQL 2008 R2 SP2 to 20012 SP1. One of our stored procedure is calling another stored procedure and it executes many update statements, out of which 2 update statements which were working fine in SQL 2008R2 Standard edition (Prod - 8 processors) as well as developer edition (Dev and QA - 2 processors) are not working as expected in Prod environment (SQL 2012 standard edition -- 8 processors). It works fine in SQL 2012 developer edition (DEV and QA -- 2 processors).
Only two differences we have between SQL 2012 QA and Prod are:

1) SQL Edition is different.
2) Number of processors are different. Prod has 8 processors, while QA has 2 processors.
What I have found so far is, if I run 2 update statements without setting up Maxdop, I am not getting out put as expected. But if I apply maxdop 1 or maxdop 2 settings at query level than I am getting data as expected. It does not work with Maxdop 3 or 4 etc. (row counts are same for update with and without parallelism but data getting updated is different in both cases).

This is strange behavior. Looks like this is one of the bug with SQL 2012 parallelism.


Code:

DECLARE @Temp1 DECIMAL(18,6)

DECLARE @Temp2 DECIMAL(18,6)

DECLARE @Temp3 VARCHAR(20)

DECLARE @fsttime bit

DECLARE @Temp4 decimal(18,6)

DECLARE @CURRENT_YEARTIME VARCHAR(6)

SET @CURRENT_YEARTIME='201410'

set @Temp4=0

SET @Temp1 = 0

SET @Temp2 = 0

SET @fsttime = 1

set @Temp3 = ''

Update T

Set @fsttime = case when T.D1 = @Temp3 then 0 else 1 End

, @Temp2= Case when @fsttime=1 then IsNull(P.[ABC $(Rtl)],0) else @Temp1 End

, T.[A1 $(Rtl)] = IsNull(@Temp2,0)

, [A2 $ $ (Rtl)]=case when cast(T.Year as varchar) + Right(T.Time,2)<@CURRENT_YEARTIME then T.[A2 $ $ (Rtl)] else T.[Sell Thru %]/100 * isnull(nullif(@Temp2 + isnull(T.[A3 $ (Rtl)],0),0),T.[A6 $ (Rtl)]) end

, [Adj Store Total End Inv $ (Rtl)]=case when cast(T.Year as varchar) + Right(T.Time,2) <@CURRENT_YEARTIME then (Isnull(@Temp2,0) + IsNull(T.[Adj Gross Shipment $ (Rtl)],0) + IsNull(T.[Inv Adj $ (Rtl)],0) - IsNull(T.[Inv Liq $ (Rtl)],0) + IsNull(T.[A3 $ (Rtl)],0) + IsNull(T.[A4 $ (Rtl)],0) + IsNull(T.[A5 $ (Rtl)],0) + IsNull(T.[PCR $ (Rtl)],0) - IsNull(case when cast(T.Year as varchar) + Right(T.Time,2)<@CURRENT_YEARTIME then T.[A2 $ $ (Rtl)] else T.[Sell Thru %]/100 * isnull(nullif(@Temp2 + isnull(T.[A3 $ (Rtl)],0),0),T.[A6 $ (Rtl)]) end,0)) - (Isnull(nullif(T.[B1],0),T.[B2]) * T.C1) else 0 end

, @Temp1 = T.[Store Total End Inv $ (Rtl)] = IsNull(@Temp2,0) + IsNull(T.[Adj Gross Shipment $ (Rtl)],0) + IsNull(T.[Inv Adj $ (Rtl)],0) - IsNull(T.[Inv Liq $ (Rtl)],0) + IsNull(T.[A3 $ (Rtl)],0) + IsNull(T.[A4 $ (Rtl)],0) + IsNull(T.[A5 $ (Rtl)],0) + IsNull(T.[PCR $ (Rtl)],0) - IsNull(case when cast(T.Year as varchar) + Right(T.Time,2)<@CURRENT_YEARTIME then T.[A2 $ $ (Rtl)] else T.[Sell Thru %]/100 * isnull(nullif(@Temp2 + isnull(T.[A3 $ (Rtl)],0),0),T.[A6 $ (Rtl)]) end,0) - IsNull(case when cast(T.Year as varchar) + Right(T.Time,2) <@CURRENT_YEARTIME then (Isnull(@Temp2,0) + IsNull(T.[Adj Gross Shipment $ (Rtl)],0) + IsNull(T.[Inv Adj $ (Rtl)],0) - IsNull(T.[Inv Liq $ (Rtl)],0) + IsNull(T.[A3 $ (Rtl)],0) + IsNull(T.[A4 $ (Rtl)],0) + IsNull(T.[A5 $ (Rtl)],0) + IsNull(T.[PCR $ (Rtl)],0) - IsNull(case when cast(T.Year as varchar) + Right(T.Time,2)<@CURRENT_YEARTIME then T.[A2 $ $ (Rtl)] else T.[Sell Thru %]/100 * isnull(nullif(@Temp2 + isnull(T.[A3 $ (Rtl)],0),0),T.[A6 $ (Rtl)]) end,0)) - (Isnull(nullif(T.[B1],0),T.[B2]) * T.C1) else 0 end,0)

, @Temp3 = T.D1

From Test2.dbo.BOMEOMForUDDSuper_BOM_EOM T INNER Test2.dbo.BOMEOMForUDDTmppldata P

on T.D1 = P.D1

And T.D2 = P.D2

And T.D3 = P.D3
Posted by Microsoft on 4/23/2014 at 9:03 AM
Sorry, my last statement below is not correct.
The change that originally caused the issue in SQL Server 2012 was back ported to SQL Server 2008 SP3 CU5 and later ported to SQL Server 2008 R2 CU14 and R2 SP1 CU7. So the issue will also exist in SQL Server 2008 and 2008 R2, depending on which CU is installed. We will be considering if we can include the fix to the next CU of SQL Server 2008 R2 SP2.

Alexey
Posted by Microsoft on 4/23/2014 at 7:58 AM
Hi Paul,

Yes, this problem is fixed in SQL Server 2012 SP2 (to be released later this year), but not in SQL Server 2014. However, the fix is also going to be available in SQL Server 2012 SP1 CU10, and then will be ported to a next CU for SQL Server 2014.
This problem should not exist in SQL Server 2008 or 2008 R2.

Thanks,
    Alexey, SQL Development
Posted by Paul McLoughlin on 4/17/2014 at 10:48 AM
I can confirm that the bug still exists in the latest version of SQL Server 2014 (12.0.2000.8). Hoping that this gets resolved in a cumulative update sometime soon.
Posted by Matthias Seelhofer on 4/17/2014 at 9:05 AM
Tanks for the post, it was very helpful since I was struggeling with wrong results on some reports. After setting MAXDOP to 1 in the server properties dialog, the result was correct.

Is there any information about when the problem will be fixed? With the latest update of SQL Server 2012 (CU9, 11.0.3412), the problem still exists.

And: According to my tests, the problem did not exist in V2008 (10.0.5500), is that correct? What about 2008 R2?

Thank you!
Posted by Microsoft on 1/5/2014 at 6:56 PM
Thanks for submitting this feedback. This issue has been resolved, and a fix will be available in an upcoming servicing release of SQL Server 2012. We will update you when the fix has been shipped. Thanks for your continued support in improving SQL Server.
Posted by Microsoft on 1/5/2014 at 6:56 PM
Thanks for submitting this feedback. This issue has been resolved, and a fix will be available in an upcoming servicing release of SQL Server 2012. We will update you when the fix has been shipped. Thanks for your continued support in improving SQL Server.
Posted by Microsoft on 11/7/2013 at 6:12 PM
Hi Paul,

We have found the root cause of the problem. There's an issue in the query plan when stream aggregate operator does not request repartitioning of the rows below it when all the values are constant across those rows. This happens to be the case for your query due to predicates that limit ID, object_type, description to constant values.
We will be working on possible ways to fix this problem, but it may impact some other scenarios, so the risk is currently high.
To reproduce this issue, a number of specific conditions must be met, such as number of CPUs, available memory, query shape, and constant predicates I mentioned above. Also, as we know, there's a workaround. So depending on the risk we may or may not be able to have a fix for this in a servicing release. We will update the issue when we have a resolution for it.

Thank you,
    Alexey, SQL Development
Posted by Paul McLoughlin on 10/8/2013 at 8:31 AM
Hi Alexey,

Thanks for the prompt response. I can see how we can work around this issue by adding the distinct to the second subquery, but I'm slightly concerned about the phantom rows that are being seen in MAXDOP 8 case in my example. This would feel like a bug to me (it being possible for incorrect results to be returned simply because of the number of degrees of parallelism being used). Do you expect this to be something that will be resolved in a future cumulative update or service pack for SQL Server 2012?

Cheers

Paul
Posted by Microsoft on 10/4/2013 at 4:15 PM
Hi Paul,

Thank you for reporting your issue and for very detailed repro steps.
I agree that the observed behavior is not obvious, but I’ll try to explain what is going on.
When you declare the sub-query 'ua1', it can provide distinct rows for the union. Inside the Query Optimizer, you can think of it as delivered property of this sub-query. However, nobody on the top using this sub-query really requires that distinct property, so the optimizer does not feel “obligated” to keep that distinctness. When a parallel plan is used, depending on how the rows are spread among worker threads, you may or may not be getting duplicate rows across multiple workers. If all the duplicate rows coming from concatenation happen to be in one thread, the steam aggregate will eliminate them. If some of them are spread among the threads, however, those duplicates will participate in other joins within their respective worker threads and will generate more rows out of the query. Depending on rows distribution for the result of concatenation and other tables, you can even see some phantom rows that you identified as incorrect results. The reason all this happens, again, is because the Query Optimizer does not make any efforts to ensure that all the rows coming out of concatenation are actually distinct across the worker threads. Even that the union may provide distinctness, it is not required on top of it (of course, if you just execute the sub-query alone, you will see a distinct sort in the query plan that ensures the final results are distinct, since this query would be the top level query and thus distinctness is required).
So in order to fix the query, you would need to add 'distinct' to subquery 'ua’ (i.e. make it 'select distinct * from ua1 …'). I verified that with this modification the query returns 8 rows consistently regardless of the degree of parallelism.

Thank you,
Alexey, SQL Development