row_number () over (order by ...) dropping rows - by T.L.A. _

Status : 

  Not Reproducible<br /><br />
		The product team could not reproduce this item with the description and steps provided.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.


0
3
Sign in
to vote
ID 483540 Comments
Status Closed Workarounds
Type Bug Repros 0
Opened 8/18/2009 7:45:12 AM
Access Restriction Public

Description

Having recently tried (a slightly modified version of) Joe Celko's query to compute the median (see below) I noticed inconsistencies in the results. 

Given a small, say, 10 row, table it works as expected. Given a larger table, approx. 100,000 rows, it seem like the row_number () ... function is applied before the where clause has filtered the virtual table in the CTE. Then when the where clause is applied there will be gaps in the row_number () numbers for each row that is filtered out. Remove the where clause and it works as expected regardless of the number of rows in the table. 

I don't have the SQL standard(s) available so I can't tell if this is the correct behaviour but it seem very counter-intuitive.

create table [dbo].[t] (
   [Id] [bigint] identity (1,1) NOT NULL,
   [a] [int] NOT NULL, -- values between 1 and 100,000
   [b] [smallint] NOT NULL, -- values between 1 and 15, mostly 1 and 2
   [c] [bigint] NOT NULL, -- typical values between -500,00 and 200,00
   [d] [bigint] NOT NULL,
   [e] [datetime] NOT NULL default (getdate ()),
   constraint [pk_t] primary key nonclustered (
      [Id] asc
   ) 
) on [primary]

create index ix_a on dbo.t;
create clustered index ix_e on dbo.t;

-- add 10 rows and it works as expected
-- add 100,000 rows and the result 

with SortedData (x, hi, lo) 
as (
   select c, -- integer, positive or negative
      row_number () over (order by c asc), 
      row_number () over (order by c desc), 
   from t 
   where t.b = 1 
) 
select avg (x * 1.0) as median 
from SortedData 
where hi in (lo, lo+1, lo-1); 
Sign in to post a comment.
Posted by Microsoft on 11/4/2010 at 8:42 PM
Hello,

We're closing this Connect item as Not Repro since the problem was related to duplicate values in the input. Please see earlier comments for a full explanation of the observed behavior and code to generate the desired result.

If you have any questions or if you feel that this item has been closed prematurely, please contact us again.

Thanks again for helping to improve SQL Server by providing us with feedback.

Susan Price
Program Manager
SQL Server Database Engine
Posted by Microsoft on 8/20/2009 at 12:19 PM
Hello,

Thank you for filing your concerns with the SQL Server database engine. I believe Adam (see Community Discussion) has correctly diagnosed the problem (thanks Adam!) as duplicate c values. If the row numbers in hi and lo are not appropriately sorted within the rows for duplicate c values, then the where clause (WHERE hi IN (lo, lo+1, lo-1)) may not be true for any row. When I ran Adam's code, the median value was -155. Without explicitly sorting by id within rows with duplicate c values, the (partial) output is:
c        hi     lo
-154    3322    3308
-154    3323    3309
-154    3324    3310
-155    3303    3311
-155    3304    3312
-155    3305    3313
-155    3306    3314
-155    3307    3315
-155    3308    3316
-155    3309    3317
-155    3310    3318
-155    3311    3319
-155    3312    3320
-155    3313    3321
-155    3314    3322
-155    3315    3323
-155    3316    3324
-155    3317    3325
-156    3293    3326
You can see that hi and lo are never equal. When I run the code the breaks the ties (orders by c, id), the output is:

-154    3320    3308
-154    3319    3309
-154    3318    3310
-155    3317    3311
-155    3316    3312
-155    3315    3313
-155    3314    3314
-155    3313    3315
-155    3312    3316
-155    3311    3317
-155    3310    3318
-155    3309    3319
-155    3308    3320
-155    3307    3321
-155    3306    3322
-155    3305    3323
-155    3304    3324
-155    3303    3325
-156    3302    3326
You can see that hi and lo are consistently ordered, even for duplicate values of c. Hi = lo = 3314 and so a value for median is returned by the query.

Since Adam has provided an explanation of the behavior you observed and code to generate the result you want, I will close this Connect item. If you have any questions or comments, or if you feel I have closed this item prematurely, please contact us again.

Thank you,

Susan Price
Program Manager
SQL Server Database Engine
Posted by Adam Machanic on 8/18/2009 at 8:04 AM
The problem is duplicate C values. There is no guarantee that each row will be ordered the same way, so if you have duplicates you can get different row numbers. The following, for example, is totally valid:

c lo hi
- -- --
1 1 1
1 2 4
1 3 2
1 4 3

... you need to break the ties by ordering by the PK.

--

Here's a query to show all of this in more detail. I've fixed errors in your code to make it work:

use tempdb
go

create table [dbo].[t] (
[Id] [bigint] identity (1,1) NOT NULL,
[a] [int] NOT NULL, -- values between 1 and 100,000
[b] [smallint] NOT NULL, -- values between 1 and 15, mostly 1 and 2
[c] [bigint] NOT NULL, -- typical values between -500,00 and 200,00
[d] [bigint] NOT NULL,
[e] [datetime] NOT NULL default (getdate ()),
constraint [pk_t] primary key nonclustered (
     [Id] asc
)
) on [primary]

create index ix_a on dbo.t (a);
create clustered index ix_e on dbo.t (e);
go

insert dbo.t
([a],[b],[c],[d])
select top (100000)
    convert(int, rand(checksum(newid())) * 100000) AS [a],
    convert(smallint, rand(checksum(newid())) * 15) AS [b],
    convert(bigint, rand(checksum(newid())) * 700) - 500 AS [c],
    convert(bigint, a.number) AS [d]
from master..spt_values a, master..spt_values b
go

--fails -- old way
with SortedData (x, hi, lo)
as (
select c, -- integer, positive or negative
     row_number () over (order by c asc),
     row_number () over (order by c desc)
from t
where t.b = 1
)
select avg (x * 1.0) as median
from SortedData
where hi in (lo, lo+1, lo-1);
go

--works -- break the ties
with SortedData (x, hi, lo)
as (
select c, -- integer, positive or negative
     row_number () over (order by c asc, id asc),
     row_number () over (order by c desc, id desc)
from t
where t.b = 1
)
select x --avg (x * 1.0) as median
from SortedData
where hi in (lo, lo+1, lo-1);
go

drop table t
go