Having recently tried (a slightly modified version of) Joe Celko's query to compute the median (see below) I noticed inconsistencies in the results.
Given a small, say, 10 row, table it works as expected. Given a larger table, approx. 100,000 rows, it seem like the row_number () ... function is applied before the where clause has filtered the virtual table in the CTE. Then when the where clause is applied there will be gaps in the row_number () numbers for each row that is filtered out. Remove the where clause and it works as expected regardless of the number of rows in the table.
I don't have the SQL standard(s) available so I can't tell if this is the correct behaviour but it seem very counter-intuitive.
create table [dbo].[t] (
[Id] [bigint] identity (1,1) NOT NULL,
[a] [int] NOT NULL, -- values between 1 and 100,000
[b] [smallint] NOT NULL, -- values between 1 and 15, mostly 1 and 2
[c] [bigint] NOT NULL, -- typical values between -500,00 and 200,00
[d] [bigint] NOT NULL,
[e] [datetime] NOT NULL default (getdate ()),
constraint [pk_t] primary key nonclustered (
) on [primary]
create index ix_a on dbo.t;
create clustered index ix_e on dbo.t;
-- add 10 rows and it works as expected
-- add 100,000 rows and the result
with SortedData (x, hi, lo)
select c, -- integer, positive or negative
row_number () over (order by c asc),
row_number () over (order by c desc),
where t.b = 1
select avg (x * 1.0) as median
where hi in (lo, lo+1, lo-1);