Home Dashboard Directory Help
Search

QUALIFY - Request for a New Filtering Clause by Itzik Ben-Gan


Status: 

Active


310
9
Sign in
to vote
Type: Suggestion
ID: 532474
Opened: 2/10/2010 12:20:01 AM
Access Restriction: Public
0
Workaround(s)
view

Description

This entry describes a request for a new query filtering clause called QUALIFY in SQL Server.

As of SQL Server 2008 (including R2) T-SQL supports three standard filtering clauses based on predicates (ON, WHERE, HAVING), and one nonstandard filtering option based on a number/percent of rows and ordering (TOP). Window calculations (e.g., ranking and aggregate calculations that use the OVER clause, and hopefully others in the future) are only allowed in the SELECT and ORDER BY clauses of a query, and therefore you cannot refer to those directly in the filtering clauses that are based on predicates. For example, the following is not a valid statement:

SELECT col1, col2, col3
FROM dbo.T1
WHERE ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;

There is logic behind this restriction. Consider a table T1 with the following sample data and the query that follows:

col1
-----
100
150
300
350
900

SELECT col1
FROM dbo.T1
WHERE col1 >= 300 AND ROW_NUMBER() OVER(ORDER BY col1) <= 2;

Should the filter based on the ROW_NUMBER function be evaluated after col1 >= 300 or before it? If the former, you should get the output:

col1
-----
300
350

If the latter, you should get an empty set. But in SQL things get processed logically in an all-at-once fashion and order of evaluation of predicates shouldn’t matter. Due to this ambiguity SQL allows window calculations only in the SELECT and ORDER BY clauses of a query, making it clear that they are evaluated after the FROM, WHERE, GROUP BY and HAVING clauses.
Sure, you can achieve the filtering task indirectly by using a table expression:

WITH C AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) AS rownum,
    col1, col2, col3
FROM dbo.T1
)
SELECT col1, col2, col3
FROM C
WHERE rownum BETWEEN 11 AND 20;

However, this adds a layer of complexity that if possible would be nice to avoid.
I propose implementing a new filtering clause called QUALIFY similar to the one implemented already by Teradata. In terms of logical query processing, QUALIFY fits after the SELECT clause, allowing it to refer to window calculations:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. QUALIFY
7. ORDER BY

If implemented, you will be able to refer to window calculations directly in the QUALIFY clause, like so:

SELECT col1, col2, col3
FROM dbo.T1
QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;

Or indirectly, like so:

SELECT ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) AS rownum,
col1, col2, col3
FROM dbo.T1
QUALIFY rownum BETWEEN 11 AND 20;

But this time without the need for a table expression.
Support for the QUALIFY clause would also help writing clearer code by avoiding the need to use the TOP option. TOP involves a lot of confusion surrounding the TOP ordering vs. presentation ordering. People use TOP to try to circumvent restrictions regarding the use of ORDER BY in table expressions, not realizing that unless the outermost query has an ORDER BY clause, presentation ordering is not guaranteed. The standard doesn’t support TOP, but does support TOP-like filtering clauses. However, those standard features have the same confusion aspects related to ordering.
In the past I proposed revising the TOP option to support an OVER clause with its own ordering specification instead of relying on the ORDER BY clause that traditionally serves a presentation purpose. And while we’re at it, why not also support a PARTITION BY clause. You can find my proposal here:

https://connect.microsoft.com/SQLServer/feedback/details/254390/over-clause-enhancement-request-top-over

I still feel that such a feature would be great since it would avoid the existing ambiguity and confusion, plus support a partitioning element that TOP doesn’t currently support.
I find that supporting the QUALIFY clause is even more important than the proposed revision to TOP since:
•    It gives you all the functionality that a TOP OVER (and the top-like standard features) would give you.
•    There’s no confusion around the ordering aspect.
•    You have a partitioning element in window calculations and naturally it will be reflected in the filter.
•    You can do much more than what TOP and the TOP-like standard features allow you to do, e.g., if SQL Server adds support in the future for other window calculations and enhance existing ones, those will be supported in the QUALIFY clause as well. For example, you will be able to filter by a running total or a sliding total.

If you also feel that this feature could be useful, please support it by casting your vote.

Cheers,
BG
Details
Sign in to post a comment.
Posted by Saeid Hasani on 2/19/2013 at 11:29 PM
Thanks Itzik,
This feature would be so helpful.
Posted by Microsoft on 3/16/2011 at 3:36 PM
Hello Itzik,
Thanks for your feedback. Given the relative priority of this item vs other standard compliant or missing features, I am closing this suggestion as "won't fix". Morover the workaround for getting the same behavior as QUALIFY is not that hard and functionally it provides the same benefit. Going the QUALIFY route will force us to redefine semantics and binding behavior for the query language constructs. It is also difficult to handle in terms of backward compatibility. Hope this helps explain our decision and keep your feedback coming.

--
Umachandar, SQL Programmability Team
Posted by --CELKO-- on 9/15/2010 at 12:32 PM
I would rather that MS spend the time and effort to add the RANGE/ROW sub-clause to the OVER() clause. The ability to do running stats is more important for mot programming. The work-around for it is self-joins which get ugly.

Adding this will only cause the product to drift away from ANSI/ISO Standards and become more a dialect.

Now, if you want to submit a paper to INCITS H2 with a proposal, that is another matter.

As an aside, if you see this as like a "GROUP BY.. HAVING.." clause, then shouldn't it be "PARTITION BY .. HAVING.."? And we do not have to add more keywords. For example:

SELECT col1, col2, col3,
     ROW_NUMBER()
     OVER(PARTITION BY col1
             HAVING VALUE BETWEEN 11 AND 20
                ORDER BY col2) AS rownum
FROM T1;

The VALUE keyword is used in the ANSI/ISO CREATE DOMAIN statement as a place holder in a CHECK() constraint for the actaul column name when the domain is used.
Posted by Dieter Noeth on 3/26/2010 at 1:18 PM
Hi Itzik,
i just found your blog post in SQLMag.

Hi Steve,
it's answer b)

Just ask me for any question regarding Teradata :-)

The order of evaluation is
1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. Calculate the OLAP function
6. filter the result of the OLAP function using QUALIFY
(7. SAMPLE clause)
8. SELECT just displays the results of step 5, but does no calculation
9. ORDER BY

If the manuals sometimes show a different order it's because Teradata was founded 1979 when there was no SQL yet.
TEQUEL didn't require the right order of query clauses and that's why you still can write it in any order as long as it starts with SELECT. Of course this was deprecated a long time ago :-)
Another extension to ANSI is that a column alias might be used in any query clause (similar to Access).
Nice to have (you get used to it very fast), because you don't have to cut&paste complex calculations or use Derived Tables.
Of course there are some scoping rules and you better don't rename an expression to an existing column name.
If fact it's just like search & replace in a text editor.

OLAP functions/QUALIFY are not allowed in subqueries, probably because they might be correlated (If you put it into a Derived Table within that subquery it's ok)

In views it's allowed.

Btw, there are so-called "language opportunities" in ANSI SQL, i remember one of them as "there should be a way to filter the results of an OLAP function similar to HAVING for GROUP BY". This is exactly what QUALIFY is ment for.

Dieter
Posted by Steve Kass on 3/8/2010 at 12:49 PM
Itzik,

This Teradata SQL reference gives examples where QUALIFY comes both before and after ORDER BY! See page 7-4 and page 7-21. http://www.info.teradata.com/eDownload.cfm?itemid=023430009.

Evaluation after SELECT sounds safer to me, but I haven't thought it through. I'd also be interested to know if Teradata documents the meaning of QUALIFY sufficiently well that how it works in subqueries, views, etc., can be ascertained. It still worries me...

Posted by Itzik Ben-Gan on 3/7/2010 at 11:10 AM
Steve, these are very good points, and I can see how logically processing QUALIFY between SELECT and ORDER BY could lead to confusion (identical expressions in SELECT and ORDER BY could give different results with and without the qualify clause).
I searched in Teradata’s documentation as well is in the web for examples of queries with both QUALIFY and ORDER BY, and in all examples that I found, QUALIFY appears after ORDER BY, not before it (contrary to what I thought originally). If this is also the case in terms of logical query processing, then contrary to my initial thoughts, logical query processing order in Teradata is actually:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. QUALIFY

I think that it’s a better idea than SELECT -> QUALIFY -> ORDER BY because now there’s no issue with identical expressions giving different results in the SELECT and ORDER BY steps. If this was the design, would this address at least some of your concerns, Steve?

Also, could someone who is familiar with Teradata please verify logical query processing order with QUALIFY, as well as provide some info regarding any kinds of restrictions that Teradata imposes on various clauses in case QUALIFY is used, as well as any other relevant info to the discussion?
Posted by Steve Kass on 3/5/2010 at 5:08 PM
Itzik,

It's not the same as group by, because with grouping, no expression changes meaning as a result of the grouping. Aggregates are only allowed after grouping, and other expressions are required to be constant on groups. With qualify, identical expressions are allowed before and after the qualify step, but they give different results with and without the qualify clause.

Your explanation of what the query with qualify means isn't clear. Windowed functions aren't allowed in the WHERE clause because they can lead to ambiguity or at least surprise (row numbers not starting at 1, for example), so defining QUALIFY in terms of a disallowed WHERE clause doesn't resolve the problem. What if the * includes the same rank() expression used in the qualify clause? Does it's presence in your definition's where clause mean it gets evaluated separately for the Where clause, then again later for the Select clause, which in a query without qualify (like your inner query) should be evaluated after the Where filter? Or would the presence of Qualify change when all analytic functions are evaluated? Or just those used in the qualify clause? Unless you can write an ANSI-like explanation of an arbitrary Qualify query, there may be unnoticed problems.
Posted by Itzik Ben-Gan on 3/5/2010 at 1:54 PM
To add to my previous message, here’s a simple way to think of how I envision QUALIFY…

Let <query specification> be the specification of a query without a presentation ORDER BY clause. Then the following:

<query specification> [ QUALIFY <qualify predicate> ] [ <order by clause> ]

Is equivalent to:

WITH
C1 AS
(
    <query specification>
),
C2 AS
(
    SELECT *
    FROM C1
    [WHERE <qualify predicate>] -- assuming window calculations allowed here
)
SELECT *
FROM C2
[ <order by clause> ]
Posted by Itzik Ben-Gan on 3/5/2010 at 7:29 AM
Hi Steve :)

To me the answer is simple since what I had in mind was logically evaluating QUALIFY after SELECT and before ORDER BY. With this in mind, the answer to your question is a clear b.

Yes, if the feature will be implemented new crazy options will be available for programmers that are focused on trouble-making instead of writing productive code ;) but thinking practically, the common use would most likely be sensible and prevent the need for table expressions.

Let me try to respond the arguments against b:

>> 1. Now it is impossible to add a row-number column to query Q, and <<

True, assuming you mean a row number starting with 1. But that’s likely the less common case. Here if you need a result row number starting with 1 you should use a table expression. It’s important to address the common need as simply as possible, and allow the less common need to be achieved with a bit more effort.

>> 2. The behavior of all ranking functions in the SELECT list becomes very confusing <<

As long as the documentation is clear about logical query processing order (FROM->WHERE->GROUP BY->HAVING->SELECT->QUALIFY->ORDER BY), I don’t see any room for confusion.

>> they rank rows neither before nor after filtering, but after some (WHERE and HAVING) filtering, but before some (QUALIFY) filtering. <<

This is no different than talking about GROUP BY. It is logically processed after the WHERE filter and before the HAVING filter. Is it so confusing? So why should it be any more confusing if the SELECT was processed after HAVING and before QUALIFY?

>> If QUALIFY is evaluated between SELECT and ORDER BY, A ranking function in the ORDER BY clause will evaluate to something different than the same ranking function in the SELECT list. It will be impossible to display the values used for ordering in this case. <<

Again, that’s correct, but I don’t find it to be confusing as long as the documentation is clear about logical query processing order. And as mentioned, I believe that the problematic cases are not the common ones, and with those, it’s reasonable to expect to put a bit more effort in using table expressions.

BTW, think of the benefits in QUALIFY beyond window calculation. You can refer to target column aliases assigned to any expression, e.g.,

SELECT custid, orderyear,
CASE orderyear
    WHEN 2006 THEN [2006]
    WHEN 2007 THEN [2007]
    WHEN 2008 THEN [2008]
END AS val
FROM dbo.PvtOrders
CROSS JOIN
    (VALUES(2006),(2007),(2008)) AS Y(orderyear)
QUALIFY val IS NOT NULL;

Yeah, like myudkin said, the same need is applicable to other pre-SELECT clauses like GROUP BY, but it’s a far more common need for filtering purposes.

>> It takes a large international committee a long time to work out all the details of a new language feature. Microsoft already got TOP wrong <<

So true, and you know I’m all in favor of adding TOP OVER and deprecating the existing TOP, but I also want to be practical. The likelihood that this will happen is very low, especially due to the fact that the standard SQL:2008 adds a TOP-like feature that (are you sitting?) suffers from the same design flows that the existing TOP does. Check out FETCH FIRST in SQL:2008:

“A <query expression> can also optionally contain a <fetch first clause>, which may limit the cardinality of the derived table. If a <query expression> contains both an <order by clause> and a <fetch first clause>, then the rows in the derived table are first sorted according to the <order by clause> and then limited to the number of rows specified in the <fetch first clause>.”

So it appears that the standard can get it wrong too. They added this FETCH FIRST option after ORDER BY, legalizing the use of ORDER BY in table expressions when FETCH FIRST is used. Same debacle as the one with TOP. Same confusion around the nature of the result and presentation ordering when this option appears in a table expression and the top-level query doesn't have an ORDER BY clause.
Now, given the fact that the standard has a TOP-like feature, do you really think that there’s a chance that they will implement TOP OVER? I put my money on “no” and would like to invest my efforts in a feature—standard or not—that will solve similar problems and the likelyhood for its implementation is higher.

We sure miss you and your insights in the MVP group, Steve. :(
Posted by Steve Kass on 3/4/2010 at 7:01 PM
Hi Itzik,

Despite all the up votes, I have to argue against it.

Consider these queries similar to your example:

-- Query Q
SELECT col1, col2, col3
FROM dbo.T1
QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;

-- Query Q_RN
SELECT col1, col2, col3, ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) AS rn
FROM dbo.T1
QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;


Multiple choice question:
For query Q_RN, what values appear in column [rn] of the result set?
a) 1, 2, ..., 10
b) 11, 12, ..., 20
c) Query Q_RN is not permitted
d) QUALIFY is not added to T-SQL

Of course, if QUALIFY is implemented in T-SQL 20xx, we can wait and look in Books Online, but what do you think?Microsoft has to decide which it is before they can implement the new nonstandard language feature, so let's think about it.

Argument for a): We can currently add any row_number() expression as a column in the outermost select list of a query expression to get rows numbered starting at 1. Adding an additional row_number() column should always number the result set rows starting at 1.

Argument against a): It will be impossible to include the QUALIFY expression in the result set in the obvious way. Of course, T-SQL programmers will try some very crazy things, and some might even work. Add 0+ in the select list?

SELECT col1, col2, col3, 0+ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2)
FROM dbo.T1
QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;

Add a dummy item to the OVER clause to change the ranking expression in the SELECT list?

SELECT col1, col2, col3, 0+ROW_NUMBER() OVER(PARTITION BY col1, dummy ORDER BY col2)
FROM dbo.T1
QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2) BETWEEN 11 AND 20;

Even better, surround the qualified query with TOP (100) PERCENT ... ORDER BY ROW_NUMBER() ..., because we know ROW_NUMBER is allowed in ORDER BY and maybe this will work. (G-d help us here!)

This is all very bad. We should choose answer b), I supposed.

Arguments for b):
1. Prevents complications from choosing answer a)
2. It agrees with Itzik's articles, which show SELECT before QUALIFY in the logical query processing order :)

Arguments against b):
1. Now it is impossible to add a row-number column to query Q, and
2. The behavior of all ranking functions in the SELECT list becomes very confusing: they rank rows neither before nor after filtering, but after some (WHERE and HAVING) filtering, but before some (QUALIFY) filtering. If QUALIFY is evaluated between SELECT and ORDER BY, A ranking function in the ORDER BY clause will evaluate to something different than the same ranking function in the SELECT list. It will be impossible to display the values used for ordering in this case.

Argument for c): Avoids figuring out whether a) or b) is correct.
Argument against c): Makes QUALIFY much less useful. Your "indirect" use of QUALIFY would have to be prohibited also. Other things would be prohibited (like QUALIFY and ORDER BY that both use ranking functions).

Arguments for d):

QUALIFY does not provide any new capabilities (as myudkin already explained)
QUALIFY will be very hard to document and whatever is decided, some results will be unintuitive (should row numbers start at 11, or should row numbers start at 1 when you asked for row numbers 11 through 20?), or many things will be prohibited, or Microsoft will not solve all the problems, and when QUALIFY is used in a view (if allowed), it will behave inconsistently, and various undocumented behaviors will have to be preserved in compatibility levels < 135.

It takes a large international committee a long time to work out all the details of a new language feature. Microsoft already got TOP wrong, and had to deprecate GROUP BY ALL and SET ROWCOUNT. Still T-SQL is not a fully defined language. Getting QUALIFY right now and in combination with future features is too hard. I believe TOP..OVER is much safer, because there is no column materialized with it. It will not be easy, but TOP .. OVER repairs a badly designed proprietary extension - it doesn't add a new and questionable one.


We have a good situation now, where we must decide whether each ranking function expression belongs inside
the CTE or outside the CTE. Clumsy, but precise. No chance of misinterpretation. We can get the 1..10 numbers or the 11..20 numbers from Q_RN, whichever we want.

By the way, I was unable to find full documentation from Teradata, and in fact I found a couple of complaints that the documentation was insufficient. Perhaps there are Teradata users of Connect who can lend some insight into how Teradata answered the multiple choice question I asked.

Steve Kass
Posted by Mark Yudkin on 3/2/2010 at 3:14 AM
With so much ANS standard SQL functionality missing, is it really worth spending time implementing a non-standard and functionally rather limited enhancement (QUALIFY doesn't permit GROUP on the ROW_NUMBER() result, but I do this frequently), for which a full-function ANS-conformant workaround (nested table expression), that can be readily readable (WITH clause) already exists?
Posted by Microsoft on 2/17/2010 at 4:09 PM
Hi Itzik,

Thank you for your suggestion! This does sound like a very useful addition to T-SQL. We will be triaging the request and see if it can make it into a future release of SQL Server.

Again, thanks for your feedback!
- Tobias, SQL Server Engine
Posted by Eric Strom on 2/16/2010 at 7:12 AM
This would be extremely useful. The OVER clause was really handy, but the CTE is a necessary evil if you want to get the most use out of row_number(). You have my vote.

Thanks,
Eric
Posted by Tom Winter on 2/15/2010 at 7:53 AM
I do these kinds of ROW_NUMBER() queries ALL the time. This would be a wonderful option. Please give this serious consideration. It would make my code much cleaner.
Posted by MatthewRoche on 2/10/2010 at 1:31 PM
What a great suggestion - this would make writing clearer and more self-documenting code so much easier.
Sign in to post a workaround.