Query Filters Based on Number of Rows and Logical Ordering

Last month I started a two-part series about query filters. I discussed filters based on predicates, including filters that T-SQL supports, such as ON, WHERE, and HAVING, and filters that T-SQL doesn’t support but that are defined by standard SQL or other database platforms, such as the standard FILTER clause for aggregates and the QUALIFY clause that Teradata supports. This month I focus on a different kind of query filter—one that isn’t based on a predicate, but rather on a requested number of rows and logical ordering.

As a reminder, in my examples I use a sample database called InsideTSQL2008. You can download the source code for this sample database from InsideTSQL.com.

TOP

The TOP filter is a nonstandard filter that was added in SQL Server 7.0 to address a common filtering need—the need to filter rows based on some logical ordering and a requested number (or percent) of rows. You specify the requested number of rows right after the TOP clause, and the ordering is defined by the query’s traditional ORDER BY clause. For example, to return the three orders with the highest order values, you’d use the following query:

USE InsideTSQL2008;

SELECT TOP (3) orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC;

On the surface, the TOP feature’s design seems straightforward. It’s clear that this query will return the three orders with the highest order values, and it seems obvious that the query will return those rows from highest to lowest order values in terms of presentation ordering. But in practice this design is a source of much confusion and trouble. To see why, we need to go back to T-SQL’s foundations.

T-SQL is based on standard SQL (both ISO and ANSI standards), which in turn is based on the relational model, which in turn is based on two mathematical branches—set theory and predicate logic. Last month’s discussion revolved around predicate-based filters, which gave us a glimpse of the relevance of predicate logic to T-SQL. As for set theory, understanding some of the key concepts from this mathematical branch can give you better insight to the meaning of the code you write using T-SQL.

What’s most relevant to our discussion is the fact that there’s no order to the elements of a set. This concept might sound simple, but it’s a great source of confusion in writing T-SQL code. A table in a SQL Server database represents a relation from the relational model, which in turn represents a set from set theory. You need to keep in mind that in logical terms there’s no order to a table’s rows—never mind what you know about the physical layer that implements the table (e.g., indexes). Similarly, the result of a query without a presentation ORDER BY clause is a table result; therefore, there’s no guarantee for the order of the rows in the output. If you add a presentation ORDER BY clause to a query, the nature of the result is inherently different than without one—you get what standard SQL conceptually considers to be a cursor. The idea behind a cursor is that it represents a result with guaranteed order, meaning that it’s nonrelational.

If you’re wondering why you should care whether the result is conceptually a relation or a cursor, you should note that SQL Server doesn’t typically let you define a table expression, such as a view, inline table-valued function, derived table, or common table expression (CTE), based on a query with an ORDER BY clause. For example, if you try defining the following view:

CREATE VIEW Sales.MySortedOrders
AS

SELECT orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC;
GO

you get the error in Figure 1.

The error message doesn’t go into a great detail of explanation but simply says the ORDER BY clause isn’t allowed. The reason for disallowing an ORDER BY clause in the view’s query is that a view is supposed to represent a table; a table is supposed to represent a relation; a relation is supposed to represent a set; and a set is unordered. As an aside, Oracle supports the concept of ordered views, which is one of the reasons people try to define them in SQL Server. But you should realize that support for such a construct is in complete contradiction with the relational model.

Going back to TOP, this feature’s design doesn’t let you indicate the logical ordering for the TOP filter separately from and independently of presentation ordering; instead, it overloads the existing ORDER BY clause that was designed originally just for presentation ordering purposes with this extra meaning for TOP. The two orders are therefore intertwined and inseparable. And what if you want to use TOP in a query that’s used to define a table expression? You need an ORDER BY clause to give meaning to the TOP filter. But there’s a conflict with the restriction that disallows ORDER BY in table expressions. The way SQL Server’s designers resolved this conflict was by making an exception regarding ORDER BY in table expressions when TOP is specified. So for example, the following is allowed:

CREATE VIEW Sales.MyTop3Orders
AS

SELECT TOP (3) orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC;
GO

Because SQL Server allows an ORDER BY clause with TOP in table expressions, some people try to abuse this fact and create a “sorted” table expression by using TOP (100) PERCENT—for example:

CREATE VIEW Sales.MySortedOrders
AS

SELECT TOP (100) PERCENT orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC;
GO

But here’s the part many people aren’t aware of—in accord with the relational model, a query against a table expression doesn’t guarantee presentation ordering unless the outermost query has a presentation ORDER BY clause. Presentation ordering is guaranteed only in the immediate level, but not in outer levels of the code. So when you issue the following query:

SELECT * FROM Sales.MySortedOrders;

you don’t get a guarantee for presentation ordering. You might or might not get the data in val DESC ordering—there’s no guarantee. For example, when I ran this code, I got the data in the order shown in Figure 2.

When the TOP query is the outermost query, the ORDER BY clause serves both the presentation and TOP filtering-related purposes. For example, consider the following query:

SELECT TOP (3) orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC;

Here you’re guaranteed both to get the three orders with the highest order values and that they’ll be presented from highest to lowest values.

Besides this confusing aspect of the TOP design, there’s also the fact that because TOP doesn’t have its own ordering specification that’s separate from the presentation ordering, you can’t directly filter with TOP using different ordering than the presentation one. What if you wanted to filter the three orders with the highest order values and present them in order date ordering? You’d need to use a table expression, like so:

WITH C AS
(
  SELECT TOP (3) orderid, orderdate, custid, empid, val
  FROM Sales.OrderValues
  ORDER BY val DESC
)
SELECT *
FROM C
ORDER BY orderdate;

Now that the TOP query is used in a table expression definition, that query’s ORDER BY clause serves only the TOP related ordering, whereas the outer query’s ORDER BY clause serves the presentation ordering.

SQL Server 2005 introduced support for TOP with modification statements, which led to even more trouble because the design of the TOP feature relies on the presentation ORDER BY clause to also give meaning to the TOP filtering, and because modification statements don’t have a presentation ORDER BY clause. The solution was to simply not allow an ORDER BY clause. So you can add TOP to modification statements—for example:

DELETE TOP (50) FROM Sales.Orders;

However, because you can’t indicate an ORDER BY clause, you can’t really control which 50 rows will be deleted—it’s a pure matter of optimization. SQL Server will delete the 50 rows it physically happens to stumble onto first. I’m not saying there are no uses for this capability with modifications, but the uses are limited.

You can use a table expression as a workaround. For example, to delete the 50 oldest orders, use:

WITH C AS
(
  SELECT TOP (50) *
  FROM Sales.Orders
  ORDER BY orderdate, orderid
)
DELETE FROM C;

Of course, if the TOP clause supported its own ordering specification, there wouldn’t be any confusion and you wouldn’t need to use a workaround.

One final thing that’s missing from the TOP design, other than supporting its own ordering, is the ability to define partitioning. For example, it would be nice if you could request the three orders with the highest order values for each customer.

FETCH FIRST

I kept hoping that if standard SQL ever came up with a design for a filter based on number of rows and logical ordering, it would be one that supported its own ordering specification, as well as a partitioning concept. The SQL Server 2008 standard indeed introduced a new filtering clause called FETCH FIRST that’s based on number of rows and logical ordering, but it’s very similar to TOP. The FETCH FIRST clause also overloads the existing ORDER BY clause with this extra filtering-related meaning instead of supporting its own ordering specification. Like TOP, FETCH FIRST doesn’t support a partitioning concept. In terms of syntax, you specify it after the ORDER BY clause, like so:

SELECT orderid, orderdate, custid, empid, val
FROM Sales.OrderValues
ORDER BY val DESC
FETCH FIRST 3 ROWS ONLY;

Because this filtering clause is tied to the traditional ORDER BY clause, the standard had to legalize the use of ORDER BY in table expressions when the FETCH FIRST clause is also specified. If doubt ever existed regarding guarantees for presentation ordering when a table expression query has an ORDER BY clause but the outer query against the table expression doesn’t, here’s what the standard has to say:

“A derived table is a table derived directly or indirectly from one or more other tables by the evaluation of an expression, such as a <query expression> or <table expression>. A <query expression> can contain an optional <order by clause>. The ordering of the rows of the table specified by the <query expression> is guaranteed only for the <query expression> that immediately contains the <order by clause>.”

Unless the outermost query has an ORDER BY clause, no presentation ordering guarantee exists. Even though the behavior is well defined by the standard, you need to be aware of it to know what to expect. This design is just as confusing as TOP—even if SQL Server adds support for it in the future, users will have the same trouble and confusion as with TOP. Most people won’t know what the standard says about presentation ordering guarantees and will therefore keep trying to abuse such features to create illogical constructs such as sorted views.

TOP OVER

What kind of design for a filtering clause based on number of rows and logical ordering wouldn’t be confusing? The standard addressed similar needs with window functions, designing them with an OVER clause in which you can indicate partitioning and ordering that’s specific to the window function. So why not design the filtering clause similarly? For example, imagine you could use the following query to return the three orders with the highest values for each customer:

SELECT TOP (3) OVER(PARTITION BY custid ORDER BY val DESC)
  orderid, orderdate, custid, empid, val
FROM Sales.OrderValues;

Wouldn’t that be great? There’s no confusion between the filter-related ordering and presentation ordering, and you can define partitions. In addition, this design lets you define filter-related ordering with modification statements.

Perhaps the filtering clause should use a different keyword than OVER, because OVER has a very specific meaning in the context of window functions. A better alternative might be the following:

SELECT TOP (3, PARTITION BY custid ORDER BY val DESC)
  orderid, orderdate, custid, empid, val
FROM Sales.OrderValues;

But the idea remains the same—no ambiguity; no confusion. If you think this feature would be useful in SQL Server, you can vote for it at the Microsoft Connect page.

In the meantime, you can get similar functionality by using the ROW_NUMBER function like so:

WITH C AS
(
  SELECT ROW_NUMBER() OVER(PARTITION BY custid ORDER BY val DESC) AS rownum,
    orderid, orderdate, custid, empid, val
  FROM Sales.OrderValues
)
SELECT *
FROM C
WHERE rownum <= 3;

Of course, the outer query isn’t limited to being a SELECT query. It can also be a DELETE or an UPDATE statement.

Write Better Code

Filtering is a common and fundamental need when querying data in SQL Server. You’d think that filtering would be a simple and straightforward task, but as I discussed last month and this month, filtering is actually a deep and involved subject. I described both predicate-based filters and filters based on number of rows and logical ordering, and I described filters that SQL Server supports, as well as filters it doesn’t support. To write better code, you must first understand T-SQL’s foundations, and you need to appreciate the subtleties and complexities of filters.

Comments

Plain text