T-SQL Foundations: Thinking in Sets

If you’re a database practitioner, you’ve probably heard people talking about the importance of thinking in terms of sets when addressing querying tasks as opposed to thinking in iterative or cursor terms. You’ve also probably heard people using the term set-based to describe solutions, and perhaps you’ve even used the term yourself. But what does thinking in sets really mean? And why is it so important? I’ll try to answer these questions.

Related: T-SQL Best Practices, Part 1

Why Sets?

Before I discuss what thinking in sets means, I want to talk about the philosophical aspects of the importance of thinking in terms of sets, or more specifically in relational terms. IT is a field that advances rapidly. New technologies keep being introduced, while existing technologies keep changing or disappear because they’ve become obsolete. However, there’s at least one computing technology that remains stable: the relational model. This profound mathematical model for database management was created by Edgar F. Codd in 1969. It hasn’t remained completely unchanged since its creation. Codd revised it, and it evolved thanks to efforts by Chris Date, Hugh Darwen, and others. But compared with other computing technologies, it’s much more stable. Similarly, the languages based on the relational model, such as SQL and its dialects (e.g., T-SQL), keep evolving but not as rapidly as other computing technologies.

One reason for the longevity of the relational model and the related languages is that the model is based on two strong mathematical foundations: set theory and predicate logic. If you learn the basic principles behind set theory and predicate logic, you’ll better understand the code you’re writing and hence write much better code.

The relational model isn’t perfect; perhaps someday a different model will replace it. But as long as it’s the dominant model, you want to flow with it and not against it. This means that you should use set-based solutions and not iterative or cursor solutions.

SQL (and its dialects) isn’t perfect either—in fact, SQL deviates from the relational model in many ways. It, too, might be replaced by a different language in the future. But at the moment, SQL is the de facto industry language for relational database management systems (RDBMSs). So, as long as you understand relational principles, you can use SQL in a relational way.

What Is a Set?

To understand what a set is, it’s helpful to look at how Georg Cantor, the creator of the mathematical set theory, defined a set. According to Joseph W. Dauben’s book Georg Cantor (Princeton University Press, 1990), Cantor defined a set as follows: “By a ‘set’ we mean any collection M into a whole of definite, distinct objects m (which are called the ‘elements’ of M) of our perception or of our thought.”

Observe the italicized words whole and distinct. Also observe that nowhere in the definition of a set does Cantor refer to the order of the elements. Let’s analyze these three aspects of a set in more detail.

A Set As a Whole

When you interact with a set, you need to think of it as a whole—not as individual elements. For example, when writing SQL queries, you interact with the input table as a whole and not with its individual rows. That’s set-based. When using a cursor, you deal with one row at a time. That’s not set-based.

Some people think that if you don’t explicitly use a cursor, the solution must be set-based. Consider, for example, a solution that implements the following logic, expressed in pseudo code:

retrieve top (1) row order by key
while current key is not null
begin
  process row with current key
  retrieve top (1) row
  where key > current key order by key
end

Although this iterative process doesn’t use a T-SQL CURSOR object, it deals with one row at a time. It’s far more expensive than an actual cursor in terms of its I/O footprint. At least with the cursor, you perform one pass over the data, so its I/O imprint is much less.

I already discussed the philosophical reasons for using set-based solutions. There’s also a more practical reason—performance. SQL Server’s implementation of T-SQL iterative constructs such as the WHILE loop is very inefficient. If you run a WHILE loop for a million iterations in T-SQL and a WHILE loop for a million iterations in a procedural language such as C#, you’ll see a striking difference in performance. On my laptop, the T-SQL WHILE loop took more than 100 seconds to complete, whereas the C# WHILE loop took under 10 seconds. That’s an order of magnitude difference. So, iterations in T-SQL are slow.

Cursor processing is also implemented inefficiently in T-SQL. You pay extra overhead for each record processing of the cursor that you don’t pay for when processing the same number of rows with a set-based query. If you want to see how much overhead, you can perform this basic test: Issue a simple SELECT * query against a table that’s large enough (say, a few million rows), measuring the runtime of the set-based query (with results discarded). Then, repeat this test, except this time use a cursor instead of a set-based query. You can find such a performance comparison with code samples in the free sample chapter for the book Inside Microsoft SQL Server 2005: T-SQL Programming. You can download “Chapter 03 - Cursors”. You’ll see that the cursor code runs several dozen times slower compared with the set-based code.

To understand why this is true, suppose that you call the cost of SQL Server’s implementation of scanning n rows with a set-based query n. Then, taking the extra cost for each record processing with the cursor (call it o for cursor overhead per row), the cost of scanning n rows with a cursor can be described as follows:

n + (n ´ o)

At the moment, the o part is quite high. As I previously mentioned, the cursor code runs several dozen times slower than the set-based code. This means that using a cursor as your starting point puts your code at a disadvantage.

Distinct

According to Cantor’s definition, a set has distinct elements. Codd amusingly described the distinct requirement by saying that if a fact is true, saying it twice doesn’t make it truer. SQL deviates from set theory (and hence the relational model) because it allows duplicates in a table if you don’t enforce a key. And even if you do have a key in a table, a query can still return duplicates. For example, consider the query

USE AdventureWorks2008R2;
SELECT TerritoryID
FROM Sales.SalesOrderHeader;

The query is looking for territories where orders were placed. SQL (or in this case, T-SQL) doesn’t attempt to remove duplicates, and that isn’t relational. A truly relational language doesn’t return duplicates.

SQL does provide tools to get closer to writing code in a relational way. So it’s more precise to say that SQL is based on multi-set theory as opposed to set theory. If duplicates are possible in the result, you can eliminate them by adding a DISTINCT clause, as in

SELECT DISTINCT TerritoryID
FROM Sales.SalesOrderHeader;

Some people recommend adding a DISTINCT clause to all queries. However, in cases where duplicates don’t exist to begin with, adding a DISTINCT clause might cause SQL Server to perform extra work in an attempt to eliminate duplicates. So, I recommend adding DISTINCT only in cases in which duplicates can appear in the result.

Order

One of the most important aspects of sets is implied in Cantor’s definition. He doesn’t mention the order of the elements in a set because the order isn’t important. That’s one of the most difficult concepts for people to grasp when querying a table—namely, understanding that there are no assurances that the data being queried will be consumed in a particular order. For example, some people assume that when they query a table that has a clustered index, they’ll get the data back in clustered index order. From the language perspective, that’s not guaranteed because what you’re querying is a set and what you’re getting back is a set. SQL Server knows that the language doesn’t provide any ordering assurances, so it scans the data in any order that it likes—including not in clustered index order. People often use techniques that violate set-based and relational concepts in the sense that the results’ correctness relies on the data being consumed in index order, which SQL Server has never guaranteed. I cover a couple of classic examples in my web-exclusive articles “Multi-Row Variable Assignment and ORDER BY” and “Ordered UPDATE and Set-Based Solutions to Running Aggregates”.

So, if a set has no order to its elements, why does SQL supports an ORDER BY clause? The answer is that traditionally the ORDER BY clause served a presentation ordering role. Later it got overloaded with other roles, which I’ll discuss shortly. A query with an ORDER BY clause truly doesn’t return a relational result but rather what standard SQL calls a cursor. Conceptually, a cursor is a result that supports an ordering property. If you want, you can iterate through its records one at a time.

If you’ve ever wondered why T-SQL doesn’t allow defining a table expression (e.g., view, derived table, inline table function, common table expression—CTE) based on a query with an ORDER BY clause, the answer should now be clear: A table expression is supposed to represent a table. A table is supposed to represent a relation. (If you’re unfamiliar with relational terminology, see the sidebar “Commonly Misused Data Management Terms.”) A relation is a set. A set has no guaranteed order to its elements.

The addition of the TOP filter to T-SQL has brought a lot more confusion. TOP lets you filter data based on a number of rows and ordering. Nothing in this filter conceptually violates relational concepts; it’s just a filter. However, the implementation of the TOP filter doesn’t allow specifying its own ordering but rather overloads the already existing ORDER BY clause with this extra role. So, when you have a query such as

SELECT TOP (3) SalesOrderID,
  OrderDate, CustomerID
FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC,
  SalesOrderID DESC;

the ORDER BY clause provides two assurances:

You’ll get the three most recent orders.
The rows in the result will be presented from most recent to least recent.

Unfortunately, you can’t separate the TOP filter ordering from the presentation ordering directly. The same limitation applies to the standard OFFSET/FETCH filter that’s being introduced in the next version of SQL Server, which is code-named Denali, when it’s used in code such as

SELECT SalesOrderID, OrderDate,
  CustomerID
FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC,
  SalesOrderID DESC
OFFSET 0 ROWS
  FETCH FIRST 3 ROWS ONLY;

How do you define a table expression based on a query that has such a filter if the filter forces presentation ordering and by doing so violates the relational properties that the table expression is supposed to possess? Both T-SQL and standard SQL decided to resolve this problem by allowing you to use an ORDER BY clause in a query that defines a table expression when TOP or OFFSET/FETCH is also specified. But what many people don’t realize is that an outer query against the table expression doesn’t guarantee presentation ordering unless you add an ORDER BY clause in that outer query. If you understand relational concepts, this makes perfect sense. However, it would’ve been much better if the design of TOP and OFFSET/FETCH had allowed you to specifying ordering that’s specific to the filter and not tied to the presentation ordering.

A great example of a beautifully designed language element that has ordering as part of the calculation’s specification, without relying on the data being consumed in certain order or without tying the calculation’s ordering to the presentation ordering, is window functions. (Window functions are a subset of the ordered set functions in standard SQL.) For example, to calculate a rank attribute for rows based on quantity ordering, you’d use

RANK() OVER(ORDER BY qty) AS rnk

The input to the query is a set, and the output is a set. You aren’t forced to add presentation ordering to the query. If you decide to add presentation ordering, it doesn’t have to be the same as the ordering in the window function. That’s a great design feature. If TOP and OFFSET/FETCH had been designed in a similar way, there would be a lot less confusion about their use.

Make a Conscious Effort

Why do most people tend to think in cursor or iterative terms and not in set-based terms? Probably because they have experience with procedural programming and using cursor or iterative solutions feels like an extension of what they already know (i.e., open a file, fetch the first record, iterate through the rest of records until the end of the file is reached). Thinking in sets is different, but it’s the way you’re supposed to think when using a relational language. Therefore, you need to make a conscious effort to think in sets. When you do, SQL starts to make more sense.

Comments

Plain text