Logical Query Processing: What It Is And What It Means to You

If I had to name a foundational topic in T-SQL that I felt that was the most important for T-SQL practitioners to know, without a doubt I would say “logical query processing.” I should say from the get-go that this term doesn’t originate in any formal source but rather is a term that I coined for my students and readers to illustrate the logical, or conceptual, interpretation of a query. A solid understanding of this topic is key to writing correct and robust code. Whenever I teach about T-SQL--whether it’s a fundamentals class or an advanced one--I like to start with this topic since I feel that it is critical to beginners and experienced T-SQL practitioners alike.

There’s so much to say about logical query processing that a single article won’t be enough. This article is the first in a series, and here I’ll explain what the term means to me and provide an overview. In future articles I’ll delve into more details about individual query clauses.

What does logical query processing mean?

In order to understand what I mean by logical in logical query processing, I need to start with the foundations of T-SQL. Figure 1 illustrates those foundations.

Figure 1: Foundations of T-SQL

Figure 01 - Foundations of T-SQL.jpg

T-SQL, or Transact-SQL, is the dialect of standard SQL that Microsoft uses in a number of its products that are part of the Microsoft data platform, like Microsoft SQL Server and Microsoft Azure SQL Database. Using T-SQL, you manage and manipulate the data in those products. T-SQL—the Microsoft-specific dialect—is mainly based on standard SQL with some extensions. SQL (for structured query language) is defined by both the International Organization for Standardization (ISO) and by the American National Standards Institute (ANSI).

SQL, in turn, is based mainly on the relational model—a semantic model representing data that was created by Edgar F. Codd in 1969. The relational model, in turn, is based on two main mathematical branches: set theory and predicate logic.

The relational model defines an important principle called the “physical data independence principle.” What it means is that the model and the language based on it define the logical aspects of the data and its manipulation—in other words, the meaning. The model intentionally avoids getting into physical implementation details--like how the data is physically organized, stored and accessed, and how queries are physically processed (optimized and executed). The physical part is the responsibility of the implementation (the database platform you’re using). You are never supposed to derive meaning based on physical data considerations, rather only based on the logical model.

A good example for a violation (meaning, incorrect expectations) of the physical data independence principle is when people query a table without an ORDER BY clause and assume that the data will be returned in clustered index order. As far as the relation model is concerned, a relation’s body is a set of tuples (SQL calls a relation a table and a tuple a row), and a set in mathematics has no order. When you query a relation, you get a relation back, and, as such, there’s no guarantee that you will get the result back in any particular order.

Microsoft’s implementation follows the physical data independence principle, and therefore does not guarantee that you will get the data back from a query in any particular order unless you add an ORDER BY clause in the outer query. A similar violation of the principle is when people update data and the solution’s correctness relies on the data being updated in clustered index order (do a Web search on “quirky update” to see what I mean).

So, in logical query processing, I refer to the logical, or conceptual, interpretation of a query based on standard SQL, which describes how the data from the input tables transforms through a series of steps into the final query result. This description is independent of any physical implementation considerations. As such, it’s important that while you learn about this topic you do not concern yourself with performance considerations since that’s not part of the responsibility of SQL and the relational model.

Logical query processing order

The first curious thing to observe about logical query processing is that the order in which you type the main query clauses is different than the order in which they get logically interpreted. This is illustrated in Figure 2.

Figure 2: Typed order of query clauses with logical query processing step numbers

5 SELECT 5.2 DISTINCT 7 TOP 
    5.1 
1 FROM 
2 WHERE 
3 GROUP BY 
4 HAVING 
6 ORDER BY 
7 OFFSET  FETCH

The main query clauses appear in the figure in the order that you are required to type them, and the step numbers represent their logical query processing order. Observe that, curiously, even though you type the SELECT clause first, the FROM clause is the one that is logically processed first. In fact, logical query processing gets to the SELECT clause only in the fifth step.

You have to wonder what’s behind this design. The answer lies in the fact that the designers of SQL wanted the language to resemble English. In fact, the original name of the language was SEQUEL (for Structured English QUEry Language), but due to a trademark dispute, the language name was changed to SQL. Now, think about how instructions in English are structured. For example, consider the request “Bring me the T-SQL Querying book from the shelf in my office.” Notice that the instruction doesn’t start with the location of the object (the office), but rather with the object itself (the book). But if you think about the order in which the instruction needs to be carried out, first the person following it needs to go to the room, then find the shelf in the room, then pick up the book and bring it to you.

In a similar manner, the typed order of the query clauses starts with the SELECT clauses indicating the columns that you want to return, then the FROM clause indicating the input tables and table operators, then the WHERE clause with the row filter, and so on. However, logical query processing has to start by identifying the tables involved (the FROM clause), then proceed to handle any row filtering (WHERE), grouping (GROUP BY), group filtering (HAVING), and only then can it get to the columns you want to return (SELECT).

Once you understand the difference between the typed order of the query clauses and their logical query processing order, many things about SQL that would have otherwise seemed strange or wrong start making sense.

Figure 2 depicts the required typing order of the query clauses. Figure 3 depicts the same clauses based on logical query processing order.

Figure 3: Logical query processing order of query clauses

1 FROM 
2 WHERE 
3 GROUP BY 
4 HAVING 
5 SELECT
    5.1 SELECT list
    5.2 DISTINCT
6 ORDER BY 
7 TOP / OFFSET-FETCH

Figure 4 depicts logical query processing as a detailed flow chart.

Figure 4: Logical query processing flow chart

Figure 04 - Logical query processing flow chart.jpg

The inputs are the tables that you specify in the FROM clause. Each step operates on the table or tables that are provided to it as inputs and returns a (virtual) table as its output. The very last step returns the final query result, either to the application or to an outer query.

If you like challenges, feel free to try and figure out the different steps in the flow chart, but note that I will explain much of the detail in upcoming articles in the series. For now, it is provided to you as a reference. Make sure you have Figures 2, 3 and 4 in front of you when you read the future articles in the series. It would be a good idea to print those.

Same database and sample queries

In my examples I will use a sample database called TSQLV4. You can download the source code that creates and populates it here: http://tsql.solidq.com/SampleDatabases/TSQLV4.zip. This database represents a simple order entry system with tables called Sales.Customers, Sales.Orders (with order headers), Sales.OrderDetails (with order lines) and others. This database is compatible with all versions of SQL Server 2008 through 2016, as well as with Azure SQL Database. Make sure that you have this database installed and available in order to run the code samples in the series. Once installed, use the following code to switch context to this database:

USE TSQLV4;

To explain logical query processing I’ll use two sample queries—a simple and a complex one. I’ll refer to them as simple sample query and complex sample query. Listing 1 has the simple sample query.

Listing 1: Simple sample query

SELECT C.custid, COUNT( O.orderid ) AS numorders
FROM Sales.Customers AS C
  LEFT OUTER JOIN Sales.Orders AS O
    ON C.custid = O.custid
WHERE C.country = N'Spain'
GROUP BY C.custid
HAVING COUNT( O.orderid ) <= 3
ORDER BY numorders;

This query returns customers from Spain who placed three or fewer orders. For qualifying customers, it returns the customer ID and the order count. The query presents the rows sorted by the order count, ascending.

You’re probably accustomed to “read” the query following the typed order of the clauses. Try to read it now based on logical query processing order, like so:

1. FROM: The query joins the Customers and Orders tables based on a match between the customer’s customer ID and the order’s customer ID. The query uses a left outer join in order to preserve customers who didn’t place any orders.

2. WHERE: The query filters only rows where the customer’s country is Spain.

3. GROUP BY: The query groups the remaining rows by the customer’s customer ID.

4. HAVING: The query filters only customer groups that have three or fewer orders.

5. SELECT: For the remaining groups, the query returns the customer’s customer ID and the order count, naming the column numorders.

6. The query presents the result rows sorted by numorders.

Listing 2 has the complex sample query.

Listing 2: Complex sample query

SELECT TOP (4) WITH TIES
  C.custid,
  A.custlocation,
  COUNT( DISTINCT O.orderid ) AS numorders,
  SUM( A.val ) AS totalval,
  SUM( A.val ) / SUM( SUM( A.val ) ) OVER() AS pct
FROM Sales.Customers AS C
  LEFT OUTER JOIN
  ( Sales.Orders AS O
      INNER JOIN Sales.OrderDetails AS OD
        ON O.orderid = OD.orderid
        AND O.orderdate >= '20160101' )
    ON C.custid = O.custid
  CROSS APPLY ( VALUES( CONCAT(C.country, N'.' + C.region, N'.' + C.city),
            OD.qty * OD.unitprice * (1 - OD.discount) ) 
      ) AS A(custlocation, val)
WHERE A.custlocation IN (N'Spain.Madrid', N'France.Paris', N'USA.WA.Seattle')
GROUP BY C.custid, A.custlocation
HAVING COUNT( DISTINCT O.orderid ) <= 3
ORDER BY numorders;

Wanna try and explain what the query does? It will all make sense by the end of the series! For now, I suggest that you make a printout of both the simple sample query and the complex sample query, and make sure they’re also in front of you when you read the upcoming articles in the series.

What’s next?

Understanding logical query processing is key to writing correct and robust SQL code. It’s also key to understanding certain limitations of the language and coming up with workarounds. Also, unlike the popular belief, logical query processing is the answer to the ultimate question of life, the universe and everything. This article is the first in a series on the topic where I provided an overview of the subject and resources that you will use throughout the series. In upcoming articles we’ll delve into the gory details, dissecting individual clauses. In the meanwhile, as homework, see if you can answer the following questions:

1. What’s the difference between the ON clause and the WHERE clause?

2. Is there a guarantee that the expressions in the WHERE clause will be evaluated in a specific order; in other words, can you rely on a short-circuit?

3. What’s the purpose of NULLs, and what kind of complexities do they add to the language?

4. When you join a table with a derived table, can the derived table query refer to columns from the other table in the join, and why?

5. Can you use a column alias that you define in the SELECT clause in the WHERE clause, and why?

6. Can you use an alias that you define in the SELECT clause in other expressions in the SELECT clause, and why?

7. Can you use an alias that you define in the SELECT clause in the ORDER BY clause, and why?

8. How do you make a column alias available to clauses like WHERE, GROUP BY, HAVING, SELECT?

9. What’s the difference between the nature of a query result when you have a presentation ORDER BY clause in the query and when you don’t?
10. What’s the difference between a group aggregate function and a window aggregate function?

11. If you have an ORDER BY clause in an inner query, is the presentation order of the outer query guaranteed?

12. If you don’t have a presentation ORDER BY clause in the query, are there any circumstances where the query presentation order is guaranteed?

Comments

Plain text