T-SQL for Starters: Simple and Correlated Subqueries

As a follow-on to last month's discussion of joins ("Data Retrieval with Joins," June 1999), this month I examine subqueries. You will see how, in some cases, you can replace subqueries with joins for more efficient code. Also, I examine simple and correlated subqueries and explain why a correlated subquery can drastically affect query performance.

What Is a Subquery?

A subquery is a query that SQL Server must evaluate before it can process the main query. Consider an example from the Pubs database. If someone asks, "Which books cost more than the average price?" the first step in finding the answer is to determine the average price of all the books. Only then can you look through the Titles table and find the books that cost more than the average. Screen 1 shows this query as a SQL statement and the Results Grid.

You can use a subquery anywhere you can use an expression. Always enclose a subquery in parentheses, so that the query parser can tell where the subquery begins and ends. In the SQL statement in Screen 1, the subquery is

(SELECT AVG(price) FROM titles)

The parser evaluates the subquery and substitutes the result into the main query. Running the subquery for this example gives you a value of $14.77. Therefore, the main query is asking

SELECT title, price
FROM titles
WHERE  price > $14.77

In this simple subquery, the query optimizer evaluates the subquery once and uses the result to evaluate the outer query. In other words, the result is one value. For a slightly more complex example, say you want a list of books sold on a particular date. The Sales table refers only to the title_id, so you first have to get the list of title_ids for books sold on that date and then relate those title_ids to the titles in the Titles table. Screen 2 shows the SQL statement and the Results Grid for this query.

If you run the subquery

(SELECT DISTINCT title_id FROM SALES WHERE ord_date  = 'Sep 14 1994')

separately, you get three title_id values—BU1032, MC3021, and PS2091—for books sold on the date specified. In other words, those books have entries in the Sales table for that date. So now the main query evaluates the SQL statement as if it reads:

SELECT title_id, title
FROM titles
WHERE title_id IN ('BU1032', 'MC3021', 'PS2091')

Although the subquery returns three values, they are all from the title_id column. This happens because a SELECT statement within another SELECT statement cannot return values to the outer SELECT statement from more than one column. It must return either a single value, a list of values from a single column, or a true or false (which is itself a single value). Running this query as a JOIN statement might be more efficient. This JOIN statement would be

SELECT DISTINCT titles.title_id, title
FROM titles JOIN sales ON titles.title_id  =  sales.title_id 
WHERE ord_date  = 'Sep 14 1994'

Screen 3 shows the SQL statement and the Execution Plan output for the subquery. The Execution Plan shows how the optimizer processes the query. If you compare the Execution Plan for the subquery and JOIN statements, you will see that the optimizer processes these queries in an identical manner. Keep in mind that these examples are simple queries against a very small database. The optimizer might handle the queries differently when dealing with a large database.

Correlated Subqueries

From a programmer's perspective, think of a subquery as a subroutine that the optimizer must run before the main routine can complete. A correlated subquery is one that depends on a value in the outer query. In programming terms, you pass the subroutine an argument, then the subroutine evaluates the query and returns a result. You move on to the next record and repeat the process. The net effect is that the subquery runs once for every row in the main query; this situation is inefficient.

A possible alternative is to rewrite the correlated subquery as a join. However, some situations require a subquery.

Suppose you want a list of all the authors in the Pubs database who have written books. Because you're not concerned with how many books each author has written or what the titles are, you can use the WHERE EXISTS command to see whether the author has an entry in the Titleauthor table. Screen 4 shows this SQL statement, which returns a list of 19 of 23 authors who have written books. Notice that in the subquery, the SELECT statement uses the asterisk, which ordinarily means return all columns. But the convention with EXISTS is that you use an asterisk, rather than specifying a column, because you do not expect SQL Server to return any data values—EXISTS returns only Boolean values (i.e., true or false). What makes this SELECT statement a correlated subquery is that the optimizer evaluates it for each au_id in the Authors table. The optimizer passes each au_id value in the outer query to the inner query, where the optimizer uses that value to run the SELECT statement. The correlated subquery will not run alone, unlike the simple subqueries you looked at earlier. If you try to run a correlated subquery alone, SQL Server tells you that you have referenced the authors.au_id column without specifying the Authors table in the query.

Rather than incur the overhead of this correlated subquery, you can write it as a join. The SQL statement for this join operation is

SELECT distinct authors.au_id, au_lname, au_fname 
FROM authors INNER JOIN titleauthor 
ON authors.au_id = titleauthor.au_id

The INNER JOIN, as I discussed last month, gives you the rows in which a match occurs between the two tables. So the results will show only the 19 published authors. What about the four authors who don't have published books? You can easily get a list of those authors by adding a NOT condition to the correlated subquery:

SELECT au_id, au_lname, au_fname 
FROM authors
WHERE NOT EXISTS (SELECT * FROM titleauthor    	WHERE titleauthor.au_id = authors.au_id)

But any time you see a NOT in a query, think twice before accepting it as the best approach. NOT conditions are usually inefficient. You get the required results, though.

Writing this query as a join is a little trickier:

SELECT DISTINCT authors.au_id, au_lname, au_fname, titleauthor.au_id
FROM authors LEFT OUTER JOIN titleauthor
ON authors.au_id = titleauthor.au_id  
WHERE titleauthor.au_id IS NULL

This query retrieves a list of all the authors and then looks for the rows where the inner table—in this case Titleauthor—shows a NULL value. Be careful with this query; you must run it using the ANSI (SQL-92) JOIN syntax rather than including the JOIN in the WHERE clause.

Also, take care that what you are writing is a correlated subquery and not a simple subquery. For example, this query returns all 23 authors, which you know is not correct:

SELECT au_id, au_fname, au_lname
FROM authors
WHERE EXISTS
	(SELECT DISTINCT au_id 
	FROM titleauthor)

This incorrect result occurs because the programmer did not pass the author ID into the subquery. This is a simple subquery, which can run alone without reference to the outer query. The simple subquery runs once, and as long as it finds even one row in the Titleauthor table, it evaluates as true. So for each author, the query simply asks for the three columns. It runs quickly, because it does not evaluate the subquery multiple times. If a correlated subquery runs a lot faster than you expected, check to see whether the subquery is correct.

It All Bad News?

So, are correlated subqueries always inefficient? No, their practicality depends on how you use them. For example, suppose the Pubs database is very large, with thousands of authors. If you want to find out who has written books, the correlated subquery is inefficient because it would run thousands of times. An INNER JOIN might work better based on the comparative results of the estimated execution plans.

But what if you want to search for just one author and see whether that author has written a book? Screen 5 shows the SQL statement for that query. In this case, you have a WHERE clause in the outer query, and you have to evaluate only one row in the correlated subquery. That's acceptable. Nothing is wrong with a correlated subquery when you are performing a specific search. Problems arise only when you are looking at many or all of the rows. So don't think that you always have to avoid correlated subqueries.

Subqueries in UPDATE Statements

You can use subqueries in SELECT statements for searching on specific records and for UPDATE statements. You might need to update data based on the results of a query, but typically you need to update only one row at a time. Often, you can control the update (insert or delete) based on data values from another table. For example, suppose you want to drop from the database the authors who have not delivered on their book contracts. As Listing 1 shows, first identify the authors, and then delete the authors who have not written a book.

This DELETE statement changes the Authors table based on data from the Titleauthor table. DELETE works because you have specifically requested authors who do not have an entry in the Titleauthor table.

If you tried to delete an author who had written a book, the referential integrity (RI) constraints would prevent you from deleting the author and leaving orphan entries in the Titleauthor table.

Remember This

If you keep in mind the basics of when to use (and when not to use) subqueries, you will find them a useful tool in your SQL Server programming toolkit. You can often write subqueries as joins, which typically results in better performance. Simple subqueries are relatively fast. Correlated subqueries evaluate once for each row of the outer query, so use these queries only for single-row results.

Comments

Plain text