T-SQL for Starters: Data Retrieval with Joins

Databases are not just for storing data, but for retrieving information. Unless you can retrieve relevant information from your database, a powerful server running the latest software is about as much use as a shoebox full of old receipts. You probably normalize your data thoroughly when building your application (for information about normalization, see Michelle Poolet, "Why You Need Data Normalization," Premiere issue), so that you have many small tables, each referring to some entity such as a student or a class. Now, when you need to retrieve information, you usually have to join multiple tables together, so that you can, for example, generate a list of which students are in each class.

Join Syntax

SQL Server supports two different ways to write a query with a join. One is the Transact-SQL (T-SQL) syntax, which SQL Server has supported since version 1. This method is familiar to users of programs such as Microsoft Access. But SQL Server also supports the ANSI-92 SQL syntax for writing joins. In both SQL Server 6.x and 7.0, you may use either syntax, but the results will differ. In a few cases involving NULLs, only the ANSI syntax gives the correct result; the T-SQL syntax query runs, but the output is not correct. Several other facts support a preference for the ANSI syntax, not the least of which is that it is now the preferred syntax for SQL Server 7.0. I will mention some other reasons as I explain the various types of join. As usual, I will use the PUBS database for this discussion. The examples shown here will work on SQL Server 6.5 and 7.0, so use whichever version you have available for testing.

Inner Joins

The simplest form of a join is the INNER JOIN. In English, you are saying, "Give me all the data that matches in both tables." You have to specify which column, or columns, in one table match the corresponding columns in the second table. Suppose you want a list of publishers and the titles they publish, as Screen 1 shows. In the old T-SQL style, you write this join as:

SELECT pub_name, title 
FROM publishers, titles 
WHERE publishers.pub_id  = titles.pub_id

The SELECT clause says which columns to output, the FROM clause lists the tables, and the WHERE clause tells SQL Server how to join the tables. In this case, the publisher ID, which the pub_id column represents, is the common link between the publishers table and the titles table. Now here's the same query in the ANSI syntax:

SELECT pub_name, title 
FROM publishers INNER JOIN titles
ON publishers.pub_id = titles.pub_id

Notice the differences. First, you type the word JOIN in the query, between the names of the tables you're joining. The word INNER is optional, because this is the default join type. Second, you do not need a WHERE clause for the join, so you can reserve the WHERE clause to restrict the rows you want displayed. Instead, you use the ON keyword to specify which columns to base the join on. In either syntax, the order of the tables is not critical. So

SELECT pub_name, title 
FROM publishers INNER JOIN titles
ON publishers.pub_id = titles.pub_id

works the same way as

SELECT pub_name, title 
FROM titles INNER JOIN publishers
ON titles.pub_id = publishers.pub_id

which reverses the order of the tables in the FROM clause.

My preference is to list the columns in the order in which I want them to appear in the output data, then list the tables in the FROM clause in the order in which I need them to retrieve the specified columns.

Equijoins and Natural Joins

An equijoin is an INNER JOIN in which you ask for the join columns from both tables. You may think that this join will return redundant data, and of course, you are correct. The whole premise of this join is that the data in the join column is the same in both tables. An equijoin would look like this:

SELECT pub_name, publishers.pub_id, titles.pub_id,  title
FROM publishers INNER JOIN titles 
ON publishers.pub_id = titles.pub_id

Screen 2 shows the results. Although this topic may seem of only theoretical interest, it does bring up an important point. The INNER JOIN examples did not request the pub_id column to be included in the output data. This example does. And if you request a column that exists in both tables, you must specify which table's column you want. Therefore, the query shows publishers.pub_id and titles.pub_id, to distinguish the two pub_id columns.

A more reasonable request would be to have the publisher ID in the output, but you need to see it only once. This approach is a natural join and it looks like this:

SELECT pub_name, publishers.pub_id,  title
FROM titles INNER JOIN publishers
ON titles.pub_id = publishers.pub_id

Does it matter which pub_id you select, as they are the same? No, there's no difference in the results.

I wrote the equijoin and natural join in the ANSI syntax, but I could have written them in the T-SQL syntax. I recommend the ANSI syntax because it is the preferred syntax for SQL Server 7.0 and has a better probability of working on other relational database management systems (RDBMSs).

Cross Joins

I recommend you avoid the CROSS JOIN, or the Cartesian product. First, the output is almost always useless, and second, this join can bring the server to a grinding halt. As an example, if you run two queries on the titles and publishers tables, you find that these tables have 18 and 8 rows, respectively. The cross join gives you 8*18, or 144, rows, as you see in Screen 3. It shows every possible combination of publisher and title, implying that every publisher published every book. Here are the queries to run to test this for yourself. Here's the ANSI join syntax:

SELECT pub_name, title
FROM titles CROSS JOIN publishers
(gets 8x18 = 144 rows)

and the T-SQL join syntax:

SELECT pub_name, title
FROM titles, publishers  
(gets 8x18 = 144 rows)

Imagine that you ran a CROSS JOIN query on a table with 1000 rows, such as a list of the top 1000 best-selling authors, with a 2000-row table containing a list of best-selling books. You might expect 2000 rows of output, each listing a book and its author, but instead you will see 2 million rows.

Here's one reason why the ANSI syntax is gaining favor. In the ANSI syntax, users have to type CROSS to run a CROSS JOIN. In the T-SQL syntax, they simply omit the WHERE clause specifying how to join the tables. This omission can happen accidentally, if the users select only the first two lines of a query before running it. A programmer can make a similar mistake, intending to test the query first, and then specify the WHERE clause to restrict the output rows.

Should you always avoid CROSS JOINS? No, they can be useful. If you need to generate a lot of test data quickly, these joins work well. For example, suppose you build a table with 100 last names and another table with 100 first names. Now do a CROSS JOIN. Instantly, you have 10,000 customers with every possible combination of first and last names. Just be careful when you use CROSS JOINS.

Outer Joins

You create an OUTER JOIN when you ask for all the data from one table, and then the corresponding records from a second table. The titles and publishers example would work here, because each title has a publisher, but not every publisher has published at least one title. But let's consider another example, where some books are selling well and others are not selling at all. The business needs to know which books are not selling. So in this example, consider the titles and sales tables, as shown in Screen 4. Here's the ANSI syntax:

SELECT substring (title,1,40) AS Book, stor_id,
ord_num, qty, ord_date
FROM titles LEFT OUTER JOIN sales
ON titles.title_id = sales.title_id

and the T-SQL syntax (note the *= for the left OUTER JOIN):

SELECT substring (title,1,40) AS Book, stor_id,
ord_num, qty, ord_date
FROM titles, sales
WHERE titles.title_id *= sales.title_id

These examples use the SUBSTRING function (which I discussed in "Working with Character Data," May 1999) to shorten the displayed output for the title table, and I relabeled the column as Book. The ANSI syntax is clearer. It says LEFT OUTER JOIN. You can just specify LEFT JOIN. It means, "Give me all the data from the table to the left of the JOIN (in this case, titles) and the matching data from the table to the right of the JOIN (in this case, sales)." So you see all the titles, including those with no sales. Where sales data exists for a title, it is output. Where no sales data exists, you see NULLs.

Here you run into something called a RIGHT OUTER JOIN. In the example above, no sales data could appear if no entries were in the titles table. Thus, if you just change LEFT to RIGHT and run the query, you see what is really an INNER JOIN. But if you flip the tables

SELECT substring (title,1,40) AS Book, stor_id,
ord_num, qty, ord_date
FROM sales RIGHT OUTER JOIN titles
ON sales.title_id = titles.title_id

you have a RIGHT OUTER JOIN. In other words,

titles LEFT JOIN sales

produces the same results as

sales RIGHT JOIN titles

In fact, internally, SQL Server doesn't process RIGHT JOINS. It just flips them around and handles them as LEFT JOINS.

The T-SQL syntax is subtler. Note the asterisk before the equals sign in the WHERE clause. In a SELECT clause, the asterisk means, "Get everything," so the *= means, "Get everything from the table on the left of the equals sign and the matching data from the right": a LEFT OUTER JOIN, in other words. What does the =* mean? It's a RIGHT OUTER JOIN, which returns all the rows from the table to the right of the join and only the matching rows from the table on the left.

ANSI join syntax lets you specify a type of join that wasn't possible in the old-style syntax, the full OUTER JOIN. It does exactly what you would expect if you combined the LEFT and RIGHT OUTER JOINS. It returns all the rows in each table, showing the matching data where appropriate, and filling in with NULLS whenever no match occurs in the other table. An example is joining the titles and sales tables to find all book sales, plus all books that are not selling, and all stores that have not sold any books. Note that you must write the full OUTER JOIN using the ANSI syntax. T-SQL has no equivalent.

Nulls in Outer Joins

Here's one more reason to adopt the ANSI syntax for your joins. When you use an OUTER JOIN, some rows show null for the data from the inner table. The question is whether this null is really data from the inner table or whether the join inserted it. So when you put in a condition WHERE xxx IS NULL, the T-SQL query returns contradictory or inaccurate results. Run the first query below, but leave out AND qty IS NULL.

SELECT Book = substring (title,1,40), stor_id,
ord_num, qty, ord_date
FROM titles, sales
WHERE titles.title_id *= sales.title_id  
AND qty IS NULL

You will see a couple of books with no sales. Now try to query for just those books by including the search condition. T-SQL syntax produces unexpected results, as you see in Screen 5. ANSI syntax works correctly:

SELECT Book = substring (title,1,40), stor_id,
ord_num, qty, ord_date
FROM titles LEFT OUTER JOIN sales
ON titles.title_id = sales.title_id  
WHERE qty IS NULL

These two queries illustrate the fundamental problem with the T-SQL query syntax, which the ANSI syntax resolves. The WHERE clause is doing double duty. It specifies the JOIN columns, and it acts as the search condition. In the ANSI syntax, you specify the join as a JOIN, and the WHERE clause acts only as a search condition to limit which rows are returned. In the examples in this article, the ANSI query usually doesn't even have a WHERE clause, but the T-SQL query always has this clause whenever two or more tables are joined.

Self Joins

Do you ever get multiple copies of a mailing, and the only difference is a small change in the spelling of your name? How can you prevent this scenario from happening in your database? Look for similar records by using a SELF JOIN, which joins the table to itself and looks for similar data in each copy of the table. As Screen 6 shows, the query finds authors with the same last name. Here's the ANSI join syntax:

SELECT au1.au_fname, au1.au_lname,
au2.au_fname, au2.au_lname
FROM authors au1 INNER JOIN authors au2
ON au1.au_lname = au2.au_lname
WHERE au1.city = 'Salt Lake City'
AND   au1.state = 'UT'
AND au1.au_id < au2.au_id

T-SQL syntax (the JOIN condition is mixed in with the WHERE conditions):

SELECT au1.au_fname, au1.au_lname, au2.au_fname,
au2.au_lname
FROM authors au1, authors au2 
WHERE au1.city = 'Salt Lake City'
AND au1.state = 'UT'
AND au1.au_lname = au2.au_lname 
AND au1.au_id < au2.au_id

The last line, stating that the author IDs must be different, simply stops matching each author's name with itself. I used a less-than symbol, rather than a not-equals symbol, because otherwise the results show both matches—Albert Ringer to Anne Ringer and vice versa. You must alias the table, so that the query can handle references to what now appear to be two distinct tables.

Comments

Plain text