T-SQL Best Practices, Part 1

Every programming language has best and worst practices, and T-SQL is no exception. Some best practices in T-SQL are similar to those applicable to procedural languages, but most actually have to do with the fact that T-SQL is based on the relational model. This article is the first in a two-part series that highlights several important T-SQL best practices. (However, please don't consider my list as complete.)

Related: T-SQL Best Practices, Part 2

Think in Sets

Thinking in sets, or more precisely, in relational terms, is probably the most important best practice when writing T-SQL code. Many people start coding in T-SQL after having some background in procedural programming. Often, at least at the early stages of coding in this new environment, you don't really think in relational terms, but rather in procedural terms. That's because it's easier to think of a new language as an extension to what you already know as opposed to thinking of it as a different thing, which requires adopting the correct mindset.

If that's not enough, there are two levels of adaptation you need to go through. The first is not thinking in procedural terms, but rather in T-SQL terms. The second is realizing that even T-SQL isn't perfect as a relational language and understanding how to use it relationally.

Here I'll just lay out the key elements you need to keep in mind. For deeper coverage of this topic, see "T-SQL Foundations: Thinking in Sets."

When I say "not thinking in procedural terms," I'm talking about generally avoiding iterative constructs such as cursors and loops, in which you process data one row at a time in specific order. Remember that tables in T-SQL are supposed to represent sets of rows; sets need to be handled as a whole, and they have no particular order to their elements. I won't tell you to never touch a cursor, because I recognize that there are cases in which, for practical reasons, it's difficult to avoid them -- however, you should consider a set-based approach by default.

As I mentioned, a set has no order, so you need to remember that neither tables nor views are ordered. This sounds easy, but in reality it's very difficult for some people to accept this fact. You need to realize that regardless of what kind of physical structures exist, such as indexing, when you query a table or a view, you aren't supposed to assume any particular physical order of processing the rows. Also, unless you add an ORDER BY clause to the outermost query, the order of the rows returned isn't guaranteed. For these reasons, attempts to create a "sorted view" -- for example, by adding TOP (100) PERCENT to the query -- are wrong from a relational perspective and aren't guaranteed to return rows from a view in any particular order when the outer query against the view doesn't have an ORDER BY clause.

You must understand that the T-SQL concepts of table, row, and column have very different meanings from file, record, and field, which people often use mistakenly.

After you stop thinking in iterative, ordered, one-row-at-a-time terms and start thinking in terms of sets, you've taken the most important leap. A second level of maturity in programming with T-SQL is to realize that T-SQL isn't perfect as a relational language, but it provides you with enough tools that you can use it relationally if you understand the relational model. Chris Date does a great job explaining this topic in his book SQL and Relational Theory, which I strongly recommend that all T-SQL practitioners read. You need to understand the difference between the T-SQL concepts of table, column, and row and their relational counterparts: relation, attribute, and tuple. (A full discussion of this topic is beyond the scope of this article.)

Think of NULLs and the Three-Valued Logic

NULLs are a source of quite a lot of complexity and confusion in T-SQL. In essence, a NULL is supposed to represent a missing value that could be either applicable (exists but we don't know what it is) or inapplicable (doesn't exist, or is irrelevant). To start with terminology, NULL is a mark for a missing value -- not a value -- therefore, saying "NULL value" is incorrect. The terminology you want to use is either just a NULL or a NULL mark. Also, a predicate that compares something with a NULL -- even another NULL -- yields neither true nor false, but instead unknown.

One of the most important best practices in T-SQL is that whenever you write T-SQL code, you should always be mindful of NULLs, as well as the fact that interaction with NULLs is special and sometimes actually counterintuitive. So, for every piece of code you write with T-SQL, you should ask yourself whether NULLs are possible in the data you're interacting with (e.g., filter, sort, group); if they are, make sure to include specific tests for NULLs in your QA and test environments. Sometimes your findings about the behavior of NULLs will require you to change the code in order for it to behave the way you need it to.

For example, you need to remember that query filters such as ON, WHERE, and HAVING return a row only if the predicate evaluates to true and don't return the row if the predicate evaluates to either false or unknown. Consider the following example query against a sample database called TSQL2012:

USE TSQL2012;

SELECT custid, country, region, city

FROM Sales.Customers

WHERE region <> N'WA';

(You can download the source code for the TSQL2012 sample database.)

Some places don't have three levels in the location hierarchy -- just two (country and city) -- in which case the region is NULL, meaning that it's inapplicable. The example query attempts to return all customers that aren't from the region WA, which is supposed to also include locations where the region is simply inapplicable. However, any comparison with NULL (including the operator <>) yields unknown, and a query filter filters out unknowns. So this query will return only customers where the region is known to be different than WA and not NULL. To fix this problem and include customers from locations where the region is inapplicable, you need to add logic to the query, like so:

SELECT custid, country, region, city

FROM Sales.Customers

WHERE region <> N'WA'

OR region IS NULL;

Also, you need to keep in mind that some language constructs filter out unknowns (e.g., query filters), whereas some don't (e.g., a CHECK constraint).

If you always think about NULLs and ensure that your tests always include NULLs, you'll quickly learn that using a NOT IN predicate to look for non-matches is very dangerous when one of the elements in the subquery can be NULL. For example, suppose that you're writing the following query to return customers who didn't place orders:

SELECT custid

FROM Sales.Customers

WHERE custid NOT IN

(SELECT custid FROM Sales.Orders);

If any order has a NULL custid value, this query will return an empty set, even when there are customers who didn't place orders. That's because as far as the language is concerned, when a custid value doesn't appear in the set of present values, the NULL could still represent that custid value. So, there's no assurance that it's not in the inner set. It's very tricky logic that results from three-valued logic. After you identify the problem, you can determine a solution. For example, you can explicitly eliminate the NULLs in the subquery with a filter saying WHERE custid IS NOT NULL, and therefore you end up comparing the outer custid value only with present custid values from the inner set. Another option is to use NOT EXISTS instead because it relies on two-valued logic.

The specific example is less important than the general recommended best practice: Be mindful of NULLs for every piece of code you write in T-SQL. If NULLs are possible in the data, make sure you include them in your tests. And if the default language treatment isn't the one you need, make sure you make the correct adjustments to make it behave the way you need.

Avoid NOLOCK

The use of the NOLOCK query hint (equivalent to using the read uncommitted isolation level) is one of the most common practices in T-SQL yet one of the worst. The practice is so common because it makes queries run faster. Concurrency improves because with NOLOCK, your queries don't request shared locks and therefore run into fewer blocking conflicts because of lock incompatibility. But at what cost?

Most people's perception is that the effect of using NOLOCK is the possibility of getting uncommitted reads (aka dirty reads), which is true. Some consider that in certain circumstances it's OK to get uncommitted reads. However, there's more to NOLOCK than just uncommitted reads. As it turns out, when the storage engine gets an instruction from the query processor to perform some activity (e.g., an index scan with an ordered: false property), sometimes the storage engine has multiple ways to go about carrying out the instruction. Typically, the storage engine will first prioritize the correctness expectations from the request before performance. But when using the NOLOCK option, the storage engine will assume your main focus is speed and will therefore prefer the faster option. This can result in double-reading or skipping rows in cases that would typically not be allowed to happen. In short, you get data consistency problems with NOLOCK well beyond getting uncommitted reads.

If you have performance problems related to concurrency when using the read committed isolation, and you're using SQL Server 2005 or later, you should consider using the newer read committed snapshot isolation. It has some overhead for writes, but it doesn't suffer from the aforementioned consistency problems -- and it typically solves the concurrency problems.

Adopt Good Coding Practices

There are some coding best practices that experienced database practitioners have heard a thousand times, but they're still so important that I want to make sure I mention them.

Terminating statements with a semicolon. In the past, T-SQL didn't have a requirement to terminate all statements with a semicolon. However, standard SQL requires it. Also, with each new version of SQL Server, this becomes a requirement for more and more specific cases to avoid parsing ambiguity. SQL Server 2008 already added "not terminating statements with a semicolon" to the list of deprecated features. So do yourself a favor and if you haven't done so already, start getting used to terminating all of your statements.

Avoid *. Using * in the SELECT list is a bad practice in most cases. Changes in the underlying table definition might cause problems for applications that don't expect them and in table expressions such as views defined based on the query. You also want to be able to rely efficiently on covering indexes when you're truly interested in only a subset of the columns.

Schema-qualified object names. Always use schema-qualified object names (Sales.Orders as opposed to just Orders) both when creating and when querying them. Different users can have different default schemas. Also, even though the savings is very minor and probably negligible, explicitly indicating a schema means you save the cost of the instructions involved in implicit resolution. Furthermore, a query plan produced by a user with one default schema can't be reused by a user with a different default schema. You'll get multiple optimizations with multiple unnecessary execution plans in cache.

Have a style and use correct casing. I find it very important to use the correct case for identifiers of objects and columns. I've seen a lot of code in which people weren't mindful of this rule; in case-insensitive environments, it typically doesn't matter much. However, what if at some point in the future you need to make your databases case-sensitive? Why not write in a way that won't get you into trouble if such changes are required in the future? Also, code that uses different casing for the same identifiers in different places doesn't look good. Consistency can go a long way in code readability and maintainability.

For similar reasons, I find it important to develop some style for your code. Have you seen queries that look like this:

SELECT CUSTID,EMPID, YEAR(ORDERDATE) ORDERYEAR,SUM(QTY) QTY FROM ORDERS GROUP BY CUSTID, EMPID, YEAR(ORDERDATE)

As opposed to:

SELECT custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty

FROM dbo.Orders

GROUP BY custid, empid, YEAR(orderdate);

There's not a specific style that's better than the rest, but it's important to have a style that's readable and natural, and to be consistent with it. Doing so improves the readability and maintainability of your code.

Avoid the old-style join syntax. Some people find it difficult to adopt the ANSI SQL-92 syntax for joins with the JOIN keyword and the ON clause (e.g., FROM T1 INNER JOIN T2 ON T1.keycol = T2.keycol), as opposed to using a comma between the table names and then all predicates in the WHERE clause (e.g., FROM T1, T2 WHERE T1.keycol = T2.keycol). I encourage you to use the newer syntax (although contrary to popular belief, the old-style syntax for cross and inner joins is actually standard and is fully supported by SQL Server). Outer joins never had standard old-style syntax; only cross and inner joins did. Outer joins have only one standard syntax, which is the ANSI SQL-92 syntax; outer joins are the reason the standard bothered creating a new syntax. For consistency's sake, the standard also adopted the newer syntax for the cross and inner joins, although it didn't deprecate the old-style syntax for cross and inner joins.

At any rate, there are two main reasons why you should stick to the newer syntax for all joins. One is consistency; it simply doesn't look good when you try mixing the different styles. Another is the fact that if you use the old-style syntax and forget to indicate the WHERE clause with the join predicate, you get an unintentional cross join. With the newer syntax, if you use an inner join and forget the ON clause, you get a parsing error that forces you to fix the code.

More to Come

In this article, I covered some of the most important T-SQL best practices for producing healthy and efficient code. I discussed the importance of thinking in a relational way, always remembering NULLs and the three-valued logic, realizing the full implications of using the NOLOCK hint and therefore avoiding it, and adopting good coding practices. Next month, I'll introduce additional T-SQL best practices.

Comments

Plain text