T-SQL, probably more than any other programming language, is prone to misuse. Last month I started a two-part series about following T-SQL best practices and avoiding bad practices (see "T-SQL Best Practices, Part 1"). I provided recommendations that have to do with the mathematical foundations to the language as well as coding best practices. This month I continue the coverage by discussing cloud compatibility, date and time best practices, the importance of writing in a standard way, and performance-related best practices.
Microsoft is investing a lot in cloud technology, making it richer and more powerful. For some applications, it's inevitable that the database will reside in the cloud at one point or another. A lot of effort is spent on having one code base for box (on-premises) and cloud versions (SQL Azure), but there are some differences you should be aware of. Understanding the differences can affect the way you write your code if you want a smooth migration (if and when you migrate).
SQL Server Books Online details the language features that are or aren't available in SQL Azure. Here I just want to mention a couple of examples to give you a sense of what it means to code with T-SQL against a box while thinking cloud.
In SQL Azure there's no support for heaps, rather only B-trees. This has to do with the internal replication system used by SQL Azure to maintain multiple copies of the database for disaster recovery purposes. There are implications to the fact that heaps aren't supported. For example, as a result, the SELECT INTO command isn't supported. Remember that SELECT INTO doesn't copy indexes from the source, including the clustered index. In other words, SELECT INTO always creates a heap and hence isn't supported in SQL Azure. When you write new code, instead of using SELECT INTO, if you use a CREATE TABLE followed by INSERT SELECT, your code will be cloud-ready in this respect.
Another example is the fact that in SQL Azure you can't switch between databases within the same connection. The USE statement is supported, but it only works if you're already connected to the target database. Try switching to a different database, and you'll get an error. With this in mind, avoid code that switches between different databases if possible.
Direct access to tempdb isn't allowed in SQL Azure. You're allowed to create temporary tables and declare table variables, but you aren't allowed to connect directly to tempdb and create user objects. Therefore, you should avoid direct access to tempdb.
Another example of a difference between the box and cloud versions of SQL Server is that in a box version the default isolation level is read committed, whereas in the cloud it's read committed snapshot. Understanding the difference between the isolations is very important. I believe that most environments using read committed nowadays in a box version can greatly benefit from transitioning to read committed snapshot. Therefore, it's recommended that you learn and understand the value of read committed snapshot and plan for transitioning at some point. Then, of course, migration to the cloud will be easier.
These examples illustrate what I mean by "think cloud" when coding with T-SQL against a box. For more information about T-SQL in SQL Azure, see "Transact-SQL Reference (SQL Azure Database)."
Date and Time Best Practices
There are many best practices related to working with date and time data. However, I want to mention two particularly common bad habits that I recommend avoiding.
Use language-neutral literals. When expressing date literals, a common practice is to use a form that's language-dependent because that's what the developer is used for (e.g., the form '01/02/11'). However, depending on the language of the login connected to SQL Server and running the code, this form can be interpreted in several different ways. For a login with US English, this form would mean month/day/year; for one with British English, it would mean day/month/year; and for one with Japanese, it would mean year/month/day. As a developer, you don't want to use a form that will be interpreted differently depending on the language of the login running your code. The best practice is then to use a form that's language-neutral. For example, when using '20110102', SQL Server will always interpret the date as YYYYMMDD because this form is language-neutral. You need to be especially careful with the form '2011-01-02', because for the older date and time types (DATETIME and SMALLDATETIME), it's language-dependent (read as YYYY-MM-DD for some and YYYY-DD-MM for others), whereas for the newer types (DATE, DATETIME2, and DATETIMEOFFSET), it's always interpreted as YYYY-MM-DD. Think about using this form in your code, and then at some point altering the type of your attribute. Your code could suddenly change its meaning. Therefore, I prefer to stick to the form YYYYMMDD for all types -- old and new, where it's always interpreted the same way.
Be careful about rounding errors. Suppose you're using the DATETIME data type for your column, and your data has both date and time values that aren't necessarily midnight. You want to filter a period of date and time data such as January 2012. Some people use the following filter form:
WHERE col BETWEEN '20120101' AND '20120131 23:59:59.999'
The problem is that 999 as the millisecond unit isn't a multiplication of the precision unit for DATETIME, which is three and a third milliseconds. Therefore, the value gets rounded to the next midnight, and your range might end up including rows it isn't supposed to include. Some people "fix" this problem by using 997 in the millisecond unit, but what if at some point in the future you alter the type to one with finer precision? Therefore, the best practice with date and time ranges is to avoid BETWEEN and to always use the form:
WHERE col >= '20120101' AND col < '20120201'
This form works with all types and all precisions, regardless of whether the time part is applicable.
Write Standard Code
T-SQL has several proprietary language elements. Some have counterparts that are supported by standard SQL (ISO and ANSI). I find that it's important to make an effort to write standard code unless there's a substantial reason to do otherwise, such as extra functionality or better performance. Following the standard lets you write code that's more portable between platforms. But it's not just your code that becomes more portable; your knowledge does as well.
In "T-SQL Best Practices, Part 1," I recommended terminating your T-SQL statements with a semicolon. You should also try to stick to standard functions and operators. For example, the CAST function is standard, whereas the CONVERT function isn't. If you're not relying on the optional style argument that CONVERT supports but CAST doesn't, use CAST. COALESCE is standard, whereas ISNULL isn't. COALESCE is more flexible than ISNULL, so using it is a win-win situation. The <> operator is standard, whereas the != operator isn't -- so, the obvious recommendation is to use <>.
These examples simply give you a sense of standard coding practices. It's important is to learn what's standard and what isn't and to follow the standard when possible.
Beware of Rules of Thumb
People like rules of thumb, but they don't always exist. I've seen many such rules that were probably created based on very limited observations and then made a rule. Have you heard rules of thumb such as "Use joins because they're more efficient than subqueries," or "If you have more than n tables in a single query, break it to multiple queries"? The reality is that sometimes joins are more efficient than subqueries, sometimes vice versa, and sometimes they're optimized the same. Sometimes a single query with 20 joins will perform just fine, and a query with 4 joins will actually perform better if split into steps that fill temp tables that are then joined. Optimization of T-SQL code is such a dynamic and sophisticated process that rules of thumb such as these won't help but will instead mislead you. If you insist on adopting a rule of thumb, make it "Keep an open mind, and be careful about rules of thumb." This should always be true...maybe....
Use Search Arguments
When optimizing a query and considering whether and how to use indexes, one of the key elements the optimizer evaluates is the predicates in the query filters. However, you need to be aware that in order to rely on index ordering (e.g., perform an index seek or ordered scan), the predicate must be of a particular form. Specifically, except for very few exceptions, you should try to avoid manipulation of the index key columns. In most cases, after you apply manipulation the optimizer can't trust that the outcome of the manipulation preserves index ordering. Sometimes there's not much you can do to avoid manipulation, but sometimes you can. When the optimizer can rely on index ordering for a given predicate, the predicate is said to be a search argument, or SARG for short. Here are a few examples:
- The predicate COALESCE(T1.col1, -1) = COALESCE(T2.col1, -1) isn't a SARG, whereas the predicate T1.col1 = T2.col1 OR (T1.col1 IS NULL and T2.col1 IS NULL) is.
- The predicate col1 - 1 <= @n isn't a SARG, whereas the predicate col1 <= @n + 1 is.
- The predicate LEFT(col1, 3) isn't a SARG, whereas the predicate col1 LIKE 'abc%' is.
As you can see, sometimes simple revisions to your predicates can retain the meaning and allow efficient use of indexing.
Practices to Use or Avoid
In this two-part series, I discussed several important areas in T-SQL where you often see people implementing bad practices. I provided what I find to be best practices that can help you write maintainable, correct, and efficient code. Some best practices involve understanding the mathematical foundations of the language and require time and effort on your part, whereas others are very simple and straightforward and only require you to be aware of them. The set of best practices I provided isn't exhaustive, but it does include what I find to be particularly important.