Debunking the Myths of Temporary Objects, Part 1

Downloads
96288.zip

T-SQL programmers commonly ask, "What's the difference between temporary tables and table variables? How about table expressions such as derived tables, common table expressions (CTEs), views, and inline functions? When is it appropriate to use those and not temporary tables or table variables?" Unfortunately, there's a lot of confusion concerning these temporary objects and constructs, as well as some false myths.

I've heard people say that table variables have no physical representation, that they're just memory-resident, whereas temporary tables have physical representation in the tempdb database—one false myth. Some people believe that table expressions (e.g., derived tables, CTEs) are always physically materialized—another false myth. Unfortunately, these myths are so common that it's inevitable that programmers listening to such advice will end up writing poor performing code in their production systems.

So, let's try to uncover the truth about temporary objects and debunk those myths. I'll start by giving you a general overview of each temporary object type (i.e., local temporary table, table variable, table expression), then provide some general recommendations for when to use each. Next month, I'll provide more concrete and tangible examples—analyzing specific cases, queries, and execution plans—to make everything more concrete.

Local Temporary Tables

Local temporary tables (name prefixed with #) are created in the tempdb database—specifically, they have physical representation.

CREATE TABLE #T(col1 INT, …);

As with permanent tables, if the pages of a temporary table are accessed frequently by SQL Server, they have the potential to remain in cache and permit access there. A local temporary table is owned by the session that created it and is visible only to that session. If your code creates it in an ad-hoc batch, the temporary table is automatically destroyed when the creating session is terminated. It's visible as of the creation point throughout the session—namely, it's visible to stored procedures and triggers that the session invokes. If your code creates a local temporary table within a stored procedure or a trigger, it's automatically destroyed when the creating routine terminates, not when the session disconnects. In such a case, inner levels in the call stack will have access to the temporary table but not to outer levels.

SQL Server creates and maintains distribution statistics (histograms) on the data stored in temporary tables. SQL Server's optimizer will consult those distribution statistics when, for example, it needs to estimate selectivity of filters to figure out whether to use an index and in what manner.

You can apply data definition language (DDL) changes to a temporary table after you create the table (e.g., create an index). Note that some activities against temporary tables might cause recompilations of stored procedures—for example, data changes, refresh of statistics, DDL changes. SQL Server considers any changes you apply to a local temporary table (including DDL changes) to be part of an explicit transaction if one exists. If the transaction rolls back, all changes to the temporary table also roll back.

You can use a SELECT INTO statement to create and populate a temporary table from the resultset of a query. SELECT INTO is a bulk operation and is minimally logged by SQL Server when the recovery model of the database isn't set to FULL. Because tempdb's recovery model is SIMPLE and can't be changed, a SELECT INTO operation that creates and populates a temporary table is minimally logged.

I won't discuss global temporary tables (name prefixed with ##) here. You use them in very special cases, and they're irrelevant to a discussion concerning the choice among temporary tables, table variables, and table expressions.

Table Variables

Table variables are declared like other local variables, with the DECLARE statement:

DECLARE @T TABLE(col1 INT, …);

It's important to understand that the common myth that table variables have no physical representation in the tempdb database is false. They do have physical representation in tempdb, much like local temporary tables. As proof, you can run the code in Listing 1 and see for yourself.

Figure 1 shows the output I received when I ran this code. You can see that no temporary table existed in tempdb before the table variable was declared by the first batch of code, and one did exist after it was declared by the second batch of code. Also, the myth that there's no difference between temporary tables and table variables is false. In fact, as I'll describe shortly, there are many differences.

Figure 1: Output of code in Listing 1

As with local temporary tables, if the pages of a table variable are accessed by SQL Server frequently, they have the potential to remain in cache and allow access there. Also as with a local temporary table, a table variable is owned by the session that created it and is visible only to that session. But its scope is more limited: It's visible only to the local batch, as with any other local variable. The table variable is automatically destroyed as soon as the creating batch is terminated. For example, if you run the following code, attempting to refer to a table variable from a different batch than the creating one, you'll get an error:

DECLARE @T TABLE(col1 INT);
GO 
SELECT * FROM @T;

Figure 2 shows the output of that code. A table variable is visible in neither the outer nor inner levels of the call stack. If your code declares a table variable in a routine (stored procedure, trigger, user defined function), it isn't visible to routines that the declaring routine invoked—not the case with local temporary tables.

Figure 2: Table variable error message

SQL Server doesn't create or maintain distribution statistics about table variables. Therefore, SQL Server's optimizer can't make good selectivity estimates that rely on distribution statistics. You can't apply DDL changes to table variables after they're declared by your code. You can't create indexes explicitly on table variables, but by defining primary key and unique constraints when the table is defined, you get indexes created implicitly, and those indexes can be used by SQL Server for performance purposes and not just to enforce uniqueness.

The fact that SQL Server does maintain distribution statistics with temporary tables, and not with table variables, is one of the main factors you should consider when contemplating which of the two to use. Without distribution statistics, SQL Server can't make good selectivity estimations in some cases; therefore, in this respect, temporary tables have an advantage. Bear in mind that in some cases, making good selectivity estimations isn't so important—for example, if the amount of data is very small to begin with (e.g., several dozens of rows residing in a couple of pages). Also, the optimizer doesn't always need distribution statistics to figure out which plan is optimal (e.g., trivial plans, such as obtaining a range from a clustered or covering index). The fact that table variables don't have distribution statistics has advantageous aspects, as well: You don't get recompilations associated with data change and statistics updates. In this respect, table variables have an advantage.

Changes applied to a table variable aren't part of an explicit transaction if one exists; rather, they're only part of the single atomic operation. So, if the individual statement terminates in the middle, the changes that statement made are undone. However, if the explicit transaction rolls back, the changes to the table variable don't roll back. In some cases, you can rely on this characteristic—for example, when you want to audit an attempt for data modification in trigger code, and you want to reject the change as well. You can copy the contents of the deleted and inserted trigger tables aside to table variables, roll back the change, and then—in a new transaction in the trigger—audit the change attempt by copying data from inserted and deleted to the audit table.

Auditing events you want to fail isn't possible with temporary tables because once you roll back the change, the inserted and deleted tables are emptied, and changes made to temporary tables are undone. Logging needs to support only a rollback of an individual statement accessing the table variable in case the statement is terminated before completion; this fact combined with the narrower scope of visibility of the table variable compared with a temporary table gives you a performance advantage—less locking and logging than with temporary tables. There is an exception in terms of logging: When populating a table variable, you can't use the SELECT INTO statement. Therefore, you'll get more logging when inserting data into a table variable compared with using SELECT INTO against a temporary table.

To summarize the performance side of table variables versus temporary tables, I'd say that when you're dealing with very small amounts of data (e.g., a few pages), I'd recommend using table variables. When you're dealing with large amounts of data, I'd investigate the types of queries involved. If the queries have trivial plans in which the optimizer doesn't need to consult distribution statistics, table variables are safe and can yield performance improvements over temporary tables. If the queries are such that the optimizer needs to consult distribution statistics to determine which of several possible plans is ideal, I'd recommend using temporary tables. I'll provide a tangible example next month.

Table expressions

Programmers often contemplate whether there's a performance advantage to using a table expression such as a derived table or a CTE versus a temporary table (or a table variable). The important thing to understand is that temporary tables and table variables are physically materialized as a table in tempdb, whereas all four types of table expressions (i.e., derived table, CTE, view, inline UDF) are virtual constructs. When processing a query that refers to a table expression, SQL Server expands its definition and ultimately optimizes and executes a query directly against the underlying objects as if the table expression wasn't there to begin with. SQL Server won't materialize the table expression physically. The optimizer might choose to physically materialize some intermediate sets aside (spool operators) in some cases, but spooling has nothing specific to do with table expressions. In short, table expressions aren't really a performance tool; rather, they're more of a simplification/abstraction tool (unlike temporary tables, which you can use as a performance tool). The only exception in this case is indexed views, which do materialize the view's result set physically in a B-Tree.

The information I've provided so far might not be sufficient to answer the dilemma of whether to use a table expression or a temporary table/table variable. Perhaps an example will clarify the situation. Suppose you have a big table with 20 million rows. You need to aggregate the data and end up with 50,000 aggregated rows. You then need to query the aggregated data with multiple queries, and the queries are such that the optimizer needs to make selectivity estimates to choose a plan from among several possible options. It doesn't make sense to use a table expression in this case because every time you query the table expression, you pay for scanning all 20 million rows and aggregating them. In this example, it makes sense to use a temporary table. You materialize the 50,000 rows, index the table to support those queries, then issue the set of queries against the temporary table. Another option is to use an indexed view, which also materializes the aggregated data. Bear in mind, however, that an indexed view will have a negative performance impact on modifications running against the base table.

If the number of aggregated rows were very small (e.g., 50), I'd use a table variable and not a temporary table—provided that the set of queries doesn't need to cross batches. If there were only one query that needed to access the aggregated data—and only once—I'd use a table expression.

First Theory, Then Practice

Now that I've provided an overview of temporary objects, I look forward to really delving into some concrete examples. I hope you can wait until next month, when I'll dive into those examples head first.

Comments

Plain text