What's New in Yukon T-SQL

Editor's note: The features described here aren't final, nor are they guaranteed to be in the final product release in the same form or at all.

You've probably heard about the new ability in the upcoming release of SQL Server, code-named Yukon, to develop programmable objects through Common Language Runtime (CLR)-based languages such as Visual Basic .NET and C#. (Randy Holloway and Andrej Kyselica cover this much-anticipated CLR integration in "Ready or Not, Here Comes the CLR.") You might have thought that as a result, Microsoft would put less effort into enhancing T-SQL, but that's far from the case. CLR-based programming complements T-SQL in its weak areas. T-SQL has always been strong in data-manipulation and set-based operations and still has no match in these areas. However, CLR languages are stronger in areas that involve complex algorithms, iterative processing, and so on. T-SQL has a long life ahead of it, as the effort that Microsoft put into these T-SQL enhancements shows.

Microsoft has improved many aspects of T-SQL in Yukon, providing long-awaited responses to programmers' needs and requests. Relational enhancements in Yukon Beta 1 include Common Table Expressions (CTEs) that have recursive querying capabilities, PIVOT and UNPIVOT operators, a more functional TOP clause, and CROSS APPLY and OUTER APPLY operators. INSERT, UPDATE, and DELETE statements can now return output, and the WAITFOR command lets you wait for a Data Manipulation Language (DML) action to affect at least one row or wait for a notification to arrive at a queue before control passes to the next command. Triggers now let you trap Data Definition Language (DDL) events such as dropping an object, and you can even get notifications when DML events take place. A new error-handling mechanism based on Try/Catch blocks lets you manage errors much more effectively than before. Important data-type enhancements include a new XML data type, separate date and time data types (finally!), and the MAX option, which allows much more elegant manipulation of large objects in dynamic columns such as varchar, nvarchar, and varbinary. And finally, a new rowset provider called BULK lets you access files more easily than ever. (A rowset provider lets you access a data source in relational form, meaning that you get a table as a result of a query against the provider.)

I can't go into great technical depth or specify syntax details in this preview article, but I can give you a taste and provide highlights of the new features. If you're enrolled in the Yukon beta-testing program, you can find more details in the accompanying Books Online (BOL) and in the "Yukon T-SQL Enhancements" white paper available with the Yukon Beta Readiness Kit. In future articles closer to the release of the product, I'll go into more implementation details.

Common Table Expressions

Have you ever wanted to combine the capabilities of views and derived tables? Need to write recursive queries in T-SQL? With CTEs, you can do both! A CTE is a named table expression followed by a query. CTEs come in two forms: nonrecursive and recursive. A nonrecursive CTE blends the characteristics of derived tables and views. As with a derived table, you have a query to which you give an alias, potentially also providing aliases to the result columns, and the CTE doesn't persist after the outer query finishes. As with a view, you can refer to the table expression more than once in the outer query. Typically, you'd use a nonrecursive CTE when you need to refer to a table expression more than once and you don't want it to persist in the database.

For example, suppose that for each year, you want the total number of orders for that year and the previous year from the Northwind database's Orders table. The code in Listing 1 shows how you can get this result by using a CTE. Following the WITH keyword are the CTE's alias—YearlyOrders—and the result column aliases, as callout A shows. The body (at callout B) holds the query you provided an alias for, which returns the count of yearly orders. The outer query performs an outer join between two instances of YearlyOrders—CurYear and PrevYear—matching current and previous yearly orders. Figure 1 shows the result of running Listing 1's code.

To achieve the same result in pre-Yukon releases of SQL Server, you have two options. One option is to create a view and refer to it twice in the query, a method that forces you to create a persistent object in the database. The other option is to use two derived tables, which forces you to duplicate code.

Recursive CTEs introduce recursive capabilities to T-SQL queries. The body of a recursive CTE contains at least two queries (also called members) separated by a UNION ALL operator. The Anchor member is a query that SQL Server invokes once. The Recursive member references the CTE's name, which represents the result set that the previous step returned. SQL Server repeatedly invokes the Recursive member until the query returns an empty set. As with a nonrecursive CTE, the outer query refers to the CTE's name, but here, the reference represents a UNION ALL of the result of the Anchor member's invocation and the results of all the Recursive member's invocations.

Listing 2 shows an example of a recursive CTE. The purpose of this code is to return a given employee, along with all direct and indirect subordinates of that employee from the Employees table in the Northwind database. The existing Employees table contains nine employees. Each employee reports to a manager, whose employee ID is stored in the ReportsTo column. Andrew (employee ID 2) has NULL in the ReportsTo column because he is the top manager. If you request Andrew and his direct and indirect subordinates, you should get all nine employees in the result.

Let's go over the code in Listing 2 to understand how it works. SQL Server invokes the Anchor member at callout A, returning Andrew's row plus a pseudo column called lvl that contains a value of 0 because he's the starting point. The Recursive member at callout B joins the Employees table to the result of the previous step, returning subordinates of the employees from the previous step. The lvl column in the Recursive query returns the value of the lvl column from the previous step plus 1 to show how far each employee is from Andrew in the management chain.

The first invocation of the Recursive member returns Andrew's subordinates Nancy, Janet, Margaret, Steven, and Laura, with 1 in their lvl column. The second invocation returns the subordinates of the previously returned employees (those with 2 in the lvl column): Michael, Robert, and Anne. The third invocation returns an empty set because none of the last-returned employees has subordinates, and recursion terminates.

The outer query at callout C that references the CTE's name returns the unified result of all invocations, including that of the Anchor member. Figure 2 shows the result you get by running Listing 2's code. To achieve the same result in earlier releases of SQL Server, you have to either store redundant data in your database to describe the employee hierarchy or use cursors and temporary tables and write iterative code, which is time-consuming and harder to maintain.

PIVOT and UNPIVOT

The new PIVOT operator lets you rotate rows into columns, optionally performing aggregations along the way. For example, suppose you want to return from the Orders table the count of yearly orders for each employee. You could write a simple GROUP BY query to get the information, but you want to return only one row for each employee, with the count of each year's orders in a separate column, as Figure 3 shows.

The PIVOT operator lets you easily get what you want, provided that you know which years you're looking for, as the following query shows:

WITH EmpOrders(orderid, empid, 
  orderyear)
  AS (SELECT OrderID, EmployeeID, 
  YEAR(OrderDate)  FROM Orders)
SELECT * FROM EmpOrders
  PIVOT(COUNT(orderid) FOR 
  orderyear IN(\[1996\], \[1997\], 
  \[1998\])) AS P
ORDER BY empid

I use a CTE in this query to isolate the columns I want to use with the PIVOT operator. You could use a derived table instead of the CTE; there's not much difference between the methods in this case because the query references the CTE only once.

The FOR clause within the PIVOT operator's body specifies the name of the column, orderyear, whose values you want to rotate into result columns. The IN clause specifies the values from orderyear that will become result columns. Before the FOR clause, you specify the aggregate function you want to use to calculate the result column values—in this case, the count of order IDs for each order year.

Behind the scenes, an implicit GROUP BY is taking place based on the columns that the PIVOT operator doesn't specify—empid in this case. So for each distinct employee ID, you get three columns—\[1996\], \[1997\], and \[1998\]—and for each intersection of employee ID and year, you get the count of orders.

The UNPIVOT operator does the opposite of the PIVOT operator—it lets you rotate columns into rows. As an example, create a temporary table called #EmpOrders by running the preceding PIVOT query and adding INTO #EmpOrders just before the FROM clause. Now, suppose you want to return the count of yearly orders for each employee, with each combination of employee and year in a separate row, as Figure 4 shows in abbreviated form. You run the following UNPIVOT query:

SELECT empid, CAST(orderyear AS
  int) AS orderyear, numorders
FROM #EmpOrders
  UNPIVOT(numorders FOR orderyear 
  IN(\[1996\], \[1997\], \[1998\]))
  AS P
ORDER BY empid, orderyear

Both orderyear and numorders will be result columns. The orderyear column will contain the order years 1996, 1997, and 1998 derived by the IN clause from the respective column names. The numorders column will contain the values that are currently stored in three columns: \[1996\], \[1997\], and \[1998\], each holding the number of orders for the respective year. I use the CAST() function to convert to integer the strings holding the column names for the order years.

Once you get used to the new syntax, writing PIVOT and UNPIVOT queries is simple. You can achieve the same results with earlier releases of SQL Server, but you have to write longer, more complex code.

TOP Enhancements

Microsoft modified the TOP option in a couple of important ways in Yukon. One addition is the long-awaited ability to specify an expression rather than a constant as an argument to TOP. For example, you can use a variable to limit the number of rows a query returns:

DECLARE @n AS int
SET @n = 5
SELECT TOP(@n) * FROM Orders ORDER BY OrderID

You can specify a more complex expression within the parentheses after TOP, including a self-contained subquery. When you don't use the PERCENT option, the result of the expression is a bigint value that specifies a number of rows, and when you use the PERCENT option, the result is a float value between 0 and 100 (specifying a percentage of rows).

Another enhancement lets you use TOP with INSERT, UPDATE, and DELETE statements. For example, the following code deletes historical data from a large table in batches of 10,000 rows:

WHILE 1=1
BEGIN
  DELETE TOP(10000) FROM Sales 
  WHERE dt < '20000101'
  IF @@rowcount <10000 BREAK
END

If you don't split a large DELETE into smaller chunks like this code shows, your transaction log might expand considerably, and row locks held during the operation might escalate to a full table lock. When you split the DELETE operation into chunks of n rows, each deletion of n rows is considered a separate transaction. After the transaction finishes, SQL Server can reuse that transaction log space, assuming you perform log backups during the operation. Furthermore, with a small number of rows per chunk, SQL Server usually has enough resources to manage individual row-level locks without having to escalate to a full table lock. Note that you can delete rows in chunks in SQL Server 2000 and earlier releases by using the SET ROWCOUNT option, but the new TOP option performs better.

CROSS APPLY and OUTER APPLY

CROSS APPLY and OUTER APPLY are new relational operators that let you invoke a table-valued function for each row of an outer query, optionally using the outer row's columns as the function's arguments. For example, the following code creates a user-defined function (UDF) called fn_cust-orders() that accepts a customer ID and a number as arguments and returns a table containing the requested number of most-recent orders for the given customer:

CREATE FUNCTION fn_custorders
  (@custid AS nchar(5), @n AS
   int) RETURNS TABLE
AS
RETURN
  SELECT TOP (@n) * FROM Orders
  WHERE CustomerID = @custid
  ORDER BY OrderDate DESC,
  OrderID DESC

The following query uses the CROSS APPLY operator to return, for each customer in the Customers table, the three most-recent orders:

SELECT C.CustomerID, C.CompanyName, O.*
FROM Customers AS C
  CROSS APPLY fn_custorders
  (C.CustomerID, 3) AS O
ORDER BY C.CustomerID, O.OrderDate DESC, O.OrderID DESC

This query returns 263 rows for 89 out of the 91 existing customers in the result table. The query eliminates the customers who made no orders (FISSA and PARIS) because the fn_custorders() function returns an empty set for them.

If you want to include rows from the outer query for an empty set that the function returns, you'd use the OUTER APPLY operator instead of CROSS APPLY. Rows that have no matches would hold NULLs in the columns that the function returns.

With earlier versions of SQL Server, you can't invoke a table-valued function in one query for each row of an outer query. In those earlier releases, to return the @n most-recent orders for each customer, where @n is an argument, you have to write more complex code that uses dynamic execution and subqueries, and you end up with a solution that performs worse and is harder to maintain.

DML with Results and WAITFOR

In Yukon, your modification statements can return data instead of just modifying it. The INSERT, UPDATE, and DELETE statements let you include an OUTPUT clause in which you specify the data you want to return. You specify the data you're interested in by referring to the inserted and deleted tables similar to how you use them in triggers.

The WAITFOR command has changed, too. In SQL Server 2000 and earlier, the only options available with the WAITFOR command are specifying a delay period to expire or a datetime value to reach before control passes to the next command. Now, you can specify a statement and wait until it processes at least one row or until it reaches a certain timeout value before passing control to the next command.

You can combine these two features so that you can wait for a modification to process rows and get data back from it. For example, suppose you maintain a queue of messages in a table called Queue:

CREATE TABLE Queue(
  keycol int NOT NULL 
  IDENTITY PRIMARY KEY,
  dt datetime NOT NULL
  DEFAULT(GETDATE()),
  processed bit NOT NULL 
  DEFAULT(0),
  msg varchar(50) NOT NULL)

Different processes insert new messages into the Queue table, which you can simulate by running the following code from several different connections:

WHILE 1 = 1
BEGIN
  INSERT INTO Queue(msg)
    VALUES('msg' + 
  CAST(CAST(RAND()*9999999 AS 
  int) AS varchar (7)))
  WAITFOR DELAY '00:00:01'
END

I used the WAITFOR command here in the traditional manner to add a delay of 1 second between inserts.

You need to have several processes handle the streaming messages and mark them as processed. You can simulate this by running several instances of the code that Listing 3 shows. This code uses three new features. The WAITFOR command waits until its argument (the UPDATE statement) has processed at least one row before passing control to the following command, COMMIT. The UPDATE statement returns the new image of the affected rows as output and skips locked rows because I specified the READPAST hint.

In earlier releases of SQL Server, you can use the READPAST hint only with SELECT statements. In Yukon, you can use the READPAST hint with UPDATE and DELETE statements, letting multiple parallel processes handle nonlocked rows and skip locked ones that other sessions are processing. Similarly, you can have multiple simultaneous processes delete nonlocked, processed messages:

WHILE 1 = 1
  WAITFOR(DELETE FROM Queue WITH
  (READPAST)
      OUTPUT DELETED.*
      WHERE processed = 1)

If you want to provide queuing capabilities in earlier versions of SQL Server, you have to implement the logic outside SQL Server. These are just simple examples of using the new WAITFOR and DML-with-results features. Yukon also introduces a whole new queuing infrastructure and implements a queuing platform called Service Broker, which is too big a topic to discuss in this article; look for Service Broker coverage in future articles.

DDL Triggers and DML Event Notifications

Earlier releases of SQL Server don't let you trap DDL events and fire a trigger as a result. DBAs have been asking for such ability for a long time, mainly for auditing and to prevent schema changes by even privileged users. With Yukon, you can create triggers on DDL events such as creating or dropping objects at the server or database level. Within the trigger, you can access information related to the event (e.g., the process ID issuing the event, the time it took place, the statement issued) by invoking the new EventData() function, which returns the information in XML format. For example, to trap login-related events at the server level, you can create the following trigger:

CREATE TRIGGER audit_ddl_logins ON ALL SERVER
  FOR CREATE_LOGIN, ALTER_LOGIN, 
  DROP_LOGIN
AS
PRINT 'DDL LOGIN took place.'
PRINT EventData()

To test the trigger, run the following code, which creates a login, alters it, then drops it:

CREATE LOGIN login1 WITH PASSWORD = '123'
ALTER LOGIN login1 WITH PASSWORD = 'xyz'
DROP LOGIN login1

You can examine the event data inside the trigger and take appropriate action, including rolling back the operation. To drop the trigger, issue the following code:

DROP TRIGGER audit_ddl_logins 
ON ALL SERVER

Similarly, you can use triggers to trap specific or all DDL events at the database level. For example, the following trigger traps all DDL events issued against the database you created the trigger in:

CREATE TRIGGER audit_ddl_events ON DATABASE
  FOR DDL_DATABASE_LEVEL_EVENTS
AS
PRINT 'DDL event took place in database
 ' + DB_NAME() + '.'
PRINT EventData()

To test the trigger, run the following code, which creates a table and a view, then drops them:

CREATE TABLE T1(col1 int)
GO
CREATE VIEW V1 AS 
SELECT * FROM T1
GO
DROP TABLE T1
DROP VIEW V1

Note that, besides returning event information, the code invokes the trigger and prints a message indicating that a DDL event took place. To drop the trigger, run the following code:

DROP TRIGGER audit_ddl_events 
ON DATABASE

Triggers work synchronously, meaning that control doesn't pass back to the application that caused the trigger to fire until the trigger's code finishes. Yukon also introduces asynchronous event consumption. Several different applications can subscribe to be notified when an event issued by any process, even a DML event, takes place. And the application that issued the code that activated the event doesn't need to wait for all subscribing applications to finish their activity to continue with its activity.

Error Handling

Another long-awaited T-SQL enhancement involves error management. Yukon introduces a TRY/CATCH construct that lets you implement error handling similar to that in other development environments. You can now trap errors that would terminate the connection in previous releases of SQL Server and deal with them in an elegant, organized way. As an example, create the T1 table:

CREATE TABLE T1(col1 int NOT NULL PRIMARY KEY)

The code in Listing 4 shows how you can use TRY/CATCH to trap errors such as primary key violations or conversion errors when you're inserting data into the table.

The TRY construct works for transaction-abort errors only, meaning that you have to set the XACT_ABORT option to ON and write your code within a transaction. In the TRY block, you include the transaction for which you want to trap errors. On an error, the transaction enters a "failed" state. You remain in the transaction's context—SQL Server holds the locks and doesn't reverse the transaction's work. Control passes to the nearest CATCH block, where you can examine the error and take corrective measures after rolling back the failed transaction.

When you run Listing 4's code the first time, the CATCH block isn't activated because no error occurs. When you run the code the second time, the CATCH block traps a primary key violation. But see what happens when you comment out the first INSERT statement and uncomment the second, then run the code again. You've trapped a conversion error.

Data Types and the BULK Rowset Provider

Yukon changes data types in several interesting ways. For one thing, it introduces a new XML data type that lets you store and manipulate XML data in variables and in table columns. In addition, you can define dynamic columns such as varchar, nvarchar, and varbinary by using the new MAX option instead of specifying an explicit size, so you have a more natural alternative to the large object (LOB) data types text, ntext, and image. The MAX option lets you store up to 2GB of data in a column. By "more natural," I mean that with the new MAX option, you can use regular DML to manipulate these large dynamic columns instead of using the cumbersome WRITETEXT, UPDATETEXT, and READTEXT commands as you have to in previous versions to manipulate LOBs. Also, Yukon introduces a new BULK rowset provider that lets you manipulate files as rowsets easily and elegantly.

To illustrate these changes, run the code that Listing 5 shows. This code creates the TestLargeObjects table and inserts a row into it. To update the XML column x with the XML result of a query, run the following code:

UPDATE TestLargeObjects
  SET x = (SELECT * FROM Customers FOR XML AUTO)
WHERE keycol = 1

To load a text file into the varchar(MAX) column vc, you use the OPENROWSET() function with the BULK provider, specifying the file path and the SINGLE_CLOB option as follows:

UPDATE TestLargeObjects
  SET vc = (SELECT vc 
  FROM OPENROWSET(
    BULK 'c:\temp\textfile1.txt', 
    SINGLE_CLOB) AS CLOB(vc))
WHERE keycol = 1

The SINGLE_CLOB option means that you're manipulating a single text file that will be returned as a single row with a single column. Similarly, you can load a Unicode-formatted file into an nvarchar(MAX) column by specifying the SINGLE_NCLOB option or load a binary file to a varbinary(MAX) column by specifying the SINGLE_BLOB option.

Probably the most exciting change is the separation of date and time data types. The following code creates a table with separate date and time columns and inserts a new row into it:

CREATE TABLE DateTimeTest(datecol date, timecol time)
INSERT INTO DateTimeTest
  VALUES(CAST('2003-11-01' AS 
  date), CAST('10:30:59.999999' 
  AS time))

The new date and time data types are CLR-based, meaning that they were developed based on the .NET infrastructure that Yukon supports. As a result, the separate date and time data types provide many useful methods and properties. For example, applying the ConvertToString() method to both date and time lets you format the date and time values according to a specified format string, as the following query shows:

SELECT
datecol::ConvertToString(
'MM/dd/yyyy')  AS thedate,
timecol::ConvertToString(
'HH:mm:ss.fff') AS thetime
FROM DateTimeTest

Figure 5 shows the result of running this query. The result column thedate contains a formatted date value in the form MM/dd/yyyy, and the result column thetime contains a formatted time in the form HH:mm:ss.fff.

The Future of T-SQL

As you can see, Yukon brings many exciting T-SQL—related enhancements and new features. With Yukon, you need to write less code to achieve the same results as in earlier releases, and you get better-performing solutions. As a set-based relational language, T-SQL has always beem strong, but now it's even stronger.

Comments

Plain text