Pivot (or Unpivot) Your Data

Data integrity, data consistency, and avoiding anomalies are the most important considerations when you store data in your database. However, when you query data, you usually think about your application's users and the information that you provide them. You often want to let them look at the data in a different format than the raw form in which it's stored. When you query data for analysis purposes, you typically want to produce either a small result set that fits on a screen or an intermediate result set that can be more easily manipulated by another query than your source data can. For these purposes, you might want to pivot your data. When you pivot data, you rotate rows to columns or columns to rows. That is, you can rotate data from multiple rows to multiple columns in a single row or unpivot data from multiple columns of a single row into result rows. You can use the pivot technique to rotate data in columns and rows for analysis purposes. Many analysis tools provide pivot capabilities, such as Microsoft Excel's PivotTable, Microsoft Access's Transform command, and the Pivot Table ActiveX Control. However, if you want to pivot or unpivot data in SQL Server 2000, you have to write complex T-SQL expressions. SQL Server 2005 provides native T-SQL PIVOT and UNPIVOT operators that simplify these operations. In this article, I discuss some techniques for pivoting data in SQL Server 2000 and introduce you to SQL Server 2005's PIVOT operator.

Pivoting Data in SQL Server 2000

To demonstrate pivoting techniques, let's use the Orders table that Listing 1 creates and populates. Say you want to analyze order quantities by customer and order year. You can group the data according to custid and YEAR(orderdate), then return the SUM(qty) value, but the total quantity for each combination of customer and order year will appear in a separate result row. For example, customer A ordered in 2002, 2003, and 2004. Hence, customer A has three rows in the result set--one row for each year. Assume that for each customer, you want one result row that has one column for each order year--2002, 2003, and 2004--and total quantities for each year in separate columns, as Table 1 shows. Writing a query that produces these results in SQL Server 2000 requires some T-SQL acrobatics. But I can show you some techniques that will solve the problem. The first solution uses an aggregate query that contains CASE expressions:

SELECT custid,
  SUM(CASE WHEN orderyear = 2002 THEN qty END) AS [2002],
  SUM(CASE WHEN orderyear = 2003 THEN qty END) AS [2003],
  SUM(CASE WHEN orderyear = 2004 THEN qty END) AS [2004]
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty FROM dbo.Orders) AS D
GROUP BY custid

This query defines the derived table D and extracts from the Orders table only the columns of interest for the pivot operation (custid, orderyear, and qty). To return one row for each customer, the query groups the data by custid. The SELECT list performs the magic of extracting quantities from multiple source rows and creating multiple result columns within a single row. The query contains a CASE expression for each order year. This example involves three order years: 2002, 2003, and 2004. The CASE expressions return the order quantity only if the base row contains the order year that the expression contains. Otherwise, the CASE expression returns NULL because you didn't specify an ELSE clause. You can explicitly specify ELSE NULL if you want, but because you want to get a NULL when the CASE doesn't find a match, you might prefer to use the shorter syntax. Then the SUM() function summarizes all the order quantities and ignores the NULLs. In the query, the first SUM(CASE) expression summarizes the order quantities for order year 2002, the second expression summarizes for 2003, and the third expression summarizes for 2004. So you get all the yearly order quantities for a particular customer in one result row.

This example uses only 3 years, so the query is short. A larger number of years will make the query much longer. However, you can shorten the query by applying a different pivoting technique--create a matrix table that contains a row and a column for each order year. A matrix table is a helper table that contains a row for each source attribute that you're about to rotate and a column for each target attribute that you're about to generate. In this example, we're rotating source year rows into target year columns so you have three rows (for the years 2002, 2003, and 2004) and three columns (for the same years). You get nine row and column intersections in a three-by-three table. The matrix table will help you isolate the column values that you want to aggregate into the different result columns. Listing 2 creates and populates the Matrix table with three rows for the order years 2002, 2003, and 2004. The Matrix table has an orderyear column and a column that uses the naming format yorderyear for each order year that you want to handle. In our example, the table contains the columns orderyear, y2002, y2003, and y2004. The table has a row for each year--three rows in this case--and stores the order year in the orderyear column, a 1 in the column whose value is the same year as its name, and NULLs in all other columns. For example, the row for order year 2002 has 2002 in the orderyear column, a 1 in the column y2002, NULL in the column y2003 and NULL in the column y2004. Issuing a SELECT * statement against the Matrix table produces the results that Table 2 shows.

Using the JOIN condition D.orderyear = M.orderyear to join the derived table D with the Matrix table (aliased as M) gets the desired results. Each order in table D gets one matching row in the matrix table that represents the same order year as the one in table D. You group the result by custid as you did in the previous query and you again write an expression for each result column, but the expressions are much shorter this time. For example, to calculate the total quantity for order year 2002, you use the expression SUM(qty*y2002). Out of all the base rows that belong to a particular customer, only the rows that have the order year value 2002 return quantities; all others return NULLs--thus you get the correct total quantity for order year 2002. Here's the complete solution query:

SELECT custid,
  SUM(qty*y2002) AS [2002],
  SUM(qty*y2003) AS [2003],
  SUM(qty*y2004) AS [2004]
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty FROM dbo.Orders) AS D
  JOIN dbo.Matrix AS M ON D.orderyear = M.orderyear
GROUP BY custid

These techniques let you pivot data, but they aren't easy or obvious. Once you're familiar with the new PIVOT operator in SQL Server 2005, you'll find it's much simpler and more intuitive.

Pivoting Data in SQL Server 2005

Before you start writing the solution for pivoting in SQL Server 2005, run the code in Listing 1 in your SLQ Server 2005 server to create and populate the Orders table. The following SQL Server 2005 query uses the PIVOT operator to return the pivoted yearly order quantities for each customer:

SELECT *
FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty FROM dbo.Orders) AS D
  PIVOT(SUM(qty) FOR orderyear IN([2002],[2003],[2004])) AS P

The query's FROM clause contains the derived table D and the PIVOT operator, followed by parentheses in which you specify arguments for the PIVOT operation. You specify columns from D as arguments to indicate which source column holds the target column names and which column contains the values you want to aggregate. The PIVOT operator includes three parts: an aggregate operator, the source column, and a list of order year values. First, you specify the aggregate you want to calculate--in this case, SUM(qty). Second, the FOR clause indicates the source column name (orderyear) that contains the order years that will appear as result columns. And finally, the IN clause contains a list of values ([2002],[2003],[2004])--the order years that you want to turn into result columns.

An astute reader will notice that the custid column that appeared in the SQL Server 2000 GROUP BY clause is missing in the 2005 query. The PIVOT operator uses the columns that appear in the derived table D but not in the operator's arguments to figure out which column is the grouping column. In this case, the aggregate operator refers to qty and the FOR clause refers to orderyear. The only column that isn't specified in any of PIVOT's arguments is custid, so PIVOT uses custid as the grouping column. Honestly, I find the implicit grouping confusing and think it would have been clearer if the PIVOT operator required you to specify the grouping columns.

More to Come

Pivoting techniques rotate data from rows to columns and give you valuable flexibility when you analyze data. The new PIVOT operator in SQL Server 2005 uses a native T-SQL construct to let you pivot data using simpler and shorter code. Now that you're familiar with the basics of pivoting, you're ready to dive into the more advanced techniques that I'll discuss next month--dynamic pivoting and string concatenation.

Comments

Plain text