Reporting on Non-Existent Data

Downloads
125130.zip

In the Reader to Reader article "T-SQL Statement Tracks Transaction-Less Dates," Saravanan Radhakrisnan presented a challenging task that I refer to as reporting on non-existent data. He had to write a T-SQL query that would determine which stores didn't have any transactions during a one-week period, but the table being queried included only the stores' IDs and the dates on which each store had a transaction. Listing 1 shows his solution.

Listing 1: The Original Query

SELECT st1.Store_ID, st2.NoTransactionDate
  FROM (SELECT DISTINCT Store_ID FROM MySalesTable (NOLOCK)) AS st1,
    (SELECT DISTINCT TransactionDate AS NoTransactionDate
    FROM MySalesTable (NOLOCK)) AS st2
  WHERE st2.NoTransactionDate NOT IN
    (SELECT DISTINCT st3.TransactionDate FROM MySalesTable st3
    (NOLOCK) WHERE st3.store_id = st1.store_id)
  ORDER BY st1.store_id, st2.NoTransactionDate
GO

Although this query works, it has some shortcomings:

As Radhakrisnan mentions, if none of the stores have a transaction on a certain day, the query won't return any results for that particular day for all the stores. So, for example, if all the stores were closed on a national holiday and therefore didn't have any sales, that day won't appear in the results.
If a store doesn't have any sales for all the days in the specified period, that store won't appear in the results.
The query uses T-SQL functionality that isn't recommended because of the poor performance it can cause. Specifically, the query uses three derived queries with the DISTINCT clause and the NOT IN construction. You can encounter performance problems when derived tables get too large to use indexes for query optimization.

I'd like to call attention to several different ways to work around these shortcomings.

How to Include All the Days

If you want the results to include all the days in the reporting period, even those days without any transactions, you can use the code in Listing 2. Here's how this code works. After creating the dbo.MySalesTable2 table, the code populates it with data that has a "hole" in the sales date range. (This is the same data that Radhakrisnan used for his query, except that the transaction for October 03, 2009, isn't inserted.) Then, the query in callout A runs. The first statement in callout A declares the @BofWeek variable, which defines the first day of the reporting period (in this case, a week). The query uses the @BofWeek variable as a base in the table constructor clause to generate the seven sequential days needed for the reporting period.

In addition to including all the days, the revamped query in Listing 2 performs better than the original query in Listing 1 because it minimizes the use of T-SQL functionality that's not recommended.

Listing 2: Query That Includes All Days

CREATE TABLE dbo.MySalesTable2 ( Store_ID INT, TransactionDate SMALLDATETIME ) GO -- Populate the table. INSERT INTO dbo.MySalesTable2 SELECT 100, '2009-10-05' UNION SELECT 200, '2009-10-05' UNION SELECT 200, '2009-10-06' UNION SELECT 300, '2009-10-01' UNION SELECT 300, '2009-10-07' UNION SELECT 400, '2009-10-04' UNION SELECT 400, '2009-10-06' UNION SELECT 500, '2009-10-01' UNION SELECT 500, '2009-10-02' UNION -- Transaction for October 03, 2009, not inserted. -- SELECT 500, '2009-10-03' UNION SELECT 500, '2009-10-04' UNION SELECT 500, '2009-10-05' UNION SELECT 500, '2009-10-06' UNION SELECT 500, '2009-10-07' GO -- BEGIN CALLOUT A DECLARE @BofWeek datetime = '2009-10-01 00:00:00' SELECT st2.Store_ID, st2.Day_of_Week   FROM     (SELECT st.Store_ID, DATES.Day_of_Week       FROM (         VALUES           (CONVERT(varchar(35),@BofWeek ,101)),           (CONVERT(varchar(35),dateadd(DD,1,@BofWeek),101)),           (CONVERT(varchar(35),dateadd(DD,2,@BofWeek),101)),           (CONVERT(varchar(35),dateadd(DD,3,@BofWeek),101)),           (CONVERT(varchar(35),dateadd(DD,4,@BofWeek),101)),           (CONVERT(varchar(35),dateadd(DD,5,@BofWeek),101)),           (CONVERT(varchar(35),dateadd(DD,6,@BofWeek),101))           ) DATES(Day_of_Week)           CROSS JOIN           (SELECT DISTINCT Store_ID FROM dbo.MySalesTable2 ) st         ) AS st2       LEFT JOIN dbo.MySalesTable2 st3         ON st3.Store_ID = st2.Store_ID AND           st3.TransactionDate = st2.Day_of_Week   WHERE st3.TransactionDate IS NULL   ORDER BY st2.Store_ID, st2.Day_of_Week GO -- END CALLOUT A

As you can see, the revamped query uses only one derived query with the DISTINCT clause instead of three. Plus, the NOT IN construction is replaced with a LEFT OUTER JOIN and a WHERE clause. In my testing environment, the revamped query was more than 15 percent faster than the original query.

How to Include Every Store

If you want the results to include all the stores, even those stores without any transactions during the reporting period, you need to change the query’s internal logic. Instead of using transactional data (in OLTP systems) or a fact table (in data warehouse/OLAP systems) as a source for obtaining the list of stores, you need to introduce a look-up table (OLTP) or dimension table (data warehouse/OLAP). Then, to get the list of stores, you replace

(SELECT DISTINCT Store_ID
   FROM dbo.MySalesTable2) st

with

(SELECT Store_ID
   FROM dbo.MyStoresTable) st

in the revamped query.

The code in Listing 3 shows this solution. It creates a new table named dbo.MyStoresTable. This table is populated with the five stores in dbo.MySalesTable2 (stores with IDs 100, 200, 300, 400, and 500) and adds two new stores (stores with IDs 600 and 700) that don't have any transactions. If you run the code in Listing 3, you'll see that the results include all seven stores, even though stores 600 and 700 don't have any transactions during the reporting period.

Listing 3: Query That Includes All Stores

CREATE TABLE \\[dbo\\].\\[MyStoresTable\\](
  \\[Store_ID\\] \\[int\\] NOT NULL,
CONSTRAINT \\[PK_MyStores\\] PRIMARY KEY CLUSTERED (\\[Store_ID\\] ASC) )
INSERT INTO dbo.MyStoresTable (Store_ID)
  VALUES (100),(200),(300),(400),(500),(600),(700)

DECLARE @BofWeek datetime = '2009-10-01 00:00:00'
SELECT st2.Store_ID, st2.Day_of_Week
  FROM
    (SELECT st.Store_ID, DATES.Day_of_Week
      FROM (
        VALUES
          (CONVERT(varchar(35),@BofWeek ,101)),
          (CONVERT(varchar(35),dateadd(DD,1,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,2,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,3,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,4,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,5,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,6,@BofWeek),101))
          ) DATES(Day_of_Week)
          CROSS JOIN
          (SELECT Store_ID FROM dbo.MyStoresTable ) st
        ) AS st2
      LEFT JOIN dbo.MySalesTable2 st3
        ON st3.Store_ID = st2.Store_ID AND
          st3.TransactionDate = st2.Day_of_Week
  WHERE st3.TransactionDate IS NULL
  ORDER BY st2.Store_ID, st2.Day_of_Week
GO

How to Optimize Performance

Sometimes when you have a large number of rows (e.g., 5,000 to 10,000 rows) in the derived tables and large fact tables, implementing indexed temporary tables can increase query performance. Listing 4 shows a version of the revamped query that uses an indexed temporary table. Note that this code uses Radhakrisnan's original data (i.e., data that includes the October 03, 2009, transaction), which was created with MySalesTable.Table.sql.

Listing 4: Query That Uses Indexed Temporary Tables

DECLARE @BofWeek datetime = '2009-10-01 00:00:00' CREATE TABLE #StoreDate   ( Store_ID int, TransactionDate datetime ) INSERT INTO #StoreDate ( Store_ID, TransactionDate ) SELECT st.Store_ID, DATES.Day_of_Week   FROM (     VALUES       (CONVERT(varchar(35),@BofWeek ,101)),       (CONVERT(varchar(35),dateadd(DD,1,@BofWeek),101)),       (CONVERT(varchar(35),dateadd(DD,2,@BofWeek),101)),       (CONVERT(varchar(35),dateadd(DD,3,@BofWeek),101)),       (CONVERT(varchar(35),dateadd(DD,4,@BofWeek),101)),       (CONVERT(varchar(35),dateadd(DD,5,@BofWeek),101)),       (CONVERT(varchar(35),dateadd(DD,6,@BofWeek),101))       ) DATES(Day_of_Week)       CROSS JOIN       (SELECT DISTINCT Store_ID from dbo.MySalesTable) st CREATE INDEX ix_tmp_MySalesTable   ON #StoreDate (Store_ID, TransactionDate) SELECT st2.Store_ID, st2.TransactionDate   FROM #StoreDate AS st2     LEFT JOIN dbo.MySalesTable st3       ON st3.Store_ID = st2.Store_ID AND       st3.TransactionDate = st2.TransactionDate   WHERE st3.TransactionDate IS NULL ORDER BY st2.Store_ID, st2.TransactionDate DROP TABLE #StoreDate GO

Like the queries in Listings 2 and 3, the query in Listing 4 creates the @BofWeek variable, which defines the first day of the reporting period. Next, it uses the CREATE TABLE command to create the #StoreDate local temporary table, which has two columns: Store_ID and Transaction_Date. Using the INSERT INTO…SELECT clause, the code populates the #StoreDate temporary table with all possible Store_ID and Transaction_Date combinations. Finally, the code uses a CREATE INDEX statement to create an index for the #StoreDate temporary table.

You need to be cautious with solutions that use indexed temporary tables. A solution might work well in one environment but timeout in another, killing your application. For example, I initially tested the query in Listing 4 using a temporary table with 15,000 rows. When I changed number of rows for the temporary table to 16,000, the query's response time increased more than four times—from 120ms to 510ms. So, you need to know your production system workload types, SQL instance configuration, and hardware limitations if you plan to use indexed temporary tables.

Another way to optimize the performance of queries is to use the EXCEPT and INTERSECT operators, which were introduced in SQL Server 2005. These set-based operators can increase efficiency when you need to work with large data sets.

Listing 5 shows a version of the revamped query that uses the EXCEPT operator. (Once again, this code uses Radhakrisnan's original data.) The query in Listing 5 provides the fastest and most stable performance. It ran five times faster than Radhakrisnan's original query. (A table with a million rows was used for the test.)

Listing 5: Query That Uses the EXCEPT Operator

DECLARE @BofWeek datetime = '2009-10-01 00:00:00' 
SELECT st2.Store_ID, st2.Day_of_Week
  FROM
    (SELECT st.Store_ID, DATES.Day_of_Week
      FROM (
        VALUES
          (CONVERT(varchar(35),@BofWeek ,101)),
          (CONVERT(varchar(35),dateadd(DD,1,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,2,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,3,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,4,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,5,@BofWeek),101)),
          (CONVERT(varchar(35),dateadd(DD,6,@BofWeek),101))
          ) DATES(Day_of_Week)
          CROSS JOIN
          (SELECT DISTINCT Store_ID FROM dbo.MySalesTable) st
        ) AS st2
EXCEPT
SELECT Store_ID, TransactionDate
  FROM dbo.MySalesTable
ORDER BY 1,2
GO

A Few Last Words

Although I presented several ways to report on non-existent data, there are no doubt more ways. For example, if you needed to provide this type of report regularly, you could probably use a Microsoft business intelligence (BI) toolset. It would be interesting to get feedback about other possible solutions.

You can download the solutions I discussed (as well as MySalesTable.Table.sql) by clicking the Download the Code Here button near the top of the page. I've provided two versions of the code. The first set of listings is compatible with SQL Server 2008. (These are the listings you see here.) The second set can be executed in a SQL Server 2005 environment.

Comments

Plain text