Aggregating Time Series Data

Time and date data is very common in SQL Server databases. The datetime and smalldatetime data types in SQL Server make it easy to store date and time data for a variety of purposes. But manipulating this data in SQL Server isn't always easy. Let's look at a difficult date problem and how to solve it.

Divide and Conquer

Consider a table of magazine subscriptions that has the following columns: subscriber_name, magazine_name, subscription_start (date), and subscription_end (date), as callout A in Listing 1 shows. The table holds each subscription renewal in a separate row. If you look at the sample input data at callout B in Listing 1, can you determine the continuous periods of time in which Phil subscribed to Car and Driver? Phil's apparent renewal of his subscription in 1998, 1999, and 2000 isn't relevant to this problem. You need to determine when each subscription began and when the last renewal of that particular subscription lapsed. You need a query that will return the results that Figure 1 shows; those results eliminate renewal rows while retaining the lapse in Phil's subscription between 1991 and 1997.

How do you consolidate sets of related date intervals and still preserve the gaps in the date sequence? This problem isn't easy to solve in SQL Server; however, you can break it into a series of manageable pieces. First, you need to find the start dates of Phil's subscriptions. These start dates have a particular property in common: They don't fall within the date interval of any other relevant record (i.e., a record that has the same values in subscriber_name and magazine_name). Note that for our purposes a date that falls within an interval either falls between the interval's start and end dates or is equal to that interval's start or end date.

Consider Phil's July 1, 1999, renewal of Car and Driver at callout C in Listing 1. The subscription_start in this record, July 1, 1999, falls within the previous record's date interval, which runs inclusively from July 1, 1998, through July 1, 1999. However, Phil's July 1, 1997, subscription_start doesn't fall within the date range of any of Phil's other subscription records. So, the query you're looking for should give you the July 1, 1997, record but not the July 1, 1999, record. The query that Listing 2 shows is such a query and uses NOT EXISTS with a correlated subquery to ensure that the subscription start date doesn't fall within any relevant date interval.

Next, to find the end dates of related date intervals, you can use the same approach that you used to find their start dates. The end date of a related sequence of records is that end date that doesn't fall within another relevant date interval. Listing 3 shows the query that determines Phil's subscriptions' end dates. The start- and end-date queries should always return the same number of rows. If they do, you can match each start date with its corresponding end date to create your final result set. If they don't, you have an error—either in the query or the data.

For a given subscriber, magazine, and start date, the correct end date is the next end date. To determine the next end date, choose any start-date record. Among the end-date records that share the same subscriber and magazine as that start-date record, find the record that has the earliest end date after the chosen start date. This end date is the next end date that you need to find to reunite the start-and end-date rows. Listing 4 shows how to implement this approach in SQL Server with the MIN() function. This listing contains two undefined fields: \{Start Date Results\} and \{End Date Results\}. To transform Listing 4 into a legitimate query, replace \{Start Date Results\} with the code in Listing 2, then replace \{End Date Results\} with the code in Listing 3. Listing 5 shows the final query.

Style and Performance

You could use an OUTER JOIN to rewrite the NOT EXISTS construct in the start- and end-date queries that Listings 2 and 3 show. In some cases, changing this construct might make the query faster, but I suggest that you test it both ways to be sure. Listing 6 shows how you'd use an OUTER JOIN construct to rewrite the start-date query. As in Listing 2, the alias m2 represents rows in magazine_subs that shouldn't exist for start-of-subscription records. The key to this OUTER JOIN construct is the condition m2.subscriber_name IS NULL. Because the code in Listing 1 defines the subscriber_name column as non-nullable (i.e., NOT NULL), the only way subscriber_name can be NULL in the WHERE clause is if m2 contains no matching rows.

The situation that inspired this article was a data-mart extract involving tables with millions of rows and client pressure to shorten load times. The algorithm in Listing 5, even with the OUTER JOIN enhancement, is too slow. The queries in Listings 2 and 3 are also too slow for large tables because they involve joining a table to itself. When you use either the NOT EXISTS or the LEFT JOIN method, the query mentions magazine_subs twice, so a join must occur. Internally, the query engine in SQL Server 2000 and 7.0 can choose among three basic types of joins: nested loop, hash, and merge. Microsoft added hash and merge joins in SQL Server 7.0 to improve performance in situations like the one we're discussing: joins involving entire tables or large quantities of data. Indexes just aren't very helpful when you're dealing with tables of large proportions. The optimizer takes into account table size, distribution statistics, and indexes in selecting a join method, so the optimizer often makes the best selection.

In a data-mart extract, adding an index to the table slowed my query down. The index led SQL Server to choose a merge join rather than a hash join. You can tell by looking at the graphical query plans in Query Analyzer. A merge join would have been fine, except that the inequality conditions in the JOIN clause forced the merge to be many-to-many (M:N), which is much slower than one-to-many (1:M). In this case, leaving the indexes off made the merge join less attractive to the SQL Server optimizer. SQL Server chose a hash join, and the query ran faster. If you implement this algorithm for your project, you should test and examine the query plans before you add or remove any indexes.

Tweaking the indexes can be helpful, but reducing the quantity of data before you join the tables is better. One approach to reducing the quantity of data to join is to use the start-date results to find end dates more quickly. Depending on the data, the start-date table might be a fraction of the size of the original table. If you can join the start-date table to the original table instead of joining the original table to itself, the query's performance should improve.

You're really looking for two kinds of end dates. One kind of end date falls at the start of a temporary lapse in the subscription: for example, Phil's Car and Driver expiration of October 1, 1991. The other kind of end date is the last one for a particular subscriber and magazine—for example, Phil's final Car and Driver expiration of July 1, 2001, and Andrea's subscription expirations of January 10, 2000, and March 7, 2000. The end dates that you want in your final results fall into one of these two end-date categories.

If you've already generated a table containing the results of Listing 2 (the start-date query), you can use this table to find the first category of end dates. (Callout A in Listing 7 creates the start-date table, and callout B in Listing 7 populates the table.) The first category of end dates all fall before some start date. In fact, the end dates you want are the "last," "maximum," or "most recent" ones before a given start date. Callout C in Listing 7 shows the query that finds these end dates, those that begin a temporary lapse of subscription.

Finding the second kind of end dates doesn't require a join at all:

SELECT m.subscriber_name, m.magazine_name, MAX(m.subscription_end) AS 
  subscription_end
FROM magazine_subs m
GROUP BY m.subscriber_name, m.magazine_name

A faster alternative for finding end dates is to combine these two queries by using a UNION ALL, as Listing 8 shows. A simple UNION would also work, but it would not perform as well. UNION alone implicitly adds DISTINCT to a query. In our case, the DISTINCT is unnecessary and would force the query engine to do extra work. In real life, the data-mart equivalent of magazine_subs contained 16 million rows. Using the query in Listing 3 to find the end dates took about 2 hours; the query in Listing 8 took 40 minutes.

This algorithm works just as well when you're grouping columns other than subscriber_name and magazine_name. For example, if you're only interested in the periods of time during which a subscriber was receiving any magazine, you can omit magazine_name from Listing 4 to get the code that Listing 9 shows. The result still reflects the gap in Phil's subscriptions, but now Andrea's Poodle Patrol phase is swallowed up in her long Cat Fancy subscription. (To see the periods when anyone was subscribing to any magazine, you would remove both magazine_name and subscriber_name.)

More Than One Application

The continuous time and date problem comes up more often than you might think—for example, collecting Web tracking data (Web site hits by site URL and caller IP address) into sessions. This problem looks quite different from the continuous time and date problem, but it isn't. Rather than focusing on overlapping intervals, this problem needs to group points in time. The sidebar "Grouping Web Site Hits into Sessions," shows how you could use a variation of the magazine-subscription problem to solve the problem of identifying Web sessions.

This T-SQL solution avoids loops and cursors. Developing such a set-oriented algorithm can be challenging, but the result is more elegant than an iterative approach and, in my test, performed better. The more you look around at various problems you need to solve, the more applications for this algorithm you're likely to find.

Comments

Plain text