Identifying Groups

Set-based wizardry foils cursors again


Jake Ashcraft, a systems engineer at Progressive Software Solutions in Albany, Oregon, contacted me about a problem to which he could find no set-based solution, just a solution that uses cursors and temporary tables. The problem is representative of a larger group of problems that deal with identifying groups of elements, or identifying subsets within sets. Let's look at the problem, then explore four set-based solutions that approach the problem from different angles. Thanks to Dieter Nöth—a true SQL wizard—for providing one of the solutions I present. As always, I encourage you to try your hand at solving the problem before looking at the solutions.

The Problem: Identifying Emission-Level Groups

The problem relates to a gas turbine­driven power station, where the emission levels of the turbine exhaust gases are monitored by the US Environmental Protection Agency (EPA). Gas emission levels are measured periodically and recorded in a table called Samples. You can run the script that Listing 1 shows to create the Samples table and populate it with some test data. The Samples table includes two columns: sample_ts, which is the sample's timestamp, and NOx, which is the emission level recorded.

The EPA needs to know when the level of NOx emissions from the gas turbines equals or exceeds 70. For each breach of this limit, a report needs to display the exact date and time when the level first exceeded 70 and when it returned to a value of less than 70. The report must also display the maximum NOx level during the intervening period. Table 1 shows the desired results, assuming the Samples table contains the test data that Listing 1 provides. Note that the elapsed time should be returned in the format hh:mm:ss.

You can use Figure 1, page 14, to help you visualize the desired results. The X axis shows the sample timestamps, and the Y axis shows the emission levels. The red elements in the graph highlight the groups of samples where the emission levels exceeded the prescribed value. The green symbols represent the maximum emission levels in each group. The challenge: From each group of interest, return the start timestamp, end timestamp, elapsed time, and maximum level measured.

Solution 1: Using Views

For the first solution, I took a modular approach—using views to break the solution into steps. First, I used the code that Listing 2, page 14, shows to create a view called VCurPrev that returns, for each sample timestamp, both the current NOx value and the previous one. The subquery returns the NOx value of the most recent sample of all the samples that were taken before the outer query's sample. So each sample retrieves the previous sample's NOx value. A SELECT * query against the VCurPrev view returns the output that Table 2 shows.

The second step is to create a view called VStartEnd that returns each group's start and end timestamps, as the code in Listing 3 shows. The query in the view joins an instance of VCurPrev called V1 to another instance of VCurPrev called V2. V1 represents starts of groups, and V2 represents ends of groups. The JOIN condition matches each row in V1 with all rows in V2 that have a more recent timestamp. The filter leaves only the rows that have a current value in V1 that's greater than or equal to 70 and a previous value in V1 that is less than 70 or is NULL. The filter also verifies that the current value in V2 is less than 70 and the previous value in V2 is greater than or equal to 70. In other words, the current value in V1 marks the start of a group, and the current value in V2 marks the end of a group. The SELECT list returns the timestamp of the start of a group; each start of a group might have several matching ends, but the SELECT list returns only the end that has the minimum timestamp. A SELECT * query against the VStartEnd view returns the output that Table 3 shows: three matching periods and their start and end times.

The third and final step in this solution is to write a query that joins the VStartEnd view with the Samples table. The code in Listing 4 matches each group in the VStartEnd view with all samples in the Samples table that fall between the group's starttime and endtime. For each group, the query uses the DATEDIFF() function to calculate the elapsed time in seconds between starttime and endtime, then calculates the maximum NOx value.

Note that the query returns the elapsed time only in seconds. If you want to format the elapsed time as hh:mm:ss, you need to write an expression that extracts the hours, minutes, and seconds; converts them to character strings; and concatenates them. You can write a user-defined function (UDF) that accepts a value in seconds as an argument and returns a formatted string. Running the script in Listing 5 creates the dbo.fn_fmtelapsed() UDF. The function uses integer division and modulo operations to extract the hours, minutes, and seconds. Then, it prefixes each part with a 0 if the part contains only one digit. After creating the UDF, you can execute the query that Listing 6 shows to get the desired results that Table 1 shows. This solution is good because it's based on the step-by-step modular approach. Such a solution is usually less prone to errors and easier to maintain than a solution that uses one gigantic query. On the other hand, this solution isn't optimal in terms of performance, as I explain later.

Solution 2: Using Joins and GROUP BY

Although the first solution works, I usually try to solve each problem in more than one way to see if an alternative solution would work better. The difference in complexity and performance of the various solutions can sometimes be striking. The query in Listing 7 uses the UDF I created earlier to provide an alternative solution to the problem.

This query uses a 3-way join between three instances of the Samples table called S, E, and M. S represents starts of groups, E represents ends of groups, and M (which stands for middle) represents samples that fall between S and E. To be more accurate, S represents the samples right before the starts of groups. The JOIN condition simply verifies that the end's timestamp is greater than the start's timestamp and that the middle's timestamp falls between the start's timestamp and the end's timestamp. The filter in the WHERE clause verifies that the samples before the starts of the groups are less than 70 and that the ends of the groups are also less than 70.

The main trick in this query happens in the HAVING clause. The code groups the rows by the start and end timestamps, and the HAVING clause leaves only those groups in which the minimum NOx value is greater than or equal to 70. If you're not sure how the query produces the groups, think about the requirements again—the start of a group is a sample that has a value that's greater than or equal to 70 while the previous sample has a value that's less than 70. All following samples belong to the same group until and including the first sample whose value falls below 70.

Note that you can find the start of a group only by locating a previous value that's less than 70. So if the first sample that appears in the table (the one with the oldest timestamp) is greater than or equal to 70, it won't be counted as a start of a group because no samples come before it. If you want to treat such a sample as a beginning of a group, you can use a simple trick: Insert a sentinel sample with a timestamp that's older than all existing timestamps—for example, the base date January 1, 1900:

INSERT INTO Samples VALUES('1900-01-01 00:00:00.000', 0)

This second solution is more elegant than the first—it uses one small, simple query. However, it performs much worse. Obviously, you need more than 16 rows in the table to run realistic performance tests. You can use the code in Listing 8 to clear and repopulate the Samples table with 65,536 sample rows that contain random values between 0 and 100 in the NOx column. Now try the queries we produced in the first two solutions. On my laptop, the first solution's query took about 7 minutes; solution 2's query kept running until I decided to cancel it after more than 3 hours. Neither solution is optimal, to say the least. At this point, you might decide to give up and use cursors—or you can keep trying other avenues.

Solution 3: Using Subqueries and TOP

Listing 9's query uses subqueries and TOP to provide the desired results. The query defining the derived table S4 returns the start and end times of each group. The filter in the WHERE clause checks whether the sample's value is greater than or equal to 70 and that the previous sample's value is less than 70. The previous value is retrieved by a subquery that uses TOP 1 to get the value from the most recent sample whose timestamp is earlier than the current sample's timestamp. Similarly, the SELECT list inside the derived table S4 contains another subquery that uses TOP 1 to return the group's end time, which is the timestamp of the oldest sample whose value is less than 70 and whose timestamp is greater than the current sample's timestamp. Then, the outer query simply has to return the start and end times of each group, the elapsed time (which it calculates by using the DATEDIFF() function), and the maximum NOx value (which it finds by using a simple subquery).

Solution 3 performs considerably better than the previous two solutions. On my laptop, the query took about 1 second of CPU time and less than 2 seconds elapsed time. Using the dbo.fn_fmtelapsed() function to format the elapsed time added about 4 seconds. From my experience, the TOP option is a good way to improve query performance, and it's certainly true in this solution. But sometimes tuning is simply a matter of writing several different queries and choosing the one that performs best, even if you don't know why it works best. Now we have a practical solution, but it uses the T-SQL­specific feature TOP. If you're looking for an ANSI-compliant solution, keep reading.

Solution 4: Using Subqueries and GROUP BY

Dieter Nöth, an SQL wizard who frequently comes up with neat solutions to my puzzles, provided the ANSI-compliant query that Listing 10 shows. The query in the derived table T returns all rows that have a value greater than or equal to 70, meaning that each returned row is potentially a group start time. The SELECT list uses the timestamp as the potential start time and a subquery to return the current group's end time. The subquery retrieves the minimum timestamp of all the samples that were taken after the current sample and have a value less than 70. So, the derived table T contains group end times and potential group start times. The outer query groups the rows in T by end time and returns the minimum start time for each end time. The minimum start time is the real start time of a group of all rows with the same end time. The outer query also calculates the elapsed time and maximum NOx value. Amazingly, this solution performs slightly better than solution 3, taking about 1 second of CPU time and 2 seconds elapsed time.

Almost every T-SQL problem you face has many solutions. Always be aware of that fact and try to attack the problem from several different directions. And be sure to check your code's performance with sufficient amounts of sample data. Your next solution might work some real SQL magic.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.