Many industries use statistical process control (SPC) to monitor the quality or consistency of a product. Technicians or monitoring devices test the product at regular intervals, and data analysts process the resulting sample data to identify variations and trends. Historically, analysts manually transcribed data onto control charts such as the one that Figure 1 shows, performed statistical calculations, and made visual inspections. Today, process control can be highly automated; analysts use database management systems (DBMSs), either alone or with specialized software, to manage and analyze the data.
Suppose you're monitoring the concentration of antimony in Manhattan drinking water by testing water from various locations every hour. Fifteen sequential measurements from two locations might produce the data that Figure 1 shows. Patterns in this data might suggest that the antimony concentration is increasing and that water department officials should perform additional testing or take corrective action.
You expect to find some fluctuation in the sample measurements, but certain patterns will rarely occur unless an underlying trend exists. One such pattern, or trend marker, is the occurrence of seven consecutive above-average measurements; another is two consecutive measurements in the top 5 percent of recorded measurements. SPC analysis uses these and other criteria to identify potential trends. Commercial SPC software packages provide a variety of trend markers to choose from, and which ones you use depends on various aspects of your process. For example, some trend markers are better indicators than others when you know your measurements are prone to error; other trend markers are better at detecting or ignoring periodic variations. However, commercial software packages don't typically include one useful trend marker—the presence of an increasing subsequence of a given length. To make this helpful SPC measure available to a SQL Server application, I developed some T-SQL code to query data for the length of the longest increasing subsequence.
To understand what an increasing subsequence is and why it's a good trend marker, try to answer the following question: In what month of the year do you have the best chance of being able to take a noontime walk outside nine times during the month on days that the outside temperature is warmer than it was during the previous walk? If you live in New York City, you might say March or April. If you live in Sydney, you might say October. And if you live in Guam, you might say the question is impossible to answer because the noontime temperature is between 84 and 86 degrees Fahrenheit almost every day of the year.
Noontime temperature is a physical quantity that can be measured, and noontime temperatures for a one-month period are a sequence of measurements. The noontime temperatures on the days you walked form a subsequence of those measurements. The question above asked you to find an increasing subsequence of nine measurements; this challenge shows how trends in a physical quantity can reveal an increasing subsequence of measurements. Thus, a long increasing subsequence is a good trend marker.
One increasing subsequence in the data that Figure 1 shows comprises the Midtown water readings numbered 3, 4, 6, and 9, which have values of 3.939, 4.050, 4.140, and 4.143, respectively. Can you find the length of the longest increasing subsequence for Midtown? For Uptown? The answers will say something about possible trends in the data; later I describe a general solution in T-SQL.
The original problem and solution resulted from a question that John Winterbottom posted to the Microsoft newsgroup microsoft.public.sqlserver.programming. Winterbottom wanted to process sequences of seven measurements to identify which sequences contained increasing subsequences of length 4. He asked how to enumerate the subsequences of length 4 within a sequence of length 7 so that he could inspect each subsequence to see whether the values were increasing. A sequence of length 7 contains only 35 subsequences of length 4, so generating those 35 subsequences and inspecting each one is a practical approach. Listing 1 shows the T-SQL code that generates the increasing subsequences of length 4 that exist in a sequence of seven measurements.
I proposed one improvement to Winterbottom's solution before I discovered a more efficient algorithm that I describe in a moment. I observed that the existence of an increasing length-4 subsequence depended only on the relative order of the seven measurements. For example, if the seven measurements, in the order they were made, are M1, M2, M3, M4, M5, M6, and M7 and of these values, if M1 is the third largest, M2 sixth largest, M3 fifth largest, M4 second largest, M5 seventh largest, M6 fourth largest, and M7 first largest, then M2, M3, M6, M7 is one increasing subsequence of length 4, regardless of the exact measurements, because these values are in the order sixth, fifth, fourth, and first largest of all seven.
With seven measurements, you can rank the measurements (first through seventh) in only 5040 ways, so it's possible to calculate the particular rank orderings that contain increasing subsequences of length 4 one time and store the results in a permanent table. Then, the next time you need to process a sequence of seven measurements, you can compute the rank order for that sequence and use the permanent table to look up the order. Creating the permanent table is a tedious process, but this method provides a reasonable solution to the problem, and being able to reuse the information in the table makes the process simpler. (If duplicate measurements exist among the seven, the earlier one must be ranked higher for this method to give correct results.)
I might have been satisfied and set the problem aside at this point. But as a mathematician, when I see two digits such as 4 and 7, I think m and n. I try to generalize problems, and I'm not satisfied with a solution that works only in a few specific cases.
On the surface, this appears to be a difficult problem. The two solutions I've mentioned—storing rank orders and testing all subsequences—become intractable for sequences that contain more than about 15 to 20 measurements. Winterbottom's sequences of 7 measurements had 5040 possible rank orders and 35 subsequences of 4 measurements. But if you look at sequences of 15 measurements and examine subsequences that contain just 8 measurements, the number of rank orders increases to 15! (15 factorial), or more than 1012, and the number of subsequences increases to more than 6000. The numbers become ridiculously large for longer sequences.
You can sometimes solve problems like this—in which you have far more cases than you can possibly check individually—by using dynamic programming techniques. A dynamic programming solution keeps track of partial results—the longest increasing subsequences up to each point in the original sequence of measurements—and uses these partial results to build the final solution. Such algorithms work well in this particular case, but you can do better with an entirely different approach that happens to translate very nicely into T-SQL. Because an analyst might need to consider any number of measurements at the same time, I developed a query to find the longest increasing subsequence within a sequence of any length. The code in Listing 2 creates tables containing the data that Figure 1 shows, if you want to try your hand at devising a solution.
Mathematics to the Rescue
Mathematicians have a remarkable habit of developing solutions to practical problems before those problems arise. For example, the mathematics that underlies present-day cryptography was developed long before cryptography. And the inverse problem that must be solved for CAT scanning to work had been stated and solved decades before the first CAT scan. So, not surprisingly, when I searched mathematics literature, I discovered an efficient algorithm for finding the longest increasing subsequences in a given sequence.
Mathematical research in this area is concerned with finding the probability distribution of the longest increasing subsequence of a random permutation of the first N positive integers. For someone who is mathematically inclined, the problem is intriguing and contains many surprising connections.
If you have a sequence S of length n, how can you find the largest value of m for which there exists an increasing subsequence of S having length m? Figure 2 shows the algorithm I developed to find this value; the algorithm is adapted and simplified from an algorithm in David Aldous and Persi Diaconis's paper "Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem," Bulletin of the American Mathematical Society 36 (1999): 413-432. (Patience is the British name for the card game called solitaire in the United States, so I chose the name Solitaire for a table in the T-SQL solution.)
Note that whereas the final size of L is the length of a longest increasing subsequence, L doesn't necessarily contain the elements of a longest increasing subsequence. Table 1 shows the contents of the list L after each iteration of the algorithm when it's applied to the Uptown readings.
Translating this algorithm into T-SQL isn't difficult. The list L becomes a small table called Solitaire. Finding the smallest number in Solitaire that's greater than or equal to a given number is an ideal task for T-SQL. Because numbers are added to the list only if they're larger than all other numbers in the list, I added an identity column called pos to Solitaire to make finding that smallest number even simpler. The identity column keeps the elements of L in order from smallest to largest, so the smallest element is first in order by the identity column pos. You might balk at using a cursor to perform the algorithm's loop, but I found no set-based alternative. Listing 3 shows the algorithm, which assumes the tables that the code in Listing 2 creates.
After solving the problem of finding sequences of seven measurements that contain an increasing subsequence of at least four measurements, I wanted to find a practical solution for arbitrary values of m and n. I not only found a solution, I found a nice bit of mathematics that I hadn't previously encountered. The general application to a practical problem was a pleasant added bonus.