Identifying a Subsequence in a Sequence, Part 2

In "Identifying a Subsequence in a Sequence, Part 1," I presented a challenge involving identifying the locations of a subsequence in a sequence. I covered three iterative solutions, using a recursive query, a loop with one temporary table, and a loop with two temporary tables (what I referred to as Divide and Conquer Halloween). The performance I got for the three iterative solutions was 11, 7, and 3.5 seconds, respectively. I had hoped to find set-based solutions that were more efficient than the iterative ones, but I was only partially successful. In this article I cover three set-based solutions and their performance. Two of the solutions are slower than the fastest iterative solution, and one is as fast.

The Challenge

First, let me quickly remind you of the challenge. Use the code in Listing 1 to create and populate the table T1 with a small set of sample data to check the correctness of the solutions.

SET NOCOUNT ON;
USE tempdb;
GO

IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
  keycol INT NOT NULL
    CONSTRAINT PK_T1 PRIMARY KEY,
  val INT NOT NULL,
    CONSTRAINT UNQ_T1_val_keycol UNIQUE(val, keycol)
);

-- Small set of sample data to check correctness
INSERT INTO dbo.T1(keycol, val) VALUES
  (1, 1),(2, 1),(3, 7),(4, 5),(5, 9),(6, 1),(7, 7),(8, 1),(9, 7),(10, 5),(11, 9);

The table T1 holds a sequence of values in the column val; the order of the elements is defined by the column keycol. Suppose you're given an input subsequence in the form of a table variable, like so:

-- Input table variable @P
DECLARE @P AS TABLE
(
  keycol INT NOT NULL
    PRIMARY KEY,
  val INT NOT NULL,
    UNIQUE(val, keycol)
);

-- sample input; try with other inputs as well
INSERT INTO @P(keycol, val) VALUES
  (1, 1),(2, 7),(3, 1),(4, 7);

--

Your task is to write code that returns the start and end locations (in terms of keycol values) of the input subsequence @p.val in the sequence T1.val.

minkey		maxkey
-----------	-----------
6		9

Figure 1 shows the desired result against the small set of sample data for the input subsequence 1,7,1,7, and Figure 2 shows the desired result for the subsequence 1,7,5,9.

minkey		maxkey
-----------	-----------
2		5
8		11

Use the code in Listing 2 to create a helper function called GetNums and to populate T1 with a large set of sample data to test the performance of the solutions.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
        FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Large set of sample data to check performance
TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1(keycol, val)
  SELECT n AS keycol, ABS(CHECKSUM(NEWID())) % 10 + 1 AS val
  FROM dbo.GetNums(1, 10000000) AS Nums;

The code fills the table with 10,000,000 rows with random values in the range 1 through 10. This means that statistically a sequence of one value will be found in 1/10 cases, two values in 1/100 cases, and n values in 1/10^n cases. For example, with an input sequence with four values you should get about 1,000 matches (1/10,000 × 10,000,000).

Hopefully you'll come up with some solutions of your own. I'll cover three solutions I came up with.

Solution 1: Using FOR XML PATH

Listing 3 contains the first solution I'll cover; Figure 3 shows its query plan. The main theme of the solution is to use the FOR XML PATH option to concatenate and compare the subsequence in @P.val with subsequences in T1.val.

SELECT
  T1.keycol AS minkey,
  T1.keycol + M.cnt - 1 AS maxkey
FROM (SELECT MAX(keycol) AS cnt FROM @P) AS M
  CROSS JOIN dbo.T1
  INNER JOIN @P AS P
    ON P.keycol = 1
    AND P.val = T1.val
    AND (SELECT P2.val AS [data()]
     FROM @P AS P2
     WHERE P2.keycol > 1
     ORDER BY P2.keycol
     FOR XML PATH('')) =
    (SELECT T1B.val AS [data()]
     FROM dbo.T1 AS T1B
     WHERE T1B.keycol BETWEEN T1.keycol + 1 AND T1.keycol + M.cnt - 1
     ORDER BY T1B.keycol
     FOR XML PATH(''));

The derived table M in the outer query's FROM clause computes the value cnt representing the number of elements in the input subsequence in @P. Using a CROSS JOIN, you match the one row in M with the rows from the other tables in the FROM clause, making cnt available in subsequent predicates.

The next part is joining T1 and @P (aliased as P) based on the predicate:

P.keycol = 1 AND P.val = T1.val

This predicate filters only values in T1.val that are potential starts of the input subsequence (same as the first value in @P.val). With a statistical probability of 1/10 matches, this part should scan and return about 1,000,000 rows from the index T1.UNQ_T1_val_keycol. This work is represented in the plan by the third operator from the bottom right, going up. Observe that the operator returns 1,000,545 rows in the plan I obtained on my system. (Your numbers might vary, because the code that creates the sample data uses randomization.)

The next predicate in the FROM clause compares the results of two subqueries, each constructing a concatenated string from the remaining elements of the respective sequence. The first subquery constructs the string from the second till the last elements in @P.val. The second subquery constructs the string for each potential start of the sequence in T1.val from the succeeding cnt-1 elements. If the two strings are the same, you have a match. The ISNULL function is used in case the subsequence is made of only one element. In such a case, both subqueries return a NULL and the ISNULL function substitutes the NULL with an empty string so that the comparison still works.

The second subquery causes the majority of the work in this plan. The work related to it is represented by the bottom right branch with the Clustered Index Seek of the T1.PK_T1 index and the XML UDX operator. This branch is executed 1,000,545 times. Once the seek reaches the leaf of the index, a range scan filters the first cnt-1 rows (three in our case) that are found (hence the total number of rows 3,001,635). Then the XML UDX operator concatenates those cnt-1 values into one string (hence we're back to 1,000,545 rows total).

This plan took 8 seconds to complete on my system, performing about 4 million logical reads. Unfortunately, this solution is slower than the fastest iterative solution from "Identifying a Subsequence in a Sequence, Part 1."

Solution 2: Using COUNT

Listing 4 contains the second set-based solution I'll discuss; Figure 4 shows its execution plan. The first part of the solution is actually very similar to the previous solution. The code first defines a derived table called M with a computation of a column called cnt, representing the count of elements in the subsequence in @P. Then, the code uses a cross join to match the row in M to the remaining tables in the FROM clause, making the cnt column available to expressions in subsequent predicates.

SELECT
  T1.keycol AS minkey,
  T1.keycol + M.cnt - 1 AS maxkey
FROM (SELECT MAX(keycol) AS cnt FROM @P) AS M
  CROSS JOIN dbo.T1
  INNER JOIN @P AS P
    ON P.keycol = 1
    AND P.val = T1.val
    AND M.cnt = 
  (SELECT COUNT(*)
   FROM @P AS P2
     INNER JOIN dbo.T1 AS T1B
       ON P2.keycol > 1
       AND T1B.keycol = T1.keycol + P2.keycol - 1
      AND T1B.val = P2.val) + 1;

Like in the previous solution, the code joins T1 and @P based on the predicate P.keycol = 1 AND P.val = T1.val. Remember that this predicate filters only values in T1.val that are potential starts of the input subsequence. The plan represents this part with an Index Seek operator against the index T1.UNQ_T1_val_keycol, returning 1,000,545 matches. (Again, your numbers might vary because the code that populates the table uses randomization.)

The next part is where this solution differs from the previous one. The code compares cnt with 1 plus the count of rows returned by a join between @P (aliased as P2) and T1 (aliased as T1B), based on the following predicate:

    P2.keycol > 1
AND T1B.keycol = T1.keycol + P2.keycol - 1
AND T1B.val = P2.val

In essence this predicate filters the second to last elements from P2, matching those elements with elements that have corresponding positions in T1B after the potential stats of the subsequence. Besides matching corresponding positions, the predicate also matches the values. If the number of rows in the result of the join plus 1 is equal to cnt, you have a match.

The problem with this last part is the work involved in computing it. It's represented by the bottom branch in the plan. The Clustered Index Seek in @P is executed 1,000,545 times, returning 3,001,635 rows. Then, to look for matches, the Index Seek against T1.UNQ_T1_val_keycol is executed 3,001,635 times! The total number of logical reads in this plan is 11 million logical reads, and it took this query 9 seconds to complete on my system. This solution is less efficient than the previous one.

Solution 3: Using NOT EXISTS x2

Listing 5 contains the third set-based solution; Figure 5 shows its query plan. As in the previous solutions, this solution starts by filtering potential starts of the subsequence in the sequence by joining @P (aliased as P) with T1 based on the predicate P.keycol = 1 AND P.val = T1.val. Like before, this is achieved in the plan with an Index Seek operator against the index T1.UNQ_T1_val_keycol, returning 1,000,545 matches.

SELECT
  T1.keycol AS minkey,
  T1.keycol + (SELECT MAX(keycol) FROM @P) - 1 AS maxkey
FROM dbo.T1
  INNER JOIN @P AS P
    ON P.keycol = 1
    AND P.val = T1.val
    AND NOT EXISTS
  (SELECT *
   FROM @P AS P2
   WHERE P2.keycol > 1
     AND NOT EXISTS
       (SELECT *
        FROM dbo.T1 AS T1B
        WHERE T1B.keycol = T1.keycol + P2.keycol - 1
      AND T1B.val = P2.val));

The very same join has an additional predicate applying a double-negative approach:

AND NOT EXISTS
  (SELECT *
   FROM @P AS P2
   WHERE P2.keycol > 1
     AND NOT EXISTS
   (SELECT *
    FROM dbo.T1 AS T1B
    WHERE T1B.keycol = T1.keycol + P2.keycol - 1
      AND T1B.val = P2.val))

The logic in this predicate is: Produce a match if you can't find an element after the first in @P for which you can't find an element with the same value in the corresponding position in T1. So, if no element in @P is not matched by a corresponding element in T1, all elements in @P are matched by corresponding ones in T1. It's like using the term "not uncommon" to say that something is common. The nice part is the way the plan handles this predicate.

For each of the 1,000,545 rows returned by the previous activity, the plan performs a seek in the clustered index in @P to get the second to last elements in the input subsequence. However, as those second to last elements are scanned in @P, the plan performs a seek in the T1.UNQ_T1_val_keycol to check whether a matching element can't be found in the corresponding position in T1. As soon as a match can't be found for an element, the execution short-circuits because there's no need to continue applying seeks in T1 for the rest of the elements in @P. That's why you get 1,110,338 seeks against T1.UNQ_T1_val_keycol and not more than 3 million.

The total number of logical reads is about 5 million, but there are no extra costs like in the other set-based solutions related to the XML-based string concatenation and string comparisons or count aggregates. This solution finished in 3 seconds on my system. That's more like it!

Best of the Best

In this article and in "Identifying a Subsequence in a Sequence, Part 1," I presented a challenge to identify the locations of a subsequence within a sequence. I presented three iterative solutions in "Identifying a Subsequence in a Sequence, Part 1"; the most optimized solution completed in 3.5 seconds. In this article, I tried to find a set-based solution that's as good as or faster than the optimized iterative solution. I covered three different solutions. The fastest solution provided similar performance to the fastest iterative solution, completing in 3 seconds. However, the relational solution is superior because in addition to performing well, it's an elegant declarative solution that requires much less code than the iterative solution requires.

Comments

Plain text