Identifying a Subsequence in a Sequence, Part 1

I recently got the chance to work on a T-SQL challenge that involved identifying a subsequence within a sequence. I found the challenge to be interesting and therefore decided to dedicate this month's and next month’s columns to it. I'll focus on iterative solutions this month and set-based solutions next month. As always, after you read about the challenge, I urge you to try solving it yourself before looking at my solutions.

Related: "Divide and Conquer Halloween"

The Challenge

Our challenge involves finding the locations of a subsequence in a sequence. For example, consider the sequence 1, 1, 7, 5, 9, 1, 7, 1, 7, 5, 9, and the subsequence 1, 7, 5, 9. The subsequence can be found in positions 2 through 5 and 8 through 11 in the original sequence.

An example for a practical use case is identifying a certain subsequence in a DNA sequence representing a possible mutation. In such a case the four DNA nucleobases (G, A, T, and C) make the elements of the sequence instead of the integers in our example, but the principle is the same. Use the code in Listing 1 to create a table called T1 and populate it with a small set of sample data representing the full sequence.

SET NOCOUNT ON;
USE tempdb;
GO

IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
  keycol INT NOT NULL
    CONSTRAINT PK_T1 PRIMARY KEY,
  val INT NOT NULL,
    CONSTRAINT UNQ_T1_val_keycol UNIQUE(val, keycol)
);

-- Small set of sample data to check correctness
INSERT INTO dbo.T1(keycol, val) VALUES
  (1, 1),(2, 1),(3, 7),(4, 5),(5, 9),(6, 1),(7, 7),(8, 1),(9, 7),(10, 5),(11, 9);

Suppose you're given the input subsequence in a table variable. Your task is to implement code that identifies the start and end positions of the occurrences of the input subsequence in the full sequence. The following code declares such a table variable and populates it with a sample input sequence with a placeholder to add your solution code:

-- Input table variable @P
DECLARE @P AS TABLE
(
  keycol INT NOT NULL
    PRIMARY KEY,
  val INT NOT NULL,
    UNIQUE(val, keycol)
);

-- sample input; try with other inputs as well
INSERT INTO @P(keycol, val) VALUES
  (1, 1),(2, 7),(3, 1),(4, 7);

--

Of course, you can change the sample input as you wish to test your solution with different subsequences. Figure 1 shows the desired result against the small set of sample data for the input subsequence 1,7,1,7.

minkey		maxkey
-----------	-----------
6		9

Figure 2 shows the desired result for the subsequence 1,7,5,9. Observe in Figure 2 that if there are multiple occurrences, you need to identify all of them.

minkey		maxkey
-----------	-----------
2		5
8		11

Use the small set of sample data to validate the correctness of your solution. To test your solution’s performance, you need to fill the table with a larger set of sample data. You can use the code in Listing 2 for this purpose.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
        FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Large set of sample data to check performance
TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1(keycol, val)
  SELECT n AS keycol, ABS(CHECKSUM(NEWID())) % 10 + 1 AS val
  FROM TSQL2012.dbo.GetNums(1, 10000000) AS Nums;

The code first defines a helper function called GetNums and then uses it to populate the table T1 with 10 million rows. Good luck!

Solution 1: Using a Recursive Query

As I mentioned, this month I’ll focus on iterative solutions. The first solution uses a recursive query to handle the task. Listing 3 contains the complete solution code minus the previously supplied code to declare and populate the input table variable @P.

WITH C AS
(
  SELECT
    T1.keycol AS minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM @P AS P
    INNER JOIN dbo.T1
  ON P.val = T1.val
  WHERE P.keycol = 1

  UNION ALL

  SELECT
    PRV.minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM C AS PRV
    INNER JOIN @P AS P
  ON P.keycol = PRV.P_keycol + 1
    INNER JOIN dbo.T1
  ON T1.keycol = PRV.T1_keycol + 1
  AND P.val = T1.val
)
SELECT minkey, T1_keycol AS maxkey
FROM C
WHERE P_keycol = (SELECT MAX(keycol) FROM @P);

The first query in the CTE body is the anchor member:

SELECT
  T1.keycol AS minkey,
  P.keycol AS P_keycol,
  T1.keycol AS T1_keycol
FROM @P AS P
  INNER JOIN dbo.T1
    ON P.val = T1.val
WHERE P.keycol = 1

The query filters only the first element from the input subsequence in the table variable @P, joining @P and T1 to identify all occurrences of the first subsequence value in the sequence. The key of this first value is identified as the potential minimum key (aliased minkey). The keys of the subsequence and the sequence are also returned and aliased as P_keycol and T1_keycol, respectively.

The second query in the CTE body is the recursive member:

SELECT
  PRV.minkey,
  P.keycol AS P_keycol,
  T1.keycol AS T1_keycol
FROM C AS PRV
  INNER JOIN @P AS P
    ON P.keycol = PRV.P_keycol + 1
  INNER JOIN dbo.T1
    ON T1.keycol = PRV.T1_keycol + 1
    AND P.val = T1.val

This query joins the previous round’s result (aliased as PRV) with the table variable @P (aliased as P) and the table T1. The purpose of these joins is to identify the next value in both the subsequence and the sequence and produce a match only if the next value in both is the same (P.val = T1.val). This recursive query runs repeatedly until no more matches are found.

Finally, the outer query filters only the rows from the CTE that represent surviving last subsequence elements (P_keycol equals maximum key in @P), meaning the entire subsequence was found:

SELECT minkey, T1_keycol AS maxkey
FROM C
WHERE P_keycol = (SELECT MAX(keycol) FROM @P);

This solution is pretty straightforward, but unfortunately it doesn’t perform very well. The reason is that the query execution performs a lot of I/O operations against the internal worktable (spool) that it generates to store the intermediate result sets. You don’t have any control over this process and no say about the indexing of this worktable or any of its other aspects. It took 11 seconds for this solution to complete on my system, applying 13 million logical reads.

Solution 2: Using a Loop

The second solution implements the same approach as in the previous solution, but instead of using a recursive CTE I use an explicit loop. Listing 4 contains the complete solution code.

CREATE TABLE #T
(
  minkey    INT NOT NULL,
  P_keycol  INT NOT NULL,
  T1_keycol INT NOT NULL,
  PRIMARY KEY(P_keycol, T1_keycol)
);

DECLARE
  @i AS INT = 1,
  @cnt AS INT = (SELECT MAX(keycol) FROM @P);

INSERT INTO #T(minkey, P_keycol, T1_keycol)
  SELECT
    T1.keycol AS minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM @P AS P
    INNER JOIN dbo.T1
  ON P.val = T1.val
  WHERE P.keycol = 1;

WHILE @@rowcount > 0
BEGIN
  SET @i += 1;

  INSERT INTO #T(minkey, P_keycol, T1_keycol)
    SELECT
  PRV.minkey,
  P.keycol AS P_keycol,
  T1.keycol AS T1_keycol
    FROM #T AS PRV
  INNER JOIN @P AS P
    ON P.keycol = PRV.P_keycol + 1
  INNER JOIN dbo.T1
    ON T1.keycol = PRV.T1_keycol + 1
    AND P.val = T1.val
    WHERE PRV.P_keycol = @i - 1
    /* OPTION(RECOMPILE) */;
END;

SELECT minkey, T1_keycol AS maxkey
FROM #T
WHERE P_keycol = @cnt;

DROP TABLE #T;

The code starts by defining a temporary table called #T to store the intermediate results. Then the code declares and initializes a couple of variables: @i is an iteration counter initialized with 1 and later will be incremented by 1 in each iteration of the loop, and @cnt holds the number of elements in the subsequence.

Next, the INSERT statement handles the task that in the previous solution was handled by the anchor member—namely, filters the first element in the subsequence and joins to T1 to find all values in the sequence representing the potential start of the subsequence. The result is stored in #T.

Then the code runs a loop that keeps going as long as the previous INSERT returned some rows. In each iteration, the INSERT handles the task that in the previous solution was handled by the recursive member—namely, matches to the elements from the previous round the next element from the sequence if it’s the expected next element according to the subsequence.

Finally, the last query returns the start and end keys of all occurrences of the subsequence in the sequence (those where P_keycol = @cnt).

When I ran the code in Listing 4 as provided (without the RECOMPILE query option), I got the same plan for all executions of the INSERT in the body of the loop. Figure 3 shows the resulting plan.

Figure 3: Plan for Queries in Listing 4 Without RECOMPILE

The first plan was generated for an estimate of about 1 million matches; it was then reused in all subsequent iterations. The problem is that all subsequent iterations return a far smaller number of matches, and the existing plan isn't the most efficient one for them. This query took 10 seconds to complete on my system, applying 3.5 million logical reads.

You can improve the performance of the solution by using the RECOMPILE query option in the INSERT statement in the loop’s body. This will force SQL Server to create a new plan for each iteration of the statement, one that's more suitable based on new estimates. I uncommented the RECOMPILE query option in Listing 4 and ran the code again.

I got different plans for the first iteration of the INSERT statement (suitable for a large number of matches) and for all remaining iterations (suitable for a small number of matches), as Figure 4 shows.

Figure 4: Plans for Queries in Listing 4 With RECOMPILE

This time the query completed in 7 seconds, doing a bit more logical reads (3.85 million), but overall it performed better.

Solution 3: Using the Divide and Conquer Halloween Approach

Even with the RECOMPILE option added in the previous solution, there’s still room for improvement. Notice in the plans in both Figure 3 and Figure 4 the Table Spool (Eager Spool) operators. The optimizer added those operators to the plans for Halloween Protection purposes so as not to handle the same rows more than once. You can find more information on Halloween Protection in an excellent series of articles by Paul White about Halloween Protection.

In "Divide and Conquer Halloween," I covered the generic form of the iterative algorithm used in this article—the divide and conquer algorithm. I explained the need for Halloween Protection when the source and target of the iterative INSERT statement are the same. I provided a solution that implements the same algorithm but avoids the need for Halloween Protection simply by using two temporary tables instead of one, alternating between them so that in each round the source and target are different. I named this approach Divide and Conquer Halloween. Listing 5 contains the solution code that implements this approach.

CREATE TABLE #T1
(
  minkey    INT NOT NULL,
  P_keycol  INT NOT NULL,
  T1_keycol INT NOT NULL,
  PRIMARY KEY(P_keycol, T1_keycol)
);

CREATE TABLE #T2
(
  minkey    INT NOT NULL,
  P_keycol  INT NOT NULL,
  T1_keycol INT NOT NULL,
  PRIMARY KEY(P_keycol, T1_keycol)
);

DECLARE
  @i AS INT = 1,
  @cnt AS INT = (SELECT MAX(keycol) FROM @P);

INSERT INTO #T1(minkey, P_keycol, T1_keycol)
  SELECT
    T1.keycol AS minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM @P AS P
    INNER JOIN dbo.T1
  ON P.val = T1.val
  WHERE P.keycol = 1;

WHILE @@rowcount > 0
BEGIN
  SET @i += 1;

  IF @i % 2 = 0
  BEGIN
    TRUNCATE TABLE #T2;

INSERT INTO #T2(minkey, P_keycol, T1_keycol)
  SELECT
    PRV.minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM #T1 AS PRV
    INNER JOIN @P AS P
      ON P.keycol = PRV.P_keycol + 1
    INNER JOIN dbo.T1
      ON T1.keycol = PRV.T1_keycol + 1
      AND P.val = T1.val
  OPTION(RECOMPILE);
  END;
  ELSE
  BEGIN
    TRUNCATE TABLE #T1;

INSERT INTO #T1(minkey, P_keycol, T1_keycol)
  SELECT
    PRV.minkey,
    P.keycol AS P_keycol,
    T1.keycol AS T1_keycol
  FROM #T2 AS PRV
    INNER JOIN @P AS P
      ON P.keycol = PRV.P_keycol + 1
    INNER JOIN dbo.T1
      ON T1.keycol = PRV.T1_keycol + 1
      AND P.val = T1.val
  OPTION(RECOMPILE);
  END;
END;

IF @i % 2 = 0
  SELECT minkey, T1_keycol AS maxkey
  FROM #T1
  WHERE P_keycol = @cnt;
ELSE
  SELECT minkey, T1_keycol AS maxkey
  FROM #T2
  WHERE P_keycol = @cnt;

DROP TABLE #T1, #T2;

Figure 5 shows the plans for the different executions of the INSERT statements in the body of the loop, with no special treatment added for Halloween Protection.

Figure 5: Plans for Queries in Listing 5

This solution completed in 3.5 seconds on my system, applying only 370,000 logical reads.

Until Next Month

The performance of the iterative solution improved quite dramatically with the Divide and Conquer Halloween approach. This is the best performance I managed to achieve with an iterative solution in T-SQL. The next challenge is to find a set-based solution that's as efficient as possible. You have a month to work on solutions of your own, before I present several possible solutions. Good luck, and have fun!

Comments

Plain text