Identifying a Subsequence in a Sequence, Part 3

A few months ago I covered a challenge involving identifying a subsequence in a sequence. In "Identifying a Subsequence in a Sequence, Part 1," I covered iterative solutions to the challenge, and in "Identifying a Subsequence in a Sequence, Part 2," I covered set-based solutions. Since those articles published, I've received additional solutions from readers. Two of the solutions, created by Peter Larsson and Dwain Camps, are so beautiful that they deserve recognition. In addition, Peter's and Dwain's solutions employ patterns that can be used to solve other T-SQL challenges, so it's certainly worthwhile to become familiar with them.

I'll start by providing a quick reminder of the challenge and the fastest set-based solution that I presented previously. I'll then cover Peter's and Dwain's solutions. If your memory of the challenge and the previously presented solutions is still fresh, feel free to skip straight to the new solutions.

The Challenge

You have a sequence of values stored in the table T1. Use the code in Listing 1 to create the table T1 and populate it with a small set of sample data to test the correctness of the solutions.

SET NOCOUNT ON;
USE tempdb;
GO

IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
  keycol INT NOT NULL
    CONSTRAINT PK_T1 PRIMARY KEY,
  val INT NOT NULL,
    CONSTRAINT UNQ_T1_val_keycol UNIQUE(val, keycol)
);

-- Small set of sample data to check correctness
INSERT INTO dbo.T1(keycol, val) VALUES
  (1, 1),(2, 1),(3, 7),(4, 5),(5, 9),(6, 1),(7, 7),(8, 1),(9, 7),(10, 5),(11, 9);

The column keycol holds consecutive integers starting with 1, representing the order of the elements in the sequence, and the column val holds the values of those elements. You're given as input a table variable @P with a subsequence of elements represented by the pair of columns keycol and val similar to those in T1. Your task is to identify all locations of the subsequence from @P in the sequence in T1, returning the start and end keys for each occurrence. The following sample code defines the input subsequence 1, 7, 1, 7:

-- Input table variable @P
DECLARE @P AS TABLE
(
  keycol INT NOT NULL
    PRIMARY KEY,
  val INT NOT NULL,
    UNIQUE(val, keycol)
);

-- sample input; try with other inputs as well
INSERT INTO @P(keycol, val) VALUES
  (1, 1),(2, 7),(3, 1),(4, 7);

--

With the small set of sample data in T1, Figure 1 shows the desired result for the input subsequence 1, 7, 1, 7.

minkey      maxkey
----------- -----------
6       9

Figure 2 shows the desired result for the input subsequence 1, 7, 5, 9.

minkey      maxkey
----------- -----------
2       5
8       11

To test the performance of the solutions, you need a bigger set of sample data. You can achieve this by using the code in Listing 2.

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
        FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Large set of sample data to check performance
TRUNCATE TABLE dbo.T1;

INSERT INTO dbo.T1(keycol, val)
  SELECT n AS keycol, ABS(CHECKSUM(NEWID())) % 10 + 1 AS val
  FROM dbo.GetNums(1, 10000000) AS Nums;

Solution Using NOT EXISTS x2

As a reminder, the following is the fastest set-based solution that I presented for the challenge in "Identifying a Subsequence in a Sequence, Part 2":

SELECT
  T1.keycol AS minkey,
  T1.keycol + (SELECT MAX(keycol) FROM @P) - 1 AS maxkey
FROM dbo.T1
  INNER JOIN @P AS P
    ON P.keycol = 1
    AND P.val = T1.val
    AND NOT EXISTS
  (SELECT *
   FROM @P AS P2
   WHERE P2.keycol > 1
     AND NOT EXISTS
       (SELECT *
        FROM dbo.T1 AS T1B
        WHERE T1B.keycol = T1.keycol + P2.keycol - 1
      AND T1B.val = P2.val));

Here are the performance statistics that I got for this solution on my machine:

Duration: 3.5 seconds, CPU: 3.5 seconds, logical reads: 5 million

Now let's examine Peter's and Dwain's solutions.

Solution by Peter Larsson

Peter had an ingenious insight that led to his solution. In cases in which the subsequence appears in the sequence, the difference between the keys in both sides has the following important properties:

It's constant because the keys in both sides keep incrementing by the same interval of 1 integer.
It's unique per occurrence of the subsequence in the sequence because each occurrence in the sequence starts at a different key, whereas the first key in the subsequence is always 1.

The easiest way to understand these properties is to look at the output of the following query:

SELECT T1.keycol AS T1_key, P.keycol AS P_key, T1.val, T1.keycol - P.keycol AS diff
FROM dbo.T1
  INNER JOIN @P AS P
    ON P.Val = T1.Val
ORDER BY diff, P.keycol;

Figure 3 shows the output of this query for the input subsequence 1, 7, 1, 7 against the small set of sample data.

T1_key      P_key       val     diff
----------- ----------- ----------- -----------
1       3       1       -2
2       3       1       -1
3       4       7       -1
1       1       1       0
2       1       1       1
3       2       7       1
6       3       1       3
7       4       7       3
6       1       1       5
7       2       7       5
8       3       1       5
9       4       7       5
8       1       1       7
9       2       7       7

Observe that for each distinct occurrence of some or all of the elements of the subsequence in the sequence, the difference between the keys is the same. When the count of rows with the same difference is equal to the cardinality of the subsequence (4 in this example), it means that the subsequence appears in its entirety in the sequence. In our example, observe that the subsequence appears in the range of keys 6 through 9. That's just beautiful!

Based on this idea, to complete the solution, simply group the result of the join by the difference between the keys, filter only the groups having a count of rows that's equal to the count of rows in @P, and for each qualifying group return the minimum and maximum keys from T1.

Here's the complete solution:

SELECT MIN(T1.keycol) AS minkey, MAX(T1.keycol) AS maxkey
FROM dbo.T1
  INNER JOIN @P AS P
    ON P.Val = T1.Val
GROUP BY T1.keycol - P.keycol
HAVING COUNT(*) = (SELECT COUNT(*) FROM @P);

Peter's solution is quite amazing in its brevity and simplicity. It's also quite efficient. Figure 4 shows the plan for the solution against the large set of sample data.

Figure 4: Plan for Solution by Peter Larsson

For each value in @P, the plan performs a seek and a range scan in the index on T1(val, keycol) to identify matches. Compared with the other solutions, this plan results in significantly fewer reads. Most of the work is done by the aggregate operator, which in this case uses a hash match algorithm. This operator mainly uses CPU resources, resulting in a plan with a comparatively high measure of CPU time. Here are the performance statistics that I got for this plan:

3 seconds, CPU: 10 seconds, logical reads: 10750

Observe that this solution is a bit faster than the fastest set-based solution I showed previously. It consumes more CPU time, but it generates significantly fewer reads.

Solution by Dwain Camps

Dwain's solution applies an intriguing short-circuiting theme. You can find the complete solution in Listing 3 and the query plan for the solution in Figure 5.

DECLARE
  @val1    AS INT = (SELECT val FROM @P WHERE keycol = 1),
  @val2    AS INT = (SELECT val FROM @P WHERE keycol = 2),
  @val3    AS INT = (SELECT val FROM @P WHERE keycol = 3),
  @val4    AS INT = (SELECT val FROM @P WHERE keycol = 4),
  @numvars AS INT = 4, -- number of variables used for short-circuiting
  @numrows AS INT = (SELECT COUNT(*) FROM @P); -- cardinality of subsequence

DECLARE @trailingrows AS INT = -- num elements remaining to check
  CASE WHEN @numrows > @numvars THEN @numrows - @numvars ELSE 0 END;

WITH ReducedSet AS
(
  SELECT A.keycol
  FROM dbo.T1 AS A
  WHERE A.val = @val1 -- filter rows that match first sub sequence element
    AND ( @val2 IS NULL -- when @val2 is NULL, EXISTS is short-circuited
      OR EXISTS
        ( SELECT * FROM dbo.T1 AS B
      WHERE B.val = @val2 AND B.keycol = A.keycol + 1 ) )
    AND ( @val3 IS NULL -- short-circuit for @val3
      OR EXISTS
        ( SELECT * FROM dbo.T1 AS B
      WHERE B.val = @val3 AND B.keycol = A.keycol + 2 ) )
    AND ( @val4 IS NULL -- short-circuit for @val4
      OR EXISTS
        ( SELECT * FROM dbo.T1 AS B
      WHERE B.val = @val4 AND B.keycol = A.keycol + 3 ) )
)
SELECT RS.keycol AS minkey, RS.keycol + @numrows - 1 AS maxkey
FROM ReducedSet AS RS

  OUTER APPLY ( SELECT TOP (@trailingrows) keycol, val
        FROM dbo.T1 AS S
        WHERE S.keycol >= RS.keycol + @numvars
        ORDER BY S.keycol ) AS TS -- trailing sequence elements

  OUTER APPLY ( SELECT 1 AS c
        FROM @P AS P
        WHERE P.keycol = 1 + TS.keycol - RS.keycol
          AND P.val = TS.val ) AS TSS -- trailing subsequence elements

GROUP BY RS.keycol
HAVING COUNT(TSS.c) = @trailingrows;

Figure 5: Plan for Solution by Dwain Camps

The first part of the solution assigns values to variables. The variables @val1, @val2, @val3, and @val4 are assigned with the respective values of the first four elements in the subsequence. If the subsequence has fewer than four elements, variables representing nonexisting element positions are set to NULL. The variable @numvars is set to 4, representing the number of variables used for short-circuiting purposes. The variable @numrows is set to the cardinality of the subsequence. As for the @trailingrows variable, when the cardinality of the subsequence is greater than the number of variables used for short-circuiting, it's set to the difference between the two—otherwise, it's set to zero. This variable represents how many trailing elements in the subsequence are left for you to check to know whether you have a match beyond the ones you check using the variables @val1, @val2, @val3, and @val4.

The solution query defines a CTE called ReducedSet, which reduces the set in T1 to only those elements representing a potential start of the subsequence based on the first four elements. The first part in the CTE filters only the rows where val is equal to @val1:

SELECT A.keycol
FROM dbo.T1 AS A
WHERE A.val = @val1

Then, for each of the next three elements, there's a predicate that looks like this (this example handles the second element):

AND ( @val2 IS NULL -- when @val2 is NULL, EXISTS is short-circuited
  OR EXISTS
    ( SELECT * FROM dbo.T1 AS B
      WHERE B.val = @val2 AND B.keycol = A.keycol + 1 ) )

The nice thing in terms of the physical processing of the code is that if a variable is NULL, the EXISTS predicate is short-circuited. Furthermore, only rows that get a true back from one predicate need to be tested against the next predicate, so the set keeps gradually reducing. You can clearly see this in the plan in Figure 5 by observing the reducing numbers from the original set in T1, starting with the index seek against the index on T1.UNQ_T1_val_keycol, and flowing to the left to the series of nested loops operators implementing the left semi joins.

Then the outer query uses two OUTER APPLY operators to collect the trailing elements from both the sequence and the subsequence. Finally, the outer query groups the rows by the key from the sequence and identifies matches only when the number of trailing elements found is equal to the number of expected trailing elements (stored in the variable @trailingrows).

What's amazing about this solution is its speed. It completed on my machine in 1.5 seconds—faster than all the other solutions I've tested. Here are the full performance statistics that I got for this solution:

Duration: 1.5 seconds, CPU: 5 seconds, logical reads: 3 million

The Theory of T-SQL Challenge Evolution

One of the interesting things about T-SQL challenges is that on one hand there are infinite possible solutions, but on the other hand the discovered solutions usually employ patterns that can be identified and reused in solving other challenges. People's creativity and knack for identifying patterns is the result of human evolution, which is what advancements in science and technology can typically be attributed to. You could say that, among other things, we've evolved to efficiently solve T-SQL puzzles! This article focused on the beautiful patterns in the solutions by Peter and Dwain, which once identified and shared can now be reused in solving other problems.

Comments

Plain text