Puzzled By T-SQL Blog

Solutions to Related Tables Challenge

Last week I provided a T-SQL Challenge involving writing a recursive query that returns tables related to an input table directly or indirectly through foreign key relationships. You can find the challenge details here. The tricky part wasn’t really to come up with any solution that works, rather to come up with a solution using a recursive query. With this constraint the challenge became much more difficult—especially coming up with a solution that performs reasonably.

I’d like to thank all those who sent solutions, including Peter Larsson (Peso), Geri Reshef and Rubén Garrigós. Special thanks go to Peso for his help in benchmarking solutions in his environment and in sending corrections to other solutions. I know that many others tried to solve the puzzle and would like to thank you for your efforts—it is a difficult challenge!

The sample data I provided last week is far too simple to be used to test the performance of the solutions. Peso tested some of the solutions in his data warehouse with 160 tables and this way we got more realistic measures. I created a new set of tables—smaller than Peso’s environment, yet more complex than last week’s—and this was sufficient to show the performance differences between the solutions. Here’s the new sample data:

USE testfk;

CREATE TABLE dbo.T2(c2 INT PRIMARY KEY);

CREATE TABLE dbo.T3(c3 INT PRIMARY KEY);

CREATE TABLE dbo.T4(c4 INT PRIMARY KEY);

CREATE TABLE dbo.T5(c5 INT PRIMARY KEY);

CREATE TABLE dbo.T1(c1 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c3 INT REFERENCES dbo.T3,

c4 INT REFERENCES dbo.T4,

c5 INT REFERENCES dbo.T5);

CREATE TABLE dbo.T6(c6 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c3 INT REFERENCES dbo.T3);

CREATE TABLE dbo.T7(c7 INT PRIMARY KEY,

c3 INT REFERENCES dbo.T3,

c4 INT REFERENCES dbo.T4);

CREATE TABLE dbo.T8(c8 INT PRIMARY KEY,

c4 INT REFERENCES dbo.T4,

c5 INT REFERENCES dbo.T5);

CREATE TABLE dbo.T9(c9 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c5 INT REFERENCES dbo.T5);

CREATE TABLE dbo.T10(c10 INT PRIMARY KEY,

c6 INT REFERENCES dbo.T6,

c3 INT REFERENCES dbo.T3);

CREATE TABLE dbo.T11(c11 INT PRIMARY KEY,

c3 INT REFERENCES dbo.T3,

c7 INT REFERENCES dbo.T7);

CREATE TABLE dbo.T12(c12 INT PRIMARY KEY,

c7 INT REFERENCES dbo.T7,

c4 INT REFERENCES dbo.T4);

CREATE TABLE dbo.T13(c13 INT PRIMARY KEY,

c4 INT REFERENCES dbo.T4,

c8 INT REFERENCES dbo.T8);

CREATE TABLE dbo.T14(c14 INT PRIMARY KEY,

c5 INT REFERENCES dbo.T5,

c8 INT REFERENCES dbo.T8);

CREATE TABLE dbo.T15(c15 INT PRIMARY KEY,

c5 INT REFERENCES dbo.T5,

c9 INT REFERENCES dbo.T9);

CREATE TABLE dbo.T16(c16 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c9 INT REFERENCES dbo.T9);

CREATE TABLE dbo.T17(c17 INT PRIMARY KEY,

c2 INT REFERENCES dbo.T2,

c6 INT REFERENCES dbo.T6);

And here’s a graphical depiction of the relationships:

These tables are created in the testfk database used last week in addition to the other tables already there (tables A, B, C, D, E, F, G).

The performance measures I’ll provide are against this data model, using warm cache, not including parse and compile time. I used the following input for the performance test:

DECLARE @table AS NVARCHAR(261) = N'dbo.T1';

The expected output is (not necessarily in this order):

obj_schema_name  obj_name

---------------- ---------

dbo              T15

dbo              T16

dbo              T17

dbo              T2

dbo              T3

dbo              T4

dbo              T5

dbo              T1

dbo              T6

dbo              T7

dbo              T8

dbo              T9

dbo              T10

dbo              T11

dbo              T12

dbo              T13

dbo              T14

First, I’d like to emphasize that the constraint to come up with a solution based on a recursive query was just to increase the puzzle’s level of difficulty. In practical terms, it makes sense to address such need using a loop-based solution. Loop-based solutions in this case are far simpler and more efficient than recursive solutions. Here’s the one I showed last week (call it Itzik Loop) with a minor revision to return the input table when it has no related tables (remember to include in the code the declaration and initialization of the input @table provided earlier):

DECLARE @T AS TABLE ( id INT NOT NULL PRIMARY KEY );

INSERT INTO @T(id)

SELECT OBJECT_ID(@table) AS id

WHERE OBJECT_ID(@table) IS NOT NULL

UNION

SELECT referenced_object_id

FROM sys.foreign_keys

WHERE parent_object_id = OBJECT_ID(@table)

UNION

SELECT parent_object_id

FROM sys.foreign_keys

WHERE referenced_object_id = OBJECT_ID(@table);

WHILE @@ROWCOUNT > 0

INSERT INTO @T(id)

SELECT referenced_object_id AS id

FROM sys.foreign_keys AS FK

JOIN @T

ON FK.parent_object_id = [@T].id

UNION

SELECT parent_object_id

FROM sys.foreign_keys AS FK

JOIN @T

ON FK.referenced_object_id = [@T].id

EXCEPT

SELECT id FROM @T;

SELECT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name

FROM @T;

The statistics for this solution on my system with the new sample data are: Duration: 4 ms, Reads: 587.

Here’s another loop-based solution provided by Peso (call it Peso Loop):

DECLARE @Tables TABLE

(

TableID INT PRIMARY KEY

)

INSERT     @Tables

(

TableID

)

SELECT     [object_ID] AS TableID

FROM SYS.TABLES

WHERE [object_ID] = OBJECT_ID(@table)

WHILE @@ROWCOUNT > 0

INSERT     @Tables

(

TableID

)

SELECT     DISTINCT TableID

FROM (

SELECT          fk.REFERENCED_OBJECT_ID AS TableID

FROM       SYS.FOREIGN_KEYS AS fk

INNER JOIN @Tables AS t ON t.TableID = fk.PARENT_OBJECT_ID

UNION ALL

SELECT          fk.PARENT_OBJECT_ID AS TableID

FROM       SYS.FOREIGN_KEYS AS fk

INNER JOIN @Tables AS t ON t.TableID = fk.REFERENCED_OBJECT_ID

) AS w

WHERE NOT EXISTS(SELECT * FROM @Tables AS t WHERE t.TableID = w.TableID)

SELECT

OBJECT_SCHEMA_NAME(TableID) AS obj_schema_name,

OBJECT_NAME(TableID) AS obj_name

FROM @Tables;

The stats I got for this solution are: Duration: 4 ms, reads: 728. In short, both loop-based solutions are pretty efficient as you can see.

As for recursive solutions, I’ll start with the ones I came up with, then I’ll provide also the ones by Peso, Geri and Rubén.

My first attempt at a solution using a recursive query is the following (call it Itzik Recursive 1):

WITH C AS

(

SELECT OBJECT_ID(@table) AS obj,

'.' + CAST(OBJECT_ID(@table) AS VARCHAR(MAX)) + '.' AS objpath

WHERE OBJECT_ID(@table) IS NOT NULL

UNION ALL

SELECT D.obj, D.objpath + CAST( D.obj AS VARCHAR(MAX) ) + '.'

FROM ( SELECT

CASE WHEN C.obj = F.parent_object_id

THEN F.referenced_object_id

ELSE F.parent_object_id

END AS obj, objpath

FROM C

JOIN sys.foreign_keys AS F

ON C.obj IN (F.parent_object_id, F.referenced_object_id) ) AS D

WHERE objpath NOT LIKE '%.' + CAST(obj AS VARCHAR(10)) + '.%'

)

SELECT DISTINCT

OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

OBJECT_NAME(obj) AS obj_name

FROM C;

The anchor member returns the input table. The recursive member returns a row for each counterpart of the tables that appeared in the previous round as either a referencing or a referenced table. For each edge, the code constructs a dot separated path with all of the object ids of the objects leading to the current. In each recursive round, the code filters out nodes that already appear in their parent node’s path.

This solution is short (in terms of amount of code) and elegant, but isn’t very fast. It creates all possible paths starting with the input node and made of unique nodes. As you can imagine, the number of paths can explode quickly to a large number with additions of new related tables.

The statistics I got for this solution are: Duration: 7,812, Reads: 2,949,749.

In order for a recursive solution to be really fast you need to avoid constructing multiple different paths and visit each node only once. The two main challenges are: 1. How to apply DISTINCT logic in the recursive member without using the DISTINCT clause (not allowed in recursive member). 2. Avoid building multiple different paths when multiple nodes are counterparts in a foreign key relationship with a single node from the previous round.

Both challenges have a solution. Challenge 1 can be handled by calculating a row number partitioned by the node and then filtering only rows with row number 1. Challenge 2 can be handled by using a FOR XML PATH query to concatenate multiple nodes into one. Here’s the complete solution code (call it Itzik Recursive 2):

WITH C AS

(

SELECT

CAST('.' + STR(OBJECT_ID(@table), 11) + '.' AS VARCHAR(MAX)) AS entirepath,

CAST('~' + STR(OBJECT_ID(@table), 11) + '~' AS VARCHAR(MAX)) AS lastpath

WHERE OBJECT_ID(@table) IS NOT NULL

UNION ALL

SELECT

REPLACE(entirepath, '~', '.') + '.',

SUBSTRING(entirepath, CHARINDEX('~', entirepath), LEN(entirepath)) + '~'

FROM (

SELECT CAST(entirepath AS VARCHAR(MAX)) AS entirepath

FROM (

SELECT element AS [text()]

FROM (

SELECT

CASE n WHEN 1 THEN entirepath ELSE obj END AS element,

ROW_NUMBER() OVER(PARTITION BY CASE n WHEN 1 THEN entirepath ELSE obj END

ORDER BY (SELECT NULL)) AS rownum

FROM (

SELECT

LEFT(entirepath, LEN(entirepath) - 1) AS entirepath,

'~' + STR(obj, 11) AS obj

FROM (

SELECT

C.entirepath,

CAST(

CASE WHEN C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

THEN F.referenced_object_id

ELSE F.parent_object_id

END AS VARCHAR(10)) AS obj

FROM C

JOIN sys.foreign_keys AS F

ON C.lastpath LIKE '%~' + STR(F.parent_object_id, 11) + '~%'

OR C.lastpath LIKE '%~' + STR(F.referenced_object_id, 11) + '~%'

) AS D1

WHERE entirepath NOT LIKE '%.' + STR(obj, 11) + '.%'

) AS D2

CROSS JOIN (SELECT 1 UNION ALL SELECT 2) AS Nums(n)

) AS D3

WHERE rownum = 1

ORDER BY element

FOR XML PATH('')

) AS D4(entirepath)

WHERE entirepath IS NOT NULL

) AS D5

),

Split AS

(

SELECT

CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

STUFF(lastpath, 1, 12, '') AS lastpath

FROM C

UNION ALL

SELECT

CAST(SUBSTRING(lastpath, 2, 11) AS INT) AS obj,

STUFF(lastpath, 1, 12, '') AS lastpath

FROM Split

WHERE lastpath '~'

)

SELECT

OBJECT_SCHEMA_NAME(obj) AS obj_schema_name,

OBJECT_NAME(obj) AS obj_name

FROM Split;

Observe in the CTE C that only one path is built in each iteration of the recursive query, concatenating the path built by the previous round and the qualifying items from the current round (concatenation done with FOR XML PATH). The code uses a trick to distinguish between the part of the path representing nodes accessed prior to current round and nodes added in this round (could be more than one). The trick is to use different separators for the two sections in the layer of the code that concatenates all parts—old (. as separator) and new (~ as separator). The reason to include both sections in the same string and not as different strings is that a FOR XML PATH query can return only one string. Then a layer of code on top can extract the last part of the path (section starting with ~) giving it the attribute name lastpart, and replace all ocurrences of the character ~ with . to return the entire path, giving it the attribute name entirepath.

Finally, another recursive query called Split extracts the node ids from the lastpath attribute calculated in each round. First round (anchor member) handles first node in each lastpath, second round (already recursive query) handles second node in each last path, and so on, until no more nodes exist.

Since this solution visits each node only once it is very fast. The performance stats I got for it are: Duration: 17, Reads: 2,129.

Following are solutions using recursive queries by Peso, Geri and Ruben and their performance statistics.

Here’s Peso’s solution (Call it Peso Recursive):

WITH cteTables (ChildTableName, ParentTableName)

AS (

SELECT          CASE x.Permutation

WHEN 0 THEN c.TABLE_NAME

ELSE p.TABLE_NAME

END AS ChildTableName,

CASE x.Permutation

WHEN 0 THEN p.TABLE_NAME

ELSE c.TABLE_NAME

END AS ParentTableName

FROM       (

SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

UNIQUE_CONSTRAINT_SCHEMA + '.' + UNIQUE_CONSTRAINT_NAME AS UNIQUE_CONSTRAINT_NAME

FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS

) AS rc

INNER JOIN (

SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME

FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE

) AS c ON c.CONSTRAINT_NAME = rc.CONSTRAINT_NAME

INNER JOIN (

SELECT     CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME AS CONSTRAINT_NAME,

CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TABLE_NAME

FROM INFORMATION_SCHEMA.CONSTRAINT_TABLE_USAGE

) AS p ON p.CONSTRAINT_NAME = rc.UNIQUE_CONSTRAINT_NAME

CROSS JOIN (

SELECT     0 UNION ALL

SELECT     1

) AS x(Permutation)

WHERE      c.TABLE_NAME p.TABLE_NAME

), cteHierarchy(TableName, TablePath)

AS (

SELECT     TableName,

CHAR(2) + TableName + CHAR(2) AS TablePath

FROM (

SELECT     CAST(TABLE_SCHEMA + '.' + TABLE_NAME AS VARCHAR(MAX)) AS TableName

FROM INFORMATION_SCHEMA.TABLES

) AS d

WHERE TableName = @table

UNION ALL

SELECT          t.ParentTableName AS TableName,

h.TablePath + t.ParentTableName + CHAR(2) AS TablePath

FROM       cteHierarchy AS h

INNER JOIN cteTables AS t ON t.ChildTableName = h.TableName

WHERE      h.TablePath NOT LIKE '%' + CHAR(2) + t.ParentTableName + CHAR(2) + '%'

)

SELECT          TableName

FROM       cteHierarchy

ORDER BY   TablePath;

Here’s Geri’s solution (call it Geri Recursive):

With T1 As

(Select    Distinct Schema_Name(P.schema_id)+'.'+P.name P,

Schema_Name(R.schema_id)+'.'+R.name R

From sys.foreign_key_columns F

Inner Join sys.objects P

On F.parent_object_id=P.object_id

Inner Join sys.objects R

On F.referenced_object_id=R.object_id),

T2 As

(Select    ROW_NUMBER() Over(Order By Case When @table =P Then 1 When @table =R Then 2 Else 3 End,P,R) Mispar,

*

From T1),

T3 As

(Select Cast(0 As Int)  Mispar,

P,

R,

P PR,

(-1) Idcun,

Cast(','+P+','+R+',' As Varchar(Max)) S_in

From T2

Where Mispar=1

And @table In (P,R)

Union

Select     Cast(1 As Int) Mispar,

P,

R,

R PR,

(-1) Idcun,

Cast(','+P+','+R+',' As Varchar(Max)) S_in

From T2

Where Mispar=1

And @table In (P,R)

Union All

Select     Cast(Mispar As Int),

P,

R,

Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,

Case When Idcun=2 Then 1 Else Idcun End Idcun,

Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in

From (Select    T2.Mispar,

T2.P,

T2.R,

Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1

When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2

Else 0 End Idcun,

T3.S_in

From    T3

Inner Join T2

On T3.Mispar+1=T2.Mispar

And T3.Mispar>0

Where    T3.Idcun0) T

Union All

Select     Cast(Mispar As Int),

P,

R,

Case When Idcun=1 Then R When Idcun=2 Then P Else '' End PR,

Case When Idcun=2 Then 1 Else Idcun End Idcun,

Cast(S_in+Case When Idcun=1 Then R+',' When Idcun=2 Then P+',' Else '' End As Varchar(Max)) S_in

From (Select    T2.Mispar,

T2.P,

T2.R,

Case When T3.S_in Like '%,'+T2.P+',%' And T3.S_in Not Like '%,'+T2.R+',%' Then 1

When T3.S_in Not Like '%,'+T2.P+',%' And T3.S_in Like '%,'+T2.R+',%' Then 2

Else 0 End Idcun,

T3.S_in

From    T3

Inner Join T2

On T3.Idcun=1

And T2.Mispar=1) T)

Select     Distinct PR

From T3

Where PR''

OPTION(MAXRECURSION 0);

Statistics: Duration: 680, Reads: 156,951. Quite fast as you can see.

And here’s Ruben’s solution (call it Ruben Recursive):

WITH referenced(id,level)

AS

(

SELECT referenced_object_id AS id, (select count(*) from sys.foreign_keys) level

FROM sys.foreign_keys FK

WHERE referenced_object_id = OBJECT_ID(@table) or parent_object_id = OBJECT_ID(@table)

UNION ALL

SELECT

case

when (parent_object_id = id) then referenced_object_id

else parent_object_id

end

AS id,

referenced.level-1 level

FROM sys.foreign_keys FK

JOIN referenced ON (referenced_object_id = id or parent_object_id = id)

AND (referenced_object_idparent_object_id)

AND referenced.level >= 1

)

SELECT DISTINCT OBJECT_SCHEMA_NAME(id) AS obj_schema_name, OBJECT_NAME(id) AS obj_name

FROM referenced;

Statistics: I stopped it after it ran for 30 minutes.

Cheers,

BG