Graph Matching with T-SQL Part 3: Maximum Matching

"The way forward is sometimes the way back…” --Labyrinth 1986

This article is the third part in a series about graph matching tasks. Part 1 introduced what graph matching is and related concepts. Part 2 covered solutions to the maximal matching task and left you with the challenge to handle the maximum matching task. This third part covers a solution to maximum matching. Make sure that you are familiar with the topics covered in the first two parts before proceeding to this article.

As a reminder, suppose you have a set X (say applicants), a disjoint set Y (say jobs), and a bipartite graph G connecting nodes from the two sets (applicant-job pairs where the applicant is qualified to perform the job). A matching M in graph G represents a subset of the edges in G where no two edges are incident (share a node). A maximal matching is a matching to which you cannot add any more edges of the graph. A maximum matching is a maximal matching with the maximum possible number of edges from the graph. Figure 1 shows graph matching examples.

Figure 1 - Graph matching examples

In our applicants-jobs example, you would want to find a maximum matching to maximize the utilization of applicants and fulfillment of jobs.

Sample data

To demonstrate and test the solution for maximum matching, I’ll use the same sample data that I used in the previous articles. Use the following code to create a small set of sample data in order to test the correctness of the solution:

SET NOCOUNT ON;
USE tempdb;

-- Cleanup
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
GO

-- Set X, e.g., Applicants
CREATE TABLE dbo.X
(
x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
moreonx VARCHAR(20) NOT NULL
);

INSERT INTO dbo.X(x, moreonx) VALUES
('A', 'Applicant A'),
('B', 'Applicant B'),
('C', 'Applicant C'),
('D', 'Applicant D');

-- Set Y, e.g., Jobs
CREATE TABLE dbo.Y
(
y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
moreony VARCHAR(20) NOT NULL
);

INSERT INTO dbo.Y(y, moreony) VALUES
(1, 'Job 1'),
(2, 'Job 2'),
(3, 'Job 3'),
(4, 'Job 4');

-- Graph G = (X, Y, E), e.g., possible applicant-job connections
CREATE TABLE dbo.G
(
x VARCHAR(10) NOT NULL
CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
y INT NOT NULL
CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
CONSTRAINT PK_G PRIMARY KEY (x, y)
);

INSERT INTO dbo.G(x, y) VALUES
('A', 1),
('A', 2),
('B', 1),
('B', 3),
('C', 3),
('C', 4),
('D', 3);
GO

-- M is a matching in G
CREATE TABLE dbo.M
(
x VARCHAR(10) NOT NULL
CONSTRAINT UQ_M_x UNIQUE,
y INT NOT NULL
CONSTRAINT UQ_M_y UNIQUE,
CONSTRAINT PK_M PRIMARY KEY (x, y),
CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);

Use the following code to create a large set of sample data (~1000000 edges) to test the performance of the solution:

-- Cleanup
SET NOCOUNT ON;
USE tempdb;

DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
DROP FUNCTION IF EXISTS dbo.GetNums;
GO

-- Helper function dbo.GetNums
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;
GO

CREATE TABLE dbo.X
(
x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
moreonx VARCHAR(20) NOT NULL
);

CREATE TABLE dbo.Y
(
y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
moreony VARCHAR(20) NOT NULL
);

CREATE TABLE dbo.G
(
x VARCHAR(10) NOT NULL
CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
y INT NOT NULL
CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
CONSTRAINT PK_G PRIMARY KEY (x, y)
);

CREATE TABLE dbo.M
(
x VARCHAR(10) NOT NULL
CONSTRAINT UQ_M_x UNIQUE,
y INT NOT NULL
CONSTRAINT UQ_M_y UNIQUE,
CONSTRAINT PK_M PRIMARY KEY (x, y),
CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);
GO

DECLARE @n AS INT = 10000000; -- ~ total num rows
SET @n = SQRT(@n * 2); -- num members of arithmetic sequence

INSERT INTO dbo.X(x, moreonx)
SELECT
RIGHT('000000000' + CAST(N.n AS VARCHAR(10)), 10) AS x,
'Applicant ' + RIGHT('000000000' + CAST(N.n AS VARCHAR(10)), 10) AS moreonx
FROM dbo.GetNums(1, @n) AS N;

INSERT INTO dbo.Y(y, moreony)
SELECT
N.n AS y,
'Job ' + CAST(N.n AS VARCHAR(10)) AS moreonx
FROM dbo.GetNums(1, @n) AS N;

INSERT INTO dbo.G WITH (TABLOCK) (x, y)
SELECT RIGHT('000000000' + CAST(X.n AS VARCHAR(10)), 10) AS x, Y.n AS y
FROM dbo.GetNums(1, @n) AS X
CROSS APPLY (SELECT TOP (@n - X.n + 1) n
FROM dbo.GetNums(1, @n)
ORDER BY NEWID()) AS Y;

If you’re using SQL Server 2016 or later, make sure to create a dummy columnstore index to enable optimization with the batch mode Window Aggregate operator:

CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy
ON dbo.G(x) WHERE x = 'yes' AND x = 'no';

Algorithm for finding a maximum matching

The algorithm for finding a maximum matching involves the concept of an augmenting path, which is a special case of an alternating path. As a reminder, given a matching M in graph G, an alternating path, or M-alternating path in G, is a path made of edges in G that alternates between being in the matching and not being in the matching. An augmenting path is an alternating path that starts and ends with free nodes. Figure 2 illustrates the types of paths.

Figure 2 - Types of paths

Following is an algorithm for finding a maximum matching M in G:

Start with any matching M in G (empty or nonempty)
While true
- Set M = M ⊕ P (symmetric difference of M and P), where P = M-augmenting path in G
- If augmenting path was not found, break

Details on this algorithm, as well as proofs for claims it relies on, and its complexity, can be found in notes on an MIT lecture on bipartite matching.

Figure 3 illustrates applying this algorithm with the small set of sample data, starting with an empty matching.

Observe how the matching’s cardinality grows by exactly one with each step. This means that if your starting point is a matching with cardinality n (0 for empty matching as a starting point), and the cardinality of the maximum matching is m, the algorithm will go through m - n iterations of augmentations.

Last month I presented different solutions for finding a maximal matching. With most of the solutions that I presented, the average time it takes to find each edge in the maximal matching is much lower than that of finding an augmenting path and applying an augmentation. Therefore, it’s a good idea to start with as big of a maximal matching as possible as the starting point for finding a maximum matching.

Maximal matching solution

As a reminder, Solution 5 to finding a maximal matching, which I presented last month, generated a matching with a greater cardinality than all other solutions. Therefore, we’ll use this solution to find a maximal matching, which we will use as our starting point for finding a maximum matching.

If you haven’t done so already, first make sure to run the following code to create a dummy columnstore index to enable optimization with the batch mode Window Aggregate operator:

CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy
ON dbo.G(x) WHERE x = 'yes' AND x = 'no';

Next, run the following code to create the MaximalMatching procedure, which implements Solution 5 from last month:

CREATE OR ALTER PROC dbo.MaximalMatching
AS

SET NOCOUNT, XACT_ABORT ON;

DECLARE @n AS BIGINT, @x AS VARCHAR(10), @y AS INT;

TRUNCATE TABLE dbo.M;

CREATE TABLE #G
(
n BIGINT NOT NULL,
x VARCHAR(10) NOT NULL,
y INT NOT NULL,
INDEX idx_n_y CLUSTERED(n, y)
);

WITH C AS
(
SELECT x, y,
COUNT(*) OVER(PARTITION BY x) AS cnt
FROM dbo.G
)
INSERT INTO #G (n, x, y)
SELECT DENSE_RANK() OVER(ORDER BY cnt, x) AS n, x, y
FROM C;

SELECT TOP (1) @n = n, @x = x, @y = y
FROM #G
ORDER BY n, y;

WHILE @@ROWCOUNT > 0
BEGIN
INSERT INTO dbo.M(x, y) VALUES(@x, @y);

SELECT TOP (1) @n = n, @x = x, @y = y
FROM #G AS G
WHERE n > @n
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y )
ORDER BY n, y;
END;

DROP TABLE #G;
GO

The MaximalMatching procedure populates the table M with a maximal matching in G.

Algorithm for finding an augmenting path

Given a matching M in bipartite graph G, which connects nodes from set X with nodes from set Y, this section describes an algorithm for finding an M-augmenting path in G. The algorithm involves creating a residual directed graph, G’. You form the residual graph like so:

Add a source node s, and create directed edges pointing from s to all free x nodes (x nodes that are not incident to any edges in M)
Add a sink (target) node t, and create directed edges from all free y nodes (y nodes that are not incident to any edges in M) pointing to t
For every edge in M, create a directed edge in G’, pointing from y to x
For every edge in G that is not in M, create a directed edge in G’, pointing from x to y

Apply either a depth-first search (DFS) or breadth-first search (BFS) algorithm to find the shortest path between s and t. This path, minus the first and last edges from s and to t, respectively, is an M-augmenting path in G. Figure 4 has a graphical depiction of this algorithm.

Figure 4 - Finding an augmenting path

To the left is some matching M in graph G. At the center is the residual graph G’, created from M and G based on the aforementioned instructions. To the right is a highlighted shortest path from s to t: s->D->3->B->1->A->2->t. After removing the first and last edges of the path, you get D->3->B->1->A->2, which is an M-augmenting path in G. What we need for our purposes is to return the augmenting path as a subset of edges from G: { (D, 3), (B, 3), (B, 1), (A, 1), (A, 2) }.

Implementation of algorithm for finding an augmenting path

Our solution for maximum matching applies a series of augmentations, each setting M to the symmetric difference of M and an augmenting path in M. For this reason, it’s convenient to implement the solution for finding an augmenting path as a multi-statement TVF (called AugmentingPath) that returns a table variable @P with the set of edges forming the path. The function implements the following steps:

Try to identify an augmenting path of size 1 and write it to @P. That’s any edge in G with a free x and a free y. If such a path is found, return.
If an augmenting path of size one doesn’t exist, find all potential beginnings of longer augmenting paths (free nodes x from G that are connected to matched nodes y) and write them to a table variable @X. If none is found, an augmenting path doesn’t exist, return.
While an augmenting path was not found:
- Add to a table variable @Y distinct matched nodes y from G that have a parent node x in @X from the previous iteration, and that are not already in @Y. Use MIN(x) per y as parent since we eventually need only one (any) augmenting path. So in the case of multiple parents x, it doesn't matter from which we got to y. We do have all handled parents in @X, so we avoid revisiting paths in later iterations. If no such y nodes are found, an augmenting path doesn’t exist, return.
- If new nodes y exist, look for corresponding nodes x in M. They must exist, and they must be distinct (since connected y nodes from previous level are matched).
- Add to @P any edge from G with a free y and an x from the current level in @X. If such an edge exists, we found an augmenting path.
If got here, augmenting path must exist, and its length is > 1 segment. Construct the path from @P, @X and @Y.

Run the following code to create the complete AugmentingPath function:

CREATE OR ALTER FUNCTION dbo.AugmentingPath()
RETURNS @P TABLE
(
x VARCHAR(10) NOT NULL,
y INT NOT NULL,
PRIMARY KEY (x, y)
)
AS
BEGIN

DECLARE
@i AS INT = 0, -- Iteration level
@pathfound AS INT = 0; -- Flag indicating if augmenting path found

-- Set of nodes x to be considered in each iteration
-- Has parent y that lead to x
DECLARE @X AS TABLE
(
x VARCHAR(10) NOT NULL PRIMARY KEY,
parent_y INT NULL,
i INT NOT NULL
);

-- Set of nodes y to be handled in each iteration
-- Has parent x that lead to y
DECLARE @Y AS TABLE
(
y INT NOT NULL PRIMARY KEY,
parent_x VARCHAR(10) NOT NULL,
i INT NOT NULL
);

-- Try to find an augmenting path of size 1
-- An edge from G with a free x and a free y
-- (free means not incident to any edge in M)
INSERT INTO @P(x, y)
SELECT TOP (1) x, y
FROM dbo.G
WHERE NOT EXISTS ( SELECT * FROM dbo.M WHERE M.x = G.x )
AND NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y );

IF @@ROWCOUNT > 0 RETURN;

-- If an augmenting path of size 1 was not found initialize @X
-- with distinct free nodes x from G
-- We already know that if they exist, they point to matched nodes y
INSERT INTO @X(x, parent_y, i)
SELECT DISTINCT x, NULL AS parent_y, @i AS i
FROM dbo.G
WHERE NOT EXISTS ( SELECT * FROM dbo.M WHERE M.x = G.x );

IF @@ROWCOUNT = 0 RETURN;

-- While augmenting path not found
WHILE @pathfound = 0
BEGIN

SET @i += 1;

-- Add to @Y distinct matched nodes y from G that have
-- a parent node x in @X from the previous iteration
-- and that are not already in y
-- Use MIN(x) per y as parent since we eventually need
-- only one (any) augmenting path so in case of multiple
-- parents x, doesn't matter from which we got to y
-- we do have all handled parents in @X, so we avoid
-- revisiting paths in later iterations
INSERT INTO @Y(y, parent_x, i)
SELECT G.y, MIN(G.x) AS parent_x, @i AS i
FROM dbo.G
INNER JOIN @X AS X
ON G.x = X.x
AND X.i = @i - 1
WHERE NOT EXISTS ( SELECT * FROM @Y AS Y WHERE Y.y = G.y )
GROUP BY G.y;

IF @@ROWCOUNT = 0 RETURN;

-- If new nodes y exist, look for corresponding nodes x in M
-- They must exist, and they must be distinct
-- (since connected y nodes from previous level are matched)
INSERT INTO @X(x, parent_y, i)
SELECT M.x, Y.y AS parent_y, @i AS i
FROM dbo.M
INNER JOIN @Y AS Y
ON M.y = Y.y
AND Y.i = @i;

-- Add to @P any edge from G with a free y
-- and an x from the current level in @X
-- If such an edge exists, we found an augmenting path
INSERT INTO @P(x, y)
SELECT TOP (1) G.x, G.y
FROM dbo.G
INNER JOIN @X AS X
ON G.x = X.x
AND X.i = @i
WHERE NOT EXISTS ( SELECT * FROM dbo.M WHERE M.y = G.y );

IF @@ROWCOUNT > 0
SET @pathfound = 1;

END;

-- If got here, augmenting path must exist, and its length is > 1 segment
-- Construct path from @P, @X and @Y
WHILE @i > 0
BEGIN
SET @i -= 1;

INSERT INTO @P(x, y)
SELECT X.x, X.parent_y AS y
FROM @P AS P
INNER JOIN @X AS X
ON P.x = X.x
AND X.i = @i + 1

INSERT INTO @P(x, y)
SELECT Y.parent_x AS x, Y.y AS y
FROM @P AS P
INNER JOIN @Y AS Y
ON P.y = Y.y
AND Y.i = @i + 1
END;

RETURN;

END;
GO

Observe that this solution doesn’t build the residual graph G’ in practice, but rather considers it as a virtual one. It skips the part with source s and sink t; they’re useful at the conceptual level. It starts with free x nodes, and ends with free y nodes, or otherwise aborts if an augmenting path cannot be found.

Use the following code to test the function (use the small set of sample data first):

SELECT * FROM dbo.AugmentingPath();

If you run this code when M is empty, you’ll get back a path with one edge. Feel free to manually populate M with any initial matching and then run the function to see that it works well.

Implementation of algorithm for finding a maximum matching

As discussed earlier, the algorithm for finding a maximum matching goes like this:

Start with any matching M in G (empty or nonempty)
While true
- Set M = M ⊕ P (symmetric difference of M and P), where P = M-augmenting path in G
- If augmenting path was not found, break

Our solution will use a maximal matching as the starting point by executing the MaximalMatching procedure you created earlier.

Then in a loop with a condition that is always true (e.g., WHILE 1 = 1), the body of the loop will implement two steps:

1. A MERGE statement that sets M to the symmetric difference of M and the result of the AugmentingPath function. This statement will define M as the target, and the AugmentingPath function (aliased as P) as the source. The merge predicate is M.x = P.x AND M.y = P.y. When a match is found, the target row is deleted; when a match is not found, the source row is inserted into the target.

2. If the number of rows affected by the MERGE statement is 0, we’re done, hence break from the loop.

Here’s the implementation of this solution, with a few more lines of code to measure the run time of each of the two main parts: first part finding a maximal matching, and second part applying the remaining augmentations:

DECLARE @i AS INT = 0, @dt AS DATETIME2 = SYSDATETIME(), @cnt AS INT;

-- Start by finding maximal matching
EXEC dbo.MaximalMatching;

SET @cnt = (SELECT COUNT(*) FROM dbo.M);

PRINT 'Maximal matching with '
+ CAST(@cnt AS VARCHAR(10))
+ ' matches found in '
+ CAST(DATEDIFF(second, @dt, SYSDATETIME()) AS VARCHAR(10))
+ ' seconds.';

SET @dt = SYSDATETIME();

-- While augmenting path found in last round, look for another and augment,
-- i.e., set M = M ⊕ P (symmetric difference)
WHILE 1 = 1
BEGIN
MERGE INTO dbo.M
USING dbo.AugmentingPath() AS P
ON M.x = P.x AND M.y = P.y
-- Below added to circumvent bug curtesy of Paul White
WHEN MATCHED AND 0 = 1 THEN UPDATE SET x = 'Thanks, Paul White!', y = 1759
WHEN MATCHED THEN DELETE
WHEN NOT MATCHED THEN INSERT(x, y) VALUES(P.x, P.y);

IF @@ROWCOUNT = 0 BREAK;

SET @i += 1;
END;

PRINT 'Processed ' + CAST(@i AS VARCHAR(10))
+ ' augmenting paths in '
+ CAST(DATEDIFF(second, @dt, SYSDATETIME()) AS VARCHAR(10))
+ ' more seconds.';

-- Return maximum matching
SELECT x, y
FROM dbo.M;

This code is mostly straightforward, but you’re probably wondering what’s the purpose of the dummy clause with the UPDATE action in the MERGE statement, since it can never be activated in practice:

WHEN MATCHED AND 0 = 1 THEN UPDATE SET x = 'Thanks, Paul White!', y = 1759

Without this clause, as a result of a bug, the MERGE statement may fail due to an intermediate duplicate key violation if it internally handles the insertions before the deletions. Paul’s solution adding a clause with an UPDATE action causes the optimizer to produce a wide (per-index) update plan with split, sort, and collapse, which prevents the intermediate duplicate keys. More in this in his blog on the topic.

Running this solution against the small set of sample data produces the following output:

Maximal matching with 3 matches found in 0 seconds.
Processed 1 augmenting paths in 0 more seconds.
x y
---------- -----------
A 2
B 1
C 4
D 3

Running this solution against the large set of sample data produces the following output (abbreviated):

Maximal matching with 4472 matches found in 36 seconds.
Processed 0 augmenting paths in 6 more seconds.
x y
---------- -----------
0000000001 4472
0000000002 4471
...
0000004471 532
0000004472 2275

As it turns out, the MaximalMatching procedure found a maximal matching that is also a maximum matching, which is ideal, so the first iteration of the loop couldn’t find an augmenting path and broke from the loop.

Conclusion

In this series, I covered graph matching challenges, starting with laying the foundations by describing related concepts, then providing solutions to maximal matching, and finally a solution to maximum matching. In this series I just scratched the surface of this subject. There are many related challenges, with additional layers of complexity, such as including weights (e.g., priority) and others. If you want to further investigate this subject, look up topics such as graph matching and flow network. Hopefully, as the built-in SQL Graph feature matures, it will add tools that will aid in solving such challenges more easily and more efficiently.

Comments

Plain text