Solutions to Proximity Puzzle

Last week I provided a T-SQL puzzle involving matching to each row from one table the row from another table with the closest value in a certain attribute from each side. You can find the puzzle details as well as the sample data used here. I got many solutions and would like to thank all who sent those: Herbert Albert, Marcello Poletti (Marc), Peter Larsson (Peso), Adam Machanic, Brad Schulz, Francisco A González Díaz (Paco), Maciej Pilecki, cwestervelt, Geri Reshef, Bhudev, Steve Kass, kevindockerty, and malpashaa. I’m not going to discuss all solutions here, rather only certain ones, but I urge you to visit the original puzzle’s comments section and check out the different solutions. The solutions I chose to present here are the ones that I would consider actually using. Others are not as efficient, or overly complicated, or use similar principals to the ones I will present.

Straightforward but Slow…

First, here’s what’s probably the most straightforward solution to the puzzle, though unfortunately a very inefficient one:

SELECT

t1.col1 AS t1_col1, t1.col2 AS t1_col2,

A.col1 AS t2_col1, A.col2 AS t2_col2

FROM dbo.t1

OUTER APPLY ( SELECT TOP (1) *

FROM dbo.t2

ORDER BY ABS(t2.col1 - t1.col1), t2.col1 ) AS A;

The reason for the inefficiency of the solution can be seen in the query’s execution plan:

Observe that for each row in t1, there’s a full scan of t2 plus a Top N sort. With the given sample data, you’re looking at 10,000 + 10,000 x 1,000,000 rows processed. That’s a lot! The query cost is 926,975; that’s very high! There’s no point even measuring the run time of this solution since it would take it so long to finish.

APPLY and UNION

Credit for the strategy I’ll describe in this section goes to Herbert, myself, Adam, and especially to Marc who’s solution based on this strategy is the most efficient.

The solution involves using APPLY and UNION ALL, like so:

SELECT

t1.col1 AS t1_col1, t1.col2 AS t1_col2,

a.col1 AS t2_col1, a.col2 AS t2_col2

FROM dbo.t1

OUTER APPLY ( SELECT TOP (1) *

FROM ( SELECT *

FROM ( SELECT TOP (1) *

FROM dbo.t2

WHERE t2.col1 t1.col1

ORDER BY t2.col1 DESC) AS PRV

UNION ALL

SELECT *

FROM ( SELECT TOP (1) *

FROM dbo.t2

WHERE t2.col1 > t1.col1

ORDER BY t2.col1 ) AS NXT ) AS U

ORDER BY ABS(t1.col1 - U.col1), U.col1 ) AS A;

Having indexes on t1.col1 and t2.col1 (created implicitly in our tables due to the definition of primary key constraints) enables the following plan:

The table t1 is scanned once (10,000) rows, then for each of the rows in t1 the bottom branch of the Nested Loops operator performs two seek operations in the index on t2.col1. The seek operations fetch the rows with the closest t2.col1 values in respect to t1.col1. The two rows are then unified, and then a Top N Sort operator fetches the closest. This solution is far more efficient than the previous one, mainly because there’s no need to scan t2 in its entirety for every row in t1, rather perform two seek operations in an index and scan one row in the leaf for each. The Top N Sort operator deals with two rows in each iteration, with a total of as many iterations as the number of rows in t1. Observe that this Top N Sort operator is the most expensive part of the plan.

Here are the stats I got for this solution on my computer (all solutions tested with warm cache, on a computer with a Core i5 processor with 8GB DDR 3 memory):

CPU time = 531 ms, elapsed time = 575 ms.
Table 't2'. logical reads 60020
Table 't1'. logical reads 70

Marc used a very similar strategy to this one, yet brilliantly, he found a way to get rid of the Top N Sort operator. Here’s Marc’s solution:

SELECT

t1.col1 AS t1_col1, t1.col2 AS t1_col2,

a.col1 AS t2_col1, a.col2 AS t2_col2

FROM dbo.t1

OUTER APPLY ( SELECT TOP (1) *

FROM ( SELECT *

FROM ( SELECT TOP (1) *, t1.col1 - t2.col1 AS d

FROM dbo.t2

WHERE t2.col1 t1.col1

ORDER BY t2.col1 DESC) AS PRV

UNION ALL

SELECT *

FROM ( SELECT TOP (1) *, t1.col1 - t1.col1 AS d

FROM dbo.t2

WHERE t2.col1 > t1.col1

ORDER BY t2.col1 ) AS NXT ) AS U

ORDER BY U.d, U.col1 ) AS A;

Here’s the query’s execution plan:

Observe that the Top N Sort operator is replaced by a much more efficient strategy using Merge Join (Concatenation) and Top operators, relying on the order in which the data is streaming to them from the two branches below. Here are the stats I got for this solution:

CPU time = 141 ms, elapsed time = 370 ms.
Table 't2'. logical reads 60020
Table 't1'. logical reads 70

As you can see, CPU time is reduced to about a third, and run time to about a half.

Two APPLYs

Credit for the strategy I’ll describe in this section goes to Adam, Maciej, and especially Peso who sent this solution first. Malpashaa’s solution used somewhat similar principals, though a bit less efficiently.

The solution at hand queries t1 and uses two APPLY operators to get the rows with the closest values from t2:

SELECT

t1.col1 AS t1_col1, t1.col2 AS t1_col2,

a.col1 AS t2_col1, a.col2 AS t2_col2

FROM dbo.t1

OUTER APPLY ( SELECT TOP (1) *

FROM ( SELECT *

FROM ( SELECT TOP (1) *, t1.col1 - t2.col1 AS d

FROM dbo.t2

WHERE t2.col1 t1.col1

ORDER BY t2.col1 DESC) AS PRV

UNION ALL

SELECT *

FROM ( SELECT TOP (1) *, t1.col1 - t1.col1 AS d

FROM dbo.t2

WHERE t2.col1 > t1.col1

ORDER BY t2.col1 ) AS NXT ) AS U

ORDER BY U.d, U.col1 ) AS A;

The logic of figuring out which of the two rows to return the values from is handled with CASE expressions. As you can imagine, this strategy eliminates the need to concatenate and sort the values. Here’s the execution plan for this query:

With the given sample data the plan is serial, but with larger volumes it becomes parallel. The stats I got for this solution are:

CPU time = 140 ms, elapsed time = 330 ms.
Table 't2'. logical reads 63824
Table 't1'. logical reads 25

Adam suggested a variation of this solution that showed in his tests 10% improvement (Adam’s test involved lowering the cost threshold for parallelism to get a parallel plan for the given sample data):

select

q.*

from t1

cross apply

(

select

t1.col1 as t1_col1,

t1.col2 as t1_col2,

case

when lower_val.diff higher_val.diff then lower_val.col1

else coalesce(higher_val.col1, lower_val.col1)

end as t2_col1,

case

when lower_val.diff higher_val.diff then lower_val.col2

else coalesce(higher_val.col2, lower_val.col2)

end as t2_col2

from (select null) as n(m)

outer apply

(

select top(1)

abs(t2.col1 - t1.col1) as diff

from t2

where

t2.col1 t1.col1

order by

t2.col1 desc

) as lower_val

outer apply

(

select top(1)

abs(t2.col1 - t1.col1) as diff

from t2

where

t2.col1 >= t1.col1

order by

t2.col1

) as higher_val

) as q

Steve’s Half and Half

Both previous strategies provide good performance mainly when t1 is fairly small and t2 is large. Steve came up with a very clever solution that seems to be more efficient than the others when the situation is reversed; namely, t2 is small and t1 is large. Here’s Steve’s solution:

declare

@small decimal(18,1) = -2147483648.1,

@large decimal(18,1) = 2147483647.1;

with Tbounds as

(

select

t2.col1, t2.col2,

0.5*t2.col1+0.5*

isnull(

(select top 1 col1 from t2 as B

where B.col1 t2.col1 order by B.col1 desc), @small) as below,

0.5*t2.col1+0.5*

isnull(

(select top 1 col1 from t2 as B

where B.col1 > t2.col1 order by B.col1 asc), @large) as above

from t2

Tmatched as

(

select

t1.col1, t1.col2,

Tbounds.col1 as t2c1, Tbounds.col2 as t2c2

from t1

join Tbounds

on t1.col1 > below and t1.col1 above

)

select * from Tmatched;

Here’s Steve’s explanation:

“My strategy is as follows: before looking at t1, scan t2 and compute the smallest and largest [col1] values for which that row would be the right match. A row in t2 will match anything up to half way down and up to the next col1 values in t2. This range goes from t2.col1+next smallest t2.col1)/2 to col1+next largest col1)/2, with strict inequality for the lower bound.”

Here’s the execution plan for this query:

This solution is slower than the previous two with the given sample data (small t1, large t2). But when I filled t1 with 1,000,000 rows and t2 with 10,000 rows, it completed in 1 second, while the previous solutions completed in 4 seconds or more.

Actually credit for this logic should also go to Brad Schulz, Brad ‘s solution is based on similar principals to Steve’s, but the implementation wasn’t as simple and efficient.

Again, thanks everyone for sharing your solutions!

Cheers,

Comments

Plain text