Solutions to the Travel Puzzle

In my previous post I provided a T-SQL challenge called Travel Puzzle that involved calculating the total miles travelled domestically in each country by each traveler. You can find the details of the puzzle here: http://www.sqlmag.com/blogs/puzzled-by-t-sql/tabid/1023/entryid/13023/TSQL-Travel-Puzzle.aspx. Here I’m going to cover the solutions. Thanks to all those who posted solutions including Jermy, Peter Larsson (Peso), Marcello Poletti (Marc), Regan Wick, and Colleen Morrow.

The solutions use four different strategies:

1. Identifying islands (Itzik 1, Peso)

2. Using a subquery that retrieves information from the “next” row (Regan, Colleen)

3. Using a function that returns the “next” row and a recursive query (Marc)

4. Matching adjacent rows using row numbers (Jermy, Itzik 2)

Now for the details…

1. Identifying islands (Itzik 1 and Peso)

The key part of the solution is to calculate two row numbers that are partitioned by travelerid—one ordered by travelstart, and the other ordered by destination, travelstart. The difference between the two is constant and unique per traveler and destination. You can then group the rows by traveler, destination and that diff, calculate the difference between the maximum miles traveled and minimum miles traveled, and this way get the total miles traveled in each leg of the itinerary. Finally, group the result by traveler and calculate the total miles traveled in all legs.

Here’s my solution implementing this approach along with performance measures:

-- Itzik 1

-- CPU time = 9501 ms, elapsed time = 8554 ms, logical reads 4978

WITH C1 AS

(

SELECT travelerid, destination, milestraveled,

ROW_NUMBER() OVER(PARTITION BY travelerid

ORDER BY travelstart) -

ROW_NUMBER() OVER(PARTITION BY travelerid

ORDER BY destination, travelstart) AS grp

FROM dbo.Travel

C2 AS

(

SELECT travelerid, destination,

MAX(milestraveled) - MIN(milestraveled) AS miles

FROM C1

GROUP BY travelerid, destination, grp

)

SELECT travelerid, destination, SUM(miles) AS totalmiles

FROM C2

GROUP BY travelerid, destination;

Peso implemented a very similar idea, only in the second row number he specified the destination attribute as a partitioning (as opposed to ordering) element. The effect is the same, though. Here’s Peso’s solution:

-- Peso

-- CPU time = 10950 ms, elapsed time = 8347 ms, logical reads 4978

WITH cteTravel(TravelerID, Destination, TotalMiles)

(

SELECT TravelerID,

Destination,

MAX(MilesTraveled) - MIN(MilesTraveled) AS TotalMiles

FROM (

SELECT TravelerID,

Destination,

MilesTraveled,

ROW_NUMBER() OVER (PARTITION BY TravelerID ORDER BY TravelStart) AS Yak,

ROW_NUMBER() OVER (PARTITION BY TravelerID, Destination ORDER BY TravelStart) AS Yik

FROM dbo.Travel

) AS x

GROUP BY TravelerID,

Destination,

Yak - Yik

)

SELECT TravelerID,

Destination,

SUM(TotalMiles) AS TotalMiles

FROM cteTravel

GROUP BY TravelerID,

Destination;

The benefit in these solutions is that the data is scanned only once. One of the row number calculations can benefit from an ordered scan of an index. However, there’s no avoiding of one sort operation in the plan.

2. Using a subquery that retrieves information from the “next” row (Regan, Colleen)

The solutions in this category use a subquery retrieving the interesting information (milestraveled) from the “next” row; that is, in respect to the current row, the first row with the same travelerid value, a greater travelstart value, based on travelstart ordering, provided that the destination is the same. Regan implemented this approach using APPLY and a TOP query, like so:

-- Regan

-- CPU time = 11372 ms, elapsed time = 4552 ms, logical reads 3197603

SELECT

dbo.Travel.travelerid

,dbo.Travel.destination

,SUM(nextstop.milestraveled - dbo.Travel.milestraveled) as totalmiles

FROM

dbo.Travel

CROSS APPLY

(

SELECT TOP 1

milestraveled

,destination

FROM

dbo.Travel Travel2

WHERE

Travel2.travelerid = dbo.Travel.travelerid

AND Travel2.travelstart > dbo.Travel.travelstart

ORDER BY

Travel2.travelstart

) nextstop

WHERE

dbo.Travel.destination = nextstop.destination

GROUP BY

dbo.Travel.travelerid

,dbo.Travel.destination;

As you can see, this solution is surprisingly fast despite the very high number of logical reads. The reason for this is that most of the reads are only logical in a test system where this is the only query running, and the plan for this query utilizes parallelism very efficiently. In an environment with multiple concurrent queries, a query with a high number of reads will often tend to incur a bigger performance hit than one with a low number. Nevertheless, it is surprising to see how fast it runs in an isolated environment albeit the high number of reads.

Colleen’s solution is based on similar principals only using a join, subquery and aggregate, instead of APPLY, subquery and TOP. Here’s Colleen’s solution:

-- Colleen

-- CPU time = 18798 ms, elapsed time = 7188 ms, logical reads 3202667

select travelerid, departed, sum(miles) from

(select t1.travelerid, t1.destination as departed, t2.destination as arrived, t2.milestraveled - t1.milestraveled as miles

from travel t1 join travel t2 on t1.travelerid = t2.travelerid

and t2.travelstart = (select min(t3.travelstart) from travel t3 where t3.travelstart > t1.travelstart

and t3.travelerid = t1.travelerid)

) a

where departed = arrived

group by travelerid, departed;

In this category Regan’s solution performs better.

3. Using a function that returns the “next” row and a recursive query (Marc)

The third category is based on Marc’s phantasy of having a PREVIUS operator in the language; right Marc? :)

The implementation of a currently supported alternative to this construct is to create an inline table function that returns the “next” or “previous” row using TOP logic based on travelstart ordering for the same traveler, and then use a recursive query that keeps iterating through adjacent rows. Here’s the complete solution:

-- Marc

-- CPU time = 45427 ms, elapsed time = 49588 ms, logical reads 5999999 (Worktable) + logical reads 3009928 (Travel)

create function fi_NextTravel(@travelerid int, @travelstart datetime2(0))

returns table as return

select top 1 *

from dbo.Travel

where travelerid=@travelerid and travelstart>@travelstart

order by travelstart

with r as(

select *, Ml=0 from dbo.Travel where milestraveled=0

union all

select n.*, case n.destination when r.destination then n.milestraveled-r.milestraveled else 0 end

from r

cross apply dbo.fi_NextTravel(r.travelerid, r.travelstart) n

)

select travelerid, destination, milestraveld=sum(Ml) from r

group by travelerid, destination

option(maxrecursion 0)

As you can see, this solution is very slow mainly due to the high overhead of the recursive query.

4. Matching adjacent rows using row numbers (Jermy, Itzik 2)

Solutions based on this category start by assigning row numbers partitioned by travelerid, ordered by travelstart. You then join two instances of the set with the row numbers, matching those by travelerid and a difference of 1 between the row numbers. You filter only those rows where the destination is the same in both sides. Here’s Jermy’s implementation of the solution:

-- Jermy

-- no perf info--stopped query after a couple of hours

WITH CTE_Ord AS (

SELECT ROW_NUMBER() OVER(PARTITION BY t.travelerid ORDER BY t.travelstart) AS Ord

,t.travelerid

,t.destination

,t.travelstart

,t.milestraveled

FROM dbo.Travel AS T

)

SELECT COFrom.travelerid,COFrom.destination,SUM(COTo.milestraveled - COFrom.milestraveled) milestraveled

FROM CTE_Ord AS COFrom

INNER JOIN CTE_Ord AS COTo

ON COFrom.travelerid = COTo.travelerid

AND COFrom.destination = COTo.destination

AND COFrom.Ord = COTo.Ord - 1

GROUP BY COFrom.travelerid,COFrom.destination;

You would expect this solution to perform reasonably, but when running it on my machine it just kept going and going, and eventually I stopped it after a few hours. I discussed the optimization of this query with fellow MVPs and would like to thank Peter Larsson and Rob Farley for their inputs. The plan shows a merge join algorithm used to process the join. The culprit seems to be the merge operator’s split of processing of predicates to Where (join columns) and Residual. The data is served to the merge operator after sorting ordered by destination and travelid. The Where (join) property of the operator therefore indicates filtering that is based on this order: ([tempdb].[dbo].[Travel].destination, [tempdb].[dbo].[Travel].travelerid) = ([tempdb].[dbo].[Travel].destination, [tempdb].[dbo].[Travel].travelerid). This merge join is processed as a many-to-many one, where for each row in one side, the operator matches all rows with the same travelerid and destination doing very massive rewinds. It then processes the predicate appearing in the Residual property of the operator on top doing the remaining filtering: [tempdb].[dbo].[Travel].[travelerid] as [T].[travelerid]=[tempdb].[dbo].[Travel].[travelerid] as [T].[travelerid] AND [tempdb].[dbo].[Travel].[destination] as [T].[destination]=[tempdb].[dbo].[Travel].[destination] as [T].[destination] AND [Expr1002]=([Expr1005]-(1)). The last part is the row number matching that does the majority of the filtering. The difference between the Where (join columns) and Residual predicates seems in a sense similar to the difference between a Seek predicate and Predicate in index operations. BTW, when forcing a hash join algorithm with a join hint, the query runs for about 5 seconds on my machine.

I used a similar solution to Jermy’s, but when observing the performance issue, I moved the predicate that matches the destination in both sides of the join to a CASE expression within the aggregate function, and this trick paid off big time. The optimizer still used a merge join algorithm, but did both destination and row number matching in the Where (join columns) predicate. The plan did not incur any sort operations either, but rather relied on the ordered scan of the index used to calculate the row numbers also for the merge join, and then handled the aggregate using hashing. Here’s the solution with the performance numbers:

-- Itzik 2

-- CPU time = 11515 ms, elapsed time = 3958 ms, logical reads 9956

WITH C AS

(

SELECT *,

ROW_NUMBER() OVER(PARTITION BY travelerid ORDER BY travelstart) AS rownum

FROM dbo.Travel

)

SELECT Cur.travelerid, Cur.destination,

SUM(CASE

WHEN Cur.destination = Prv.destination

THEN Cur.milestraveled - Prv.milestraveled

END) AS totalmiles

FROM C AS Cur

JOIN C AS Prv

ON Cur.travelerid = Prv.travelerid

AND Cur.rownum = Prv.rownum + 1

GROUP BY Cur.travelerid, Cur.destination;

Wishful Thinking

I’d like to point out one more solution that is not supported at the moment in SQL Server (as of 2008 R2). But if support for window functions is enhanced in a future version; specifically for the LAG and LEAD functions, you will be able to achieve this task like so:

SELECT travelerid, destination,

SUM(CASE

WHEN destination = LAG(destination) OVER(PARTITION BY travelerid ORDER BY travelstart)

THEN milestravled - LAG(milestraveled) OVER(PARTITION BY travelerid ORDER BY travelstart)

END) AS totalmiles

FROM dbo.Travel

GROUP BY travelerid, destination;

Such a solution has all the potential for good optimization, but naturally until we see it in SQL Server we can’t tell.

Cheers, and happy querying!

Comments

Plain text