Order data with NULLs last, with Merge Join (Concatenation)

SQL Server: Avoiding a Sort with Merge Join Concatenation

There are cases when SQL Server doesn’t normally realize it can rely on index order, but you can make the platform realize it with some extra manipulation and creativity. This article is the first in a two-part series in which I describe such tasks and solutions. This month I’ll talk about avoiding a sort with the Merge Join (Concatenation) operator. Next month I’ll describe avoiding a sort when needing the data in descending order.

One of the advantages of B-tree-based indexes is that they organize the data ordered by some index key. SQL Server can then rely on index order to present or process data ordered (for operations such as ORDER BY, GROUP BY and JOIN), without the penalty of an actual sort operation. Pulling the data preordered from an index has linear scaling since the cost of an ordered scan is simply proportional to the number of rows involved. Puling the data unordered and then applying an actual sort operation scales in an extra-linear fashion--more specifically, in an n log n fashion. So, if you often need to rely on ordered data, it’s beneficial to have supporting indexes. However, there are cases when SQL Server doesn’t normally realize it can rely on index order, but you can make the platform realize it with some extra manipulation and creativity. This article is the first in a two-part series in which I describe such tasks and solutions. This month I’ll talk about avoiding a sort with the Merge Join (Concatenation) operator. Next month I’ll describe avoiding a sort when needing the data in descending order.

Order Data with NULLs Last

Our first example for specialized ordering needs is a classic. I’ll use the Sales.Orders table in the TSQLV4 database for this example. The task is to return orders (orderid and shippeddate columns) ordered by shippeddate ascending, orderid ascending, but with NULLs last. Normally, SQL Server orders NULLs first. A common way for people to handle the task is by using a CASE expression as the first ordering expression returning a flag with higher ordering value for NULLs, and then shippeddate and orderid, like so:

 

USE TSQLV4;
 
SELECT orderid, shippeddate
FROM Sales.Orders
ORDER BY
  CASE WHEN shippeddate IS NOT NULL THEN 0 ELSE 1 END,
  shippeddate, orderid;

 

This solution returns the desired output with NULLs ordered last, as the query output shows:

 

orderid     shippeddate
----------- -----------
10249       2014-07-10
10252       2014-07-11
10250       2014-07-12
10251       2014-07-15
10255       2014-07-15
...
11073       NULL
11074       NULL
11075       NULL
11076       NULL
11077       NULL
 

There is an index called idx_nc_shippeddate defined on shippeddate and orderid (implied due to the presence of a clustered index on orderid). The problem is that since the first ordering expression manipulates columns, similar to breaking sargability of filters, it prevents SQL Server from being able to rely on index order in this example. This results in explicit sorting in the plan for this query as shown Figure 1, and therefore n log n scaling. 

Figure 1 – Order data with NULLs last, with Sort

 Order data with NULLs last, with Sort

 

There is a trick that you can apply as a workaround. Write two queries—one returning rows where shippeddate is not NULL, with a flag sort column (call it sortcol) based on the constant 0, and another returning rows where shippeddate is NULL, with the flag sort column based on the constant 1. Apply a UNION ALL operator between them, and place the whole thing in a CTE. Have the outer query against the CTE return the desired columns only, ordering the rows by sortcol, shippeddate and orderid. Here’s the complete solution’s code:

 

WITH C AS
(
  SELECT orderid, shippeddate, 0 AS sortcol
  FROM Sales.Orders
  WHERE shippeddate IS NOT NULL
 
  UNION ALL
 
  SELECT orderid, shippeddate, 1 AS sortcol
  FROM Sales.Orders
  WHERE shippeddate IS NULL
)
SELECT orderid, shippeddate
FROM C
ORDER BY sortcol, shippeddate, orderid;

 

The plan for this query is shown in Figure 2.

Figure 2 –

 

 

Notice in the plan that there are two index seeks used to obtain the data—one for the rows where shippeddate is not NULL (emitted ordered by shippeddate and orderid) and another where shippeddate is NULL (also emitted ordered by shippeddate and orderid). The two Compute Scalar operators produce the flag based on the constant—0 for rows where shippeddate is not NULL and 1 for rows where shippeddate is NULL. The Merge Join (Concatenation) operator than merges the two ordered inputs based on the order of: flag column, shippeddate and orderid. This operator is similar to the Merge Join operator, only instead of joining, it unifies two sorted inputs. Figure 3 illustrates how this operator works with a simplified example of two sorted integer inputs.

Figure 3 - Illustration of Merge Join (Concatenation)

 

 

 

The algorithm is pretty simple:

·       Fetch the first row in each of the sorted inputs.

·       While not reached the end of any of the inputs:

o   If the left sort key is less than or equal to the right sort key, emit the left row and read the next row from the left input. Otherwise, emit the right row and read the next row from the right input.

·       If any rows are left in any of the inputs, emit those.

The beauty of this algorithm is that it can be applied to preordered inputs based on index order scans (or seeks and range scans), like in the plan shown in Figure 2, meaning that in such a case the plan scales linearly.

Window function with NULLs last

The Merge Join (Concatenation) operator can be used to produce ordered unified rows for any purpose—not just for presentation ordering purposes. For example, window functions, such as ROW_NUMBER, need the data sorted by the window partitioning and window ordering columns, or just by the window ordering columns if there’s no window partition clause. For example, suppose you wanted to compute rows numbers for orders based on shippeddate ascending, orderid ascending ordering, but again, with NULLs last. Just like with the previous example, here in the window order clause you can start with a CASE expression that produces a flag with a higher ordering value for NULLs than for non-NULL values, and then continue with shippeddate and orderid, like so: 

SELECT orderid, shippeddate,
  ROW_NUMBER() OVER(ORDER BY
                     CASE WHEN shippeddate IS NOT NULL THEN 0 ELSE 1 END,
                     shippeddate, orderid) AS rownum
FROM Sales.Orders;

 

You get the following output showing row numbers starting with 1 for the non-NULL shipped dates, and with the last range of row numbers assigned to rows with a NULL shipped date:

 

orderid     shippeddate rownum
----------- ----------- --------------------
10249       2014-07-10  1
10252       2014-07-11  2
10250       2014-07-12  3
10251       2014-07-15  4
10255       2014-07-15  5
...
11073       NULL        826
11074       NULL        827
11075       NULL        828
11076       NULL        829
11077       NULL        830

 

The plan for this query is shown in Figure 4.

Figure 4 – Window function with NULLs last, with Sort

As in Figure 1, due to the fact that the window order clause starts with an expression that manipulates columns, the optimizer doesn’t rely on index order and applies explicit sorting. Just like in the previous example, you can avoid explicit sorting by using unioned queries with a sort flag (sortcol), only here it’s the window order clause—not the presentation order clause—that is based on sortcol, shippeddate and orderid. Here’s the complete solution query:

 

WITH C AS
(
  SELECT orderid, shippeddate, 0 AS sortcol
  FROM Sales.Orders
  WHERE shippeddate IS NOT NULL
 
  UNION ALL
 
  SELECT orderid, shippeddate, 1 AS sortcol
  FROM Sales.Orders
  WHERE shippeddate IS NULL
)
SELECT orderid, shippeddate,
  ROW_NUMBER() OVER(ORDER BY sortcol, shippeddate, orderid) AS rownum
FROM C;

 

The plan for this query is shown in Figure 5. 

Figure 5 – Window function with NULLs last, with Merge Join (Concatenation)

 Observe the use of the Merge Join (Concatenation) applied to the two preordered inputs, emitting the unified rows sorted like the window function needs; hence, no explicit sorting is used.

In other articles I cover solutions to tasks involving intervals where I rely on this trick to avoid explicit sorting. You can find examples in a series I wrote about Calculating Concurrent Sessions (Part 1, Part 2, Part 3), and in an article on Packing Intervals. In those solutions I unify interval start times with interval end times to one chronologically ordered sequence of events, and, using the Merge Join (Concatenation) operator, avoid the need for explicit sorting.

Unpivoting

Another example where the Merge Join (Concatenation) operator can be used to avoid explicit sorting is in ordering of unpivoted data. To demonstrate the technique, I’ll use a table called CustSales, which you create and populate with 1,000,000 rows by running the following code:

USE tempdb;
DROP TABLE IF EXISTS dbo.CustSales;
GO
CREATE TABLE dbo.CustSales
(
  custid INT NOT NULL
    CONSTRAINT PK_CustSales PRIMARY KEY,
  val2016 NUMERIC(12, 2) NULL,
  val2017 NUMERIC(12, 2) NULL,
  val2018 NUMERIC(12, 2) NULL,
);
 
INSERT INTO dbo.CustSales WITH (TABLOCK) (custid, val2016, val2017, val2018)
  SELECT custid,
    NULLIF(val2016, 0) AS val2016,
    NULLIF(val2017, 0) AS val2017,
    NULLIF(val2018, 0) AS val2018
  FROM ( SELECT N.n AS custid,
           ABS(CHECKSUM(NEWID())) % 1000 AS val2016,
           ABS(CHECKSUM(NEWID())) % 1000 AS val2017,
           ABS(CHECKSUM(NEWID())) % 1000 AS val2018
         FROM TSQLV4.dbo.GetNums(1, 1000000) AS N ) AS D;
 
CREATE INDEX idx_val2016 ON dbo.CustSales(val2016, custid);
CREATE INDEX idx_val2017 ON dbo.CustSales(val2017, custid);
CREATE INDEX idx_val2018 ON dbo.CustSales(val2018, custid);

 The CustSales table has a row per customer (custid) and a column per order year (valyyyy), with the total order values for the current customer and year. Observe the code creates an index on (valyyyy, custid) for each of the three order years. The task is to unpivot the data so that you get a row per customer and year, with one result column holding the year (call it orderyear) and another holding the value (call it val). In addition, you want the result to be sorted by the val column, ascending.

There are two main techniques people use to handle such unpivoting tasks—one using the UNPIVOT operator and another using the APPLY operator. Here’s the classic solution using the UNPIVOT operator:

SELECT custid, CAST(RIGHT(valyear, 4) AS INT) AS orderyear, val
FROM dbo.CustSales
  UNPIVOT(val FOR valyear IN (val2016, val2017, val2018)) AS U
ORDER BY val;

 

The plan for this query is shown in Figure 6.

Figure 6 – Plan for UNPIVOT query

 Observe the explicit Sort operator in the plan. With 1,000,000 rows in the source table, it took this query 4 seconds to complete on my machine (with results discarded to focus on processing time). Due to the need for explicit sorting, this plan scales in an n log n fashion.

Here’s the solution using the APPLY operator:

SELECT custid, orderyear, val
FROM dbo.CustSales
  CROSS APPLY ( VALUES(2016, val2016),
                      (2017, val2017),
                      (2018, val2018) ) AS A(orderyear, val)
WHERE val IS NOT NULL
ORDER BY val;

 

The plan for this query is shown in Figure 7. 

Figure 7 – Plan for APPLY query

 As you can see, the plan is very similar to the one used for the solution with the UNPIVOT operator, and also here the plan applies explicit sorting. This query too took 4 seconds to complete on my machine, and has n log n scaling.

If you’re unhappy with the plans that we got for both the UNPIVOT- and the APPLY-based solutions, you’re in the same boat as me. Your intuition should tell you that there may be a way to leverage the indexes that we have on the (valyyyy, custid) combinations. Indeed, such a solution exists. You write a separate query for each of the years, returning custid, the applicable yyyy constant as orderyear, and the respective valyyyy column as val, for rows where valyyyy is not NULL. You apply UNION ALL operators between the queries, and order the result by the column val, like so:

SELECT custid, 2016 AS orderyear, val2016 AS val
FROM dbo.CustSales
WHERE val2016 IS NOT NULL
 
UNION ALL
 
SELECT custid, 2017 AS orderyear, val2017 AS val
FROM dbo.CustSales
WHERE val2017 IS NOT NULL
 
UNION ALL
 
SELECT custid, 2018 AS orderyear, val2018 AS val
FROM dbo.CustSales
WHERE val2018 IS NOT NULL
 
ORDER BY val;

 

Figure 8 shows the execution plan for this solution.

Figure 8 – Plan for UNION ALL query

 

This technique leverages the order-preserving Merge Join (Concatenation operator), which relies on index order, twice—once between the queries that handle the years 2016 and 2017, and another time between the result and the query that handles the year 2018. This query completed in 1 second on my machine and has linear scaling.

Finally, just for fun, suppose that you want to unpivot the data, sorting the rows by val ascending, only this time keeping rows with NULLs in the val column, and having NULLs sorted last! This can be achieved by mixing the two techniques I described in the article—the one that orders NULLs last without sorting, and the one that unpivots data without sorting. This means that you need to unify six queries—one for each of the three years times each of the two NULL | non-NULL cases producing the sort flag (sortcol), like so:

WITH C AS
(
  SELECT custid, 2016 AS orderyear, val2016 AS val, 0 AS sortcol
  FROM dbo.CustSales
  WHERE val2016 IS NOT NULL
 
  UNION ALL
 
  SELECT custid, 2016 AS orderyear, val2016 AS val, 1 AS sortcol
  FROM dbo.CustSales
  WHERE val2016 IS NULL
 
  UNION ALL
 
  SELECT custid, 2017 AS orderyear, val2017 AS val, 0 AS sortcol
  FROM dbo.CustSales
  WHERE val2017 IS NOT NULL
 
  UNION ALL
 
  SELECT custid, 2017 AS orderyear, val2017 AS val, 1 AS sortcol
  FROM dbo.CustSales
  WHERE val2017 IS NULL
 
  UNION ALL
 
  SELECT custid, 2018 AS orderyear, val2018 AS val, 0 AS sortcol
  FROM dbo.CustSales
  WHERE val2018 IS NOT NULL
 
  UNION ALL
 
  SELECT custid, 2018 AS orderyear, val2018 AS val, 1 AS sortcol
  FROM dbo.CustSales
  WHERE val2018 IS NULL
)
SELECT custid, orderyear, val
FROM C
ORDER BY sortcol, val;

 

The plan for this solution (using SentryOne’s Plan Explorer) is shown in Figure 9.

Figure 9 – Plan for UNION ALL query, keeping NULLs

 

 

Amazingly, five Merge Join (Concatenation) operators are used in this plan, without the need for explicit sorting, hence this plan scales linearly.

Summary

This article demonstrates the use of query rewrites to enable the use of optimal optimization capabilities. The focus of this article was avoiding explicit sorting by using techniques that can leverage the Merge Join (Concatenation) order-reliant and order-preserving operator. Next month I’ll continue the coverage of explicit-sort avoidance by discussing queries that need to handle data in descending order.

 

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish