Ordered Set Functions: What's New--and Missing--in SQL Server vNext

Back in 2012, I wrote an article introducing ordered set functions and mentioned the lack of support for those in SQL Server at that time. Starting with SQL Server vNext (2017/2018?) CTP 1.1, we finally have support for the first ordered set function—STRING_AGG. This function is also already supported in Azure SQL Database. In this article, I’ll provide a reminder of what ordered set functions are, what’s new in vNext, and what’s still missing. This is a true cause for celebration, so it’s perfectly appropriate to have a nice glass of Champagne while reading this article.

Example of Ordered Set Function

Aggregate functions are very common in use in T-SQL code, mainly for analytical purposes. So far, T-SQL has supported only grouped aggregate functions that have no ordering relevance. For example, consider the following grouped query, which computes the maximum order value for each customer:

SET NOCOUNT ON;
USE TSQLV4; -- TSQLV4 database: http://tsql.solidq.com/SampleDatabases/TSQLV4.zip

SELECT custid, MAX(val) AS maxval
FROM Sales.OrderValues
GROUP BY custid;

This code generates the following output:

custid      maxval
----------- --------
1       933.50
2       514.40
3       2082.00
4       4441.25
5       3815.25
...

The MAX aggregate function is a general set function that has no ordering relevance--i.e., it doesn’t matter in what order within the group the maximum value is computed. Irrespective of order, you’ll always get the same maximum. Conversely, an ordered set function is an aggregate function that has ordering relevance. For instance, suppose that instead of returning the maximum order value per customer you wanted to return the last order value per customer, based on orderdate and orderid ordering. The SQL standard defines a clause called WITHIN GROUP that defines the ordering for an ordered set function. Had T-SQL supported a LAST_VALUE grouped ordered set function (it currently doesn’t, but suppose that it did for the sake of the example), you would have achieved the task like so:

SELECT custid,
  LAST_VALUE(val) WITHIN GROUP(ORDER BY orderdate, orderid) AS lastorderval
FROM Sales.OrderValues
GROUP BY custid;

This query would have returned the following output:

custid      lastorderval
----------- -------------
1       933.50
2       514.40
3       660.00
4       491.50
5       1835.70
...

As of SQL Server 2016, T-SQL doesn’t support any grouped ordered set functions, but this changes in vNext …

String Aggregation

What was probably the most sought after ordered set function in T-SQL prior to vNext is one that applies grouped string concatenation. A connect item requesting it as part of more general support for ordered set functions got lots of positive votes. Grouped string concatenation is useful for many purposes, such as constructing pivoting and unpivoting queries for execution with dynamic SQL, concatenating individual items such as order IDs per group such as a customer based on recency ordering, and others.

Prior to vNext, the most popular solution for ordered string concatenation used the FOR XML PATH option. I’ll demonstrate how to apply it to query the Sales.Orders table in the TSQLV4 database and construct for each customer a comma separated string with all of its order IDs concatenated from most recent to least recent.

Create the following index to support the solution:

CREATE UNIQUE INDEX idx_cid_odD_oidD
  ON Sales.Orders(custid, orderdate DESC, orderid DESC);

Here’s the solution code, applied to a single input customer first:

DECLARE @custid AS INT = 1;

SELECT
  STUFF(
    (SELECT ',' + CAST(orderid AS VARCHAR(20)) AS [text()]
     FROM Sales.Orders
     WHERE custid = @custid
     ORDER BY orderdate DESC, orderid DESC
     FOR XML PATH('')), 1, 1, NULL) AS orderids;

The query returns the order IDs of the input customer sorted based on recency. It converts the integer order IDs to character strings and concatenates a comma in front of each ID. Using FOR XML PATH('') and returning the nodes as text nodes results in simple string concatenation. Finally, using the STUFF function the code removes the superfluous leading comma. This code generates the following output:

orderids
------------------------------------
11011,10952,10835,10702,10692,10643

To apply this logic per customer, query the Customers table, and apply the above logic in a correlated subquery that correlates the inner orders with the outer customer, like so:

SELECT C.custid,
  STUFF(
    (SELECT ',' + CAST(orderid AS VARCHAR(20)) AS [text()]
     FROM Sales.Orders AS O
     WHERE O.custid = C.custid
     ORDER BY orderdate DESC, orderid DESC
     FOR XML PATH('')), 1, 1, NULL) AS orderids
FROM Sales.Customers AS C;

This code generates the following output:

custid      orderids
----------- ---------------------------------------------------
1       11011,10952,10835,10702,10692,10643
2       10926,10759,10625,10308
3       10856,10682,10677,10573,10535,10507,10365
4       11016,10953,10920,10864,10793,10768,10743,10741...
...

Note that since order IDs are made of only digits; you don’t need to worry about handling characters that have special meaning in XML, such as >, which will normally show up in the result as >. But If such special characters are possible in your data and you want to return the strings with the original characters, there is a workaround. You add the TYPE directive and then extract the value from the XML instance as a character string using the .value method. To achieve this, in the above solution instead of (SELECT ... FOR XML PATH('')) use (SELECT ... FOR XML PATH(''), TYPE).value('.[1]', 'VARCHAR(MAX)').

The execution plan for this query is shown in Figure 1.

Figure 01: Plan for string aggregate with XML

The plan scans the index idx_nc_region on Customers to pull the customer IDs, and using a Nested Loops operator, applies a seek per customer in the supporting index that you created on Orders to collect the current customer’s order IDs. The UDX operator then forms the XML string with the concatenated order IDs. I tested this solution against a Customers table with 20,000 customers and an Orders table with 1,000,000 orders (~50 orders per customer). I got the following performance statistics on my laptop: CPU time: 1406 ms, elapsed time: 1549 ms, logical reads: 64147. As you can see, the run time is about 1.5 seconds.

The solution performs significantly worse when you do not create a supporting index. I got the following performance statistics without an index: CPU time: 9125 ms, elapsed time: 10491 ms, logical reads: 3079626. The IO cost jumps to millions of reads and the run time to 10.5 seconds.

In SQL Server vNext life is good. Microsoft introduces the function STRING_AGG, as an ordered set function. Its first input is the element you wish to concatenate, and its second input is the separator. Using the standard optional WITHIN GROUP clause you specify the ordering for the concatenation. Here’s how you use this function to handle our task:

SELECT custid,
  STRING_AGG(CAST(orderid AS VARCHAR(20)), ',')
    WITHIN GROUP(ORDER BY orderdate DESC, orderid DESC) AS orderids
FROM Sales.Orders
GROUP BY custid;

If you don’t need to guarantee concatenation order, you can omit the WITHIN GROUP clause.

The function supports both Unicode and regular character strings, and the input type dictates the result type. If the input is not a character string—say, an integer—by default SQL Server will implicitly convert it to a Unicode character string. That’s why I explicitly convert the orderid column to a regular character string in this solution.

The plan for this solution is shown in Figure 2.

Figure 02: Plan for string aggregate with STRING_AGG

The plan scans the supporting index on the Orders table in key order and then applies a Stream Aggregate operator, which relies on group order, to handle the concatenation. Clearly, this solution is much simpler and more efficient than the one based on FOR XML PATH. I got the following performance statistics for this solution on my laptop against a table with 1,000,000 orders: CPU time: 578 ms, elapsed time: 673 ms, logical reads: 3973. I got the following statistics without the supporting index: CPU time: 6890 ms, elapsed time: 3711 ms, logical reads: 4511. Use the following code to drop the supporting index:

DROP INDEX idx_cid_odD_oidD ON Sales.Orders;

FIRST_VALUE, LAST_VALUE

Adding STRING_AGG is an important first step in supporting ordered set functions in T-SQL. Hoping not to sound ungrateful, we’re still missing a few other important ordered set functions. For instance, currently T-SQL supports the FIRST_VALUE and LAST_VALUE functions only as windowed functions, not as grouped ones. For instance, suppose that you wanted to compute for each customer the values of the first and last orders. Here’s the desired output:

custid      firstorderval  lastorderval
----------- -------------- -------------
1       814.50     933.50
2       88.80      514.40
3       403.20     660.00
4       480.00     491.50
5       1488.80    1835.70
...

Had these functions been supported as grouped ones, you would have been able to achieve the task like so (this example is currently not supported):

SELECT custid,
  FIRST_VALUE(val) WITHIN GROUP(ORDER BY orderdate, orderid) AS firstorderval,
  LAST_VALUE(val) WITHIN GROUP(ORDER BY orderdate, orderid) AS lastorderval
FROM Sales.OrderValues
GROUP BY custid;

Instead, currently one solution is to invoke these functions as the supported windowed functions, like so:

SELECT custid,
  FIRST_VALUE(val)
    OVER(PARTITION BY custid
     ORDER BY orderdate, orderid
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS firstorderval,
  LAST_VALUE(val)
    OVER(PARTITION BY custid
     ORDER BY orderdate, orderid
     ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS lastorderval
FROM Sales.OrderValues;

However, recall that unlike grouped queries, windowed queries do not hide the detail. Hence, the result of the function is repeated for all rows that are associated with the same partition (custid in our example). This code generates the following output:

custid      firstorderval  lastorderval
----------- -------------- --------------
1       814.50     933.50
1       814.50     933.50
1       814.50     933.50
1       814.50     933.50
1       814.50     933.50
1       814.50     933.50
2       88.80      514.40
2       88.80      514.40
2       88.80      514.40
2       88.80      514.40
...

To remove the duplicates, add a DISTINCT clause, like so:

SELECT DISTINCT custid,
  FIRST_VALUE(val)
    OVER(PARTITION BY custid
     ORDER BY orderdate, orderid
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS firstorderval,
  LAST_VALUE(val)
    OVER(PARTITION BY custid
     ORDER BY orderdate, orderid
     ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS lastorderval
FROM Sales.OrderValues;

Another supported solution is to compute both ascending and descending row numbers for the individual orders within each customer partition; filter only the first and last orders per customer, and then use a pivoting technique to return the first and last order values in the same row. Here’s a solution implementing this logic:

WITH C AS
(
  SELECT custid, val,
    ROW_NUMBER()
  OVER(PARTITION BY custid
       ORDER BY orderdate, orderid) AS rownumasc,
    ROW_NUMBER()
  OVER(PARTITION BY custid
       ORDER BY orderdate DESC, orderid DESC) AS rownumdesc
  FROM Sales.OrderValues
)
SELECT custid,
  MAX(CASE WHEN rownumasc = 1 THEN val END) AS firstorderval,
  MAX(CASE WHEN rownumdesc = 1 THEN val END) AS lastorderval
FROM C
WHERE rownumasc = 1 OR rownumdesc = 1
GROUP BY custid;

Clearly, the two supported solutions are awkward and more complex than they should to be.

PERCENTILE_CONT, PERCENTILE_DISC

The situation with the PERCENTILE_CONT and PERCENTILE_DISC functions, which compute percentiles like median using continuous and discrete distribution models, respectively, is similar to that with FIRST_VALUE and LAST_VALUE. If you need a refresher about these functions, see How to Write T-SQL Window Functions, Part 2. These functions are currently not supported as grouped ordered set functions, rather only as windowed functions. For example, suppose you wanted to compute the median order values per customer, using both continuous and discrete distribution models. Here’s the desired output:

custid      medianvalcont  medianvaldisc
----------- -------------- --------------
1       830.15     814.50
2       399.875    320.00
3       749.06     749.06
4       480        480.00
5       1277.6     1096.20
...

Had there been support in T-SQL for PERCENTILE_CONT and PERCENTILE_DISC as grouped ordered set functions, you would have used a grouped query to achieve the task, like so (this example is currently not supported):

SELECT custid,
  PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY val) AS medianvalcont,
  PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY val) AS medianvaldisc
FROM Sales.OrderValues
GROUP BY custid;

For now, you would need to use a similar trick to the one you used with FIRST_VALUE and LAST_VALUE with the DISTINCT clause, like so:

SELECT DISTINCT custid,
  PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY val)
    OVER(PARTITION BY custid) AS medianvalcont,
  PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY val)
    OVER(PARTITION BY custid) AS medianvaldisc
FROM Sales.OrderValues;

With recent window function optimization improvements using the batch mode Window Aggregate operator, it becomes much more appealing to rely on PERCENTILE_CONT and PERCENTILE_DISC to compute percentiles, as oppose to resorting to alternative solutions. Hopefully, if/when Microsoft adds support for such functions as grouped functions, we’ll see similar optimization capabilities.

Conclusion

SQL Server makes the first big step in supporting ordered set functions by introducing the long-awaited STRING_AGG function, which concatenates strings in a group based on optional desired order. There are still a few other types of ordered set functions missing, like offset functions, inverse distribution, and others. Moreover, we don’t yet have support for CLR ordered set functions. It would be great to be able to control the order of appliance of a CLR user defined aggregate (UDA) using the WITHIN GROUP clause. If you find such improvements important to you, add your vote to the connect item requesting those. Back to the Champagne glass … Cheers!

Comments

Plain text