T-SQL Tiebreakers

Downloads
45235.zip

A tiebreaker is a rule used to determine precedence between candidates when there's a tie. For example, when soccer teams have the same number of points in the league, the one with the higher goal difference takes precedence. A tiebreaker in T-SQL is a rule that determines which row your query will return if multiple rows have the same values for a given set of attributes. For example, suppose you were asked to return the most recent order for each employee from the Orders table. Returning the most recent order date for each employee requires a simple GROUP BY query with MAX(orderdate), but you also need to return other attributes besides the employee ID and the order date.

For this example, suppose you need to return the order ID, customer ID, and shipper ID of the most recent order for each employee. This request by itself is problematic because an employee row might have multiple orders with the same order date, yet you need to return one order for each employee. The request is non-deterministic because SQL Server can consider multiple rows to be the most recent for an employee according to the order date alone. Here's where you need a tiebreaker: a rule identifying a unique row for each employee. Let's look at four solutions to this problem; as we go along, I compare them in terms of their complexity and performance and recommend when to use each.

To test your solutions for logical correctness, you can use the Northwind database's Orders table. But you'll need a larger table to test performance. Run the code in Listing 1 to create and populate a large Orders table in tempdb. Listing 1 creates a table with 830,000 orders, handled by 100 different employees over a period of 365 days.

The solutions' performance might vary depending on the table size and the density of the employee IDs and dates. You can adapt the numbers in Listing 1 to generate density that more closely resembles your needs and test the performance of the different solutions.

Solutions That Use ANSI Subqueries

Before we examine different tiebreaking solutions, let's look at what happens when you don't specify a tiebreaker. Listing 2's ANSI-compliant solution uses a correlated subquery to filter orders that have an order date equal to the highest order date among the orders with the same employee ID as the one in the outer row. Figure 1 shows the result of this query against Orders in Northwind.

Notice that, in the output for employee 2, two orders have the same most-recent order date (1998-05-05). To return only one order for each employee, you need to choose a tiebreaker that, together with the order date, uniquely identifies a row for a given employee. For example, among the orders with the latest order date for employee 2, you could return the row with the highest order ID (11073). Listing 3 shows the revised query, which uses MAX(OrderID) as the tiebreaker; Figure 2 shows the query results.

The query uses two nesting levels of correlated subqueries to filter the desired orders. The outer query against Orders (O1) returns the rows with an order ID equal to the highest order ID from Orders (O2) among the rows with the same employee ID as in O1, and an order date equal to the highest order date from Orders (O3) among the rows with the same employee ID as in O1.

The recommended index for this solution, as well as most other solutions for this task, is a covering index on correlation column, sort column, tiebreaker column, and covered columns. For this example, the index would be on EmployeeID, OrderDate, OrderID, CustomerID, and ShipVia. EmployeeID is the correlation (group) column, OrderDate is the sort column, OrderID is the tiebreaker column, and I included CustomerID and ShipVia in the index for covering purposes. Remember that a covering index is an index that contains all columns referenced in the query, including attributes beyond the search keys.

This solution performs well (if you created the recommended index), it's ANSI-compliant, and its complexity level is low—so far. However, adding tiebreakers degrades performance and makes the query too complex. For example, suppose you had to use MAX(CustomerID) as the first tiebreaker, MAX(ShipVia) as the second, and MAX(OrderID) as the third and last. You can't use MAX(CustomerID) alone as a tiebreaker because EmployeeID, OrderDate, and CustomerID don't uniquely identify a row. The same applies when you add MAX (ShipVia) as the second tiebreaker. Only when you add the third—MAX (OrderID)—can you isolate one row for each employee. To revise the query, simply replace the filter in the query against O2 at callout A in Listing 3 with the filter in Listing 4, which specifies three tiebreakers. For each tiebreaker, you specify another correlated subquery, and in each such subquery, you have to correlate by the employee ID and all preceding tiebreakers.

The recommended index is on EmployeeID, OrderDate, CustomerID, ShipVia, and OrderID. EmployeeID is the correlation (group) column, OrderDate is the sort column, and CustomerID, ShipVia, and OrderID are the tiebreaker columns. The revised query is the slowest solution among the ones I present here. It ran on my laptop for more than 2 minutes against the sample data. All the other solutions ran for less than 10 seconds. Also, its complexity level is high. Unless there's a compelling advantage, I try to avoid using highly complex queries in production code.

Solutions That Use TOP 1

The TOP 1 solution uses only one correlated subquery regardless of the number of tiebreakers you have. The subquery returns the order ID of the row representing the desired order for the employee ID in the outer row. If the outer row's key is equal to the key the subquery returned, the outer row is returned. Listing 5 shows a solution that uses MAX(OrderID) as the tiebreaker.

The trick here is that the subquery filters all rows belonging to the employee ID specified in the outer row, sorts them by the sort column (OrderDate DESC) and the list of tiebreakers (OrderID DESC in this case), and returns the key of the first row. Of course, in practice, with the recommended index, the subquery won't really access all rows for the employee, nor will it sort anything. It will perform a seek within the index and might even hash or spool the result in order to save additional I/O operations for processing that employee's other orders.

This query performs very well. Even better, it scales well in terms of performance and complexity when you add tiebreakers. For example, to use MAX(CustomerID), MAX(ShipVia), and MAX(OrderID) as tiebreakers, all you need to do is revise the ORDER BY list in Listing 5's subquery to

ORDER BY OrderDate DESC, CustomerID DESC, ShipVia DESC, OrderID DESC

As I mentioned, query performance will vary according to the density of employee IDs, among other things. If the number of employees is small, you can further optimize the TOP 1 technique by querying the Employees table and using a join. Instead of invoking the subquery in the filter of the query against Orders, invoke it in the SELECT list of a query against Employees:

SELECT (SELECT TOP 1 OrderID
  FROM Orders AS K
  WHERE K.EmployeeID =
    E.EmployeeID
    ORDER BY OrderDate DESC, OrderID DESC) AS topkey
FROM Employees AS E

Join this table (call it D) to the Orders table (call it O) on O.OrderID = D.topkey.

The query in Listing 6 shows the revision of Listing 5's query. Here again, enhancing the tiebreaker list is merely a matter of revising the ORDER BY list. Investigate the execution plan and I/O stats for this query, and compare them to the previous ones. You'll find that the technique in Listing 6 performs one seek operation within the covering index for each employee to get the key from the subquery, then a seek operation within the index on order ID to grab the whole row. This plan is more efficient than other plans when the number of employees is small. As an exercise, you can experiment with the number of unique employees in Orders and see when these queries become slower than previous ones.

Solutions That Use Concatenation

You might be satisfied with the solutions you have so far. However, the solutions I've described will perform well only if a good index (preferably a covering index) exists for them. However, if no good index exists and you can't create one, you might want to consider another solution.

The idea is to first concatenate the sort column, tiebreakers, and covered columns. Then, group the data by employee ID, calculate the maximum value of the concatenated string, and in an outer query break the concatenated string into individual attributes. The query in Listing 7 shows a solution that uses MAX(OrderID) as the tiebreaker. To use MAX(CustomerID), MAX(ShipVia), and MAX(OrderID) as tiebreakers, simply revise the order of attributes you specify accordingly.

You can accomplish the concatenation through several methods. The example in Listing 7 concatenates the binary representation of the attributes (I find this concatenation method the simplest), assuming that the sort order of the binary data reflects the original sort order of the attributes. In both examples, I use positive integers, fixed-length character strings, and dates, which keep their original sort behavior. The binary representation for character-based data is case-sensitive, but this doesn't generate any logical problems. Another thing to keep in mind is that if you need to concatenate variable-length data (e.g., varchar, nvarchar), make sure you convert it to a fixed-length binary value, which will pad it with zero bits. Otherwise, SQL Server won't calculate the MAX(concatenated string) value correctly, and you won't be able to figure out its location in the concatenated string.

The main benefit of this solution is its performance—it relies on a single scan of the data, even when no adequate indexes exist. However, it has drawbacks. First, it's complex, not intuitive or trivial. Second, you can use this solution only when you're dealing with data types for sort columns and tiebreakers that keep their original sort behavior when converted to a different type (binary/character) for concatenation purposes.

Finally, all other solutions let you control the sort direction separately for each attribute participating as sort column or tiebreaker. In the ANSI subqueries solutions, you can simply specify MIN or MAX for each attribute. With the TOP 1 solutions, you control the sort direction by specifying ASC or DESC in the ORDER BY clause. But the concatenation solution allows only one sort direction for all attributes—either MAX for all or MIN for all. The only exception is numeric data. For example, if you want to return for each employee the order with the latest order date and use the lowest order ID as the tiebreaker, you make two revisions to the query in Listing 7. First, instead of concatenating OrderID, concatenate MAXINT - OrderID (2147483647 - OrderID). Second, in the outer query, instead of just extracting the OrderID, subtract the value you extract back from MAXINT to return it to its original value (2147483647 - OrderID). MAXINT - MAX(MAXINT - value) effectively gives you MIN(value).

Performance vs. Simplicity

For me, an ideal solution performs well, scales well, is ANSI-compliant, and is simple enough for others to understand and maintain. If you find such a solution to a problem, you're lucky. With most problems, I have to compromise on something. But if you're familiar with many techniques, you have more options. In different situations, I might compromise on different aspects of the solution, depending on the customer's priorities. But until I find the ideal solution, I don't consider myself finished with the problem. I stash it in a drawer and try it again when I learn a new technique.

Comments

Plain text