Calculating the Median Gets Simpler in SQL Server 2005

SQL Server 2005 introduces several new T-SQL enhancements that let you simplify and optimize solutions to existing problems. Here, I’ll show you first how to calculate the median in SQL Server 2000, then show you a solution that uses T-SQL enhancements in SQL Server 2005. You'll see that the T-SQL solution running under SQL Server 2005 is both simpler and faster.

Median in the Middle

Median is a statistical calculation that's commonly used for analytical purposes to determine the middle value(s) in a distribution. Median is calculated differently depending on whether the input set of values contains an odd or even number of elements. When you have an odd number of elements and you sort the values in chronological order, median is the middle value. For example, if you have this set of values \{30, 10, 40\} and you sort them in chronological order, the median is 30 (middle value of \{10, 30, 40\}). In an even number of elements (also sorted in chronological order), the median is the average of the two middle values. For example, given the set of values \{30, 10, 40, 10\}, the median is 20 (average of 10 and 30).

To demonstrate different solutions for median calculations, I’ll use the VOrders view in my examples. Run the code in Listing 1 to create the VOrders view for the Northwind database. The view contains a row for each order; each order lists the ID of the employee who handled the order (EmployeeID column), the total monetary value of each order after discount (Value column), and additional information that's irrelevant for this example. Your task is to calculate the median Value for each employee; you should get the results that Table 1 shows. Remember to treat an employee with an odd number of orders differently than an employee with an even number of orders.

Round One: SQL Server 2000

You might find it helpful to start by tackling a similar, yet simpler, example that closely resembles the actual problem. For example, you can try to devise a solution to calculate the median for all values in the source table. First handle a case with an odd number of orders and then with an even number of orders. Then you can add a layer of complexity by merging the odd and even cases and apply the calculation for each employee. Here’s my proposed solution, handling an odd number of orders, which uses this query:

SELECT MAX(Value) FROM
  (SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value) AS H1;

The query creates the derived table H1 and returns 50 percent of the values from the VOrders view, based on the order of the Value column. Note that the PERCENT option in the TOP clause rounds up the number of rows to return if the result number isn't a whole number. For example, if the view contained 11 rows, 50 PERCENT would return six rows. This means that the median value is included in the result set, and in fact, it's the maximum value in the set. So the outer query extracts the maximum value by requesting MAX(Value).

Now tackle a case with an even number of elements. You need to extract both the maximum value in the first half of the rows (based on the order of the Value column, of course) and the minimum value in the second half of the rows. Once you've extracted both the maximum and minimum values, you add them up and divide by two to get their average. If you think about it, applying the above query to an even number of rows actually returns the maximum value in the first half (946.00 in our case). To extract the minimum value in the second half (940.50 in our case), you simply reverse the ORDER BY clause in the derived table query and request MIN(Value) in the outer query, as the following code shows:

SELECT MIN(Value) FROM
  (SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value DESC) AS H2;

To complete this example, add the two resulting values and divide that result by two (as Listing 2 shows), producing the result 943.25.

Now merge the odd and even cases so that your solution is invariant to parity (odd/even number of rows). Interestingly, the solution in Listing 2 already meets this requirement. We already know that the solution in Listing 2 handles the even case correctly. As for the odd case, think about it: Both subqueries are going to return the very same middle value. Thus, the average of two values that are equal is the same value.

Finally, let's add another complexity layer by applying the median calculation per employee. This layer is very simple to handle; you query the Employees table and invoke the calculation you used in Listing 2 in a subquery. To apply the calculation from the Employees table to the employee in the current row, you just need to add a correlation in the derived table queries that associates the inner EmployeeID with the outer one, as Listing 3 shows. Table 1 shows the results—though not necessarily in the same order, of course, since no ORDER BY clause was specified.

Round Two: SQL Server 2005

Now let's look at a solution in SQL Server 2005 that uses three T-SQL enhancements that I've described in past articles: common table expressions (CTEs), the ROW_NUMBER function, and the OVER clause for aggregate functions. (For more information, see "Get in the Loop with CTEs," May 2004; "Calculating Row Numbers in SQL Server 2005," April 2004; and "OVER Clause Simplifies Aggregate Window Calculations," November 2005.) First, create a CTE (call it OrdersRN) that calculates, for each order, a row number partitioned by EmployeeID that's based on the order of the Value column. The row number represents the position of the value, in a sorted set of values, for the employee. The CTE query also calculates a COUNT(*) by using the OVER clause partitioned by EmployeeID. This means that the result will show the count of orders that have the same EmployeeID as the one in the current row. In short, for each order you get the position of the value in a sorted list of values for the current employee (call it RowNum) and also the count of orders for that employee (call it Cnt). In the outer query, you filter only the rows that are relevant to the median calculation—the middle value in an odd case, and the two middle values in an even case. Use the following filter expression to achieve this level of filtering:

WHERE RowNum IN((Cnt + 1) / 2, (Cnt + 2) / 2)

Listing 4 shows the code that filters only the relevant rows for the median calculation, yielding the output that Table 2 shows. Observe that for an odd number of rows, the result is one row. For example, for employee 1 who has 123 orders, the result is the row with the row number 62. If the number of rows is even, the result is the two middle values; for employee 2 who has 96 orders, the results are the rows with row numbers 48 and 49. The remaining tasks are to group the rows by EmployeeID and request the average of the Value column, as Listing 5 shows. Doing so produces the results that Table 1 shows.

And the Winner is?

I find that obtaining the solution in SQL Server 2005 is simpler and more intuitive compared with using SQL Server 2000. Simplicity is achieved because the row numbers intuitively correspond to the median concepts, such as middle point or two middle points. Moreover, the SQL Server 2005 solution is substantially faster because it involves only one scan of the source data. If you examine the execution plan for both solutions, you'll find that the ratio of the cost of the SQL Server 2000 solution to the SQL Server 2005 solution is 86 percent to 14 percent—that is, the SQL Server 2000 solution uses about six times estimated cost than the SQL Server 2005 solution.

Comments

Plain text