Inverse Distribution Functions

Last month, I covered the concept of ordered set functions (see "Ordered Set Functions"). I described the support of the ISO and ANSI SQL standards for ordered set functions, as well as the SQL Server 2012 workarounds for the missing functionality. I described inverse distribution functions (percentiles), hypothetical set functions, and other types of calculations that can benefit from a design based on the ordered set function concept. Regarding inverse distribution functions, I described the implementation of the PERCENTILE_DISC and PERCENTILE_CONT functions as window functions in SQL Server 2012. I explained how you can use these functions to compute percentiles. I also included a challenge for readers to come up with their own solutions to both computations that work in SQL Server versions prior to 2012 -- namely, solutions that don't use the PERCENTILE_DISC and PERCENTILE_CONT functions.

This article starts with a brief reminder of the meaning of the computation of percentiles. I then provide solutions to the computations that are supported in SQL Server versions starting with SQL Server 2005. I hope you had a chance to try to come up with your own solutions first.

I use the same sample data in this article that I used last month -- the ExamScores table, which represents student exam scores. Use the code in Listing 1 to create and populate the table.

SET NOCOUNT ON;
USE tempdb;
GO

IF OBJECT_ID('dbo.ExamScores') IS NOT NULL DROP TABLE dbo.ExamScores;
CREATE TABLE dbo.ExamScores
(
  examid    INT         NOT NULL,
  studentid VARCHAR(10) NOT NULL,
  score     INT         NOT NULL,
  CONSTRAINT PK_ExamScores PRIMARY KEY(examid, studentid),
  CONSTRAINT CHK_ExamScores_score CHECK(score BETWEEN 0 AND 100)
);

CREATE INDEX idx_examid_score ON dbo.ExamScores(examid, score);

INSERT INTO dbo.ExamScores(examid, studentid, score) VALUES
  (1, 'E', 90),
  (1, 'I', 80),
  (1, 'M', 85),
  (1, 'P', 50),
  (1, 'R', 50),
  (2, 'B', 75),
  (2, 'E', 90),
  (2, 'M', 70),
  (2, 'N', 40),
  (2, 'R', 90),
  (2, 'U', 45);

Inverse Distribution Calculations

As a reminder, inverse distribution calculations compute percentiles. Loosely speaking, given a percent @pct (a float value in the range 0 through 1) and an ordering element o, the percentile computation returns a value below which @pct percent of the o values fall. For example, consider the ExamScores table. Suppose that you want to find the 50th percentile (aka the median) score for each exam. In this case @pct is 0.5, the ordering element is score, and the computation needs to be applied for each exam group. So for each exam, you're looking for the middle score -- the score below which 50 percent of the scores fall.

There are two distribution models used to compute percentiles: discrete and continuous. I'll cover each separately, starting with the standard's description of the computation, followed by a suggested pre-SQL Server 2012 compatible solution. Again, I hope you had a chance to try to come up with your own solutions first.

Percentile Discrete Distribution

Standard SQL defines the PERCENTILE_DISC function as an ordered set function that computes percentiles assuming a discrete distribution model. The standard describes the computation like so:

If PERCENTILE_DISC is specified, by treating the group as a window partition of the CUME_DIST window function, using the specified ordering of the value expression as the window ordering, and returning the first value expression whose cumulative distribution value is greater than or equal to the argument.

As a reminder, the CUME_DIST distribution function is a window function implemented in SQL Server 2012 that computes a relative rank of a row. The computation is defined as NP/NR, where NP is the number of rows that precede or peer with the current row and NR is the number of rows in the respective partition. Because your solution is supposed to work prior to SQL Server 2012, you can't use the CUME_DIST function.

You can compute NR with the COUNT window function. Computing NP is more complex. You can compute it as one more than the minimum RANK that's greater than the current rank. The implementation of this logic in T-SQL is expensive. What's interesting is that for our purposes of computing percentile discrete distribution, we can use the ROW_NUMBER function instead of NP and still get the correct result. (Many thanks to Adam Machanic for discovering this while reviewing one of my solutions to this task.) It can be a bit tricky to see how the logic works, so to help simplify things, first examine the information in Figure 1. This figure shows for each row in the ExamScores table a number of values: the row number (rownum), the number of rows that precede or peer with the current row (np), the number of rows in the partition (nr), the cumulative distribution (cd), and the result of the computation rownum/nr as an alternative to np/nr to compute cd. Remember that we're looking for "the first value expression whose cumulative distribution value is greater than or equal to the argument." Peers in terms of score have row numbers in the range <previous np + 1> through np. So "the first value expression whose np/nr value is greater than or equal to the argument" will always be the same as "the first value expression whose rownum/nr value is greater than or equal to the argument." In other words, for our computation's purposes we can use rownum/nr instead of np/nr and get the correct result. With this in mind, here's the pre-SQL Server 2012 compatible solution to computing percentile discrete distribution implementing this logic:

DECLARE @pct AS FLOAT = 0.5;
WITH C AS
(
SELECT examid, score,
ROW_NUMBER() OVER(PARTITION BY examid ORDER BY score) AS np,
COUNT(*) OVER(PARTITION BY examid) AS nr
FROM dbo.ExamScores
)
SELECT examid, MIN(score) AS percentiledisc
FROM C
WHERE 1.0 * np / nr >= @pct
GROUP BY examid;

Most of the complexity here is in figuring out the logic behind the computation and realizing that rownum can be used instead of np. As you can see from the solution, after you get this logic, the translation to T-SQL is very straightforward. Figure 2 shows the output of this solution with the input @pct = 0.5.

Percentile Continuous Distribution

If following the explanation of the previous solution didn't give you a headache, I promise you that the next solution will. But my assumption is that if you continued reading this far, you really need to perform such calculations in your environment, or you're a true geek -- or both. Either way, I admire you for continuing to read what to some people might seem like a dense subject (but to me is a fascinating one).

The standard describes the PERCENTILE_CONT function, which computes percentiles assuming a continuous distribution model, like so:

If PERCENTILE_CONT is specified, by considering the pair of consecutive rows that are indicated by the argument, treated as a fraction of the total number of rows in the group, and interpolating the value of the value expression evaluated for these rows.

The basic idea is that if there's no value in the population that falls exactly in the desired percentile, you interpolate it assuming continuous distribution. You interpolate the value from the two values that appear immediately before and after the desired one. For example, suppose that you want to find the 50th percentile (the median) score for each exam. Five students took exam 1. Because it's an odd number of scores, the median is an actual existing score -- the third one (assuming score ordering). But six students took exam 2, and because it's an even number of scores, there's no actual score as the 50th percentile. So the computation assumes continuous distribution and interpolates the result from the two points surrounding the desired one -- namely, the third and fourth. With the 50th percentile, the interpolation is simply the average of the two middle points. With other percentiles, it's a bit more involved.

For a given percent @pct, the percentile computation is as follows:

Let rownum be the row number of the row with a scale of 0 (ROW_NUMBER - 1).
Let cnt be the count of rows in the partition.
Let a be @pct * (cnt - 1).
Let factor be a - FLOOR(a).
Let row0 be FLOOR(a) and row1 be CEILING(a).
Let val0 be the ordering value (score, in our example) associated with row0.
Let val1 be the ordering value associated with row1.

· The desired percentile is then val0 + factor * (val1 - val0).

To test the logic, you can try using the 50th percentile. In such a case, factor will be 0.5, and the computation will simply give you the average of the two middle points.

Again, the complexity here is mainly figuring out the logic of the computation. After you've done that, the translation to T-SQL is quite straightforward:

DECLARE @pct AS FLOAT = 0.5;
WITH C1 AS
(
SELECT examid, score,
ROW_NUMBER() OVER(PARTITION BY examid ORDER BY score) - 1 AS rownum,
@pct * (COUNT(*) OVER(PARTITION BY examid) - 1) AS a
FROM dbo.ExamScores
),
C2 AS
(
SELECT examid, score, a-FLOOR(a) AS factor
FROM C1
WHERE rownum IN (FLOOR(a), CEILING(a))
)
SELECT examid, MIN(score) + factor * (MAX(score) - MIN(score)) AS percentilecont
FROM C2
GROUP BY examid, factor;

Figure 3 shows the output of this solution with the input @pct = 0.5.

Alternative Solutions

This article covered solutions to computing percentiles using both discrete and continuous distribution models. SQL Server 2012 provides window functions called PERCENTILE_DISC and PERCENTILE_CONT to compute those percentiles, but the challenge was to come up with solutions that are compatible with SQL Server versions prior to SQL Server 2012. The solutions I presented are compatible with SQL Server 2005 and later.

Comments

Plain text