Compute a Trimmed Mean

My good friend Andy Kelly sent me a puzzle concerning the computation of an average aggregate, but with a twist. The puzzle involves data representing job durations. To try the puzzle, use the code in Listing 1 to create the Jobs table and fill it with some sample data.

SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID(N'dbo.Jobs', N'U') IS NOT NULL DROP TABLE dbo.Jobs;
GO
CREATE TABLE dbo.Jobs
(
  jobid INT NOT NULL,
  starttime DATETIME2(0) NOT NULL,
  endtime DATETIME2(0) NOT NULL,
  duration AS DATEDIFF(s, starttime, endtime),
  CONSTRAINT PK_Jobs PRIMARY KEY(jobid, starttime),
  CONSTRAINT CHK_start_end CHECK (endtime >= starttime)
);
GO

INSERT INTO dbo.Jobs(jobid, starttime, endtime) VALUES
  (1, SYSDATETIME(), DATEADD(s, 60, SYSDATETIME())),
  (1, DATEADD(s, 100, SYSDATETIME()), DATEADD(s, 165, SYSDATETIME())),
  (1, DATEADD(s, 200, SYSDATETIME()), DATEADD(s, 254, SYSDATETIME())),
  (1, DATEADD(s, 300, SYSDATETIME()), DATEADD(s, 362, SYSDATETIME())),
  (1, DATEADD(s, 400, SYSDATETIME()), DATEADD(s, 400, SYSDATETIME())), -- failure
  (1, DATEADD(s, 500, SYSDATETIME()), DATEADD(s, 570, SYSDATETIME())),
  (1, DATEADD(s, 600, SYSDATETIME()), DATEADD(s, 663, SYSDATETIME())),
  (1, DATEADD(s, 700, SYSDATETIME()), DATEADD(s, 750, SYSDATETIME())),
  (1, DATEADD(s, 800, SYSDATETIME()), DATEADD(s, 2000, SYSDATETIME())), -- stuck
  (1, DATEADD(s, 2100, SYSDATETIME()), DATEADD(s, 2158, SYSDATETIME()));

Query the Jobs table as follows to see its contents:

SELECT jobid, duration
FROM dbo.Jobs;

Figure 1 shows the output you get.

jobid       duration
----------- -----------
1       60
1       65
1       54
1       62
1       0
1       70
1       63
1       50
1       1200
1       58

The sample data represents 10 executions of the same job. In reality, you'll have multiple jobs with many more rows per job. The puzzle involves computing the average duration per job. The problem is that outliers exist that need to be excluded. For example, consider a case in which a problem causes a job to run for an abnormally short or long duration. You want to exclude the outliers from consideration.

Different statistical methods are available for excluding outliers. The method you should use depends on your exact needs. Dejan Sarka goes into a lot of detail regarding such computations in his chapter "T-SQL for BI Practitioners" in the book T-SQL Querying (Microsoft Press, 2015).

I'd like to describe one of the classic methods used in such cases, called trimmed mean. The idea in this method is to exclude a certain percent of the lowest and highest values. For example, suppose you want to exclude 10 percent of the lowest and 10 percent of the highest values from consideration. For our sample data, this would mean excluding the first and last samples out of the 10 samples we have, as Figure 2 shows.

Figure 3 shows the desired result for our sample data.

jobid  avgduration
------ ------------
1      60.250000

One way to achieve the task is to arrange the samples in 10 tiles, using the NTILE(10) function. The first 10 percent of the rows will be assigned with tile number 1 and the last 10 percent with tile number 10. Then you filter only the rows with the tile numbers 2 through 9, excluding 1 and 10. After filtering, you group the remaining rows by the jobid column and return the average duration. Here's the complete solution query:

WITH C AS
(
  SELECT jobid, duration,
    NTILE(10) OVER(PARTITION BY jobid ORDER BY duration) AS tile
  FROM dbo.Jobs
)
SELECT jobid,
  AVG(1. * duration) AS avgduration
FROM C
WHERE tile BETWEEN 2 AND 9
GROUP BY jobid;

Another option is to use the PERCENT_RANK window function. This function computes the relative rank of a row and expresses it as a percent. More specifically, if rk is the rank of the row computed with the RANK function based on job duration order and cnt is the count of rows (job executions), PERCENT_RANK is computed as (rk - 1) / (cnt - 1).

For our trimmed mean calculation, you start by computing the percentile rank of the row (call the column pctrnk). Then, similar to the previous solution, you filter only the rows in the desired percentile range (e.g., between 0.1 and 0.9). Finally, you group the remaining rows by the jobid column and compute the average duration. Here's the complete solution query:

WITH C AS
(
  SELECT jobid, duration,
    PERCENT_RANK() OVER(PARTITION BY jobid ORDER BY duration) AS pctrnk
  FROM dbo.Jobs
)
SELECT jobid,
  AVG(1. * duration) AS avgduration
FROM C
WHERE pctrnk BETWEEN 0.1 AND 0.9
GROUP BY jobid;

Depending on the exact semantics you're looking for, instead of using PERCENTILE_RANK you can use the alternative function CUME_DIST. This function computes the percent with the formula np / cnt, where np is the number of rows that precede or are peers of the current row.

Comments

Plain text