Generating Histograms

Downloads
38251.zip

A histogram is a statistical report that shows you the frequency of values within steps or ranges of values that fall between a certain minimum and maximum. Let's take student exam results as an example. Suppose that of 30 students taking an exam, the lowest score is 51 and the highest is 100. You want to generate steps ranging between the lowest and highest scores and count the number of results within each step to get a sense of score distribution in the class. If you wanted to generate five consecutive steps of similar range sizes, the steps and ranges would be 50 to 60, 60 to 70, 70 to 80, 80 to 90, and 90 to 100, lower bound excluded and upper bound included. The histogram would contain the steps and the number of results that fall within each step.

Similarly, you might use histograms to analyze a sampling of values from performance counters (e.g., CPU utilization, memory) set on servers in your network. For example, suppose you record CPU utilization of a certain network server every minute and the measured values for a particular day ranged from 21 percent to 100 percent. You could generate a histogram with four steps—20 to 40, 40 to 60, 60 to 80, and 80 to 100—to find the number of samples that fell within each step. If your server was overloaded during that day, most of the samples would fall in the fourth step.

Generating a Performance-Counter Histogram

Here's a problem involving histograms; see if you can solve it before looking at my solutions. A scheduled SQL Server Agent job periodically records a sampling of performance counters for a certain network server in a table called Samples. Run the code that Listing 1 shows to create the Samples table and populate it with data. Each row contains the measurement ID (measid), when the sample was taken (dt), the measured value (value), and a filler column of 100 bytes that represents other columns in the table. (For example, you'd usually have a serverid column so that you could record measurement samples of multiple servers. For the sake of this problem, this table contains only one server's data.)

Suppose that measid 1 is CPU utilization in percent and measid 2 is memory usage in megabytes. Your users need a histogram to help them analyze the performance data of a certain measurement over a certain period of time. Your users give you the following as arguments: the number of steps (@numsteps), the measurement ID (@measid), and a date range (@fromdt—inclusive— and @todt—exclusive). Your task is to generate a histogram for the given arguments. Note that you don't have to include steps with 0 occurrences in the result. For example, say a user gives you the following arguments:

DECLARE @numsteps int, @measid
  int, @fromdt datetime, @todt
  datetime
SELECT @numsteps=5, @measid=1,
  @fromdt='20030101', @todt='20030102'

In the Samples table, the minimum measured value for measid 1 in the given period is 26 and the maximum is 50. The number of requested steps in the @numsteps argument is 5. First, you need to calculate the lower and upper bounds of the ranges within the five steps. Because the lower bounds of the ranges aren't inclusive, your calculations should result in the following ranges: 25 to 30, 30 to 35, 35 to 40, 40 to 45, and 45 to 50. Figure 1 shows the steps and ranges for the given arguments.

You need to devise code that will tell you how many measures are in each step—in this case, your code should result in one measure matching Step 1 (26), two matching Step 2 (33, 35), and two matching Step 5 (47, 50). Figure 2 shows how your code's output should look with this specific set of arguments. Following are some solutions I came up with for this problem.

Solution 1: Using a Table of Steps

The first solution involves writing a query that generates a derived table, Steps, containing the step numbers and the range of values for each step. After you generate such a table, the solution is easy. All you have to do is join the Steps derived table to the Samples table based on the value column that falls in the step's range, group the result by step number, and count the number of rows within each group. The tough part in this solution is writing the query that generates the derived table. To generate the step numbers, you can use an auxiliary table called Nums, which you populate with a sequence of integers in the range 1 to <max_number_of_possible_steps>. Running the script in Listing 2 creates the Nums auxiliary table and populates it with 1000 integers.

To calculate the lower and upper bounds of each step's range, you need to cross-join the results of two queries: one against Nums to return the step numbers, and the other against Samples to return the minimum and maximum measurement values. The resulting query might look like Listing 3. Run this query, and note in the results, which Figure 3 shows, that the query returns the same general minimum and maximum measurement values with each step number. This is only an intermediate step in the solution; you'll use the general minimum and maximum values together with the step number to calculate the step's minimum and maximum values later.

Now you need to replace the asterisk in the SELECT list with expressions that return, besides the step number, the lower boundary (exclusive) and upper boundary (inclusive) of the step's range. The step number is easy—it's n.

The following code is the basis for constructing the expression that calculates the lower boundary:

mn + <step_size>*(n-1) - 1

where mn is the general minimum measurement value, n is the step number, and step_size is the size of the range that each step covers. You calculate step_size as follows:

(mx-mn+1)/@numsteps

By combining the two, you get

mn + (mx-mn+1)/@numsteps*(n-1) - 1

This expression would be sufficient if you always had a range of values that divides evenly by the number of steps, but this isn't always the case. The following expression uses numeric division and rounding to accommodate ranges of values that don't divide evenly by the numbers of steps:

mn + CAST(round(1.0*
(mx-mn+1)/@numsteps*(n-1), 0) AS int) - 1

You force numeric division (instead of integer) by multiplying the first operand of the division operation by the numeric value 1.0 (read, one point zero), which means that SQL Server won't truncate the fraction. You then round the result to get whole range boundaries. SQL Server then converts the result of the rounding to an integer to remove nonsignificant zeros after the decimal point. You calculate the upper boundary in a similar manner:

mn + CAST(round(1.0*
(mx-mn+1)/@numsteps*n, 0) 
AS int) - 1

The only difference here is that you multiply the step size by the step number (n) and not by n-1. To test the complete query that calculates the derived Steps table, run the code in Listing 4; note that the results are identical to Figure 1.

Feel free to experiment with the arguments in this solution's code, testing how changing the number of steps affects the result, and so on. Recall that the query that returns the table of steps is only part of the solution. You now have the missing part denoted as <query that calculates steps> in the following pseudo code:

SELECT step, count(*) AS cnt
FROM Samples JOIN (<query that calculates
steps>) AS Steps
  ON value > f AND value <= t
WHERE measid = @measid
  AND dt >= @fromdt AND dt < @todt
GROUP BY step

Replace the missing part with the query that Listing 4 shows. Listing 5 shows the complete solution, which produces the desired histogram that Figure 2 shows. Of all the measures for the requested counter and period, divided into five steps, one measure (26) fell within the first step's range, two measures (33, 35) fell in the second, and two measures (47, 50) fell in the fifth.

If, like me, you prefer the modular development approach to simplify your code and its maintenance, you could write a user-defined function (UDF) that accepts as arguments the number of steps, measurement ID, and date range and returns a steps table. Run the code that Listing 6 shows to create the fn_steps function.

To test the function, run the following code and verify that you get the results that Figure 1 shows:

SELECT * FROM fn_steps(1, 5, '20030101', '20030102')

Now you can use the fn_steps table instead of the derived table Steps, as the code in Listing 7 shows. You get the histogram that Figure 2 shows.

Solution 2: Calculating the Step Number On the Fly

This puzzle has another solution that doesn't involve generating a table of steps; it calculates the step number on the fly. With this solution, you don't need an auxiliary numbers table. Notice the FROM clause in callout A of Listing 8, page 18. You cross-join two derived tables—one called S, containing the rows from Samples that match the provided arguments, and the other called R, containing the minimum measurement value and the whole range size. The following expression, which you write in the SELECT list, calculates the step number:

floor((value-mn) / (1.0*range/@numsteps)) + 1
AS step

This expression uses logic similar to what you used to calculate the boundaries in the previous solution. The expression (value-mn) gives you the position of the value within the range, and (1.0*range/@numsteps) gives you the step size. By dividing the first operand by the second, flooring the result, and adding 1, you get the step number. Now that you have attached to each row a step number matching the criteria from Samples, you can use this query in a derived table called RS so that you can group the result by the calculated step number. Listing 8 shows the complete solution, in which the outer query groups the result by step number and counts the rows for each step.

About Performance

Of the two solutions—one using an auxiliary table and a modular approach and the other calculating step numbers on the fly—the second solution is shorter and more elegant. To find out which performs better, you can use the script that Listing 9 shows to populate the Samples table with 1 million rows for testing. The rows contain measurement IDs in the range 1 to 10, with one sample per counter recorded every minute (1440 a day) and dates in the range January 1, 2003, to March 11, 2003. I compared the two solutions from Listing 5 and Listing 8, using periods ranging from 1 day to 1 month. In my tests, both solutions performed well, with the second performing a bit better both in terms of duration and in terms of I/O, so I'd probably use that one.

Comments

Plain text