Alternative Aggregations

When people think of the structure of an OLAP cube, they usually think of a group of hierarchical dimensions and a set of measures that use summation to determine the measures' values. Often, cubes include measures that are aggregated by using a summation. A variety of other aggregation types exist, but Analysis Services doesn't natively support all of them. A classic example of nontraditional aggregation is inventory analysis. When you analyze inventory, you don't want to determine the quarterly inventory by summing the inventory numbers from the 3 months; you want to use only the inventory quantity from the last month in the quarter. You can easily implement this type of aggregation in Analysis Services, but it's not one of the native aggregation types you can choose from. Let's look at the native types, then learn how to implement a couple of non-native aggregation types.

Native Aggregations

First, let's establish some terminology. A cube cell is a location in the cube where the dimensions intersect. Just as you can use numeric values on X, Y, and Z axes to identify a point in three-dimensional space, you can use a member from each dimension to identify a cell in a multidimensional cube. Because each dimension has a default member, you don't have to explicitly list a member from each dimension for a cube cell to be uniquely identified. For example, the tuple (USA, Drinks) is enough to identify a unique cell in the FoodMart 2000 Sales cube, even though the Sales cube has 12 dimensions, because the other 10 dimensions are positioned on their default members (i.e., they're assumed, so you don't need to explicitly name them).

Every cube cell has a value for each of the cube's measures. Some cube cells' measure values are loaded (i.e., come directly from the source data), and some cube cells' measure values are determined by an aggregation. An aggregation is a value that Analysis Services determines by performing some math function on other cube cells. To put it most simply, any leaf-level cell's values are loaded values, meaning that the values are calculated based on the source data, not based on the values in other cells. The nonleaf-level cell values are determined by aggregations. Leaf-level cells are cells that are identified entirely by the bottom-most dimension members—dimension members that have no children.

Table 1 shows Analysis Services' native aggregation types (math functions) and their descriptions. Distinct count is a special type and warrants an explanation. Analysis Services allows only one distinct-count measure in a cube, and the values of a distinct-count measure are never aggregated (determined from other cell values). Analysis Services derives distinct-count measure values directly from the fact table when the cube is processed. A distinct-count value is determined by the number of distinct types of fact-table records that a particular cube cell identifies. For example, if the cube cell is (1997, Beverages), then the distinct count is the number of different types of fact-table records that occurred during 1997 in the Beverages product category. When you create the distinct-count measure, you specify a column in the fact table that determines the type of fact-table record. Distinct-count measures are different from typical count measures because a distinct count isn't the count of fact-table records; it's the count of distinct values of a particular column in the fact table. DBAs typically use a distinct count to determine the number of unique transactions that exist when more than one fact-table record comes from the same transaction.

One characteristic of the native aggregation types is that Analysis Services treats them the same across all cube dimensions. The non-native inventory aggregation doesn't have this characteristic. When aggregating inventory quantities, you use summation for all dimensions except the Time dimension. You aggregate the Time dimension by using the last child (i.e., the most recent value). Figure 1 shows how the Product dimension aggregates inventory totals as a sum, whereas the Time dimension uses the last child.

Going Non-Native

Because Analysis Services doesn't offer native support for a last-child aggregation, you have to create your own aggregation support. You have several choices for how to do this. You can use a calculated measure, a calculated cell, or a custom rollup formula in the Time dimension. I prefer to use the method that limits performance effects to the smallest percentage of the cube. For example, the FoodMart Sales cube contains other measures besides the inventory quantity measure, so I suggest picking a method that doesn't add extra processing to the other measures. Analysis Services evaluates a custom rollup formula for every measure, so a custom rollup isn't a good choice for this problem. You can use either the calculated cell or calculated measure technique to isolate the aggregation processing so that it affects only the inventory quantity measure. You need to consider a couple of facts when you're choosing between a calculated cell and a calculated measure. Note that calculated cells are available only with the Enterprise Edition of SQL Server. But calculated measures can be problematic in cubes that support drillthrough because the drillthrough operation doesn't work when you specify a calculated measure as part of the tuple (i.e., the list of members) that identifies the cube's cell.

For inventory analysis, I suggest using a calculated measure simply because most OLAP front-end tools offer the ability to create calculated measures. For example, when you use the following calculated measure formula in the FoodMart Sales cube, it returns \[Unit Sales\] as if it were an inventory quantity:

IIF( Time.CURRENTMEMBER.LEVEL
.NAME <> "Month",
   Time.CURRENTMEMBER.LASTCHILD,
   (Time.CURRENTMEMBER, \[Unit Sales\]) )

This recursive formula determines whether the current Time member's level name is the lowest level in the Time dimension, and if not, it returns the last child of the current member. The LASTCHILD portion of the formula causes the formula to reevaluate recursively until the level's name is Month; then, the formula returns the Unit Sales value for the current Month.

Some cubes don't contain loaded values for every leaf-level Time member. In those cubes, you can't use a formula as simple as the one above to determine the most recent value for a measure. I ran into such a situation while building a cube to track software defects. The cube was built from a fact table that contained defect state transitions. The software quality-assurance process tracks defects through states, which have names like found but not assigned to a developer, assigned to a developer, fixed but awaiting the next product build, fixed but not verified, or closed. Defect state transitions are the occurrences of a defect moving from one state to another. For example, each entry would contain a defect ID, description, date the change occurred, the original state, and the new state. The fact table I used contained a date that the defect changed state, an origination state, and a destination state. To create in the cube a measure that returns the defect state for any time period, I had to include a formula that looks back to the most recent date that the defect changed state:

IIF( Time.CURRENTMEMBER.LEVEL
.NAME  "Month",

  Time.CURRENTMEMBER.LASTCHILD,
  TAIL( FILTER( \[Time\].\[1997\]
.\[Q1\].\[1\] : Time
.CURRENTMEMBER,
        NOT ISEMPTY( \[Unit Sales\] ) ) * 
        \{ \[Unit Sales\] \}, 1 ).ITEM(0) )

This MDX expression is significantly more complex than the previous one. The principle behind this expression is the same as for the previous expression, but instead of just using (Time.CURRENTMEMBER, \[Unit Sales\]) as the last parameter to the IIF() function, you have an expression that identifies the most recent non-empty month. First, take the set that includes all months from the first month ( \[Time\].\[1997\].\[Q1\].\[1\] ) to the current month—the colon operator (:) identifies a range of members—and filter out the empty months. The expression then uses the asterisk operator (*) to cross-join Unit Sales with the set of non-empty months and uses the TAIL() function to reduce the set to its last item. The ITEM(0) operator converts the result of the TAIL() function (a set) to its first item.

In the two MDX examples above, I used Unit Sales from the FoodMart Sales cube so that you could experiment with the MDX. In a real-world situation, you'd replace Unit Sales with a loaded measure. Because you have to use one of the native Analysis Services aggregation types to create the loaded measure, the measure won't return valid information when you use it with nonleaf-level Time dimension members. Because of the possibility of returning invalid data, you need to set the visibility of the loaded measure to false and allow your application's end users to see and use only the calculated measure. After you've selected the measure, you can change the visibility of a loaded measure in the Properties pane of Analysis Manager's Cube Editor.

Many other aggregation types exist, and some are quite simple. For example, a common application of OLAP is to model a business income statement. At the highest level of the income statement, you want to subtract expenses from income to create net income. At all other levels in this dimension, you want to aggregate account categories by using a summation. In this situation, a custom rollup function might be the most appropriate choice.

Calculated measures, calculated cells, and custom rollups are all tools that can help you customize aggregation formulas. Analysis Services is flexible when you know how to use its advanced features. I hope reading through these examples and explanations will help you think more creatively about how to create your OLAP cubes so that they aggregate measure values in a way that's appropriate for the problem you're trying to solve.

Comments

Plain text