Distinct Count Queries

Have you ever needed to answer a question such as, Which products have the most customers? or, How many customers bought these two products together? or, What percentage of our customers buy fewer than 10 items at a time? These questions are examples of distinct count queries, which are tricky to answer with an OLAP product because OLAP products usually store only summarized information.

Distinct count queries differ from normal summations in that queries about items at a dimension-hierarchy level always require the lowest-level hierarchy details to determine the result. For example, if you're counting the number of distinct products purchased in the first quarter, you can't just add the distinct products from January, February, and March. You must know all the products purchased in each of these months to determine the total value for the first quarter because you have to eliminate all the duplicate products. In this article, I explore MDX solutions that either SQL Server 7.0 OLAP Services or SQL Server 2000 Analysis Services can use for problems such as these. I also point out a new Analysis Services feature that can make solving these problems more straightforward.

Let's give credit where credit is due. This article is based in part on a white paper that Amir Netz (Microsoft OLAP Services architect) wrote about basket analysis and OLAP Services. Basket analysis typically refers to retail sales analysis, in which the basket is the group of products purchased in one transaction. If basket analysis is relevant to your work, I highly recommend reading Netz's white paper; he covers some of the same material I do, but in more depth.

Related: OLAP for Developers

Market Basket Analysis

I've heard the term basket analysis, or market basket analysis, used in a variety of ways. The most common use refers to a data-mining technique that discovers relationships among products. The classic discovery example is the relationship between diapers and beer. Someone, somewhere, used market basket analysis of the purchasing habits of male customers to show a strong correlation between buying diapers and buying beer. The theory is that men use a run to the store for diapers as an opportunity to replenish their beer supply. Although I'm a father of young twins, I will neither confirm nor deny this theory. Nevertheless, a store manager armed with this information might reorganize shelf space to encourage this behavior and potentially increase beer sales.

The basket analysis technique that I discuss here isn't really data mining because it's not an automated or computer-guided search for correlations. I concentrate on distinct count queries, which are the basis for basket analysis. You must be able to ask distinct count questions to perform manual or automatic searches for correlations.

Distinct count queries are useful in a variety of applications, not just retail sales or marketing applications. In fact, distinct count queries might be useful in any situation in which you have a many-to-many (M:N) relationship among entities. For example, in a call-center analysis application, you could determine how many customer calls were requests for both an address change and a new product purchase. Or in a manufacturing application, you might try to find correlations between types of product failures and types of product components.

Which Products Have the Most Customers?

Now let's examine the following question in detail: Which products have the most customers? A standard cube for sales analysis has Customer, Product, and Time dimensions, and the fact table contains Sales transactions. An individual transaction might say, Customer A bought Product B at Time C for some amount of money. If you create a count measure for this cube, the count might tell you the total number of transactions for any Customer, Product, and Time element combination. This count is useful, but it isn't what I'm looking for. I don't care if the same customer purchased a particular product many times; I want to count unique customers, not unique transactions. To get the number of unique customers for a product, you can use the following MDX expression:

Count( Filter( Customers.Name.Members, Not
  IsEmpty( (Product.CurrentMember, 
  Customers.CurrentMember) ) )

This expression filters all customer names by eliminating those that don't have transactions for the currently selected product. A shorter way to express the same concept is

Count( \{Unit Sales\} *
 Customers.Name.Members, ExcludeEmpty )

This formula cross-joins—denoted with the asterisk—Unit Sales and Customers and counts the non-empty items.

Asking distinct count questions about a subset of customers might also be useful. For example, a variation of the question might be, Which products have the most California customers? You can generalize the count formula to deal with a selected customer subset by replacing the full set of customers with only the descendants of the currently selected customer:

Count( \{Unit Sales\} * Descendants
( Customers.CurrentMember,
 Customers.Name ), ExcludeEmpty )

To create the full query that answers the question, Which products have the most California customers? you might use an MDX expression to create a calculated member. In the MDX query that Listing 1 shows, I limited customers to those in California so that the query wouldn't take too long to execute. On my 600MHz laptop, the query takes about 2.5 minutes to execute.

In Analysis Services, distinct count is now a built-in aggregation type. Thus, you could make a Unique Customers measure when you build the cube. You set the measure type to distinct count and identify the fact-table source column whose contents the measure should count. In this case, you want Unique Customers to count customer names.

You need to be aware of some limitations that Analysis Services puts on distinct count measures. One such limitation is that the cube with the distinct count measure can't contain a dimension that has custom rollup operators or custom rollup formulas. Another limitation is more significant: A cube can have only one distinct count measure. Fortunately, this limitation applies only to nonvirtual cubes. In fact, Microsoft recommends that you make as small as possible any cube that has a distinct count measure, then combine that cube with another cube to form a virtual cube. The smallest cube possible would contain no measures other than the distinct count measure and only the dimensions relevant for distinct count queries.

If you isolate the distinct count measure in a small cube, Analysis Services will likely use less memory and perform distinct count queries much more quickly. This approach improves performance because Analysis Services can determine that a distinct count requires only the information from the smaller distinct count cube. Therefore, Analysis Services accesses much less information to determine the result. This cube structure optimization is particularly important with distinct count because all distinct count queries access the dimensions' most detailed information. Analysis Services usually accesses only summarized information to execute non-distinct count queries.

Which Products Are Purchased Together?

Classic market basket analysis also deals with questions such as, Which products are purchased together? Before getting into the MDX you need for a basket analysis query, let's discuss the definition of together. Netz's white paper contains a good discussion of this concept. To summarize, Sales cubes don't typically retain invoice information such as sales receipts, so a cube can't use an invoice to decide whether two products were purchased together.

Another point Netz makes is that in decision making, treating two transactions that occur within a certain elapsed time as the same transaction can be important. For example, a department store customer might buy something in the store, then continue shopping, find something else, and make another purchase. This customer's transactions would be on the same day, but perhaps not on the same cash-register receipt. Netz's point is that for decision-making purposes, using a time period—such as a day or a week—to bind a transaction to a customer is adequate and probably preferable. This approach works well with OLAP because multidimensional databases almost always have Time dimensions, and MDX is good at dealing with time periods. In the FoodMart 2000 Sales cube, the Time dimension doesn't break down into days; so for these MDX examples, I use the smallest possible time unit represented in the Sales cube—months.

Now let's look at a specific example of a basket analysis question: What beer product is purchased most frequently with pizza? I can imagine a store manager wanting to know the answer to this question to decide whether to put a special beer display next to the pizza freezer. To solve this problem, I broke the question into pieces to make it easier to understand.

First, I figured out the number of qualifying transactions that a particular customer had. The following MDX statement calculates the qualifying transactions by filtering out all the time periods that don't include both pizza and the current product. The number of remaining time periods equals the number of transactions.

Count(Filter(Time.Month.Members, Not
 IsEmpty((Unit Sales, Pizza)) and Not
 IsEmpty((Unit Sales, Product.CurrentMember))))

Second, I needed to determine the total number of transactions for all customers. Cross-joining the selected customers with the number of transactions per customer and summing the results accomplished this step:

Sum( \{Tx per Customer\} *
 Descendants(Customers.CurrentMember,
  Customers.Name))

Listing 2 shows how you can combine the previous two formulas to create a full MDX query that answers the question. In a real basket analysis application, limiting queries, as I did here with California, might be useful because the total number of transactions isn't important; finding the relationships among products is. You can find the relationships by sampling only part of the data. So, the example that Listing 2 shows is valid (as long as you believe that customers in California form a representative sample). Figure 1 shows the answer to this query. The results are disappointing because they aren't what you'd expect. If FoodMart contained actual purchase data, I'm sure that a strong correlation between pizza and beer would exist!

You can easily extend the solution for the purchased-together problem to check for occurrences of more than two products together. Just change the formula for Tx per Customer to include more products. Unfortunately, Analysis Services' new distinct count aggregation type doesn't help with this type of query. The distinct count aggregation works only if you want the distinct number of items in a fact-table column. However, the purchased-together example searches for the distinct number of a particular combination of items in a column.

Cube Analysis

The kinds of questions that distinct count queries can answer are invaluable in analyzing retail product sales. The example I used—What beer product is purchased most frequently with pizza?—is a good example of this type of question. However, this type of analysis is also useful in any cube. Look at the human resources cube in the FoodMart 2000 database as an example. The Number of Employees measure is a distinct count measure that you can use to determine how many employees work at a particular store or under a certain manager. And because the underlying fact table contains paycheck transactions, you can't determine these answers with a typical count measure. Note that you can use the techniques I demonstrated in this article to ask distinct count questions in your applications as well. Test your distinct count query skills by trying to answer the "May MDX Puzzle." "April MDX Puzzle Solution Revealed," shows the solution to last month's puzzle.

Comments

Plain text