Cross-Join Performance

More companies are using OLAP to analyze large sets of data for their business. When I say large, I mean terabytes of data. After you build an OLAP cube from that much data, you should quickly figure out where the slowest part of your application is, because that part—the bottleneck—will determine the speed at which you can execute MDX queries. Sometimes the bottleneck is caused by the speed of a slow network or the performance of the client application. But frequently, the performance of the MDX query execution in Analysis Services is what determines overall performance. I've run into several enterprise deployments of Analysis Services that slowed because of the same problem: the performance of the MDX CROSSJOIN() function. Knowing how the CROSSJOIN() function works can help you improve the performance of your MDX queries when you're using very large OLAP cubes.

The CROSSJOIN() function combines two sets of dimension members into one set. The members of each of the two sets are from a single dimension. The CROSSJOIN() function combines the sets by matching every member from the first set with every member of the second set. The result is a set of items in which each item includes two members from different dimensions. These compound items are called tuples. For example, if you cross-join \{ USA, France \} with \{ Sales, Profit \}, you get the following set of four tuples: \{ (USA, Sales), (USA, Profit), (France, Sales), (France, Profit) \}. If the two sets each have 100 items, the resulting set will contain 10,000 tuples. If you cross-join medium-sized or large-sized sets (e.g., sets that contain more than 100 items each), you can end up with a result set that contains many thousands of items—enough to seriously impair performance.

In most OLAP cubes, the result set of a large cross join is sparse. A sparse result set is one that includes very few tuples that have non-empty measures associated with them (i.e., it contains mostly empty measures). For example, say a retail electronics chain has 10,000 products and 300 sales representatives. If you're an executive responsible for the California region, you might want to determine whether any of the sales representatives are generating a lot of revenue from a single product in California. To find this information, you can cross-join all the products with all the sales representatives, then determine the top 10 combinations. The MDX code for this expression would look like the following statement:

TOPCOUNT( CROSSJOIN(
 Products.Name.Members,
SalesReps.Name.Members ), 10,
(California, Sales) )

Unfortunately, this cross join generates 3 million combinations of products and sales representatives because the query must create every combination. This volume of combinations is inefficient in a sparse cube because many of the combinations will have no associated sales. In real terms, no sales representative has sold every item in the store. More likely, each representative has sold only a fraction of the unique products. And many sales representatives have probably never sold products in California.

The MDX function NONEMPTYCROSSJOIN() makes cross joins more efficient. NONEMPTYCROSSJOIN() cross-joins two or more sets and eliminates items that have no data associated with them. You can also specify a set of dimension members to further refine your determination of empty items. The following MDX expression uses NONEMPTYCROSSJOIN() but is nearly functionally equivalent to the previous example:

TOPCOUNT( NONEMPTYCROSSJOIN
( Products.Name.Members, 
SalesReps.Name.Members, \{ California \},
2 ), 10, Sales )

In this query, NONEMPTYCROSSJOIN() cross-joins the products and sales representatives and eliminates the items that have no data for California. The TOPCOUNT() function then returns the top 10 remaining items based on the Sales measure.

The parameters that the query passes in to the NONEMPTYCROSSJOIN() function deserve an explanation. NONEMPTYCROSSJOIN() accepts a variable number of sets followed by an optional numeric value. This numeric value, called crossjoin set count in SQL Server Books Online (BOL), determines the number of sets that will be in the result. If you choose to include a crossjoin set count, the query will use all the sets to determine whether the resulting items are empty, but only the number of sets that the crossjoin set count specifies will be included in the result. In the preceding query, I included three sets but specified that only the first two be cross-joined to create the result.

The second query executes significantly faster than the first query. I was recently involved in a customer deployment in which the NONEMPTYCROSSJOIN() function changed a query's performance from 15 minutes to about 3 seconds. That's compelling! Notice that I said the second query is nearly equivalent to the first. The difference between the queries introduces another challenge for your data analysis. The NONEMPTYCROSSJOIN() function doesn't include calculated members in the result set, as a typical CROSSJOIN() function would. Because of this difference, the second query might return a different set of top 10 items than the first query does. My company has had several clients whose cubes were large enough to require the NONEMPTYCROSSJOIN() function for acceptable performance, but the fact that NONEMPTYCROSSJOIN() removes calculated members was a serious problem for those clients because they risked losing calculated dimension members that were added after the cube was created.

As BOL describes it, the NONEMPTYCROSSJOIN() function in the second query is conceptually equivalent to the following MDX expression:

FILTER( CROSSJOIN( Products.Name.Members,
SalesReps.Name.Members ), NOT ISEMPTY(
California ) )

However, the two methods aren't exactly equivalent. The syntax of this example implies that the cross join occurs first, then the query eliminates empty items by specifying that the items contain California sales data. But what actually happens is that Analysis Services retrieves the non-empty portion of the cube for California, then cross-joins the products and sales representatives that exist in that portion of the cube. Performing the NONEMPTYCROSSJOIN() function in this order gives the function a significant performance advantage over filtering the items after the cross join. Because cubes are generally sparse, Analysis Services stores only the non-null values; and when the client-side driver, PivotTable Service, requests data from the server, it receives only non-null blocks of cube data. NONEMPTYCROSSJOIN() takes advantage of this method of storing and communicating data in Analysis Services. If you think about it, creating a long list of empty entries with the CROSSJOIN() function only to turn around and eliminate them with the FILTER() function is a giant waste of CPU time.

So what if you're developing an OLAP application and you want the improved performance of NONEMPTYCROSSJOIN(), but you don't want to risk losing any calculated members that might be in the cube? I spoke with the architect of Analysis Services, Amir Netz, about the problem, and he identified a couple of solutions.

The first approach is to determine whether the sets identify any calculated members before the NONEMPTYCROSSJOIN() function. If the sets include calculated members, fall back to using CROSSJOIN() with the FILTER() function. For example, consider the following generic function:

NONEMPTYCROSSJOIN( <set1>, <set2>, 
<filter criteria>, 2 )

First, you can execute the following two queries and compare the counts they return:

COUNT( <set1> )
COUNT( STRIPCALCULATEDMEMBERS( <set1> ) )

Then, repeat the two queries for <set2>. If the counts are different for either set, the result of your cross-join query might include calculated members, and you must use the following filter syntax with the CROSSJOIN() function to guarantee the correct result:

FILTER( CROSSJOIN( <set1>, <set2> ),
NOT ISEMPTY(
<filter criteria> ) )

This approach does the best job of determining when you can use NONEMPTYCROSSJOIN(), but it also incurs the overhead of running four extra queries. These queries on a single dimension should execute quickly, but always performance-test your application before putting it into production. If running four extra queries for every query that includes a cross join makes you nervous, a more conservative approach exists.

The more conservative approach is to see whether any calculated members exist in the dimensions that the cross join references. You can check for calculated members programmatically by using the OpenSchema() function call in ADO MD. Request the list of dimension members from the MDSCHEMA_MEMBERS rowset by using the member type MDMEMBER_TYPE_FORMULA. If the request returns any members, you know the dimension contains calculated members and the set definitions included in the cross join might include a calculated member. This approach isn't exact and might cause you to avoid using the NONEMPTYCROSSJOIN() function more often than you need to. But the determination that uses OpenSchema is faster than using four extra queries and thus adds less overhead.

I recommend choosing your approach based on which situation you believe will occur more frequently in your environment. For example, if you know that most cross-join situations in your application will benefit from a change to a NONEMPTYCROSSJOIN(), you might want to use the first approach. However, if you believe that the large dimensions that can benefit from the NONEMPTYCROSSJOIN() optimization don't (and won't) have any calculated members, the second approach is more efficient; and the second approach still works correctly if a calculated member is added.

I hope that the next release of Analysis Services includes a NONEMPTYCROSSJOIN() function that doesn't strip calculated members. Meanwhile, if you're developing analysis applications based on large, sparse data sets, consider using one of these approaches to optimize your cross joins.

Finally, if you're looking for a challenge, check out the October MDX Puzzle on the Web. Go to http://www.sqlmag.com and enter 26327 in the InstantDoc ID text box.

Comments

Plain text