For many IT organizations, especially those with larger databases, optimizing database performance is a mystery, a discipline that relies more on anecdotal remedies than on proven methodologies. We recently conducted an extensive series of performance optimization tests against an internal large-scale database; our goal was to develop reliable methodologies for improving OLAP cube performance. We designed the tests to evaluate various optimization methods, and our results reveal concrete steps you can take to enhance the performance of multidimensional databases.
When creating cubes with SQL Server 7.0 OLAP Services, your two most important design decisions are the storage mode and the aggregation level. For designers making basic choices about data warehouse design, Microsoft SQL Server 7.0 documentation provides adequate guidelines and explains the trade-offs. But little of the available information is based on direct experience of how storage modes and aggregation levels affect processing and query performance.
We describe the tests we conducted to isolate problems with cube performance. We review the study constraints, explain how we constructed the OLAP cubes, and examine the query process. Our analysis of cube processing times and cube storage, including a comparison of MDX language and SQL queries, illustrates the range of possible methods. Finally, we give you performance optimization tips derived from the tests. We have fully described our testing and our results in the white paper "Optimizing OLAP Performance for Very Large Databases" (which you can find at http://www.unisys.com/sql7/).
The Challenge of Cube Performance
We began by setting up representative SQL Server and OLAP databases as test beds. We identified sample business questions to run against an existing very large database (VLDB), then isolated the tables and fields that contained the information.
Next, we created and populated a SQL Server data mart based on a star schema design. We then used three OLAP storage modes and multiple aggregation levels to create various representative OLAP cubes. Finally, we created SQL and MDX queries based on the business questions and ran the queries to compare the data mart with the OLAP cubes.
During the study, we investigated several variables that influence performance optimization, including different storage modes and various aggregation levels. We've included comparative test results for cube processing times, disk space requirements, CPU usage on a relational database management system (RDBMS) server versus OLAP Server while querying, MDX query execution times, query execution times with warm and cold caches, disk space required to build OLAP cubes versus the star schema data mart, and MDX versus SQL query execution times.
For its ongoing research and development efforts, Unisys maintains the Unisys [email protected] Customer Profitability SQL Server 7.0 VLDB for retail banking data. At more than 2TB, this is the largest commercial database deployed with SQL Server 7.0 in an enterprise-class environment using Windows NT. The database contains data representing 2 years of transactions by 4 million banking customers with 10 million accounts.
We conducted our study by using the bank VLDB's development version database (approximately 130GB) and designed a star schema for the business process represented in the OLAP cubes. This model let users query a cube based on a bank account profitability fact table containing approximately 13 million records, covering information about economic income earned from January 1996 through December 1997 (in 24 periods) from various products and customer segments.
The test system hardware consisted of two similarly configured Unisys [email protected] ES5045 R servers, with four Pentium III Xeon 450MHz CPUs. We configured each server with a 512MB cache, 8GB of RAM, and fibre channel data storage OSM7700 and RAID 10 volume (5 mirrored sets and striped data across those sets). We connected the servers through a 100MB Ethernet network.
One server housed the relational database and the star schema called VLDBMart. We configured the second server to store the OLAP cubes and multidimensional data.
Setting Up the Study
We took several steps to build our test configuration. These steps include defining questions and describing bank data, creating a data mart, populating the data mart, and building OLAP cubes.
Define questions and describe bank data. First, we defined the appropriate and relevant business questions, including economic issues such as income, spread, and moving averages for various combinations of time periods across products and customer segments. We chose seven questions that this type of customer profitability system would typically answer, such as "What is the average economic income per household for the past 2 years (1996 and 1997) from each of the products?" Next, we examined the existing VLDB entity relationship diagram (ERD) to determine which tables held the information required to answer the questions.
Create a data mart. We then created a data mart with the table information required to answer the profitability questions. We chose a star schema because this design closely matched the design of the OLAP cubes we planned to construct.
To construct the required data mart, we defined a fact table called Account_Prof_Fact to hold the data from the account profitability tables in the banking database and to contain data for all 24 periods in the study (January 1996 through December 1997). The fact table records account profitability information such as economic income and expenses for various products at a monthly snapshot level. This snapshot contains monthly totals for all measures. Having defined the fact table, we identified five relevant dimensions: product, time, region, household, and customer segments. Figure 1 shows the resulting star schema.
Populate the data mart. After completing the data mart design, we used Data Transformation Services (DTS) to populate the Account_prof_fact fact table and the associated dimension tables in the data mart. After we populated the fact table, it held approximately 13 million records. Table 1 shows the records and size for the fact and dimension tables in the data mart.
Build OLAP cubes. To construct the OLAP cubes, we created a multidimensional OLAP database, which we named AccountProfitabilityOLAPDatabase. Within that database, we generated 12 cubes with identical structure but varying storage types and aggregation levels.
We chose aggregation levels of 0, 30, 60, and 90 percent for the cubes because these levels represent the range used in most applications. The aggregation percentage represents the expected improvement in query performance compared with performance without precalculated aggregations.
Figure 2 shows the cube structure for one of the cubes we designed, AccountProfitabilityCubeM90. Next, we developed the fact measures for the cubes. To accurately measure each account's true profitability, we defined eight fact measures for the cubes. To complete the cube structures, we selected five cube dimensions corresponding to the dimension tables selected for the data mart's star schema. Table 2 provides details about the defined OLAP dimensions.
Cube Processing and Storage Results
Before testing our queries, we compared processing times and disk space requirements for each storage type—i.e., Multidimensional OLAP (MOLAP), Hybrid OLAP (HOLAP), and Relational OLAP (ROLAP). And we compared disk space requirements for MOLAP cubes with space requirements for the star schema.
Processing times for each storage type. We first determined the differences in cube processing times for the three storage modes. To obtain processing comparisons, we processed identically structured cubes with different data storage methods and aggregation levels. Table 3 presents detailed processing time results. When you use Microsoft OLAP Services to create cubes, your two most important design decisions are the storage mode and the level of aggregation. Table 4 contains descriptions for the storage modes in OLAP Services. In OLAP Services, aggregations are precalculated sums of fact table data from various levels of each dimension. Developers use these aggregations to answer queries and create additional aggregates. When choosing how much aggregation to include in a cube, you must balance the demands of storage space and query response time. At one extreme, precalculating all possible aggregates greatly increases a database's storage requirements. At the other extreme, calculating all aggregates at query time slows query responses.
The tests revealed significant differences in processing times across the three storage types and for various levels of aggregation within each type. ROLAP cubes at 0 percent aggregation had the fastest processing times. ROLAP achieves this speed because OLAP Services doesn't copy the fact and dimension tables for ROLAP cubes, and OLAP Services doesn't calculate aggregations at 0 percent aggregation. However, as the aggregation level increases, processing ROLAP cubes takes longer than processing MOLAP and HOLAP cubes.
We recorded some surprising results at higher aggregation levels. The difference between processing times for MOLAP and HOLAP cubes at 30 and 60 percent aggregation wasn't substantial, but the processing time increased substantially between 60 and 90 percent for MOLAP and HOLAP cubes. For ROLAP cubes, we measured an exponential increase in processing time between 60 and 90 percent aggregation.
Disk space requirements for each storage type. Table 5 presents disk space requirements for the three processing modes. We found that the storage type and aggregation level significantly affect storage requirements.
We determined that MOLAP cubes require more space than HOLAP or ROLAP cubes because OLAP Services copies the original fact and dimension tables into the OLAP database for MOLAP cubes. Although the MOLAP cube takes more space, it no longer requires the star schema when queried. However, HOLAP cubes use the least amount of space because OLAP Services doesn't copy the original fact and dimension tables into the OLAP database, and aggregates are stored in the OLAP database in an efficient multidimensional format.
Disk space usage for all storage types doesn't increase significantly between 0 and 60 percent aggregation but jumps substantially for all types as the aggregation level reaches 90 percent. The increase for ROLAP is especially significant.
Disk space requirements for MOLAP cubes compared with the star schema. Table 6 compares the space that MOLAP cubes require with the size of the original RDBMS's star schema fact table and the dimension tables. Our tests determined that building the aggregations typically required only 20 percent more space.
Table 6 shows that rather than producing a data explosion, OLAP Services actually performs data compression while building OLAP cubes. Surprisingly, even at an aggregation level of 90 percent, the compression level was nearly 79 percent. Your results might differ because the additional space that is required to build MOLAP cubes varies with the number of levels in the dimension, the number of measures, and the type of data.
MDX vs. SQL Queries
To test query response times, we ran sets of similar queries in different environments: SQL queries against SQL Server, and MDX queries against the MOLAP cube stored on the OLAP server. Although you can't exactly compare execution times for SQL and MDX queries, the results clearly show that using MDX for OLAP Services can significantly boost query performance because OLAP Services precalculates and stores aggregates in the OLAP cubes.
For questions 2, 3, 4, 6, and 7, responses for the MDX query ranged within several seconds, whereas responses for the SQL query were within a hundred seconds. For questions 1 and 5, MDX queries took somewhat longer to execute, probably because these queries reference a distinct count calculated measure in the cube. (A distinct count calculated measure determines, for each member of a dimension, how many distinct, lowest-level members of another dimension share rows in the fact table. If a member occurs multiple times, it is counted only once.)
Queries 1 and 5 use a distinct count to determine the number of unique households that generated economic income for a given product. The MDX statement in the MDX query then uses this measure to calculate an average economic income. Listing 1 shows a sample.
In this SQL Server 7.0 test scenario, we used an MDX expression to implement a distinct count with a calculated member, which can greatly affect the performance of queries that use the distinct count calculated member. Because calculated members are not preaggregated, OLAP Services must perform these calculations at query execution time. Additionally, OLAP Services must read all lowest-level members to calculate the distinct count. MDX queries 1 and 5 reference a distinct count of the HouseholdDim dimension, which contains 200,000 members. Accessing this number of members, as well as computing the distinct count, can explain the increased query response time with a cold cache. (A cold cache contains either no data or no relevant data.)
In SQL Server 2000, distinct count is implemented as a standard aggregate function defined for a measure, such as sum, count, min, or max. Implementing a distinct count in SQL Server 2000 yields much better performance than implementing a calculated measure, and the distinct count performs at levels similar to the other aggregate functions. We tested processing times by executing a set of queries that included both SQL queries on SQL Server and MDX queries on OLAP Services using MOLAP cube design. Table 7 displays the results. Listing 2 shows an MDX query that is typical of those we used in the testing. In this example, we formatted the MDX query to answer the business question "What is the economic income for first quarters of 1996 and 1997, and how does it compare for these two periods by each customer segment/product?" Table 8 shows the results. Despite its complexity, the query executed in less than 1 second.
MDX query execution times for MOLAP, ROLAP, and HOLAP. Having established the relative processing times for MDX versus SQL queries, we compared the execution times for the three OLAP storage modes. To conduct the test, we developed MDX queries for the seven identified business questions. We executed the queries against various levels of aggregation by using MOLAP, ROLAP, and HOLAP. We also conducted tests against cold and warm caches. (A warm cache contains relevant data from previously executed queries.) We created a cold cache condition by restarting the server before executing the query. To achieve a warm cache, we executed a query and then immediately executed again.
The average query processing time for all storage types changes significantly for warm versus cold caches. Also, after the first execution, subsequent queries take significantly less time because the query data is already in cache.
You would expect the tests that you execute against a cold cache to show that ROLAP and HOLAP processed queries more quickly as the aggregation level increased from 0 to 30 to 60 percent. But as the level of aggregation increased from 60 to 90 percent for ROLAP and HOLAP, queries didn't perform significantly better. MOLAP was fastest against a cold cache, though increasing the levels of aggregation for MOLAP had little effect on performance.
The warm cache tests produced significantly different results. Whereas query times improved substantially for all storage types as the aggregation level increased from 0 to 30 percent, increasing aggregation levels from 30 to 60 to 90 percent had little effect on total query performance.
CPU usage during MDX querying. In addition to execution time, CPU use is often crucial for large databases. We monitored the CPU on each server, and Table 9 shows the results. Our tests show that MOLAP does most of its processing on the machine holding OLAP Services, making little or no use of the machine holding the SQL Server RDBMS data mart. The small amount of usage in Table 9 could have resulted from server noise.
However, ROLAP and HOLAP use more processor time from the RDBMS server and a small amount from the OLAP Server machine, perhaps because our test queries involved a distinct count. (Calculating a distinct count requires OLAP Services to read all lowest-level household dimension member information from the RDBMS.) Also, additional system activity might have occurred on the RDBMS during the test.
Performance Optimization Tips
During the study, we experimented with various database components and characteristics and uncovered some techniques that significantly improve OLAP cube processing time and query times. We first recommend using a dimensional schema for the data mart. A star schema model dovetails nicely with the dimensional OLAP cube design and performance.
You can also improve cube processing time by using keys and indexes in different ways. We recommend that, besides associating a key with each dimension and fact table in the data mart, you declare foreign key relationships between fact and dimension tables, create a composite index on all foreign keys in the fact table, and create indexes on individual foreign keys.
Use the Optimize Schema feature in the OLAP Services Cube editor to minimize the number of joins that OLAP Services performs while processing a cube. Specify the maximum process buffer size in the OLAP Server Properties dialog box.
You can optimize MDX query execution times in three ways. First, put OLAP Services (the OLAP cubes) and SQL Server (the data mart) on separate machines, and be sure that the OLAP machine has adequate memory for OLAP Services. Second, if you use large data sets, partitioning can significantly improve query execution results. Finally, our tests show that MOLAP produces the best query performance.
The aggregation level you select affects cube processing time and query performance. Monitor your system's performance and adjust aggregation levels accordingly. We suggest that you start at about 25 percent, then increase the aggregations as needed. If you need to build additional aggregations, use the Usage-Based Optimization Wizard, which helps you create additional cube partitions based on the history of queries you previously sent to the cube.
By carefully applying the knowledge and techniques presented here to create and use OLAP cubes, you can build a database of almost any size that doesn't explode when you add data, that can be maintained efficiently, and that users can query quickly to produce meaningful results. (For more information about evaluating your database, see the sidebar "Using the FASMI Test.") In short, you can achieve the success that has until now largely eluded large-scale database designers.