Skip navigation

Incremental Updates in OLAP Cubes

Here's how to incrementally update cubes—manually, programmatically, and with DTS

Download the Code iconThe capacity for an OLAP cube to store and organize colossal quantities of data is the attribute that makes OLAP more valuable for reporting than a standard relational database. Queries that take 20 minutes to return results in a relational database can return the same results in less than 1 minute when the system uses previously calculated aggregate tables. But if you have a cube containing more than a year's worth of sales measures, which are represented by more than 20 million records, adding new records to the sales fact table and processing the data cube can take several hours. It seems strangely inefficient to have to reprocess huge stores of old data that haven't changed just so you can incorporate a few records to the cube. You need a way to add those records to the existing cube in the few minutes it takes to process only the new records.

To make matters more complicated, let's say you need to add daily sales records to the cube automatically in a batch program every night. You need to incrementally and automatically update the cubes. This capability is critical to the successful implementation of an enterprise-level cube, yet it is one of the most under-documented and most easily misunderstood processes of OLAP Services. Using the example in which daily sales records are added to the cube, I'll show you how to set up the relational tables to facilitate incremental updates. Then I'll show how to manually update cubes incrementally and how to programmatically update cubes through the Decision Support Objects (DSO) COM interface. Finally, I'll explain how to do incremental updates with Data Transformation Systems (DTS).

Setting Up the Relational Tables

OLAP Services' ability to identify records in the relational tables that haven't yet been entered into the cube makes the incremental update possible. Without this ability, OLAP Services couldn't differentiate the new data from the old data. When you apply an incremental update without this step, all the cube's measures are counted twice because OLAP Services sums the new measures with existing ones.

To successfully implement incremental updates, OLAP Services needs to compare elements in the fact tables to data in the OLAP cube in a way that excludes data already in the cube. To demonstrate this method, I'll use a unique incrementing field in the fact table that will be aggregated with the MAX() column function. By using MAX(), you can write a simple MDX query that will return the largest ID number stored inside the cube. Any record containing a larger ID number in the underlying relational tables is a candidate for inclusion the next time the cube is processed. If your fact tables are populated with batch programs that can supply a unique batch number for every set of inserted records, then you could use that batch number instead. To use the following techniques, you just need to ensure that the records in the fact table contain a sequential numeric column that can be stored in the cube as a measure.

Adding a measure in the cube only because you need to manage incremental updates internally might not seem to be a best-practices strategy. But I find that if you don't add a measure, the relational tables will continually need to keep track of which records in the fact table were used to populate the cube. I explored two methods of using only the relational tables to differentiate the new records in the fact table from the ones already used to populate the cube. You could establish one fact table per partition, and write lengthy procedures that ensure the fact tables are populated correctly, but then you would need to do a full refresh on the individual partitions. Another, perhaps better, option is to represent the fact table with a view that contains a WHERE clause that limits the rows returned to those rows that are not present in the cube. You could set up this view by maintaining a separate ID tracking table that contains the last ID sent to the cube. Each time users updated the cube, they would also have to update the table to reflect the last ID sent to the cube. The view's results would be based in part on the last ID on that table. The problem with this method is that it's nearly impossible to ensure that the cube update succeeded before you update the ID management table on the relational database. Another problem is that the cube could get updated properly but the ID management table update could fail, which means that OLAP Services might repopulate the cube with existing data. I've found no acceptable way to wrap two, separate processes around a common transaction when one occurs in SQL Server and the other within OLAP Services. To ensure that even interrupted transactions leave both the OLAP cube and the underlying relational tables in a consistent state, you need to let the cube store the information about the records it contains compared to the records in the fact tables. You can best accomplish that by using the previously mentioned identity key as the measure.

Incrementally Updating a Cube Through the OLAP Manager

A cube is a structure that contains a hierarchy of smaller structures called partitions, which contain portions of aggregated data. Incremental updates need to interact with the cube at the partition level. Partitions are often described as slices of the cube because they represent parallel subsets of data. For example, you can divide a sales cube into partitions, each containing sales data for a quarter of the year. But because these partitions are part of the same cube, client queries don't have to account for this configuration. A cube comprises at least one partition, and might comprise tens or hundreds of partitions to optimize data access and processing. The key to successfully implementing multiple partitioned cubes is ensuring that each partition contains only the data pertaining to that partition. Otherwise, the cube might return incorrect data.

You can use one of two methods to ensure that multiple partitions never contain overlapping data. One method is to associate a partition with a separate fact table or a view that contains only the subset of data needed to populate that partition. This is a good method when you're creating a new partition for new data and you're designating the current partition to hold only archived data. With this method, the cube needs to reprocess only the new partition to account for any new data coming in. The second method is to place a filter on the partition, which means that the OLAP engine will process only the data it was meant to contain even though the engine uses the same fact table each time it processes the data. The latter method is most useful when you're doing incremental updates.

For the daily sales update example, you need to place a filter that returns only those records that have an ID value greater than the largest ID value in the cube. You can find this value in the cube's data browser. Because in this example, OLAP Services uses the MAX() function to aggregate the ID field, you can find the highest ID value by taking the highest-level aggregate of all the dimensions, which is the default level shown when you open the OLAP Manager Cube Editor, as Screen 1 shows. To place a filter on the current partition, select the cube you will be using and expand the tree under it to expose several subfolders, including the Partitions subfolder. When you click on Partitions, you'll see a list of the cube's partitions. Right-click on the partition that, by default, has the same name as the cube. From the dialog box, select Edit, and type in the WHERE clause, as Screen 2 shows.

From the Process a Cube dialog box, select Incremental update, as Screen 3 shows. When you select this option, OLAP Services will add to the partition only the records that match the filter. Note that if a user selects a full reprocess or a refresh while the filter is still in place, then the partition will be emptied of all records and OLAP Services will add to the cube only the records that correspond to the filter. So if you increment data manually, make sure you remove the filter from the partition after OLAP Services processes the cube.

Automatic Incremental Updates

Although you can use manual cube updates for occasional data refreshes, cubes often need to reflect daily or hourly data, which requires the ability to automate incremental updates. This process isn't difficult if you understand DSO and the details of cube structure. Here's some basic information: DSO is a COM interface that comes with OLAP Manager. In theory, you can access DSO from almost any scripting language you choose, such as VBScript, JavaScript, or Perl. But Microsoft developed DSO by using VB interfaces that require strongly typed variables, so using widely available scripting languages is difficult, requiring you to use many undocumented method calls and properties that Microsoft might not support in the future. For this reason, I recommend simply using VB to create an ActiveX DLL that encapsulates the functionality you need to manipulate the cube. Then, you can use scripting to make calls to this DLL.

You can learn more about DSO's basic structure by walking through portions of VB code, described below, that you need to accomplish each step of the tasks. Note that for simplicity, I've omitted error handling and hard-coded values in these examples.

The topmost object is the server, which you instantiate in the following manner :

Dim dsoServer as DSO.Server
Set dsoServer = new DSO.Server

Then, connect the server to the OLAP server:

dsoServer.Connect('SALESSERVER')

After you create the server connection, you can access the next level in the hierarchy through the MDStores interface, which is a generic interface to access the level below for any hierarchical DSO object. So, you use this method to access the WIDGET database in the server:

Dim dsoDb as DSO.MDStore
Set dsoDb = new 
  dsoServer.MDStores('Widgets')

You can use the same interface to gain access to the cube that will be processed, as in:

Dim dsoCube as DSO.MDStore
Set dsoCube  = new 
  dsoDb.MDStores('Sales')

Now you need to find the cube's partition. By default, when the cube has only one partition, it adopts the same name as the cube, as in this example:

Dim dsoPartition as DSO.MDStore
Set dsoPartition = new 
  dsoCube.MDStores('SALES')

Based on the manual procedure for incrementally updating a partition, you could logically conclude that now you only need to apply a filter to the partition and run an IncrementalUpdate() function, as in the method that used the OLAP Manager interface. But the process isn't that simple. DSO does provide a process method within the partition object that you call with

DsoPartition.Process 

But the process types available are limited to those in Table 1. So where is the Incremental flag? There is none. The OLAP Manager made it look as if there was one by offering Incremental update as one of the processing choices, along with Full Process and Refresh. But OLAP Services actually uses only the DSO methods available in Table 1. OLAP Services creates a new temporary partition that contains the filter you entered and adds the new data to this partition. Then, OLAP Services merges the temporary partition with the original one, which is possible only as long as the structures of the two partitions are the same.

To create a temporary partition, you use the cube object, as in:

Dim dsoTmpPartition as dsoCube.MDStores
Set dsoTmpPartition = dsoCube.MDStores.AddNew('~tmpPartition')

Note that I placed a tilde character before the temporary partition name. DSO handles any partition that begins with a tilde as a temporary partition, and destroys the partition after the connection to it is released. This is an important point because OLAP Services will add any remaining partitions to the parent cube measures whether they were intended to be temporary or not. Using the tilde character ensures that the partition will disappear even after an interruption of the merge.

To be merged, the temporary and original partitions need to have the same dimension and level structure. Instead of programmatically creating the structures one by one, you can use a method designed for that purpose, as in:

dsoPartition.Clone(dsoTmpPartition)

This method ensures that the temporary partition is structurally the same as the original. To limit the amount of data that goes into this partition, you need to apply a filter, as in the manual process. To apply the filter, you need to find the highest ID value in the cube. You need to use ADO multidimensional (ADO MD) to extract the information from the cube, as Listing 1 shows.

After recovering the value of the highest ID in the cube, you can create the filter for the partition. DSO uses the filter as a WHERE clause for the fact table, so you need to structure the filter as such, as Listing 2 shows.

After you process the partition, you need to remove the existing filter before merging the temporary and original partitions. Otherwise, OLAP Services will add the temporary filter to the original partition's filter, which usually isn't the result you want. Also, keep in mind that attempting to merge an empty partition will cause a run-time error, so you need to test the temporary partition to see whether any rows were added before you attempt the merge, as in Listing 3. And with that step, you've automatically processed the cube.

Using DTS to Incrementally Update Cubes

Recently Microsoft introduced a new DTS add-in task called the OLAP Services Processing Task, which lets you apply incremental updates to cube partitions. Because it lets you control task flow in a GUI environment, the OLAP Services Processing Task provides a convenient way to schedule these updates and integrate them with scheduled fact table tasks. You can download the add-in for free from http://www.microsoft.com/sql/bizsol/DTSKit.htm.

To create a package without adding any custom DTS code, I created a SQL Server table that serves only to store the last highest ID in the cube. This method lets you create a generic filter on the partition. A SQL query in the data pump will gather the information relating to the highest ID present in the cube. This SQL query gets the maximum value of the ID field from the fact table immediately before OLAP Services processes the cube. This information will later help filter the fact table without your having to modify the WHERE clause in the partition before each update. In this case, I called the table SALESID and created two fields, as Listing 4 shows.

As the name suggests, the ID field holds the highest ID found in the cube and the BATCHDATE field helps determine which is the last record entered in the table. You can also use the BATCHDATE field as an internal audit to see when OLAP Services attempted to gather that value.

You need to create one connection for the cube and another for the database where the SALESID table was created. To create the OLAP connection, open the OLE DB connection, select the Microsoft OLE DB Provider for OLAP Services data source, and enter the name of the OLAP server in the Server field, as Screen 4 shows. The connection icon will automatically change to a cube-like object, as Screen 5 shows.

Then, you need to add the connection to the SQL Server database and add a transformation task that adds a record to the SALESID table containing the highest ID value in the cube at that time. Again, at processing time, the filter in the partition will use this value to determine which records are needed to add measures to the cube. You need to use an MDX query as the source of the transformation to get the highest ID value, as Screen 6 shows. Be sure to link the time-related dimension information to the ID field, as Screen 7 shows. In this case, the TIMEDATA column contains the cell that needs to be sent to the ID field.

If the cube update process enters a measure into the cube that depends on dimensions that don't yet exist, you will get an error that prevents the cube from processing. To avoid this error, create a task that updates the dimension data before updating the cube partitions. Successful partition processing depends on the success of this step. To set up a dimension processing task, choose the OLAP Services Processing Task icon represented by a solid yellow cube, as Screen 8 shows.

After OLAP Services successfully updates the dimension, the partition processing can begin. In this case, you need to add to the partition portion of the task a filter that brings only records from the fact table that have a higher ID value than the highest one in the cube. After you choose the partition and then choose incremental update as the processing option, you'll see fields that let you change the fact table and the filter. In this case, change only the filter, as in Screen 9. Unlike the filter that's applied directly to the partition through the OLAP Manager, this filter exists only while the task is running and no longer exists in the partition after the task is run. Screen 10 shows the executing DTS package.

You can run the task on a scheduled basis, but remember that if no new records need processing, the partition processing portion of the task will fail with an error. In the DSO code from the previous section, I avoided processing the partition if no new records existed in the temporary partition. Because DTS merges a temporary partition with the original one to accomplish incremental processing, you might get an error when you use DTS because it doesn't check for cases in which the temporary partition is empty. If you want to avoid this error, simply include some checks in the DTS package that prevent that portion of the task from beginning if no new records are in the fact table.

Automating Data Integration

The ability of cubes to provide meaning to data becomes even more valuable when the data is up to date. Understanding how OLAP Services treats incoming data and how you can take advantage of that knowledge to automate your data integration process is crucial to managing servers containing large, multiple cube structures that are refreshed on a regular basis.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish