Editor's Note: "Using the Microsoft Repository" and "The Open Information Model," page 42, by Patrick Cross and Saeed Rahimi, and "Incremental Updates in OLAP Cubes," page 47, by Claude Seidman, continue our series on data warehousing and Incremental Updates.
In recent years, data warehousing best practices have included the use of a repository to store information about the warehouse. The repository information helps people assess the impact of changes, aids in tracking down problems, and helps warehouse users better understand the data they use to make key business decisions. The Microsoft Repository is an important part of Microsoft's data warehousing strategy. By following these examples, you'll understand and use the repository more effectively.
What Is a Repository?
A repository is a storage place for meta data, or data about data. Several vendors have created repository products in the past 20 years, starting with data dictionaries in the 1960s. More recent products have evolved from these simple dictionaries into complex systems for storing data from hundreds of applications and environments. Most recently, Microsoft included its repository technology as part of its data-warehousing framework (a collection of tools provided with SQL Server), in recognition of the important role repositories play in warehousing. Microsoft uses Extensible Markup Language (XML) to integrate support for exchanging information in the repository. XML provides a simple way for tools to interchange meta data. Interest in repositories has recently surged, primarily because of the emergence of data warehousing, knowledge management, and enterprise application integration.
What Is Meta Data?
According to Bill Inmon, meta data is "the description of the structure, content, keys, indexes, etc. of data," (Managing the Data Warehouse, John Wiley & Sons, 1996). More specifically, in a data-warehouse environment, meta data can be information about warehouse data, information about how to get a piece of data out of the warehouse, or information about the quality of data in the warehouse. Meta data can even give information about how to run warehouse tools to perform different tasks. Meta data about a warehouse includes information about systems, processes, source and target databases, data transformations, data cleansing, data access, data marts, and OLAP tools.
For example, a warehouse's meta data can tell you how an OLAP tool calculated a report's Total Sales column. Without knowing anything else about the warehouse, a user looking at a regional sales report might think that Total Sales includes all discounts and shipping and handling charges. But Total Sales might not include any or all of these elements. To help the end user, the warehouse needs meta data that ties the reports' columns to data transformations, data queries, field calculations, and source database tables and columns. The warehouse can provide this help if it has meta data about the source databases (such as tables and columns), what information was extracted, how the information was transformed, the target (or warehouse) database, data warehouse reporting tools, and database design and modeling tools.
Many warehouse implementations today use a simple spreadsheet to capture the source and target mappings and conversions. But beyond the project's initial requirements phase, this information quickly becomes outdated and inaccessible to most users who interact with the warehouse.
Why Is Meta Data Important?
Meta data acts as a road map to the information in the warehouse. It helps users find out what type of data is in the warehouse, what a piece of data means, how to access the data, who owns the data, and who created it and when and how. Without meta data, data warehouse users can access data but not information in context that helps them make business decisions with confidence. Currently, the best practices for warehouse and data-mart development mandate having a meta data strategy that makes the warehouse easy to update and use.
Meta data can provide complementary information about the warehouse's contents. For example, an analyst can use meta data to better understand what information is available and how it is calculated. Meta data provides a detailed analysis of where the information came from and can give a confidence factor for describing the data's validity.
Meta data can also maintain a history of activity before the current processes. This history is important in understanding how the structure and processes change over time. If a user wants to compare sales by product category during the past 2 years, but the product category was added this year, the results of the comparison are meaningless.
Also, meta data provides a common frame of reference, acting as the organization's central knowledge base. For example, everyone who looks at the regional annual sales report can arrive at the same understanding of what Total Sales means.
The ability to analyze the impact of changes in the warehouse can drastically reduce the cost of ongoing maintenance. If you need to change a source table, you can simply look at the table's other columns to determine which parts of the warehouse use those columns. With this usage information, you can make a quick list of items that you'll need to modify to accommodate the source system changes. Likewise, you can more easily fulfill requests for additional information in the warehouse if you can see where information is already coming from. Before adding new data to a table, review all the data sources for the table.
Meta Data Services
Repository technology is the core of the services that SQL Server provides for managing meta data. These services' features bring significant value to information consumers and people who maintain the repository data. The most important features are information models, the Microsoft Repository engine, XML interchange, and extensibility.
Information models. Information models define the meta data in terms of object types and their relationships. The information model is thus the language for describing the information the repository will store. Microsoft's data-warehousing consortium (for more information, see "The Open Information Model," page 42) chose the Unified Modeling Language (UML) for documenting and communicating the information model the Microsoft Repository uses. Although users can define their own information models to describe meta data, Microsoft created the Open Information Model (OIM) as a common specification for storing information about systems, warehouses, etc.
Repository engine. The Microsoft Repository engine handles meta data storage and retrieval. The engine uses the information model and stores meta data instances that the information model describes in SQL Server tables and columns. And the engine uses caching to optimize access to the meta data. Versioning is an important feature of the Microsoft Repository, and is fairly new to repositories generally. Versioning means that as updates happen to objects in the repository, those changes are captured, and the repository maintains a history of how the repository data changed. By letting you view objects as they're updated, the repository can answer questions about changes over time. For example, if new columns are added to the warehouse, you can track and query this addition later. This ability helps you solve problems related to warehouse alterations that cause sudden changes to data the users receive.
XML interchange. Besides describing the data types that the repository can store, the information model includes an XML format for interchanging meta data between the repository and other tools. And users can easily import information into the repository from custom sources or other tools. One example in "Sample Repository Applications" (which you can download at the link to this article at http://www.sqlmag.com) contains a sample XML file that you can use to load data into the repository.
Extensibility. Extensibility is an important repository feature that lets users add information to the models to cover topics specific to their organization or tool. Using the Microsoft Repository software development kit (SDK), which is available at the Repository Web site (for this and other resources, see "For More Information"), you can extend the information models that the Repository uses. If your organization needs to store information about locations that aren't part of the base OIM, you can extend the model to add a new object and corresponding relationships to track that information. Sites can also add new properties to track information for existing objects (e.g., adding a new attribute to a table).
Integration with the Microsoft Data-Warehousing Framework
The Repository is a component of Microsoft's data-warehousing framework and an important part of its SQL Server strategy. Microsoft adds value for its database customers by bundling warehouse tools with SQL Server 7.0, such as Data Transformation Services (DTS), OLAP services, English Query, and the Repository. You can also access repository data through the SQL Server Enterprise Manager, as Screen 1 shows, by clicking the Metadata node under Data Transformation Services. From there, you can access the relational schema information and package and lineage information. You can explore the relationships between columns and the packages that update them.
Data Transformation Services. DTS is the extraction, transformation, and loading (ETL) tool that comes with SQL Server 7.0. Its flexibility and range of features help you populate a warehouse. You can save packages directly into the Repository. One of DTS's more compelling features, lineage, requires the Repository. Lineage lets you track how data in the warehouse was calculated and when it entered the warehouse.
To save DTS information into the Microsoft Repository, choose SQL Server Repository as the location for saving the package. Use the Advanced tab on the Package Properties to set the scanning options, which Screen 2 shows. Doing so causes DTS to call the OLE DB scanner to load all source and target catalogs into the Repository. If you don't set the scanning options, DTS creates DTS Local Catalogs as the reference for all source and target catalogs, which can make locating the databases impossible. Each subsequent save replicates this reference, so you can't keep comments and other descriptive information updated.
You can run into problems when you try to save certain DTS transformations to a repository. If you use a script to perform a simple transformation and you choose the source columns explicitly (not from a query), all the transformation data is captured, as you can see in the transformation model in "The Open Information Model." If you choose a query as the transformation source, that source becomes objects that aren't part of the OLE DB imported data. This choice makes following the connection back to the true source objects difficult. Also, the query isn't parsed to create a connection between the query columns and the columns you select the data from. So in many cases, the connection between source and target is available, but in some, it isn't. You can solve these problems by writing a program to resolve the references in a repository or by using a custom model along with the DTS model to store the source target mappings.
DTS also uses versioning for the package object, but it replicates all the subordinate objects with each save. You can then go back to any version of a package to see exactly how the data was transformed. This capability is important for tracking down problems with data months (or years) after it entered the warehouse. This versioning scheme won't cause many problems if you use the appropriate scanning options to import the relational schemas, which will then be appropriately versioned for continuity.
OLAP Services. OLAP Services gives you multi-dimensional analysis for warehouse data. By using a utility in the Repository SDK, you can import all the definitions for the OLAP data into a repository. With SQL Server 2000, OLAP Services stores meta data directly in the repository. One problem with the utility in SQL Server 7.0 is that it doesn't connect the measures and dimensions to the underlying columns they're based on. Making these connections can be a tedious manual task, or you can write a program to make the connections automatically. The information about which columns are used is available in the repository model, so you can easily retrieve the data.
English Query. English Query lets you define a semantic model for a database, then translates English phrases into SQL. The semantic model provides information about a system and how it is used, and can be valuable as you use the warehouse. The Repository SDK contains a utility to import data from English Query into a repository. SQL Server 2000 has an option within English Query to export and import models from a repository.
OLE DB (relational schema). Importing relational schemas into a repository gives you the base set of information necessary to begin documenting the warehouse. You can import database schemas from any OLE DB- or ODBC-compliant data source. To run the import from Enterprise Manager, right-click the Metadata tab under Data Transformation Services and choose Import Metadata. This import utility uses full versioning as it loads the data, so you can easily understand the changes to the database schemas over time. Because the utility uses versioning, it preserves all the descriptive information, such as comments, and rescanning the catalog doesn't affect the information.
Example Usage Scenarios
The following usage scenarios demonstrate how end users and warehouse analysts can benefit from the data in the Microsoft Repository. For most organizations, the end users benefit most from the assurance that they have the appropriate information and an understanding of the data they're viewing.
Following a data path from OLAP cube to source system. A business user is using an OLAP client to view data and needs to better understand how the data was calculated and where it came from. The user is familiar with the company's current operational systems, but isn't familiar with the warehouse. The organization has undergone numerous mergers and acquisitions, so several systems that populate the warehouse originated in different companies.
The user starts with the OLAP client that is displaying sales data from a cube. To get more information about what the data means and where it came from, the user could obtain from the Repository descriptive information about the cube and its measures and dimensions. This description would be more informative than the simple labels assigned to each item. For Total Sales, the description might note that returns don't appear until the end-of-quarter processing, or that the regions changed in 1997, so region comparisons before that year will return invalid results. The user can then follow the path from a particular measure or dimension to the warehouse, then through the transformations to the source systems. If users are interested in data from a specific source system, they can use the Repository to determine which cubes or reports display data from that system. Descriptions of the source systems, the transformations that moved the data, and related information are also available. This information gives users a frame of reference in terms of the current business, not just the warehouse.
Using DTS lineage to find source data and transformations. In another case, a user is viewing data from a report or cube and wants to know exactly where the data came from. This environment has seven source systems, each of which uses a different package to populate the warehouse. Using the lineage feature of DTS, the user can determine which package a system used to populate the warehouse with a given row. From a particular report row, the user can determine the lineage IDs for that warehouse data, then query the Repository to find out which packages from which source systems populated that data. This process gives the user a detailed analysis of the data source. An experienced warehouse analyst can also use this information to see the exact transformation package that populated that data, even if the package changed after populating the data.
Make the Best Decision
The Microsoft Repository and its integration with other warehousing tools give you the kind of information that's impossible to obtain in a typical warehousing environment, which stores designs and transformations primarily in spreadsheets. The goal of warehousing is to give as much information to as many users as possible, so that they can make business decisions based on the best information.
But without a means to understand and describe that data, users might get too much information or not trust the data and could base decisions on a faulty interpretation of the data's meaning. Repository technology can play a significant role in the delivery of warehouse information to end users, supplying consistent descriptions and a road map of the data being viewed. Microsoft and other vendors provide significant integration with data warehousing tools to make meta data easier to use.