Before Microsoft's standardization effort, vendors and the IT community couldn't agree on what type of meta data a warehouse must maintain. Then Microsoft brought together leading vendors and end-user companies to determine the types of data-warehousing meta data. Similar to an object model for an application, the information model for a repository defines what kinds of meta data the warehouse stores and how it organizes that data.
This effort resulted in the Open Information Model (OIM), which contains definitions of approximately 300 object types and the relationships between them. Documenting these types was a large undertaking; the group needed to clearly define and document each type and its necessary characteristics and relationships. Because the information model defines the meta data in terms of object types and their relationships, looking for a suitable object modeling technology was the obvious solution. The group chose Unified Modeling Language (UML) to document and communicate the information model. The OIM is the common language for describing the information in a repository.
In July 1999, Microsoft transferred ownership of the OIM to the Meta Data Coalition (MDC), an independent organization dedicated to creating common standards for the exchange of information about systems (meta data). As the MDC continues to evolve the model over time, it will encourage and help other vendors to support the OIM in their tools. The MDC published Extensible Markup Language (XML) formats as a method of exchanging data from various tools that use the OIM. This standard reduces the complexity of getting data in and out of Microsoft Repository and other vendors' tools.
The currently accepted OIM version (originally submitted by Microsoft and its partner vendors) is 1.0; a proposal for version 1.1 was submitted in November 1999 and will probably be accepted in early 2000. The examples and descriptions that follow are from the 1.1 model. These examples require Microsoft Repository version 3, which is available in SQL Server 2000. Repository version 2 doesn't support inherited collections or properties, which the MDC OIM uses extensively.
The OIM covers many topic areas, including:
- Analysis and design of UML, UML extensions, generic elements, common data types, and entity relationship modeling
- Object and components: component descriptions
- Database and warehousing: relational database schemas, data transformations, OLAP schemas, record-oriented database schemas, XML schemas, and report definitions
- Business engineering: business goals, rules, and processes; organizational elements
- Knowledge management: knowledge descriptions and semantic definitions
From a data-warehousing perspective, the relational database schemas, data transformations, and OLAP schema models are the most relevant. You can download each model as an .mdl file from the MDC Web site, then use a design tool such as Visual Modeler or Rational Rose to view it.
Each object in a repository can have three properties: a name (255-character maximum), a short description (255-character maximum), and comments (virtually unlimited length). These fields, which provide much of the documentation for repository data, describe what the item is used for, what it means, and any special considerations. Much of this data might already exist in design tools or printed documentation, which you can use as a starting point for putting the information into a repository. These fields are often the most important because business users will consult them as they try to determine what a piece of data means or how it applies to the business objective they're trying to measure.
Relational Database Schema
Most warehouses use relational databases, and much of a warehouse's source data comes from relational sources. So this relational model serves as a description of the warehouse and of many source systems that feed the warehouse, the most important aspects of which are the tables, columns, and server or catalog objects.
The data in the SQL Server relational database schema is usually populated by the OLE DB scanner, which comes with SQL Server 7.0. Figure 1 shows the basic components for retrieving or updating information about a relational database. The data source represents a server in SQL Server terms, usually the name of the server machine. For other databases, the data source is the name the OLE DB provider returns, usually the data source name. On each data source is a set of deployed catalogs, or databases in SQL Server. To get to the catalogs from the data source, follow the DeployedCatalogs collection. Catalogs contain schemas, which you reach by following the Schemas collection. Schemas contain tables, accessible through the Tables collection. Next are columns on tables, which you reach through the inherited collection, Columns. Because the OIM is based on the UML standard, interfaces can inherit from other interfaces by using the inheritance concept from object modeling and UML—the interface inherits all the properties and relationships of the inherited interface. Tables inherit from ColumnSet, which contains the Columns collection. You retrieve the column's data type from the DataType collection.
The diagram also shows some inheritance for other items. Although you don't need to know this information to use the models, advanced users might need to become familiar with it to extend the model, for example. The MDC documentation includes a complete specification of this model, including additional items such as index, referential integrity, and stored procedures.
The data transformation model defines the transformation and cleansing process, which converts the data from the source systems and brings it into the warehouse. The most important objects in this complex area are related to linking the warehouse columns to their original source columns. Specifically how the data is converted is important to warehouse analysts, but end users typically need textual descriptions of the process, not code or script, to understand how the data is transformed or cleansed.
Like the OLAP model, the transformation model has a submodel to store the specific objects for the DTS transformation packages. DTS automatically populates the transformation model when you save the packages to a repository.
Figure 2 shows how to trace an element from the target (the warehouse column) back to its source (the source system columns). Here, inheritance within the model becomes important for understanding how this model functions. The ModelElement object is one of the base objects in the UML model; most other objects inherit (indirectly) from it. In this case, the relevant objects that inherit from it are column, table, and record field, which are the sources and targets for most of the transformations. The model doesn't show them explicitly because the inheritance chain is so long. Starting at a ModelElement that represents the warehouse column, the supplierDependency collection gets a TransformableObjectSet. The set represents a collection of objects that are the source or target of a transformation. Moving through the TargetOf collection brings you to the transformations that this column is the target of. Going back to the TransformableObjectSet through the Source collection gives you the set that is the transformation source. Follow the TransformObjects collection to find the source objects (columns, in the case of DTS).
This process is somewhat complicated to follow, but the model's complexity allows for modeling of scripts or code and other common cleansing and transformation processes that many tools use. For an example of how to navigate this model, see the sample Visual Basic (VB) code in Listing 1. It starts at a warehouse column (roWHCol), then navigates back to the TransformableObjectSet on the warehouse side (roWHTransObjSet). Then it follows the path back to the transformation (roTransform), and from there to the TransformableObjectSet on the source side (roSrcTransObjSet). The final step is to attach the columns (in roSrcCol) to the source transformable object set.
Many warehouse users access warehouse data through OLAP cubes. The primary objects for the OLAP model, which Figure 3 shows, are the server, catalog, cube, dimension, dimension hierarchy, and measure. The server is the machine that OLAP Services runs on. The OLAPDatabase is the database on the server, which you access through the DeployedDatabases collection. Catalogs have a collection of Cubes, whose measures you can access through the inherited Measures collection on the Store object. The dimensions are stored in the database's Dimensions collection. You can get from the cube to the dimensions it uses, but the objects are part of the OLAP Services submodel. This separation can create problems with providing complete information to the end user about the available dimensions. But most client tools get the available dimensions from the cube, and you can use the cube data to filter the list from the database to the appropriate subset.
Another submodel (the Microsoft OLAP Services model) that inherits from the OLAP model stores specific information for OLAP Services. It contains all the objects that the generic OLAP model doesn't cover.