You've heard the terms—data mart, dimensional database, star schema—but many people find that the how-to of data warehousing remains a mystery. If you haven't been involved in developing an enterprisewide data warehouse or a departmental data mart, you might have a hard time wrapping your mind around these concepts and how they translate into practical business solutions. So let's start the data warehousing discussion again by looking at a common business problem and walking through a simple data mart solution that uses the controlled environment of SQL Server's Northwind sample database.
Let's say I work for Northwind Traders Inc., as a DBA, developer, and overall get-it-done person. And let's say that one day, Dr. Andrew Fuller, vice president of sales, comes to me asking for some information about the products Northwind Traders sells. "I want to see how much of each product U.S. customers ordered, by quarter, in 1997," he demands. "And I need to know who our top three sales representatives and their managers are."
"Yes, sir," I reply. I know the company's order-processing system contains all this information—orders, customers, products, suppliers, shippers, and employees—in a database in third normal form, as Screen 1 shows. But I had just finished tuning the database so that it would process order entries as fast as possible, and now the boss wanted to run reports that would likely contend for resources with other functions and drag down order-entry processing again.
I decided I needed details about how Dr. Fuller and his sales team defined and planned to use this information. In talking with the sales manager and representatives, I discovered that "how much product U.S. customers ordered" boiled down to the number and price of ordered items—information in the Order Details table's line items. Dr. Fuller wanted to see this data "by quarter, in 1997," which meant he wanted to filter the information by time, and he wanted to see the "top three sales reps," which meant he wanted to see sales data relative to employee information. The sales manager also told me Dr. Fuller wanted to see this information at every Tuesday morning sales meeting, which meant he didn't need up-to-the-minute reporting from the operational system—good news for order-entry performance.
Designing a Solution
Because the solution involved looking at aggregations of specific data over time and in relationship to other entities, such as sales representative, I began thinking about building another data structure to meet this new reporting need. I spent some time reviewing my collection of data warehousing resources, including Ralph Kimball's classic, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (John Wiley & Sons, 1996).
In the end, I decided to model a new data mart type of database. A data mart, which lets a business unit access and analyze department-specific data, was particularly suited to meeting the sales department's business needs. The data mart would include only sales data, and the sales department would be the only data mart consumer. In addition, with a data mart, I could load data from the operational system to eliminate contention between report queries and order-entry activity, and I could limit the number of joins the report queries would require and build indexes to deliver fast query performance.
Based on Dr. Fuller's and the sales staff's business needs, I had identified measures and dimensions for the data mart database. Measures are points of business data, such as dollar amount of sales, and dimensions are the entities the sales team uses to query the measures, such as time and sales representative. When I hear "I want to see something by something else," the "something" is typically a measure candidate and the "something else" is a possible dimension member. The operational system's dimension hierarchies were fairly straightforward. For example, the Product dimension contained product category and product name, and the Customer dimension contained country and city. But the data for suppliers' regions was sparsely populated, so I included only the supplier name and country in the new database. From these basic operational database relationships, I created the dimensional design you see in Screen 2.
To provide flexible reporting and reasonable query performance, I joined the fact table to the dimension tables and used queries to filter and group the dimension data. For even better performance, I indexed each foreign key in the Orders_Fact table, and for entity integrity, I built a primary key on the combination of all the foreign keys. These design decisions yielded a database that could serve as the foundation for a sales department data mart based on operational order information.
The dimensional database still had relational characteristics: foreign-key constraints between the fact and dimension tables and primary keys for each table. However, I denormalized the database by combining some tables from the operational system; this denormalization created a star schema database, which gets its name from its entity-relationship diagram's star-like appearance. For example, the schema for my dimensional database in Screen 2 looks like a five-point star. The star schema's primary benefit is the ease with which you can use dimensions to query measures.
To denormalize the database, I combined the Orders and Order Details tables into the Orders_Fact table, which holds information about each order line item. The Orders_Fact table contains the lowest level of order-data granularity to provide maximum reporting flexibility. Likewise, I combined the Products and Categories tables into the Product dimension table. Had I preserved the normalized relationship between these two tables for my Product dimension table, I would have produced a partial snowflake schema, which gets its name from the snowflake appearance of normalized dimension tables. However, I decided on the star schema instead of the snowflake schema because the snowflake would require more joins to query the fact table.
Another design consideration involved the keys for both the dimension and fact tables. The operational system's Customers table defined the CustomerID column as a character data type. In the dimensional model, I changed the CustomerID data type to integer for two reasons. First, for query performance, I prefer to index and join integer values. And second, the more independent my dimensional database is from the operational system, the less the possibility that changes to the operational system will force changes to the dimensional database structure. For demonstration purposes, I changed the data type value only for the CustomerID and TimeID dimensions. The other dimension keys are numeric, as they are in the operational system.
Loading the Data Mart
I now had to figure out how to load the operational data to my dimensional database. Getting data from one database structure to another is called data extraction, transformation, and loading (ETL). My database design—which combined data from more than one table to populate the fact and product tables and which turned the character CustomerID into an integer value—would drive part, but not all, of the ETL process.
To move the data, I created a seven-part Data Transformation Services (DTS) package, as Screen 3 shows. Because the operational system's source data was in good shape, I didn't have much data cleansing or transforming to do. But the sidebar "Data Transformation in the Real World" gives you a glimpse at the data quality you can expect in real-life data mart and data warehousing solutions. Note that the arrows you see in Screen 3 show the workflow between tasks, and the green-and-white striped arrows represent actions the package takes when the previous task succeeds. This DTS package contains two SQL Server data connections: one for the operational Northwind database and one for the dimensional Northwind_MD data mart database. Let's step through this package's seven tasks.
Step 1. To start, I created a T-SQL task that performed two functions: the removal of foreign-key constraints so that I could truncate all the tables in the dimensional database and the removal of fact table indexes to speed up loading later on. You could develop another package for ongoing incremental data mart updates based on order date, for example.
Step 2. I then created another T-SQL task, which loaded the Time dimension table by inserting date information for every day between two dates. This task, as Listing 1 shows, breaks the date into year, quarter, month, and day to represent the hierarchy Dr. Fuller and his sales team will use to look at sales information.
Step 3. Next, I created a T-SQL task that loaded a cross-reference table with the CustomerID character values. The cross-reference table has an integer column with the identity property set to generate the data mart's surrogate customer key values, as Listing 2 shows.
Step 4. This step includes four data-pump tasks, marked by gray lines in Screen 3, to load the other dimension tables from Northwind data. These tasks use the SQL Server connections to move data between the Northwind and Northwind_MD databases. You can view each data-pump task from the DTS package, which is part of the accompanying source code (available at the top of the page). I cover just the task for populating the Customer dimension and generating the surrogate customer key. Listing 3 shows the query that serves as the source for loading the Customer dimension. Remember that the Customer_XRef table has both the original CustomerID character value (SourceCustomerID) from the operational system and a system-generated integer value (DestCustomerID). So the source for the data-pump task is a query that joins the Northwind Customers table with the Northwind_MD Customer_XRef table, moving the integer DestCustomerID value to the Northwind_MD Customer table. Staging tables such as Customer_XRef can be very useful in transforming and integrating data during the ETL process. This is a simple example; I've built entire databases to hold staging tables for more complex integration.
The other data-pump tasks also use queries as their source, and for demonstration purposes, they use ActiveX scripts or Copy Column transformations to format the data for the dimensional database. Copy Column transformations are faster than ActiveX scripts, which require row-by-row processing. If you use an ActiveX transformation, using ordinal positions to identify columns is faster than using names, as you can see in the Employees data-pump task, which Listing 4 shows.
Step 5. After loading the relational data to the dimensional tables, I used a T-SQL task to rebuild the foreign-key constraints.
Step 6. I then used another data-pump task to load the fact table. Listing 5 shows the source for this task: a query that joins the Orders and Order Details tables from the Northwind database to Northwind_MD's dimension and cross-reference tables. As I noted earlier, I cheated a bit with the product, supplier, and employee keys by simply taking the values from the Northwind database and loading them directly to Northwind_MD. Notice the join between the Northwind Orders table and the Northwind_MD Time table to match the integer TimeID key values.
Step 7. The last step was to rebuild the foreign key indexes in the Orders_Fact table to support queries. My dimensional database was now complete.
Trying It Out
With the dimensional database loaded, I tried to answer one of Dr. Fuller's requests: "I want to see how much of each type of product U.S. customers ordered, by quarter, in 1997." Listing 6 shows the query I ran against the Northwind_MD data mart to answer this question; Listing 7 shows the query against the operational Northwind database to get the same results.
Although both queries returned the same answers, I performed one less table join to get the results from the data mart, and the T-SQL in the query against the dimensional database was a little cleaner because I designed the dimensional database to answer business questions. Another benefit of the data mart was that I didn't have to query the operational system that was already busy processing transactions. These advantages of a dimensional, or star, schema are even more pronounced in complex environments.
This simple data mart demonstrates an effective way to pull operational data together so that decision makers can quickly and easily access business information, but the solution doesn't have to stop here. For example, you could use Microsoft OLAP Services or other OLAP or reporting tools to access the data mart. An OLAP tool can improve query response time by preaggregating data, and many OLAP tools support full-featured user interfaces that let business decision makers analyze and report on data. You can also tap in to the real power of the star schema structure, which lets you easily build a cube that end users can interactively query. In an upcoming article, I'll show you how to use the OLAP Services Cube Wizard to build a cube out of the dimensional database, then use the DTS Processes Cube add-in to refresh the cube on a scheduled basis. You can then let end users answer their own questions by using OLAP tools or running queries from Excel's PivotTable Services.