Step-by-Step Data Warehousing

In the February issue of SQL Server Magazine, we introduced the "7 Steps to Data Warehousing." In this follow-up article, we’ll demonstrate more in-depth data warehousing practices by focusing on a single business process, training. Keep in mind that we can add other processes to the data warehouse. The first step is to verify that data to describe this process is available. Then we’ll choose the key performance indicators that characterize the process, and perform dimensional analysis to generate the star schema. In future articles, we’ll populate the star schema tables, create cubes from the star schema, and use front-end tools to analyze it.

The company in this example has many lines of business, including development, staffing, consulting, and training, which contain some overlapping customer bases. Although the processes for these business activities are very different, they share many common dimensional entities. The same employees who consult often train. Clients who purchase development also use staffing services, too. To keep the system manageable, each star schema structure should focus on a single business subject (e.g., training sales, development hours, consultant utilization, etc.). This method will result in many individual star schemas. The data from multiple stars can be merged, however, as long as they share common dimension tables. Thus, if there is one common dimension for customers, we can merge data about their training, development and consulting activities, drawn from distinct star schema structures. This requires some careful planning up front, but the end result quickly justifies the investment.

The training line of business provides an outstanding example of how to implement data-warehousing and decision-support systems. The training market has seen rapid and substantial change over the past few years, and the regional market for training across several other product lines has expanded significantly. Therefore, the data warehouse must let the training managers quickly identify trends and assess their impact and longevity. It must provide a basis for modifying existing business practices and creating new ones. And, the data warehouse needs to make relevant data as accessible as possible to answer future questions that we couldn’t predict during the design phase.

Step 1: Define the Processes

The processes in the training line of business are marketing, sales, class scheduling, student registration, attendance, instructor evaluation, billing, etc. To choose a manageable subset of these processes, we conducted interviews with the managers. From the interviews, we found that the most crucial information fell into four main areas: student demographics, payments, the reasons students chose this company, and the correlation between ratings on instructor evaluations and repeat registrations. To answer the questions about these defining processes, we captured data from the student registration, attendance, instructor evaluation, and billing processes that the company had previously collected. For later phases of the project, we collected information from other processes to answer questions about subjects such as profitability. Because the company shares resources among different lines of business, determining the resource’s cost component to be assigned to training requires integration with the parts of the data warehouse that will be developed for the other lines of business. Also, note that assigning such costs will be complex and prone to error if you do it in multiple phases. The cost components must be mutually exclusive and collectively exhaustive, otherwise profitability calculations will be meaningless.

Step 2: Define the Data Sources

After identifying a process, you must identify appropriate data sources. In our case, this process was easy because three FoxPro tables stored all the student registration data: student name, class, and registration. Furthermore, the company originally designed these tables with efficient decision support in mind, so we found columns in the tables describing invoicing, payment, hotel information, etc. A separate SQL Server database stored the instructor evaluations, which we could relate back to the FoxPro tables.

The student table, mename, contains complete information about each student, including name, address, and company. The class table, meclas, contains information about each occurrence of a class, including the location, the start and end dates, how many seats were originally available, how many seats were occupied, and which instructor taught the class. The registration table, meregis, acts as a resolver table between the student table and the class table and contains additional information about each time a student attended a class. Figure 1 shows a diagram of the three student registration tables.

Step 3: Define the Dimensions

Now that we have a picture of the object to model, based on the questions we were tasked to answer and the tables we’ll extract the data from, we must refine this picture. Because the company sells training in the form of classes, we must clearly describe what a class is. Let’s use the term course to describe a particular curriculum, such as the System Administration for Microsoft SQL Server 7.0 course. The term class describes a specific event: a group of students and an instructor in a room on a specific day covering specific material. You identify the entities that work together to create the key performance indicators (KPI). For example, if the KPIs are gross revenue and expenses, the dimensions that generate that fact might be the student, the instructor, the course, the location, and the date. Each of these entities is represented as a dimension table.

Step 4: Define the Grain

Next, we select the grain of the fact table. The grain specifies the level of detail to provide for each dimension, and you can use it to decide whether you can aggregate the data before loading it in the data warehouse. The art of determining grain is one of learning how the users view the data. If there is too much detail then the structure is so large that it becomes unwieldy. If there’s too little grain then there’s not enough detail to see the underlying cause of trends and make informed decisions. Our rule of thumb is to go a little deeper than appears necessary. It is easier to roll up data to a higher level of aggregation later that to go back to the sources and extract more detail. Our grain is one course, one student, one instructor, one start date, and one location.

The Student dimension requires a bit more thought because it’s possible to aggregate information about students to the company level. You could simply record information about how many students from a particular company attended a particular class and how much the company paid for it. This way of recording information would certainly save disk space; however, it would also make answers to certain questions obscure. For example, the managers want to know if students continue to attend the courses when they leave one company and go to another. So to reveal this information, the grain for Students is set to the individual student.

The Instructor and Location dimensions both have a small number of distinct values called members of the dimension. We are looking at revenue for the class, so the instructor might not necessarily be a dimension. Instructors train the students after the students have paid for instruction; instructors don’t directly contribute to revenue generation. But instructors do influence repeat students. Inserting the instructor as a dimension and linking him or her to revenue on repeat sales might be a method to evaluate instructor performance. The company has two primary locations, each with several classrooms. Also, many classes take place in other locations. It’s possible to set the grain for Location to the individual room the class is held in. However, the location information would be irrelevant (and often unavailable) for any site other than the two primary locations. Because the company didn't ask us to retrieve this information, we set the grain of the Location dimension to the building in which they held the class.

The class starting date and ending date will be stored in two dimensions. It’s possible to have a single time dimension and record a fact for each day of a class; however, the dimension doesn’t store the distinct information on a regular basis for each class day. Because the individual days of the class don’t directly affect the facts (e.g., I don’t earn less or more on the second or third day), there is no need to increase the size of the structure with redundant data that will not enhance analysis. We also need to store a third date dimension, the registration date, which will help the marketing staff or the managers analyze the success of advertising campaigns.

You might have noticed that we haven’t discussed a dimension to store information about the evaluations that students use to rate the instructors. The company collects these evaluations daily in some classes, and only once for the whole class in others. The evaluation ratings are grouped into two areas: the instructor and everything else (e.g., environment, textbook). The question becomes, then, should evaluation ratings be stored as facts or as a dimension? In general, you need to include the information you know beforehand in the dimensions and record the information that you uncover through the business process in the fact table. For example, we need to store the evaluation ratings, which are always uncovered during the delivery of the class, as part of the fact table. If the evaluations are collected daily, we might want to consider storing each day of the class individually, which would let us analyze the course content for a particular day and determine what materials are more or less effective than others. To make that method effective, we also would need to change the course dimension to break down content presented per day in order to support this expansion. Because that information isn’t currently tracked, we would need to enhance the source data structures to allow the instructor to report on what material they covered each day. Courses are broken into individual modules, so the time dimension grain of day might not be effective, either. We could expand into start and stop hours. But this method would lead to an analysis of whether modules are more or less effective before or after lunch, etc. As you can see, the problem of choosing the right grain rapidly escalates.

How do you determine the right grain in this case? You need to estimate the cost of collecting and storing the additional information. Then you need to project the potential cost benefits of knowing what material generates more student satisfaction. The bottom line is to use common sense. Although a high level of detail would provide some potential benefit if the information were scrupulously used to improve courseware, collecting and maintaining the information probably doesn’t offer enough benefit to justify the expense. Therefore, we’ll add two facts to our fact table—an aggregated instructor satisfaction score and an aggregated class environment score.

Step 5: Create a Star Schema

The star schema in Figure 2 has only seven dimensions, which might seem to be too few. However, this star schema is typical. If you find that you have too few dimensions (only 2 or 3) or that you’re creating a star schema with 15 or 20 dimensions, you might want to reconsider your design. It would generally be better to either create fewer, larger dimensions or multiple star schemas. Whenever possible, design and use dimensions that can be used by other cubes. A well-planned customer dimension could serve many star-schema structures. This method is known as conformed dimensions. This helps simplify analysis by allowing you to create smaller star schemas, each one focused on a single business question. Later, the conformed dimensions form a bridge to combine data from multiple cube structures to form virtual cubes.

Step 6: Specify Data Points

Now that we’ve made initial decisions about which dimensions we have and the grain of the fact table we’ll use, we need to specify the data points to store in the fact table. The data points are the key performance indicators that occur when a class event occurs (e.g., the student evaluation, the revenue, etc.). For example, most students take only a few of the many classes that the company offers. Including a row in the fact table for a class that the student didn’t attend usually isn’t necessary, although you might find cases in which including a row is desirable. Each row will then contain a foreign key that points to each dimension and additional columns for the data you collect. Measures are the columns for that data.

We’ve already introduced some measures—the evaluation ratings. As we mentioned, you aggregate these ratings to store two measures for each student per class. When the OLAP cube is processed, the measures will be aggregated to match the hierarchies in the dimensions. Measures such as revenue are additive. You can roll all the students in a company up and determine the revenue by adding the revenue for each student. Evaluation scores, however, are not additive. In this case, the aggregation function should be set to average. We store the average rating for Instructor and the average rating for Other in each row of the fact table. We also need to store the invoiced amount that each student is charged. We then set a flag in a binary measure to track, for example, whether a student stayed at a hotel, which adds to the student’s overall cost. Finally, we store the student’s reason for attending the class. This last value isn’t numeric, so it’s not a measure; instead, it’s called a degenerate dimension. Degenerate dimensions contain only one column of data so they don’t need to be in their own distinct table. Thus, the fact table consists of a composite primary key built from a foreign-key value from each of the dimension tables. The fact table has a column for each performance indicator (Instructor rating, other rating, revenue, hotel) and a column for the degenerate dimension.

Step 7: Select the Columns

The final task is to select the columns to include in each dimension table. The Course dimension will include the course name, Microsoft course number if it exists, version indicator, length of the course, and maximum number of students. You can calculate the course’s length from the start and end dates in the fact table. However, in a data-warehousing environment, it’s almost always better to look up the course’s length than to calculate it because you’re working with so much data (in data warehouses, remember that you’re exchanging space for speed). The Student dimension will contain first name, middle name, last name, city, ZIP code, state, country, and company name as columns. At first, the Instructor dimension will contain only the Instructor name. In the future, we won’t store Instructor as a degenerate dimension in the fact table because we plan to store more information about instructors, such as certification level and years of experience. The Location dimension will contain the building name, city, ZIP code, state, and country. The StartDate, EndDate, and RegistrationDate dimensions will contain the date in a single column. The column will contain all the components of the date broken out by day of the week, month, quarter, and year.

In these seven steps, we chose a business process and identified the objects relevant to the process. Then we modeled the interaction of these objects with a star schema, which required us to define the dimensions, the grain of the fact table, and the measures to store in the fact table. The next step is to create a Data Transformation Services (DTS) package to extract the data from the sources, manipulate the data, and populate the data warehouse. If you follow these steps, you’ll be on your way to creating a well-designed data warehouse.

Comments

Plain text