Discover the Star Schema

Author's Note: If you've been reading SQL Server Magazine since the beginning, you know that my column has focused mainly on design fundamentals for transactional databases. This month, I switch gears to cover design fundamentals for data warehouses and data warehousing. In subsequent columns, I'll cover topics such as dimensional table physical design, fact table physical design, fact table partitioning design, summary table design, and slowly changing dimension design. In addition, I'll discuss indexing techniques that are optimized for use with data warehouses and partitioned tables, and I'll look at how to gather requirements for data warehouse design, which can differ from the requirements-gathering techniques used for transactional database design.

The data warehouse is one of the foundational structures of a business intelligence (BI) solution. Like transactional databases, data warehouses require a schema design. The most basic schema design for a data warehouse is a star schema. If you want to create multidimensional cubes for BI analysis, using a star schema for the data warehouse is a good solution. In this article, I explain why the star schema is preferable over other schema designs, and I use an example star schema to illustrate this design's benefits. (To learn more about BI and data warehouses, see "Data Warehousing: The Foundation of BI". For more information about data warehouses and star schemas, see the Learning Path.)

Several reasons exist for using a star schema rather than a conventional normalized design. First, you must use a star schema if you want to build and use OLAP cubes. The cube dimensions are the axes of analysis—the "by" items (e.g., by time period, by product line, by region). The fact table defines the cube and its purpose; you analyze the facts by or through the different dimensions. And, perhaps most importantly, the star schema provides fast response time when implemented as an OLAP cube.

Another reason for using a star schema for data warehousing is that the star schema parallels the way that people tend to think about and use data. No one except data modelers, DBAs, and some database programmers think of data the way it's structured in a transactional database. The star schema, when implemented as an OLAP cube, lets both developers and end users more easily understand and navigate the metadata. In addition, you can modify and build upon a star schema as your organization's BI needs expand. Unlike with conventional transactional database schemas, you don't have to worry about storing a non-key attribute only one time. And last but not least, the star schema broadens your choice of front-end BI tools because some tools work only against OLAP cubes.

Figure 1 shows an example of a star schema that's modeled after the AdventureWorksDW sample data warehouse that ships with SQL Server 2005. The schema features only one fact table, Reseller_Sales. A fact table is a collection of keys and measures. The keys relate each row in the fact table to an associated row in a dimension table. As in a transactional database schema design, the primary key of the Product dimension becomes a foreign key in the Reseller_ Sales fact table. The measures (effectively, anything that's not a key column) are the operational data that the statisticians have been waiting for and are all nicely packaged and ready for analysis.

The example star schema in Figure 1 is meant to support decision-making and BI software tools. If you implemented the schema, it would be populated from comparable tables and columns in the transactional version of the AdventureWorks database.

The schema dimensions (i.e., Time, Product, Reseller, and Sales_Territory) can be mapped to tables or views in the AdventureWorks database, which facilitates transferring the operational data into the data warehouse and ultimately into the Reseller_Sales cube. Each dimension is an axis of analysis in the cube, so you could analyze the data in the cube by month, by region, or by business type.

Notice the simple dimensional hierarchy in this schema—from Product_Category to Product_Subcategory to Product. This structure reduces redundancy and makes the star schema a snowflake schema—albeit a rather lopsided snowflake in this case. You can add as many dimensions as necessary to the star/snowflake schema. You can also implement more complicated structures, such as a geography dimension that is parent to both Reseller and Sales_Territory—but I'll save that discussion for when I cover dimensional table design.

The star schema design that Figure 1 shows has several notable characteristics:

Every table in the star schema has an identity primary key, which prevents quibbling about natural primary keys versus surrogate primary keys. In a data warehouse, the primary key assignment goes to the surrogate; if you have a natural key that you want to retain for querying, you designate it as an alternate key.
Most of the columns in both the dimension tables and the fact table are nullable. Only the primary and alternate keys are mandatory, and only the primary key is unique. You need to assume that data will be loaded into this data warehouse structure from various sources, even from multiple generations of operational databases. Therefore, the constraints that you would normally apply to enforce business rules in the transactional database must be relaxed in the data warehouse. Most of the columns must be nullable because you might not have data for them.
Relationships are optional on the parent side (i.e., zero-to-one to zero-to-many). If you have no control over the source data's referential integrity, you must allow for orphan records in the fact table and in lower levels of a dimensional hierarchy.
Most columns in the fact table (i.e., the measures) are numbers. The fact table is the focus of a BI investigation, and BI analysts are looking for numbers and facts.
Data redundancy is rampant throughout the star schema design; this redundancy is necessary for the data warehouse to reach an acceptable level of performance. The amount of data in a data warehouse is typically enormous compared with the amount of data in a transactional database. When you write a T-SQL query against the data warehouse's star schema, the redundancy minimizes the number of joins required to return the data, yielding much better performance than if you issued the same query against the source transactional database.

Schema design for a data warehouse need not be much different than schema design for a conventional transactional database. Because the data warehouse is an historical archive, you can retain some semblance of normalcy in the data warehouse schema design. In addition, you can create summary tables or columns, and you'll certainly want to add timestamps and identity values to individual records in each of a relational data warehouse's tables. A benefit of retaining a relational or near-relational data warehouse is that the warehouse is a readily available reporting database that you can use conventional T-SQL to query. With no programming learning curve and no special tool requirements, the data warehouse is an immediate ROI.

LEARNING PATH

SQL SERVER MAGAZINE RESOURCES:
"OLAP: Resistance Is Futile!"
"Solutions by Design: Adding a Questionnaire to a Design"
"Step-by-Step Data Warehousing"

Comments

Plain text