It’s often said that the best way to learn is from your mistakes, but mistakes made in a data warehouse/business intelligence (BI) environment tend to be very expensive and possibly career-killers. Building a data warehouse isn’t a simple task; it takes a village to build one out of which you can extract viable information.
A data warehouse and the operations that build and maintain it are a combination of business considerations and design best practices. It can be a huge challenge to balance these requirements and still be able to deliver valuable content to users at the end of the day. Make no mistake—not everyone is suited to work in a data warehouse environment, but the work is stimulating and the rewards are satisfying. Here are some of the lessons I’ve learned from designing data warehouses and BI environments.
Get Buy-in from Business and Technical Groups
When designing a data warehouse, business alignment must come first because if the product that’s delivered at the end of the day doesn’t reflect the goals and purposes of the enterprise or meet users’ needs, then the data warehouse is a failure. To give your data warehouse/BI project any chance of success you need to get buy-in from the business and technical sides of your organization, from the top-level executives to the front-line worker bees. Don’t believe that “if you build it they will come.” An IT-driven, IT-sponsored, IT-centric data warehouse/BI project is leaving out the most important facet of a BI solution—alignment to business needs and requirements. Involving business decision makers throughout the project ensures that the data warehouse will deliver meaningful, useful information to business users.
At the beginning of your data warehouse project, establish a steering committee that’s composed of both business decision makers and technical decision makers, and consider establishing a BI competency group that reports to a non-technical C-level manager, such as the COO or CFO. Use these people as a resource throughout the lifetime of the project and even once the system is in production.
Maintain Data Integrity
When designing your data warehouse environment, it’s important to minimize data replication in the table design (by using conformed dimensions, for example) and data inconsistency by vetting source data and adhering to Master Data Management (MDM) standards. (For more information about MDM, see "Master Data Management," January 2007, or "Master Data Management Challenges.")
Poor data quality is one of the biggest hurdles when it comes to data warehouse adoption rates. You can do the following to mitigate poor data quality:
- Identify which business units own specific data and are responsible for its quality.
- Appoint a data steward in each business unit and include “data quality” in the job description.
- Make sure your extraction, transformation, and loading (ETL) routines are scouring for data quality problems and either correcting the data before it’s loaded into the warehouse or blocking it entirely.
- Audit and verify data as necessary by setting up back-end routines that periodically reconcile warehouse data with source data, taking into account any inaccuracies that are inherent to the source data store.
- Determine what level of data quality is considered to be “good enough” by involving the business units and finding out what their data analysis needs are. A business unit that’s using only highly summarized data might not need data to be cleansed to the same level as a unit that’s drilling down into the detail level.
Provide User-Friendly Interfaces
Strive for user-friendly interfaces because a data warehouse without users is a failure, and a BI system that’s too complicated to learn and incorporate into everyday use is a disaster. Providing highly summarized visual information to the business decision makers is one of the things that every data warehouse/BI system must be able to do. Dashboards, which are highly visual real-time reflections of operational activities, and balanced scorecards, which are non-real-time reports that blend the operational, marketing, developmental, and financial aspects of an organization, are both highly summarized and symbolic of a data warehouse. However, without an overarching strategy, dashboards and balanced scorecards can easily become flashy facades for silo-specific “islands of influence” that aren’t connected to enterprise objectives.
To avoid these islands of influence, get your strategy map in place. A strategy map links the long-term goals and objectives of an enterprise with its operational activities, illustrating the cause-and-effect relationship between different Key Performance Indicators (KPIs) that you’ll see on a balanced scorecard and possibly on dashboards. Then, make reports as visual as possible—a picture is truly worth 1000 words.
Reduce Operational Costs
It’s important to make your data warehouse environment easy and cost-effective to support and maintain. Design implementation costs can quickly get out of control. It’s not unusual to have an 80 percent/20 percent division in cost between the back-end ETL and table building processes and the front-end applications and reports. Reduce design implementation costs by creating a set of reusable designs and repeatable ETL processes so that you don’t have to start each data warehouse/BI project with a clean slate. Simplify the design by conforming dimensions and standardizing the granularity of the fact tables wherever possible; do not over-engineer the tables. Overly complex designs can quickly drive costs to the point where the entire project fails.
Make Your Data Warehouse Scalable
Your business is going to change over time, so you need to make sure your data warehouse/BI environment can, too. According to the Gartner Group (Gartner RAS Core Research Note G00161231, James Richardson & Bill Hostmann, 12 September 2008), during the first year of a data warehouse/BI implementation, you should expect users to request changes that will affect 35 percent to 50 percent of the application functions. Anticipating and managing change will be one of your biggest challenges. A good change-enhancement request system should be at the top of your wish-list so that you can better manage the changes that are inevitable with a data warehouse/BI implementation.
In addition, hardware and software costs need to be balanced with scalability. The enterprise will change and so will your data warehouse/BI environment. If the data warehouse isn’t a core competency or the primary line of business, then you should seriously consider running your data warehouse in the cloud. At the time of this writing, Microsoft doesn’t have a viable cloud computing solution for large systems, but there are other options available, including GoGrid, Amazon/Pentaho, and VMware’s newly-released vSphere 4 cloud OS. With the right choice of cloud vendor, you can end up hiring expertise in business continuity, system availability and security, data storage and archiving, and scaling the data warehouse/BI environment as needed, all for a pay-as-you-go price that generally equates to one or two magnitudes decrease in implementation and maintenance costs compared with doing it yourself. Your time-to-terabyte will be measured in days rather than weeks or months.
Comply with Internal and External Standards
Having an enterprise overview of how and where data is used makes figuring out if you’re in compliance much easier. Compliance is another instance in which MDM becomes a “must-have” component in your organization. Although complying with both internal and external standards seems like a big headache prior to a security breach or event, you’ll be glad you did in the event of an attack. Every organization has de-centralized data management, so I recommend using compliance regulations such as Sarbanes-Oxley to level the playing field. An outside influencer can often do more to predicate cooperation than internal directives. Measuring compliance can be a tough and time-consuming job, so you might want to look into risk management software such as RiskWatch, which was originally built to track HIPAA compliance but now includes features that evaluate total enterprise risk.
Data Warehouse Design Tips
It’s not possible to describe or even list all the technical design issues that one could encounter when building a data warehouse. However, the following are some data warehouse design tips that can help you minimize problems in your data warehouse:
- Don’t confuse measures with context. Measurements (measures) are usually numeric, can be aggregated, and are stored in fact tables, whereas context is descriptive and open-ended. Divide the context into logical groupings such as customers, store locations, and time, and store them as dimensions. (For more information about measures, see "Data Warehousing: Measuring the Facts," September 2007.)
- Define the grain of the fact table. To do so, start by determining exactly what a fact table represents: For example, is it an individual insurance policy transaction or a weekly inventory count? Define the grain of a fact table in terms of a clear business objective and a set of business rules. Don’t be afraid to store too fine a level of detail if you need to because although you can always aggregate the details, you can’t extrapolate details from summarized data.
- Don’t snowflake when a simple star schema will do. The purpose of data warehouse design isn’t to normalize the data (as you attempt to do when you snowflake) but rather to organize the data in such a way to be intuitively meaningful to end users. That often means you need to de-normalize the dimensions. (You can learn more about dimensions by reading "Data Warehousing: Dimension Basics," October 2007.)
- Partition fact tables on a date key. Partitioning fact tables on a date key makes data maintenance much easier. If you need to delete or archive historical data you can “retire” the oldest partitions. Index maintenance is faster if the data is partitioned, as is data retrieval because SQL Server has to look only in relevant partitions for data. In addition, always align the partitioned indexes by using the same partition schema that you used to partition the base table. If you partitioned the base table on a date column, then index the date column and use the same partition scheme for the index. Keep empty partitions at both ends of the table so that it’s easy to add and remove partitions, and choose the partition grain carefully because there’s a maximum of 1000 partitions in SQL Server 2008 and 2005. (To find out more about partitioning fact tables, see “Data Warehousing: Horizontally Partitioning the Fact Table,” April 2008.)
- Cluster fact tables on a date key. Doing so lets you efficiently retrieve historical slices of data and supports queries that are used to populate cubes—most of which have a major time component. (For more information, see “Indexing the Data Warehouse,” August 2008.)
Query the star schema using WHERE on the partitioning key. For instance, if the fact table (Sales) has been partitioned by date (dateKEY), your query would be look like
SELECT… WHERE Sales.dateKEY between ‘2009-01-01’ and ‘2009-01-31’
Because SQL Server 2008 and 2005 Developer Edition and Enterprise Edition are partition-aware, you’ll want to use literals, not variables, to point the query analyzer to the correct partition and to optimize SQL Server 2005’s idiosyncrasies regarding multi-partition query scans.
Data Warehouse Design Best Practices
There are so many lessons still to be learned about designing data warehouses. From these lessons we glean both business and technical best practices. As you discover new, and possibly better, ways to perform these tasks, spread the wealth by sharing your solutions with your peers.