Data warehouses are evolving from the traditional information archive that supports business intelligence (BI) users and reporting tools, becoming instead true analytical repositories that support mission-critical functions of the enterprise. My definition of mission-critical is any system that supports revenue generation or cost control. Should such a system fail, users would have to manually perform these formerly automated tasks to prevent loss of revenue or increased cost. In light of the mission-critical nature of the data warehouse, it’s necessary to optimize the design for maximum performance. You begin this optimization effort by identifying what type of workload your warehouse is dealing with.
In the study “Magic Quadrant for Data Warehouse Database Management Systems, 2007,” the Gartner Group identified six distinct workload types that typify the way organizations use their data warehouses. The following are descriptions of the workloads:
1. Continuous (near real-time) data loading for dynamic data elements. This workload creates an environment that’s very similar to an OLTP database workload because indexes and other optimization structures have to be continuously updated. Continuous updating can have a profound effect on summary and aggregate data, which is used to support dashboards and prebuilt reports.
2. Batch data loading, as organizations realize that not all data needs to be refreshed continuously.
3. Warehouses that support reporting systems and large numbers of standard, prebuilt reports (thousands per day), which require indexing and tuning for specific SQL queries, new types of storage partitioning, and other types of optimization structures in the data warehouse.
4. Tactical business analytics (as opposed to the traditional strategic uses that data warehouses were once limited to) in which business professionals with little or no programming experience use analytical applications to drill down and drill through multidimensional objects. Business users rely on BI architects to build commonly used cubes for their work.
5. Warehouses that support users of true ad hoc queries (i.e., data miners), who use the data in random, unpredictable ways, making it impossible to optimize and tune their queries.
6. Analytic and BI-oriented functions in conventional OLTP databases (a re-emerging trend).
In my experience, there are also five distinct data warehouse use cases (i.e., stereotypes of real conditions in which data warehouses are used). Each use case includes one or more of the Gartner Group’s workloads and has specific infrastructure needs and requirements. By identifying which use case most closely matches your data warehouse, you can better tune for performance. On the downside, if your data warehouse is trying to be all things to all people, you might consider splitting the data warehouse to optimize for the desired functionality. Here are my five data warehouse use cases and the workloads they typically support. Once you’ve identified your data warehouse among them, you can then look at optimizing it.
The Traditional Data Warehouse
The type of structure you think of when you hear the name “Bill Inmon” is the traditional data warehouse. (For more information about Inmon, see the webexclusive sidebar “Who is Bill Inmon?”) It usually contains very large data volumes with daily or weekly data refreshes. The scope and breadth of data might be very wide and deep because this data warehouse typically serves a very large audience. This type of data warehouse is what most organizations start with or migrate to after their proof-of-concept data mart project. It’s the conventional data store used for BI and will support vast reporting systems. Its content can help the management team determine answers to strategic questions such as “Are we doing the best job positioning ourselves in the marketplace?” However, if your users need to use the data warehouse for tactical or data mining purposes, I recommend spinning off individual data marts designed for those purposes. Workloads 2 and 3 are consistent with what you’d expect to find in a traditional data warehouse.
The Data Mart
The data mart is typically characterized by its size and focus. Data marts tend to house less data than data warehouses and to be narrow in scope. They hold data for a particular department or function of a company. Data refreshes and retention are based on the needs of that department or function. It’s not unusual to have both real-time feeds and daily or weekly refreshes in a data mart because key decisions will be made based on the most up-to-date data possible. Data marts are small and include any and all kinds of data that’s needed to service the data mart consumers, so they’re ideal for sourcing dashboards and for supporting analytics and data mining because the data can be processed quickly. You might find workloads 1, 2, 3, 4, and 5 associated with a data mart.
The Real-Time Data Warehouse
A real-time data warehouse has continuous data feeds and can be used for near-real-time reporting. The development and popularity of this type of data warehouse has been driven by organizations’ increased desire to have the most current, integrated information possible at their fingertips. Some of the problems you might encounter in a real-time data warehouse include constant resource contention between real-time feeds and (often ad hoc) queries being made against the same set of data objects, daily or even hourly increases in data storage, purging of unneeded data, and a community of users that might be very broad or very narrow, depending on the type of data contained within the data warehouse. Workload 1 is often related to a realtime data warehouse.
The Historical Data Warehouse
In the past couple of decades, the government has mandated that many types of businesses have to keep large amounts of historical information (financial information, especially) for compliance reasons. These mandates have caused an increase in the popularity of the historical data warehouse. This type of data warehouse contains data volumes that are many times larger than that of traditional data warehouses, requiring you to pay close attention to storage structures and partitioning. Data refreshes are semi-frequent (i.e., weekly or even monthly), and there’s usually less query traffic than with traditional data warehouses, data marts, or real-time data warehouses. Queries that are run against historical data warehouses are usually well-defined, which means you can optimize them for performance. Workloads 2 and 3 are consistent with what you might find in a historical data warehouse.
The Analytic OLTP Warehouse
An analytic OLTP warehouse is one that has been created to be all things to all people. In the old days—when hardware was more expensive than a programmer’s time—one database was used for both rapid transactional updates and resource-intensive analytical queries. Mixing these two workload types slowed applications down and made them non responsive. According to the Gartner Group, there seems to be a resurgence of this type of architecture, fueled by cheap, powerful hardware and lack of experience on the part of both DBAs and business decision-makers as data warehousing works its way down the economic ladder into the small-to-midsized business (SMB) market. This architecture always has been and always will be a worst practice because it’s neither scalable nor extendable. Workload 6 is often associated with the analytic OLTP warehouse.
Workload Mix and Match
You need to be aware of what types of workloads your data warehouse is supporting. According to the Gartner Group, “through 2010, mixed workload performance will remain the single most important performance issue in data warehousing.” As a direct result of the need for real-time data loading and automated transactions (e.g., sourcing dashboards), “the transactional DBMSs may be able to erode the performance edge that was formerly attributed to specialized data warehouse DBMS solutions.” In other words, for certain applications, such as dashboards, your best datasource might be one or more transactional databases, not your data warehouse.
Data warehouses are growing in size and complexity as more businesses develop and use them. The mix of workloads in a single data warehouse is creating more problems than the size of the data warehouse because one type of workload will have different resource needs than other types of workloads. Optimizing for these various and contradictory requirements might not be possible; if you can’t split the workloads into separate data warehouses or data marts, you might have to settle for “good enough” or less. By becoming aware of what types of data warehouses you have and what types of workloads they perform, you can determine how to optimize them.