Building a data warehouse isn’t a simple task and it shouldn’t be done by one person working alone. Because a data warehouse combines the best of business practices and information systems technology it requires the cooperation of both business and IT, continuously coordinating in order to align all the needs, requirements, tasks and deliverables of a successful data warehouse implementation. I’d like to share the approach I use when planning and managing any database project—this includes transactional databases, data warehouses, and hybrid databases. I live in the world of relational databases and data warehouses and the extraction, transformation, and loading (ETL) processes that support them, so I’ll focus my approach in this realm. However, you can extend this approach to the entire stack—OLAP cubes and information delivery applications such as reports, ad-hoc analysis, scorecards, and dashboards.
I’m not presuming how to tell a bona-fide project manager (PM) how to do his or her job. Rather, I’m writing this for DBAs and developers who don’t have the good fortune to be working with an experienced PM, or for those IT pros who have been summarily ordered to “build a data warehouse” and act as their own PM. My discussion won’t be complete, but I hope that it’ll give you enough information to start the project ball rolling.
A data warehouse project has three tracks—the data track, the technology track, and the application layer track, as illustrated in Figure 1. When you’re putting together any database project plan I recommend using these three tracks as a template to manage and synchronize your activities. You can also use Figure 1 as a high-level overview when explaining the plan to technical decision makers (TDMs), business decision makers (BDMs), and all other participants in the data warehouse project.
Using a Lifecyle Management Method
I encourage you to take advantage of the resources that your organization may offer, such as techniques and methodologies for designing, developing, and deploying systems and software. If your company hasn’t adopted any formalized methodology for doing this work, then go ahead and use the technique that I’ve developed for my own database projects, namely, the 7D Database Lifecycle Management Method™, familiarly known as the “7D Method™.”
My 7D Database Lifecycle Management Method™ addresses lifecycle management of the database, not the lifecycle of hardware or software (applications) that touch the database. I’ve included both hardware and software tracks in Figure 1, but I won’t be expanding on the management of either. It’s necessary to align and synchronize milestones in the database lifecycle with both hardware and applications in order to successfully implement the database lifecycle methodology.
You never truly finish building a data warehouse. Unlike a traditional database, which often remains relatively static for some extended period of time after deployment, a data warehouse is constantly in a state of flux, responding to changes in the business conditions for which it was created. Today’s business environment is more complex and deals with a faster rate of change than ever before in recorded history. Managing nearly constant change is one of the greatest challenges of the enterprise. That’s why it’s so important that everyone on the data warehouse team, BDMs and TDMs alike, should be on the same book and page, using the same type of lifecycle approach, so that they’re totally aligned in their thinking. Only by doing this is there any hope of aligning the implemented data warehouse with the vision and purpose of the enterprise. In Figure 1 I’ve laid out the seven steps of my 7D Method™, and I’ll walk you through each step in this article.
Step 1: Discover
I guarantee that any database project of any appreciable size and scope will fail if you don’t start with a discover stage. Also known as “requirements analysis and definition,” the Discover step requires a business-centric approach, especially in data warehousing projects, since the output from a data warehouse needs to support the organizational goals. The Discover step is essentially an investigation, and you should be constantly asking six basic questions (what, how, where, who, when, and why), recording the answers, and incorporating these answers into the solutions you craft.
In the first three steps (Discover, Design, and Develop) there must be concentrated coordination between thebusiness owners and the technology specialists; the PM should be enabling this process. As an independent professional who’s primarily concerned that the project stays on time, within budget, and works as promised, the PM is in charge of establishing critical paths, milestones, and success metrics, after getting feedback from all parties. If there’s no PM on the project these will be your tasks.
In the Discover stage the PM has to collect information for all three tracks shown in Figure 1, the technology track, the data track, and the application layer track. Among other tasks, the PM has to specify stakeholders and users, and must understand each of their roles and their data/visualization needs. The PM has to be aware of the organization’s performance management strategy: What are the objectives, initiatives and supporting metrics/KPIs used to track the health of the business and the project? If any of these portions of the strategy are missing, then there’s a high probability that the project will miss its mark with end-users, which would result in low adoption rates and no future funding. In other words, the project wouldfail, no matter how flawlessly the project tasks were executed.
Step 2: Design
The major activity of the Design step is the development of the semantic and schematic models that describe the data warehouse. These models need to address business users’ management information systems (MISs) and business intelligence (BI) analytical needs. For a data warehouse project you can create conceptual and logical data models for the relational data warehouse and dimensional models to represent the multi-dimensional cubes. You can use a decision matrix to help determine what needs to be included in each dimensional model; list key business processes that will be supported by the data warehouse along the Y axis and the proposed dimensions along the X axis. This matrix will serve as a guide for current development and for future extension and cross-organisational integration. The models you create for this Design stage have to reflect the answers to the six interrogatives you collected in Step 1.
It’s a good idea to identify all the data sources (internal and external), operational/transactional databases, and flat file structures that will be part of the data warehouse. You should also specify which data will be imported into the data warehouse and which will simply be referenced as an external data source.
Typically the Technology Track will have its own PM, but you might need to fill that role, also. A data warehouse can grow to be very large in content and very broad in scope, so it’s necessary to properly size a data warehouse before its deployment..Estimate the size on paper first, so you can get an idea of how much processing power and disk capacity you’ll need when you take the data warehouse into production. You’ll need to estimate the number of business end-users on day one and what kind of applications they’ll be using (for example, doing an ad-hoc analysis against cubes, or pulling cached reports from the relational data warehouse) and the amount of data that will be stored in year one. You’ll need a two-year and a five-year projection, because just as a data warehouse is a work in progress, so too its processing and data storage needs will continuously increase over time. Data warehouse installations include a variety of hardware, communication, and software solutions, all of which must work together to deliver to a working data warehouse to the end users. You’ll want plenty of time to plan and test how you’ll integrate all these different components.
Like the Technology Track, the Application Track may have its own PM or a lead software developer who fills that role. If you’re that fortunate, your job will be to coordinate with this person to synchronize tasks. If not, your job description just expanded. The Application layer includes output gleaned from the data warehouse, which is generally MIS reports and BI analysis results. MIS reports are the screen displays, dashboards, and printed copy that help business managers make the tactical decisions needed to run the day-to-day business. These outputs are relatively easy to define, codify, and capture into a set of standardized processes that can be run in a scheduled environment. The BI portion of the application layer is that set of queries and responses that help executive management make the strategic decisions that drive a business. The BI solutions are often unstructured and hard to pre-define since they tend to explore the data in an ad-hoc manner. Scoreboards, graphs, and pivot tables are examples of BI applications that stimulate additional data investigation which can result in strategic directional shifts within a company.
At this stage many methodologies call for a prototype or pilot project. The 7D Method™ does not. At most, as part of the design activities for the Application layer, you can do a “clickable model” – a quickly-rendered mockup of an input/output screen that involves no or extremely little code, but gives the stakeholders a visualization of the concept while not eating up precious time and resources. If a pilot or prototype is necessary, then choose a slice of the whole and walk the pilot through the steps of the 7D Method™. The 7D Method™ doesn’t distinguish between pilots, prototypes, and production systems – they’re all treated as projects.
If you engineer a prototype through the 7D Method™ and it ends up in production (as so many prototypes do), then you’ll have to choose the second slice even more carefully than the first. If the slices don’t successfully integrate with each other, if they don’t support the enterprise purpose and intent you uncovered during the Discover step, then you’ll be creating “stovepipes,” which are islands of information that integrate with each other only with difficulty, and in some cases, not at all.
Step 3: Develop
The Data Track Develop step has two major components: The first involves mapping the data models to their equivalent physical design counterparts (essentially, blueprints for the relational data warehouses and the OLAP cubes), sizing the database and partitioning the tables where necessary, setting up naming conventions for data warehouse objects that both business users and technology users can live with, and developing strategies for indexing and identifying index candidates. The second component involves developing schemes for the ETL of data from external sources into the data warehouse. Included in but not limited to this portion of the project is Data Transformation Services (DTS)/ SQL Server Integration Services (SSIS) package development and testing, import/export and T-SQL script development and testing, and data integration testing for those external data source components that won’t be imported into the data warehouse.
The Technology Track Develop step includes reviewing, testing and selecting products that will deliver their piece of the system architecture design. This has to be done for the various layers that make up the communications links—the physical layer; the data link and network layers; and the transport, session and presentation layers. While many products seamlessly fold multiple layers into a solution, it’s necessary to be aware of future load requirements and performance demands on each of these layers in turn, and to prepare for these demands in advance. You should select the data warehouse servers and storage solutions, and new, end-user-facing hardware that you’ll need to deliver data from the new data warehouse. You’ll do this for the production data warehouse and for the staging database—the place where the DTS/SSIS packages and T-SQL scripts will be executing, importing data from external data sources, and exporting the manipulated and massaged data into the relational data warehouses and OLAP cubes. Depending on the requirements gathered during the Discover stage, you may also be supporting data marts, snapshots, and reporting databases as part of your data warehouse environment, so be prepared to plan the environments for these, also.
The Application Track Develop step sounds simple: Just develop the end-user applications. However, this might be the most complex and time-consuming task in the entire process, and could be the most costly if success metrics aren’t carefully developed and respected. It’s in this stage that scope creep (continuously adding features and functionality without regard to the impact on design and development of the other two tracks) can torpedo the project. In addition to developing the end-user applications, you’ll also have to develop a scheme for testing these applications. You’ll need to create end-user training programs on how to use these applications. At every milestone you’ll have to ensure that you get sign-off from the relevant parties involved.
This may sound obvious, but it’s surprising how many development projects are staged and tested in a production environment. Don’t, just don’t do this! Create a separate physical environment for developing, testing, and staging the components. You do this for your line of business (transactional) systems—do the same for your BI/data warehouse.
Step 4: Deploy
You don’t deploy a data warehouse the same way you deploy a transactional database. Typically, you deploy a transactional database in a swift, all-encompassing fashion—Friday evening end-users are in the legacy system and Monday morning they’re logging into the new database. Data warehouses are usually deployed incrementally, to various groups of users throughout the enterprise. The pace at which this happens, and the order in which the data warehouse is made available to individual groups is part of the rollout plan contained within the Deploy stage.
Ideally, data warehouse deployment happens in rapidly-cascading tiers. First the technology is put into place—servers, storage, comm links, etc. Systems software is installed, tested and made ready for production. Then the data track components are rolled out—the data warehouse databases (relational and OLAP) are built, and the ETL processes are brought online. There’s usually a break in time before the final application layers are added, while you let the data flow from the external sources, through the ETL processes, and into the various data warehouse databases and cubes, testing and adjusting as needed. Then the application layer is deployed. You’ll probably want to plan to deploy the application layer gradually, because different audiences within the enterprise adopt at different rates.
As PM, you play a vital role. With your guidance and shepherding, the three tracks will arrive at the Deploy stage when they’re scheduled, avoiding weeks or months of “running behind schedule” angst. As soon as the Technology and the Data tracks are ready and tested and ready to go, start rolling out the application layer. A data warehouse without a user interface (UI) is of no use to anyone, and a data warehouse with an undersized and weakly-engineered systems architecture will perform so badly that it will never be adopted by the business users.
Step 5: Day to Day
Managing Day-to-Day operations is very important; this management is often overlooked in the planning and development process. Not only do you have to ensure that regular (daily, weekly, etc.) maintenance is performed, both at the hardware and software levels, but you have to continuously monitor performance and growth of all systems. As I said at the start, a data warehouse is never finished; it will continue to grow and expand as more users discover the value inherent in the data and create new and sometimes challenging ways to query the data warehouse. Some of the PM tasks you must be prepared to undertake include ensuring that all systems (hardware, communications links, systems software) are fully operational and up-to-date on patches and upgrades; when operational bottlenecks occur diagnosing and resolving them as quickly as possible; ensuring that all systems needing to be backed up do, in fact, have backup jobs defined and scheduled, and that all backups are tested by subsequently restoring to test, development, or reporting databases.
Businesses don’t stand still. They must constantly reinvent themselves in order to stay competitive. It’s the job of the data warehouse data administrator to track data use, to gauge the importance of the data, and to detect when business needs start to shift. As the business model changes, so will the need for newer, better, more flexible, possibly more complex user applications, and the data administrator should be congnizant of these requirements. In time there will be enough change in the direction and focus of the business to initiate a re-do of the Discover stage, and the lifecycle will have come full circle. Wash, rinse, repeat.
Step 6: Defend
Defending your data warehouse involves more than just taking regular backups or making sure that none of the applications include SQL queries that might be open to SQL injection attacks. You have to plan for the entire scope and breadth of defense because the data warehouse contains the enterprise’s most valuable asset—its data, in a compiled, cleansed, and (in some cases) informational format.
Threats to the data warehouse usually fall into two categories, physical and logical. Physical threats can be external (tornado, flood, fire, earthquake) or internal (intentional, accidental). You can protect against physical threats by employing practices as simple as restricting access to the computer and communications rooms, or as complicated (and expensive) as mirrored servers at a failover site that’s geographically distant. The steps you take for physical defense depend on your recovery time and recovery point objectives, that is, how much time you can afford to have your data warehouse offline, and how much data you afford to lose, respectively.
Logical threats are more complicated, simply by the nature of the data warehouse environment. The operating system might fail, the database management system might crash, one or more applications might intentionally or accidentally corrupt, destroy, or misinterpret data (this is especially true in the ETL processes that feed the data warehouse). Browser UIs have exposed embedded SQL calls to the threat of SQL injection attacks. Each potential threat has to be identified and addressed; it’s far better to develop remedies for threats before they occur, rather than after they happen. Your job as PM is to develop a comprehensive defense for your entire data warehouse installation. If you’re lucky enough to have a security administrator, leverage this person’s expertise and experience.
Step 7: Decommission
There may come a day when the data warehouse, or a component thereof (a staging database, a data mart, a reporting database, a cube) no longer meets requirements, and it’s time to decommission it. Not every database can be constantly refactored or upgraded to meet new requirements. Sometimes you simply have to scrap and rebuild, especially if the database instance was “built on spec”, that is, never properly architected to fully reflect the goals and intent of the enterprise. When this happens, as PM, you have to synchronize the process.
Typically, the Decommission step happens in one of three ways: Decommission with no replacement; decommission with cutover; and decommission with phase in/phase out. “Decommission with no replacement” means that the function the database used to perform is no longer needed. Not only is the database retired, the functions it used to perform are also retired. “Decommission with cutover” indicates that another database will replace the one being retired, and that the functions performed will be quickly transferred from the old database to the new. One day the users will be pointing to the old database, the next day they’ll be pointing to the new. “Decommission with phase in/phase out” indicates that the old and new databases will be run side-by-side for some period of time while functions and users are incrementally transferred from the old onto the new, until at last there are no users or functions running on the old database and it can be retired. Each scheme has its risks and rewards; as PM you have to determine when the risks outweighs the rewards and which scheme makes most sense for your situation. Then you have to plan and execute, working with others in the Technology Track and the Applications Track to ensure a seamless conversion.
The Virtuous Cycle
As you work with these data warehouse components there will be subsequent rounds of discovery, during which you’ll assess the new requirements that have evolved over time. This can happen as a result of the information gleaned from the data stored in the data warehouse. These new requirements might lead to enhanced designs and expanded solutions for one or more tracks. You’ll need to integrate the changes in the existing data warehouse so you can deploy the newer, better solutions to the business users eager to take advantage of them. Some new requirements might cause changes in the day-to-day operations in order to keep the data warehouse running like a well-oiled machine.
Over time, multiple iterations of this lifecycle process will cause the data warehouse to be ingrained into the fabric of the enterprise, until the data warehouse and the business become a seamless unit. The PM part of this puzzle ensures that all activities and tasks are done according to specification, are accepted according to the success metrics established, and deployed synchronously. Even if your data warehouse project has no formal PM, even if you’re the only person listed under “resource, human” on this project, you should still do some sort of PM plan and keep it up to date. Then, when management asks “how’s the data warehouse project going?” you can show them exactly what stage the project is at and what’s been done.