How Data Warehouse Automation Tools Do (and Don't) Ease Cloud Moves

In most scenarios, a data warehouse automation (DWA) tool will assist an organization as it moves an on-premises data warehouse system to a cloud service. Before that happens, however, an organization must standardize on the DWA tool as the primary means of designing, deploying and maintaining its data warehouse system. Because the DWA tool captures and manages all of the work that is performed on the data warehouse, it can bring this work forward to the new warehouse target platform. If this condition is met, the use of a DWA tool can simplify the migration process. This is an ideal on-premises-to-cloud migration scenario. However, if an organization has not already standardized on DWA, a DWA tool is less helpful--at least at first. In this less-than-ideal on-premises-to-cloud migration scenario, a DWA tool can still be helpful, albeit primarily in the context of re-engineering the existing data warehouse system for the new target environment.

Data Warehouse Automation and the Many Modes of Cloud

All IT people grok the distinction between the different types of cloud data warehouse services. To quickly recap: The IaaS data warehouse essentially replicates on-premises infrastructure in a virtualized cloud context; the PaaS data warehouse is a managed cloud service that exposes ease-of-getting-started and ease-of-use features. It likewise abstracts much of the complexity involved in configuring, managing, administering and, for a subset of tasks, maintaining the cloud data warehouse.

Almost all conventional on-premises data warehouse systems are also available as PaaS offerings. Several cloud providers also market PaaS-only data warehouses, the most visible of which are Amazon Redshift, Microsoft Azure Synapse Analytics and Snowflake. In most cases, even if a cloud data warehouse provider bills its service as a “software-as-a-service” offering, it is PaaS. An exception to this is a query-as-a-service (QaaS) platform such as Google BigQuery. Although BigQuery is not the only QaaS platform, it is the only one that is officially supported by commercial DWA tools.

Some quick points:

Generally, a multi-target DWA tool that supports one vendor’s on-premises data warehouse will support that vendor’s IaaS data warehouse. This is not always the case, however.
It does not always make sense to “forklift” a warehouse based on one vendor’s software to that same vendor’s software running in the cloud; see the last section of this article for more.
Most multi-target DWA vendors also support cloud PaaS solutions such as Redshift and Snowflake. Some also support Microsoft’s Azure Synapse Analytics service. Official support for the PaaS data warehouse continues to lag among commercial DWA tools. This is changing, however.

Ideal Migration Scenario: Prior Standardization on a Data Warehouse Automation Tool

The steps involved in using a DWA tool are mostly the same irrespective of “where” the target data warehouse lives. In other words, using a DWA tool to move an on-premises data warehouse system to the IaaS cloud is a lot like using a DWA tool to assist with an on-premises-to-on-premises migration.

PaaS data warehouse migrations can be more complicated, chiefly because DWA tools officially support only a few PaaS data warehouse environments--usually, Azure Synapse Analytics, Redshift and Snowflake. We will say more about this topic in the third article in this series.

Basically, the DWA tool regenerates the existing data warehouse--along with all its assets--for the new target platform. To understand what this means, think of this as analogous to using a compiler to recompile code so that it runs on a completely different instruction set architecture (ISA), to take advantage of the new instructions supported by a specific revision of an ISA.

In “regenerating” the warehouse, the DWA tool automates the following tasks:

Regenerate SQL code to create data structures optimized for the target platform
Move all data from the old to the new data warehouse target
Bring forward (updating and/or deprecating as necessary) all native objects
Port and schedule ETL jobs, data conditioning jobs and associated validation logic
Related: Regenerate and/or re-sequence ELT code for optimal parallel execution
Update, delete and/or create metadata and documentation
Test and validate steps 1-6 in the migration process

From the user’s perspective, the DWA tool automatically regenerates existing views, indexes, models, stored procedures, UDFs, metadata, documentation, and so on for the cloud data warehouse target. The DWA tool also brings forward all the data integration--that is, data engineering--pipelines that feed the warehouse. This gets at an important point: In DWA, data engineering is usually performed in the target data warehouse itself. (This extract, load, transform [ELT] technique centralizes the work of managing and maintaining data engineering logic, as well as scheduling data engineering jobs. It also simplifies migration to other targets.) The DWA tool schedules, tests and validates ETL jobs in the destination IaaS or PaaS data warehouse, too.

This is just to scratch the surface, of course. In a very real sense, once you have used a DWA tool to perform the steps described above, the real work of migration begins. See the final section, below, for more on DWA and the challenges posed by on-premises-to-cloud data warehouse migration.

Less-Than-Ideal Migration Scenario: New to DWA, No Prior Standardization on a Data Warehouse Automation Tool

If an organization has not standardized on a DWA tool as the primary means of designing, deploying and maintaining its data warehouse system, it cannot expect to use that tool to regenerate its existing data warehouse for the new cloud warehouse target. In this scenario, the DWA tool functions as a kind of RAD environment for data warehouse design, developing and testing.

For example, a DBA could load an existing logical data model into the DWA tool and use it to generate the SQL code that instantiates native structures (tables, primary and foreign keys, and so on) in the cloud warehouse target. The DBA could use the DWA tool to replicate data from the existing data warehouse to the new target. In addition, some DWA tools can identify and import assets (indexes, views and UDFs, among others) to assist with recreating the warehouse in the new cloud target.

(Note: Most PaaS data warehouse services expose migration tools that support some of these capabilities, especially in same-vendor to same-vendor migrations.)

Unfortunately, the data warehouse automation tool cannot be used to replicate the data engineering logic that loads fresh data into the data warehouse system--for the simple reasons that (a) the tool was not instrumental in creating this logic, so it is not used to manage and maintain it, and (b) this logic is not instantiated (for example, as ETL jobs) in the existing data warehouse system. Instead, it is executed, managed and maintained in a separate environment, such as an ELT platform or an operational data store.

So, can DWA be useful even if an organization does not already use it? The simple answer is that all migration efforts involve re-engineering and custom-fitting. A DWA tool can assist with these tasks, supporting activities such as exploring and remapping existing data engineering processes, or designing, testing and scheduling new ones; modeling data to improve performance and/or create business views of data; or creating, modifying and maintaining business rules.

No less important, a data warehouse automation tool provides a development environment for changing the data warehouse itself. It exposes features that aim to simplify maintenance tasks, such as adding new types of fact tables, adding new dimensions (such as columns) to a fact table and re-keying surrogate keys. It likewise generates and manages metadata (for example, automatically creating data lineage metadata) and documentation.

The main takeaway of this article is that data warehouse migration is not--and probably cannot be made--a turnkey process, no matter what cloud infrastructure providers promise. A DWA tool can assist with a portion of these non-turnkey bits. Moreover, a multi-target DWA tool helps put the organization on a more sustainable path going forward, not just by centralizing the management and maintenance of a data warehouse system, but by potentially simplifying migration to dissimilar data warehouse target platforms. The next section discusses these non-turn-key problems in more detail.

General-Purpose Cloud Migration Considerations

To repeat: This article is just scratching the surface of DWA in the context of on-premises-to-cloud data warehouse migration. However, whether you use a DWA tool or not, you should be mindful of a few common problems that tend to obtrude during on-premises-to-cloud migration. If you do use a DWA tool, the tool itself provides built-in features that assist in troubleshooting these problems—for example, detecting incompatible structures (such as indices or constraints) and recreating compatible structures, if practicable. A data warehouse automation tool cannot automatically correct all of these problems, however.

Here are some things to consider in the context of general-purpose cloud migration:

Data engineering is not necessarily portable. Pay careful attention to the performance and reliability of your ETL jobs. Most data engineering jobs were designed with access to on-premises data sources in mind. The problem is that shifting the warehouse (also the site of ELT processing) to the cloud introduces new constraints. At a minimum, organizations should expect to have to redesign some (perhaps a majority of) ELT jobs to troubleshoot performance and reliability issues. Most interventions will require diagnosing problems and designing/testing new data flow, error control, data validation, etc. logic.

Mind your models. Organizations may opt to modify their existing warehouse data models to exploit the features (or quirks) of a cloud data warehouse. A data model that’s designed for a specific platform (for example, Teradata) may not perform the same in a dissimilar environment; moreover, virtualized cloud infrastructure has a tendency to exacerbate these performance asymmetries. In some cases, too, organizations may opt to break up large enterprise data warehouses (for example, into several subject-specific data marts) to move them to the cloud.

Less may be more. Maybe. Anecdotally, some PaaS data warehouse services tend to perform better using relatively simple data models, as with denormalized one-big-table (OBT) schemas. This is a big change, however: Software developers, data engineers, BI developers, and so on must instantiate modeling logic in queries that fetch data from the warehouse, or in a semantic layer that sits atop the warehouse and facilitates access for apps and services. (This is a feature, not a bug, of QaaS platforms such as BigQuery.) To cite just one example, implementing a change like this means redesigning applications, services and workflows to incorporate new modeling logic, or refactoring these assets to query against a (new?) semantic layer. One way or another, in one context or another, data must be modeled.

Test. Test. Test. Test. Test your ETL using your own data from your own upstream sources/producers. Test your most important (even if infrequent) queries against your own data in the new target warehouse environment. Test the performance of real-world user-initiated ad hoc queries, as well as of queries that are initiated via operational applications and their workflows. Test the behavior of apps or workflows that initiate multiple dependent queries. Depending on the complexity of your queries and the performance characteristics of the cloud data warehouse service you plan to use, you may need to explore different kinds of workarounds to ensure acceptable query response times or to prioritize queries for certain users/use cases. You may need to refactor application workflows, too.

Don't be constrained. Test the ability of the new data warehouse to accommodate and enforce different types of constraints. Some PaaS data warehouses (such as Snowflake) do not enforce most ANSI SQL constraints--viz., UNIQUE, PRIMARY KEY and FOREIGN KEY--even if they do support them for data modeling purposes. They likewise do not enforce table, column or in-/out-of-line constraints. This means you must employ workarounds (for example, the NOT NULL constraint, which Snowflake does support) in cases where you either use or might expect to use constraints. A DWA tool should identify and correct these issues.

Don't just forklift the old into the new. Scaling the data warehouse-in-the-cloud is easier--in the PaaS cloud, this is certainly the case--although it also poses challenges of its own. In the case of poorly performing workloads, it is usually possible to provision sufficient cloud resources to equal or surpass performance in the on-premises environment. However, this is inefficient--and, ultimately, costly. And even if cloud PaaS and QaaS providers purport to automate some or all of the nitty gritty involved in tuning and optimizing the data-warehouse-in-the-cloud, you should not reflexively relinquish responsibility for these tasks. Whether it is modifying or radically redesigning the warehouse's data model, breaking up a complex data model into discrete subject areas, creating different kinds of indexes (or clustering keys, for databases, such as Snowflake, that do not support indices), pre-calculating aggregates, and so on, focus on engineering the warehouse for the new cloud environment in which it lives.

Conclusion

Strive to balance ease of maintenance against optimal responsiveness, while focusing on keeping costs down. (Remember, the cardinal rule of cloud infrastructure is that it is not expensive … until it is.) This is not necessarily an insoluble three-body problem. Moreover, this is precisely the kind of problem data warehouse automation tools evolved to address. The end result will be a cloud warehouse that performs better, is easier to maintain and is less costly to operate.

Comments

Plain text