This is the fourth article in a series of five explaining the data lakehouse. The first article introduced the data lakehouse and explored what is new and different about it. The second article explored the data lakehouse from the standpoint of cloud-native design, a radial break from classic data warehouse architecture. The third article explored the viability of the lakehouse -- and of data lakehouse architecture -- as replacements for the data warehouse. The final article assesses the differences (and, just as important, the surprising similarities) between the lakehouse and the platform-as-a-service (PaaS) data warehouse.
This article looks at the role of data modeling in the context of designing, maintaining and using the lakehouse. It assesses the claim that the lakehouse is a light alternative to the data warehouse.
Data Lakehouse vs. Data Warehouse: Death, Taxes and Data Modeling
In making the case for the lakehouse as a data warehouse replacement, proponents usually point to a few extra benefits. The first benefit is that the lakehouse is supposed to simplify data modeling, which (in turn) is supposed to simplify ETL/data engineering. A second claimed benefit goes to the reduced cost of managing and maintaining ETL code. A third claimed benefit is that the elimination of data modeling makes the lakehouse less apt to “break” -- that is, the routine alterations and vicissitudes of business, such as merger-and-acquisition activity, expansion into (or retreat from) new regions, and the launching of new services, do not break the lakehouse’s data model because there is no data model to break.
How a Bill Becomes a Law, or How Data Is, or Isn’t, Modeled for the Data Lakehouse
To understand what this means, let’s look at a best-case scenario for modeling in the data lakehouse.
- Data is ingested by the data lake’s landing zone.
- Some/all raw data is optionally persisted in a separate zone for archival storage.
- Raw data, or predefined extracts of raw data, are moved into one of the data lake’s staging zones. The lake may maintain separate staging zones for different types of users/practices.
- Raw OLTP data may undergo immediate data engineering (for example, scheduled batch ETL transformations) after which it gets loaded directly into the data lake’s curated zone.
- Data in the lake’s staging zones is made available to different kinds of jobs/expert users.
- A subset of data in the lake’s staging zones is engineered and moved into the curated zone.
- Data in the curated zone is lightly modeled -- for example, it is stored in an optimized columnar format.
- The data lakehouse is a modeling overlay (akin to a semantic model) that is superimposed over data in the lake’s curated zone or, optionally, over select data in its staging zones.
- Data in the lake’s curated zone is unmodeled. In the data lakehouse, per se, data modeling is instantiated in application- or use case-specific logical models, akin to denormalized views.
So, for example, rather than engineering data so it can be stored in and managed by a data warehouse (usually an RDBMS), data is lightly engineered -- put into a columnar format -- prior to its instantiation in the data lake’s curated zone. This is where the data lakehouse is supposed to take over.
How much data must be instantiated in the lakehouse’s curated zone?
The simple answer is as little or as much as you want. The pragmatic answer is it depends on the use cases, practices and consumers the data lakehouse is intended to support. The nuanced answer: Modeling data at the level of a historical data store (such as the warehouse or the lakehouse) has an essential strategic purpose, too.
Before we unpack this claim, let’s look at what happens to data once it gets loaded into the data lake’s curated zone. The data in the curated zone is usually persisted in a columnar format, such as Apache Parquet. This means, for example, that the volume of data that comprises the curated zone is distributed across hundreds, thousands, even millions of Parquet objects, all of which live in object storage. This is one reason the curated zone usually eschews a complex data model in favor of a flat or one-big-table (OBT) schema -- basically, a scheme in which all data is stored in a single denormalized table. (The idea is to maximize the advantages of object storage -- high-bandwidth and sustained throughput -- while minimizing the cost of its high and/or unpredictable latency.) A claimed benefit of this is that the flat-table or OBT schema eliminates the need for the logical data modeling that is usually performed in 3NF or Data Vault modeling, as well as the dimensional data modeling performed in Kimball-type data warehouse design. This is a significant timesaver, lakehouse proponents say.
But wait, isn’t this how data is also modeled in some data warehouse systems?
One problem with this is that data warehouse systems commonly run flat-table and OBT-type schemas, too. In fact, OBT schemas were used with the first data warehouse appliance systems in the early 2000s. Today, OBT schemas are commonly used with cloud PaaS data warehouses, such as Amazon Redshift and Snowflake. The upshot is that if you do not want to perform heavy-duty data modeling for the data warehouse, you do not have to. For good or ill, plenty of organizations opt not to model.
This gets at a more perplexing problem, however: Why do we model data for the warehouse in the first place? Why do data management practitioners place such great store in data modeling?
The reason is, like it or not, data modeling and engineering are bound up with the core priorities of data management, data governance and data reuse. We model data to better manage, govern and (a function of both) reuse it. In modeling and engineering data for the warehouse, we want to keep track of where the data came from and what has been done to it, when, and (not least) by whom or by what. (In fact, the ETL processes used to populate the data warehouse generate detailed technical metadata to this effect.) Similarly, we manage and govern data so that we can make it available to, and discoverable by, more and different types of consumers -- and, especially, by non-expert consumers.
To sum up, we model data so we can understand it, so we can impose order on it, and so we can productionize it in the form of managed, governed, reusable collections of data. This is why data management practitioners tend to be adamant about modeling data for the warehouse. As they see it, this emphasis on engineering and modeling data makes the warehouse suitable for a very wide range of potential applications, use cases and consumers. This is in contrast to alternatives that focus on engineering and modeling data for a semantic layer, or encapsulating data engineering and modeling logic in code itself. These alternatives tend to focus on specific applications, use cases and consumers.
The Unbearable Fragility of Data Modeling
A final problem is that the typical anti-data modeling frame is misleading. To eschew modeling at the data warehouse/lakehouse layer is to concentrate data modeling in another layer. You are still modeling and engineering data; you are just doing it in different contexts, such as in a semantic layer or in code itself. You still have code to maintain. You still have things that can (and will) break.
Imagine, for example, that a business treats Europe, the Middle East and Africa (EMEA) as a single region, then suddenly decides to create separate EU, ME and Africa divisions. Yes, a change of this kind will require it to make changes to its data warehouse’s data model. However, it will also affect the denormalized views instantiated in its semantic layer. At minimum, modelers and business subject-matter experts must refactor these views. They may also opt to rebuild some of them from scratch.
The essence of the claim is that it is easier, faster and cheaper to fix problems in a semantic layer or in code than to make changes to a central repository, be it a data warehouse or a data lakehouse. This claim is not wrong, exactly, just tendentious. At the very least, it arises out of a distorted sense of how and why data gets modeled, be it for the old-school data warehouse or for the data lakehouse.
Both sides have valid concerns and make good points. It is a question of balancing costs and benefits.
To assume that the lakehouse reduces or eliminates data modeling -- and, with it, the complexity of ETL engineering -- is to ignore the purpose of data modeling in data management. It is likewise to play a kind of shell game with ETL. As a friend who works as an ML solutions architect likes to remind me: “You can never escape the work of ETL; you can only ever push it somewhere else.”
It is never easy to accommodate business change. To change something about the business is to break the correspondence between a data model that represents events and phenomena in the business’s world and reality itself. This correspondence is never identical. At best, it serves a purpose: It makes it possible for the business to see, understand and manipulate its structure, operations, and so on.
It is probably easier, on the whole, to shift most data modeling logic to a BI/semantic layer. In the scenario above, for example, modelers and SMEs must design a new warehouse data model; repopulate the data warehouse; and fix and identify broken queries, stored procedures, UDFs, and so on. However, they must also fix the modeling logic that is instantiated in the BI/semantic layer. This is extra work, no doubt about it.
As I have shown, however, this is not a problem that is specific to the data warehouse. In fact, it has just as much salience for any organization that implements a data lakehouse system. The idea of a lightly modeled historical repository for business data is not new; ergo, if you want to eschew modeling for the data lakehouse or the data warehouse, that is an option -- and has been for quite some time.
On the other hand, an organization that opts to model data for its lakehouse should have less modeling to do in its BI/semantic layer. Perhaps much less. Yet, the data in this lakehouse should be lucid and understandable to, as well as reusable by, a larger pool of potential consumers.
For this reason, these consumers are also more likely to trust the data in the data lakehouse.
By the way, this is another case in which a less loosely coupled data lakehouse implementation (for example, Databricks’ Delta Lake or Dremio’s SQL Lakehouse Platform) has an advantage relative to what I call an “ideal” data lakehouse implementation -- that is, an implementation in which the data lakehouse is cobbled together out of loosely coupled services, such as a SQL query service, a data lake service or a cloud object storage service. It makes more sense to model and govern data in a tightly coupled data lakehouse implementation in which the data lakehouse has sovereign control over business data.
It is not clear how this is practicable in an implementation in which a SQL query service does not have sovereign control over the objects that live in the curated zone of the underlying data lake.