This article is the first in a five-part series on the data lakehouse. Its aim is to introduce the data lakehouse and to describe what is new and/or different about it. Subsequent articles will assess the strengths and weaknesses of data lakehouse architecture as a complement to the data warehouse.
The second article in this series explores this architecture as an attempt to adapt (or to retrofit) the core requirements of data warehouse architecture to accord with the priorities of cloud-native software design.
The third article assesses the viability of the lakehouse for typical data warehouse workloads.
The fourth article explores the use of data modeling with the data lakehouse. It evaluates the claim that the data lakehouse comprises a lightly modeled alternative to the conventional enterprise data warehouse.
The Data Lakehouse Explained
The term “lakehouse” is a pun/portmanteau. It derives from the two foundational technologies -- the data lake and the data warehouse -- from which the concept of the data lakehouse itself derives.
At a high level, the data lakehouse consists of the following components:
- Data lakehouse
- Data lake
- Object storage
The data lakehouse describes a data warehouse-like service that runs against a data lake, which sits on top of an object storage service. These services are distributed in the sense that they are not consolidated into a single, monolithic application, as with a relational database. They are independent in the sense that they are loosely coupled -- that is, they expose well-documented interfaces that permit them to communicate and exchange data with one another. (Loose coupling is a foundational concept in distributed software architecture and a defining characteristic of cloud services and cloud-native design. The second article in this series explores the cloud-native underpinnings of the lakehouse.)
How Does the Data Lakehouse Work?
From the top to the bottom of the data lakehouse stack, each constituent service is more specialized than the service that sits “underneath” it.
- Data lakehouse: The titular data lakehouse is a highly specialized abstraction layer -- basically, a semantic layer -- that exposes data in the lake for operational reporting, ad hoc query, historical analysis, planning and forecasting, and other data warehousing workloads.
- Data lake: The data lake is a less specialized abstraction layer that schematizes and manages the objects contained in an underlying object storage service (e.g., AWS S3, Google Cloud Platform Storage, Azure Blob, etc.) as well as schedules operations to be performed on them. The data lake can efficiently ingest and store data of every type, including not only relational data (which it persists in a columnar object format), but also semi-structured (text, logs, documents) and multi-structured (files of any type) data.
- Object storage: As the foundation of the lakehouse stack, object storage comprises an even more basic abstraction layer: a performant and cost-effective means of provisioning and scaling storage, on-demand storage.
Again, for this to work, data lakehouse architecture must be more (or less) loosely coupled, at least in comparison with the classic data warehouse, which depends on the RDBMS to provide most functions.
So, for example, several providers market cloud SQL query services that, when combined with cloud data lake and object storage services, can be used to create the data lakehouse. Think of this as the “ideal” data lakehouse -- ideal in the sense that it is a rigorous implementation of a formal, loosely coupled architectural design. The SQL query service runs against the data lake service, which sits on top of an object storage service. Subscribers instantiate prebuilt queries, views and data modeling logic in the SQL query service, which functions like a semantic layer. Voila: the data lakehouse.
This implementation is distinct from the data lakehouse services that Databricks, Dremio and others market. These implementations are usually coupled to a specific data lake implementation, with the result that deploying the lakehouse means, in effect, deploying each vendor’s data lake service, too.
The formal rigor of an ideal data lakehouse implementation has one obvious benefit: It is notionally easier to replace one type of service (for example, a SQL query) with an equivalent commercial or open source service. As we shall see, however, there are advantages to a less loosely coupled data lakehouse implementation, especially in connection with demanding data warehousing workloads.
What Is New And/or Different About the Data Lakehouse?
It all starts with the data lake. Again, the data lakehouse is a higher-level abstraction superimposed over the data in the lake. The lake usually comprises several zones, the names and purposes of which vary according to implementation. At a minimum, these consist of the following:
- one or more ingest or landing zones for data;
- one or more staging zones, in which experts work with and engineer data; and
- one or more "curated" zones, in which prepared/engineered data is made available for access.
In most cases, the data lake is home to all of an organization’s useful data. This data is already there. So, the data lakehouse begins with an innocuous idea: Why not query against this data where it lives?
It is in the curated zone of the data lake that the data lakehouse itself lives, although it is also able to access and query against data that is stored in the lake’s other zones. In this way, its proponents claim, the data lakehouse is able to support not only traditional data warehousing use cases, but also novel use cases such as data science and machine learning/artificial intelligence engineering.
The following is a mostly uncritical summary of the claimed advantages of the data lakehouse.
Subsequent articles in this series will explore and assess the validity of these claims:
More agile and less fragile than the data warehouse
Advocates argue that querying against data in the lake eliminates the multi-step process entailed in moving the data, engineering it and moving it again prior to loading it into the warehouse. (In extract, load, transform [ELT], data is engineered in the warehouse itself. This obviates a second data movement operation.) This process is closely associated with the use of extract, transform, load (ETL) software. With the data lakehouse, instead of modeling data twice -- first, during the ETL phase, and, second, to design denormalized views for a semantic layer, or to instantiate data modeling and data engineering logic in code -- experts need only perform this second modeling step.
The end result is less complicated (and less costly) ETL, as well as a less fragile data lakehouse.
Query against data in situ in the data lake
Proponents say that querying against the data lakehouse makes sense because all of an organization’s business-critical data is already there -- that is, in the data lake. Data gets vectored into the lake from sensors and other signalers, from cloud apps and services, from online transaction processing systems, from subscription feeds, and so on.
The strong claim is that the extra ability to query against data in the whole of the lake -- that is, its staging and/or non-curated zones -- can accelerate data delivery for time-sensitive use cases. A related claim is that it is useful to query against data in the lakehouse, even if an organization already has a data warehouse, at least for some time-sensitive use cases or practices.
The weak claim is that the lakehouse is a suitable replacement for the data warehouse. The third article in this series assesses the case for the lakehouse as a warehouse replacement.
Query against relational, semi-structured and multi-structured data
The data lakehouse sits atop the data lake, which ingests, stores and manages data of every type. Moreover, the lake’s curated zone need not be restricted solely to relational data: Organizations can store and model time series, graph, document and other types of data there. Even though this is possible with a data warehouse, lakehouse proponents concede, it is not usually cost-effective.
More rapidly provision data for time-sensitive use cases
Expert users -- say, scientists working on a clinical trial -- can access raw trial results in the data lake’s non-curated ingest zone, or in a special zone created for this purpose. This data is not provisioned for access by all users; only expert users who understand the clinical data are permitted to access and work with it. Again, this and similar scenarios are possible because the lake functions as a central hub for data collection, access and governance. The necessary data is already there, in the data lake’s raw or staging zones, “outside” the data lakehouse’s strictly governed zone. The organization is just giving a certain class of privileged experts early access to it.
Better support for DevOps and software engineering
Unlike the classic data warehouse, the lake and the lakehouse expose a variety of access APIs, in addition to a SQL query interface.
For example, instead of relying on ODBC/JDBC interfaces and/or ORM techniques to acquire and transform data from the lakehouse -- or using ETL software that mandates the use of its own tool-specific programming language and IDE design facility -- a software engineer can use her preferred dev tools and cloud services, so long as these are also supported by her team’s DevOps toolchain. The data lake/lakehouse, with its diversity of data exchange methods, its abundance of co-local compute services, and, not least, the access it affords to raw data, is arguably a better “player” in the DevOps universe than is the data warehouse. In theory, it supports a larger variety of use cases, practices and consumers -- especially expert users.
True, most RDBMSs, especially cloud PaaS RDBMSs, now support access via RESTful APIs and/or language-specific SDKs. This does not change the fact that some experts, particularly software engineers, are not -- at all -- enamored of the RDBMS.
Another consideration is that the data warehouse, especially, is a strictly governed repository. The data lakehouse imposes its own governance strictures, but the lake’s other zones can be less strictly governed. This makes the combination of the data lake + data lakehouse suitable for practices and use cases that require time-sensitive, raw, lightly prepared, etc., data (such as ML engineering).
Support more and different types of analytic practices
For expert users, the data lakehouse simplifies the task of accessing and working with raw or semi-/multi-structured data.
Data scientists, ML and AI engineers, and, not least, data engineers can put data into the lake, acquire data from it, and take advantage of its co-locality with an assortment of intra-cloud compute services to engineer data. Experts need not use SQL; rather, they can work with their preferred languages, libraries, services and tools (notebooks, editors and/or favorite CLI shells). They can also use their preferred conceptual vocabularies. So, for example, experts can build and work with data pipelines, as distinct to designing ETL jobs. In place of an ETL tool, they can use a tool such as Apache Airflow to schedule, orchestrate and monitor workflows.
It is impossible to disentangle the value and usefulness of the data lakehouse from that of the data lake. In theory, the combination of the two -- that is, the data lakehouse layered atop the supervening data lake -- outstrips the usefulness, flexibility and capabilities of the data warehouse. The discussion above sometimes refers separately to the data lake and to the data lakehouse. What is usually implied, however, is the co-locality of the data lakehouse with the data lake -- the “data lake/house,” if you like.
So much for the claimed advantages of the data lake. The next article in this series makes the case that data lakehouse architecture comprises a radical break with classic data warehouse architecture.