This is the third article in a five-part series that explores the advantages and disadvantages of the data warehouse -- both as its own thing and as a replacement for the data warehouse. This article considers the requirements the data lakehouse must address if it is to completely replace the warehouse.
The first article introduces the data warehouse and explored what is new and different about it. The second article explores data lakehouse architecture as an attempt to adapt (or to retrofit) the core requirements of data warehouse architecture to accord with the priorities of cloud-native software design. The fourth article looks at the role of data modeling in designing, maintaining and using the lakehouse. The final article assesses the differences and, just as important, the similarities between the lakehouse and the platform-as-a-service (PaaS) data warehouse.
A Quick Recap of Data Lakehouse Architecture
Ideal data lakehouse architecture is loosely coupled in precisely the way that classic data warehouse architecture is not. By “ideal,” I mean the formal specification of an architecture -- that is, its ideal implementation. So, for example, ideal data lakehouse architecture decomposes the capabilities of the data warehouse into a set of basic software functions (see the second in this series) that it implements as discrete services.
These services are “loosely coupled,” in that they communicate via well-designed, well-documented APIs. In this way, they do not need to know anything about the internal definitions, interfaces, classes, etc., of the other services with which they interact. Loose coupling is not a new concept. However, as I explained the previous articles, it is a bedrock principle of cloud-native software design.
The ideal lakehouse is implemented by layering these services one on top of the other such that, in theory, it is possible to replace the functions provided by one (for example, AWS Glue) with those of another (Azure Data Factory), to replace one query service (AWS Athena) with another (Presto), and so on.
An alternative, pragmatic implementation couples the data lake and data lakehouse services to one another. This is what two of the most prominent data lakehouse providers, Databricks and Dremio, have done with their combined data lake/house implementations. As we shall see, this pragmatic approach has several advantages relative to the ideal implementation of data lakehouse architecture.
The thing to keep in mind is that even if the tightly coupled character of a classic (RDBMS-based) data warehouse implementation has its drawbacks, it has certain advantages, too. The upshot is that loose coupling constitutes a point of failure, especially when it comes to coordinating concurrent, transaction-like operations in a distributed software architecture composed of independent services.
The Formal, Technical Requirements of Data Warehouse Architecture
First, let’s explore what the data lakehouse must be able to do if it is fully to replace the data warehouse. At minimum, the set of capabilities provided by its software functions must also comply with the formal, technical requirements of data warehouse architecture.
From the point of view of data warehouse architecture, it is less important that a query platform should provide fast results than that these results should be uniform and replicable.[i] In practice, the trick is to balance these requirements against one another, to achieve query results that are fast enough and also uniform and replicable.
This is actually much easier said than done. It is the reason why Hive + Hadoop consistently failed when used as a data warehouse replacement. It is the reason why distributed NoSQL systems almost always have problems when tapped for use as would-be RDBMS or data warehouse replacements.
Against this backdrop, let’s review the formal, technical requirements of data warehouse architecture.
The data warehouse is the following:
- A single, central repository for topical and historical business data
- A system that permits a panoptic view across the business and its function areas
- A system that permits a monitoring/feedback loop into the performance of the business
- A system that consumers can use to pose common and/or unpredictable (ad hoc) questions
- A system that delivers consistent, uniform query results (i.e., everybody has the same data)
- A system capable of hosting concurrent jobs/users and highly demanding mixed workloads
- A system capable of enforcing strict, invariant data management and data processing controls
- A system capable of anticipating and resolving conflicts that occur between requirements 5, 6 and 7
Does the data lakehouse fit the bill? It depends on how you implement data lakehouse architecture.
If you architect your lakehouse by pointing a SQL query service at the curated zone of a data lake, you will have an implementation that is almost certainly able to scale to address requirements 1 through 4. However, this implementation will probably struggle with requirements 5 through 8, inasmuch as each demands an engine that has the ability to enforce strict consistency, uniformity and replicability guarantees while also performing multiple, simultaneous operations on data. This is because requirements 5 through 8 go to the wicked problem of managing (and resolving) the conflicts that occur as a result of concurrency.
Reality Check: It Is Not OK To Drop ACID-like Safeguards
In a classic, tightly coupled data warehouse implementation, the warehouse is usually instantiated in a relational database, or RDBMS. Almost all RDMBSs enforce ACID safeguards that enable them to perform concurrent operations on data while at the same time maintaining strong consistency.
In the popular imagination, ACID safeguards are associated with online transaction processing (OLTP), which is also strongly associated with the RDBMS. It must be emphasized, however, that a data warehouse is not an OLTP system: You do not need to deploy a data warehouse on an RDBMS.
Reduced to a primitive technology prescription, the database engine at the heart of the data warehouse requires two things: a data store that has the ability to create and manage tables, as well as to append records to them, and logic to resolve the conflicts that tend to occur between concurrent operations on data. So, it is possible to design the data warehouse as an append-only data store and to commit new records in a timeline -- for example, as new rows. If you can only append new records (that is, you cannot change or delete existing ones), concurrency conflicts cannot occur. You can likewise design coordination logic to address data uniformity requirements, to ensure that if n users/jobs query the warehouse at the same time, each will query against exactly the same records in the timeline.
In practice, however, the easiest way to address these requirements is by using an RDBMS. The RDBMS is also optimized to quickly and correctly perform the relational operations (such as different kinds of joins) that are essential in analytical work. These are but two of the reasons the on-premises data warehouse is usually identical with the RDBMS -- and why a procession of would-be replacements, such as Hadoop + Hive, failed to displace the conventional warehouse.
It is also the reason almost all PaaS data warehouse services are designed as RDBMS-like systems. As I wrote in a previous article, “if you eschew in-database ACID database safeguards, you must either roll your own ACID enforcement mechanisms or accept data loss as inevitable.” This means that you have a choice among building ACIDic logic into your application code, designing and maintaining your own ACID-compliant database, or delegating this task to a third-party database.
Data Warehouse Workloads Require Consistency, Uniformity and Replicability Safeguards
Like it or not, consistency, uniformity and replicability are common requirements for production data warehouse workloads. For example, core operational business workflows routinely query the warehouse. In a production use case, then, the data lakehouse that replaces it might be expected to service hundreds of such queries per second.
Let’s consider what this entails by looking closely at a representative workload -- for example, a credit-application process that queries the lakehouse for credit-scoring results dozens of times each second. For statutory and regulatory reasons, queries processed at the same time must return correct results (“correct” as in they all use the same scoring model). This model also uses the same point-in-time data, albeit adjusted for customer-specific variations.
But what if a concurrent operation attempts to update the data used to feed one or more of the model’s parameters? The RDBMS’ ACID safeguards would prevent this update from being committed until after the results of the (dependent) credit-scoring operations had first been committed.
Can a SQL query service enforce the same safeguards? Can it do this even if objects in the data lake’s curated zone are accessible to other services (say, an AWS Glue ETL service), which are also able to update data at the same time?
The example above is by no means uncommon. It is, rather, the routine way of doing things. In other words, if you need to ensure consistent, uniform and replicable results, you need ACIDic safeguards. The upshot is that data warehouse workloads require these safeguards.
Can Data Lakehouse Architecture Enforce These Safeguards?
It depends. The first issue has to do with the fact that it is difficult to coordinate dependent operations across loosely coupled services. So, for example, how does an independent SQL query service restrict access to records in an independent data lake service? This is necessary to prevent concurrent users from modifying objects in the lake’s curated zone. In the tightly coupled RDBMS, the database kernel manages this. In the credit-scoring example above, the RDBMS locks the rows in the table(s) in which dependent data is recorded; this prevents other operations from modifying them. It is just not clear how to manage this in the data lakehouse architecture, with its layered stack of decoupled services.
A fit-for-purpose data lakehouse service should be able to enforce ACID-like safeguards -- if it is its own data lake and can control concurrent access to (and modification of) the objects in its data lake layer. Again, this is what Databricks and Dremio have done in their implementations of data lakehouse architecture. They solve the problem of coordinating concurrent access to (and operations on) resources that are shared across services by less loosely coupling these services to one another.
By contrast, this is much more difficult if the data lakehouse is implemented as a layered stack of loosely coupled, independent services -- for example, a discrete SQL query service that sits atop the curated zone of a discrete data lake service, which itself sits atop a discrete object storage service. It cannot achieve strong consistency because it cannot control access to the objects in the data lake.
In any distributed topology, the challenge is to coordinate concurrent access to shared resources while at the same time managing multiple dependent operations on these resources across space and time. This is true whether software functions (and the resources they operate on) are tightly or loosely coupled. So, for example, the way distributed processing is handled in a classic data warehouse implementation is by instantiating the warehouse as a massively parallel processing (MPP) database. The MPP database kernel identifies, schedules and coordinates dependent operations across the nodes of the MPP cluster, as well as manages (and resolves) the conflicts that occur between dependent operations. In other words, the MPP database kernel is able to enforce strict ACID safeguards while also performing concurrent operations in a distributed topology. This is no mean feat.
By contrast, a loosely coupled distributed software architecture, such as data lakehouse architecture, is confronted with the problem of coordinating access to resources and managing dependencies across what are, in effect, independent services. This is a wicked problem.
This is one reason the data lakehouse (like the data lake itself) usually functions as what is called an eventually consistent platform rather than as a strongly consistent platform.
On the one hand, it is able to enforce ACID-like safeguards; on the other hand, it may lose data and be unable to deliver uniformly replicable results. To enforce strict ACID safeguards would require combining the data lakehouse and the data lake into one platform: to tightly couple both services to one another.[ii] For what it is worth, this is the likely tendency of data lake/lakehouse evolution: Each will converge into the other, provided the idea of the data lakehouse actually has staying power.
However, to implement the data lakehouse as its own data lake is (in effect) to recapitulate the phylogeny of the data warehouse: It is to put the data lakehouse on the same evolutionary path as the warehouse itself. At a minimum, it is to tightly couple the lakehouse and the lake, to introduce (and to consolidate) a dependency on a single software platform and, just as important, on a single provider.
The next article in this series explores the use of data modeling with the data lakehouse.
[i] For more on the distinction between uniformity and replicability, read this article. The nickel summary: If n consumers query the same data at the same time, they should all receive exactly the same results, even if another consumer (e.g., an ETL job) attempts to update this data. This is what is meant by uniformity. If the data warehouse were to re-sequence and re-run these operations n times, it would produce exactly the same results each time. This is what is meant by replicability.
[ii] I will not attempt to vet the claims to strong consistency made by prominent data lakehouse + data lake providers. It seems to me that theirs is an especially wicked problem indeed. Instead, I say with Augustine: “I believe that I may understand.”