I have technology marketers to thank for the theme of this article -- that is, “the end of ETL.”
By which I mean what, exactly? The end of … extract, transform, and load? But what does this mean?
Does it mean the end of traditional ETL tools? For traditional ETL development practices? For the data warehouse-focused data modeling programs that matured on top of these practices? If so, the end of ETL in this sense is news to exactly … nobody. It’s a fait accompli.
Or does it mean the end of the data movement and data transformation operations that are commonly designated by the term ETL? Because, like it or not, “ETL” routinely gets used as a synonym to describe any sequence of data movement and data transformation operations. This usage is well-attested to today among data scientists, data engineers and others -- even if what these experts actually mean is any sequence of data acquisition, data movement and/or data transformation operations that is orchestrated between data sources and compute engines. You know, a data-engineering pipeline.
All this is to say that the term ETL is a very different thing from the acronym ETL. When technology marketers talk about the “end of ETL,” which ETL do they mean? The term or the acronym?
Of Technology-marketing Strawmen -- Or Piñatas
Unfortunately, marketing people are not always clear about this. As Exhibit A, permit me to paraphrase an excerpt from a promotional email that I received last December. (In this case, I am paraphrasing to protect the not-so-innocent.) To wit: “Data transformation is an inefficient process that has long bedeviled IT teams, most of which continue to depend on ETL and ELT to establish connections to data sources and to acquire data for analytics.”
In fairness, when marketing people speak of “ETL,” they usually mean the traditional -- i.e., the data warehouse-focused -- ETL tools model. But they are vague, perhaps deliberately so, on this point.
For example, if I were to paraphrase the rest of the message, I would do so as follows:
“A new trend we have seen is for IT teams to employ a different approach to data engineering -- what we call Data-Analytic Fracking (DAF) -- to provide rapid access to data without the inefficiencies of ETL or ELT. Because DAF makes data (in a database or any other data source) available on an ongoing basis for analytics, it wholly changes how companies work with their data.”
Ask yourself: What is the rhetorical effect of this message? Consider the following:
- The message problematizes something called “ETL,” along with something else called “ELT.”
- The message does not distinguish between “ETL” as an acronym -- i.e., a practice nominally associated with specific use cases and a specific category of tools -- and “ETL” as a generic term routinely used by data engineers and other experts to describe data engineering.
- The message problematizes the terms “extract,” “transform” and “load” -- operations integral to the engineering and delivery of data -- without making the distinction described in 2.
- The message touts a “different approach to data engineering” that permits access to data “without the inefficiencies of ETL or ELT” -- without extracting, loading or transforming data.
- The message explicitly problematizes data transformation, citing unspecified “inefficiencies.”
- Data transformation is one of the most costly/labor-intensive aspects of data engineering.
- Is there not also a sense in which the message promises to obviate data engineering as such?
At a minimum, the claims this vendor makes in its marketing invite a few obvious questions. First, how does DAF, or “data-analytic fracking,” differ from the E and the L operations that are ubiquitous in data engineering? That is, does DAF not also entail a sequence of data extraction and loading operations of some kind? Because if it does not, that would be -- how should I put it? -- impossible. Yes, impossible.
Data Engineering Always Involves E and L
When a data engineer creates a data pipeline, she codifies a sequence of data extraction, data movement and (if applicable) data transformation operations. At a minimum, the data engineer needs to access and move -- i.e., to Extract -- the data she needs, as well as to Load it into a staging area of some kind. In many cases, this is still a local client (e.g., a laptop). In other cases, it could be local or remote object storage. The upshot is that data engineers, data scientists, machine learning (ML) engineers, and others routinely perform EL operations to acquire the data necessary to do their work.
We can draw several conclusions from this.
- EL operations are ubiquitous in data engineering.
- It is not always necessary to transform data.
- Even if data transformation is required, the constituent E, L, and T operations need not occur in some inviolable order. ETL and ELT are just two of the more common sequences.
- Data often undergoes multiple E, L and, yes, T operations as it gets engineered for use. We could as easily speak of ELTTEL or TELTTEL, to say nothing of other permutations.
- This is not a vestige of the “legacy” ETL tools paradigm. Multi-stage extraction, loading, and transformation operations are no less common in the paradigms that have supplanted it.
For example, prior to extracting data from an upstream source, a data engineer might opt to transform the data in situ -- e.g., in a temporary table in the cloud database in which it lives -- to reduce the volume of data she must move across the WAN. (This has implications for data egress pricing, too.) Similarly, the data engineer might opt to take advantage of available intra-cloud compute services (such as AWS Glue or Amazon Athena) to transform and/or query against data her organization stores in a multi-gigabyte Parquet data set in Amazon S3 storage. In both scenarios, the engineer performs combined transformation and extraction processes prior to extracting the data and loading it locally for use.
These operations are not necessarily simple, however. The data scientist must either design her own logic to manage the operations’ constituent steps and validate that the steps are successfully completed (e.g., by writing dependency control, error correction and validation logic) or entrust this task to a tool of some kind.
Like It or Not, Data Transformations Are Inevitable
To return to the “data-analytic fracking” example I discussed above: In the background, DAF is essentially identical to a scheme in which E and L operations get orchestrated at runtime, which itself is conceptually like the modus operandi of data virtualization (DV) technology.
However, data virtualization does not purport to eliminate the requirement to transform data. In fact, most DV implementations permit developers, modelers, etc., to specify and apply different types of transformations to data at runtime. Does DAF? That is, how likely is it that any scheme can eliminate the requirement to transform data?
Not very likely at all. Data transformation is never an end unto itself. It is rather a means to the end of using data, of doing stuff with data. The essence of the problem has to do with optimization -- with the need to transform, to optimize, data so it can be prepared for use in different kinds of scenarios.
For example, what if a data engineer needs to convert from one encoding scheme (ASCII) to another (Unicode)? Or to convert between regions (en_US to en_AU) in the same encoding scheme? What if a data engineer needs to concatenate data in the source system? Or convert from one base value (decimal) to another (binary)? What if it is necessary to change from one system of measurement (Imperial) to another (Metric)? Or from one unit of measurement (Pascals) to another (Gigapascals)? These are just a handful of the hundreds of different transformations a skilled data engineer might use to transform data.
In a scheme such as DAF, what happens if the data in a source system needs to be transformed before it can be used? Basically, the same thing that happens in the data virtualization model: The DAF software exposes features that a developer, modeler or expert can use to specify and/or design different kinds of transformation functions that get applied to data at runtime.
My point is that if you look at what a scheme such as DAF actually does, it generates basically the same abstractions -- e.g., directed acyclic graphs (DAGs) -- that get generated in every mode of data engineering. In SQL, for example, EL operations are expressed as a SELECT-INSERT -- i.e., a simple DAG. In fact, the separate SELECT and the INSERT operations can each be abstracted as DAGs. (This goes to the ideality of declarative languages for data engineering. So, as an example, a relational database management system parses SQL and generates a DAG, whereupon the optimizer sequences the DAG operations for efficiency. By contrast, if you write out your code, you must explicitly specify these steps. However, you are still specifying a finite sequence of atomic operations that, as a totality, is equivalent to a DAG.)
Conceptually, then, what this vendor is doing with what it calls “DAF” is straight from basic data engineering. Both deal with the same concepts, the same techniques, the same abstractions. The acronym itself has changed, but the constituent operations have not. DAF is not ETL in the sense of the legacy, data warehouse-focused ETL-tools model. It is ETL in the generic sense of the term.
The Ecstasy -- Or Agony? -- Of an All-in-One Approach to Data Engineering and Analytics
In fact, this vendor is attempting to force a distinction between, on the one hand, the types of operations that get performed as part of a typical data engineering scenario and, on the other hand, the types of operations its software performs as part of a so-called “data-analytic fracking” scenario. This is a distinction without merit.
However, even if this vendor is not reinventing data engineering as such, it is doing something different. Depending on a customer’s needs, the integrated experience it claims to achieve with its “data-analytic fracking” scheme could be useful. This is a distinction with merit.
What the vendor is indeed delivering is a kind of consolidated data engineering and analytics platform. It’s a software ecosystem that incorporates and synthesizes features and functions that are otherwise implemented across multiple categories of products, then exposes them in the context of a unified user experience with pre-built ease-of-use features. This platform aims to accelerate the acquisition, engineering and presentation of data in support of a wide gamut of analytics use cases, from core operational reporting, ad hoc query and OLAP-based analysis to the types of more advanced use cases associated with analytic discovery and data science. To this end, the platform also makes it possible for experts to write custom code to address novel or complex data engineering requirements.
This is a different argument, a different model -- one that has its own pluses and minuses.
I discussed the pluses of this model in the previous paragraph. As for its minuses, a platform of this kind creates and reinforces a dependency on itself -- i.e., on its own features and services, as well as on the software development lifecycle tools and practices for which it is best suited. (Keep in mind that not just the mode and cadence but also the evolution of these practices is, to an extent, dictated by the platform’s vendor.) This platform, like all platforms, achieves a trade-off between convenience and flexibility. As an integrated platform that promises to simplify data engineering and accelerate the delivery of analytics, it exposes pre-built ease-of-use features that function to formalize some of the more common operations associated with acquiring, moving, modifying and modeling data for analysis. However, in addition to shunting developers, data engineers and scientists, and other experts into a certain way of doing things -- e.g., by using in-platform tools, features, and services -- the platform likewise functions as a kind of lockbox for the data engineering logic, models and other assets these experts create.
How portable -- i.e., reusable -- is this logic? How portable is the modeling logic associated with it? Or the views (analogous to reports, dashboards, etc.) that expert users create? How about the different types of metadata that are generated and/or managed by the platform?
It is one thing clearly and unequivocally to champion the merits of this tradeoff. It is another thing to claim to obviate a practice (viz., data engineering) that -- because it is integral to the acquisition, preparation and delivery of data for analytic use cases -- is an inescapable cause of pain and frustration for all organizations. At the very least, this claim is confusing. At worst, it is misleading.
It is not my style to pick on any one vendor. This is no less true in this case. After all, this unnamed vendor and its overly optimistic messaging are not outliers. The essence of its marketing pitch -- viz., that technology alone can eliminate the complexity, costs, etc., that are byproducts of a cluster of complex sociotechnical problems -- is boilerplate for the data and analytics space in late 2021.
This vendor’s pitch is also consistent with the familiar trope whereby an upstart technology player unilaterally declares null and void the set of rules, constraints, laws and so forth that have historically determined what is practicable in a given domain. Or, more precisely, declares these rules, constraints and laws to be null and void with regards to itself.
Because this trope is so common, technology buyers should be savvy enough not to succumb to it. Yet, as the evidence of four decades of technology buying demonstrates, succumb to it they do. This problem is exacerbated in any context in which (as now) the availability of new, as-yet-untested technologies fuels optimism among sellers and buyers alike. Cloud, ML and AI are the dei ex machina of our age, contributing to a built-in tolerance for what amounts to utopian technological messaging. That is, people not only want to believe in utopia -- who wouldn’t wish away the most intractable of sociotechnical problems? -- but are predisposed to do so.
In this article, I have highlighted a representative vendor’s marketing to make the point that a core set of concepts, techniques and abstractions is baked into the ground truth of the world of data engineering.
To the degree that this world reliably corresponds to a larger, richer world -- i.e., to reality itself -- this ground truth does not and, in fact, will not change. Today, right now, no combination of cloud, ML, AI and serverless technologies is sufficient to nullify this ground truth and its constraints.
Technological innovation does hold the promise of relaxing these constraints -- of making them less constraining -- but cannot yet eliminate the necessary work of engineering data for use in analytics (or any use case). Not today, not next year and probably not in five years’ time, either.
A technology marketer’s job is to convince the IT buyer that each new “disruptive,” “game-changing” and “paradigm-shifting” technology has the potential to solve most if not all their problems. Yet, paradigm-nullifying shifts do not tend to occur. Like, ever. If or when a paradigm shift does occur, the set of constraints that determined what was possible in the predecessor paradigm do not cease to apply in the new paradigm, even if they do get interpreted differently.
The same is true, more or less, of what marketers like to call the “disruptive” effects of technological innovation. Innovations that fundamentally alter the ground truth that undergirds a given technological and/or usage paradigm are relatively rare. The IT buyer’s job is to be mindful of this.
To sum up: “ETL” as a synonym for data engineering is alive and well. As a generic term, it encompasses a set of essential data engineering operations and reduces to the same basic abstractions. In this generic sense, ETL by any other name is still the same.