In a discussion I had with a spokesperson for a cloud services vendor several months ago, the representative said something that stuck with me. “One thing you don’t get with the cloud is a history of how your data changes over time,” the spokesperson said. “There’s no way you can look at a record [at a point in time] and compare it to other time periods. What we’re doing is we’re preserving the historical data that is [otherwise] lost in the cloud.”
The spokesperson was not affiliated with a cloud data lake, data warehouse, database or object storage vendor. It seemed that the company hadn’t previously considered that cloud subscribers could use one of these services to collect and preserve the historical data produced by cloud applications. Or, if the company had previously considered this, it was rejected as an undesirable, or nonviable, option.
In my conversations specific to the data mesh architecture and data fabric markets, the terms “history” and “historical” tend to come up infrequently. Instead, the emphasis is on (a) enabling business domain experts to produce their own data and (b) making it easier for outsiders -- experts and non-experts alike -- to discover and use data. And yet, discussion of the technologies that underpin both data mesh architecture and the data fabric, viz., data virtualization, metadata cataloguing, knowledge discovery, focuses on connectivity to operational databases, applications and services; i.e., resources that do not preserve data history.
Historical data is of crucial importance to machine learning (ML) engineers, data scientists and other experts, of course. It is not an afterthought. And there are obvious schemes you can use to accommodate historical data in data mesh architecture -- a historical repository that is instantiated as its own domain, for example.
But these two data points got me wondering: Are we forgetting about data history? In the pell-mell rush to the cloud, are some organizations poised to reprise the mistakes of past decades?
Data History and the Cloud
Most cloud apps and services do not preserve historical data. That is, once a field, value or record changes, it gets overwritten with new data. Absent a routinized mechanism for preserving it, this data is lost forever. However, having said this, some cloud services do give customers a means to preserve data history.
This option to preserve data history might seem convenient, at least so far as the customer is concerned. But there is a myriad of reasons why organizations should consider taking on and owning the responsibility of preserving historical data themselves.
The following is a quick-and-dirty exploration of considerations germane to the problem of preserving, managing and enabling access to historical data produced by cloud applications and services. It is not in any sense an exhaustive tally, but it does aspire to be a solid overview.
What if your provider does not offer services/features to preserve data history?
Then you need a plan to preserve, manage and use the data produced by your cloud apps and services.
What if you do not have an existing source of data history?
The good news is that it should be possible to recover historical data from extant sources. Back in the early days of decision support, for example, recreating data history for a new data warehouse project usually involved recovering data from backup archives, which, in most cases, were stored on magnetic tape.
In the cloud, this legacy dependency on tape may go away, but the process of recreating data history is still not always straightforward. For example, in the on-premises environment, it was not unusual for a backup archive to tie into a specific version of an application, database management system (DBMS) or operating system (OS). This meant that recovering data from an old backup would entail recreating the context in which that backup was created.
Given the software-defined nature of cloud services, virtual abstraction on its own does not address the problem of software dependencies. So, for example, in infrastructure as a service, you have the same dependencies (OS, DBMS, etc.) as you did in the on-premises data center. With respect to platform as a service (PaaS) and software as a service (SaaS), changes to newer versions of core cloud software (e.g., deprecated or discontinued APIs) could also complicate data recovery.
The lesson: Develop a plan to preserve and manage your data history sooner rather than later.
But what if your provider does offer services/features to preserve data history?
You should still have a plan. When you use your provider’s offerings to preserve data history, it creates an unnecessary dependency. That is, do you really “own” your data if it lives in the provider’s cloud services?
Moreover, your access to your own data is mediated by the tools and APIs -- and the terms of service -- that are specified by your cloud provider. But what if the provider changes its terms of service? What if you decide to discontinue use of the provider’s services? What if the provider is acquired by a competitor or discontinues its services? How much will it cost you to move your data out of the provider’s cloud environment? What formats can you export it in?
In sum: Are you comfortable with these constraints? This is why it is incumbent upon customers to own and take responsibility for the historical data produced by their cloud apps and services.
What historical data should you preserve?
Even in the era of data scarcity -- first, scarcity with respect to data volumes; second, scarcity with respect to data storage capacity -- savvy data warehouse architects preferred to preserve as much raw historical data as possible, in some cases using change-data capture (CDC) technology to replicate all deltas to a staging area. Warehouse architects did this because having raw online transaction processing (OLTP) data on hand made it relatively easy to change or to maintain the data warehouse. For example, they could add new dimensions or rekey existing ones.
Today, this is more practicable than ever, thanks to the availability (and cost-effectiveness) of cloud object storage. It is likewise more necessary than ever, due to the popularity of disciplines such as data science and machine learning engineering. These disciplines, along with traditional practices such as data mining, typically require raw, unconditioned data.
A caveat, however: If you use CDC to capture tens of thousands of updates an hour, you will ingest tens of thousands of new, time-stamped records each hour. Ultimately, this adds up.
The lesson is that not all OLTP data is destined to become “historical.” If for some reason you need to capture all updates -- e.g., if you are using a data lake to centralize access to current cloud data for hundreds of concurrent consumers -- you do not need to persist all these updates as part of your data history. (Few customers could afford to persist updates at this volume.) What you should do is persist a sample of all useful OLTP data at a fixed interval.
The historical data produced by a cloud app or service is part of a bigger picture.
On its own, it is possible to query against data produced by cloud applications or services to establish a history of how it has changed over time. Data scientists and ML engineers can trawl historical data to glean useful features, assuming they can access the data. But data is also useful when it is combined with (historical) data from other services to create different kinds of multidimensional views: you know, analytics.
For example, by combining data in Salesforce with data from finance, logistics, supply chain/procurement, and other sources, analysts, data scientists, ML engineers and others can produce more useful analytics, design better (more reliable) automation features and so on.
By linking sales and marketing, finance, HR, supply chain/procurement, logistics, and other business function areas, executive decision makers can obtain a complete, synoptic view of the business and its operations. They can make decisions, plan and forecast on that basis.
This is just to scratch the surface of its usefulness.
The purpose of this article was to introduce and explore the problem of capturing and preserving the data that is produced by cloud apps and services -- specifically, the “historical” operational data that typically gets overwritten when new data gets produced. There are several reasons organizations will want to preserve and manage this data, including the following:
- Support core BI reporting and analytics. A derived subset of raw operational data is essential grist for reporting, ad hoc query and analysis, dashboards, scorecards, and other types of BI-analytics. Without this data, it is impossible to situate what is happening “now” in a useful comparative context -- for example, what happened a year ago, three years ago, etc.
- Develop new (core) BI reporting and analytics. Data modelers, BI developers, business analysts and other experts can use this raw data to develop and introduce new BI reports and analytics. For example, preserving all raw operational data in a separate repository makes it easier to add new dimensions to the data warehouse or to design data modeling logic that consumes new data for specific use cases. Data modeling logic can be instantiated as views in a database, a data lake or a data warehouse, as well as in a separate BI/semantic layer. (Models in a semantic layer are metadata constructs that, for example, a BI tool might use to translate and generate SQL queries. Conceptually, then, they are similar to SQL views.) This also makes it easier to modify the data warehouse -- for instance, to rekey a fact table.
- Support data science, machine learning and data engineering practices. In its raw form, this data may also be useful to data scientists, ML and data engineers, and other expert users. In most cases, data scientists and ML engineers require data that is just not available via the data warehouse. Similarly, software engineers may also be interested in this data: The software engineer’s common lament is that she cannot get the data she needs from the warehouse -- because it is not there. Instead of querying a database or a semantic layer, developers sometimes prefer to build modeling logic into their code to retrieve and use data.
There is another reason that organizations will want to capture and preserve all the data their cloud apps produce, however. Most SaaS apps (and even many PaaS apps) are not designed for accessing, querying, moving, and/or modifying data. Rather, they are designed to be used by different kinds of consumers who work in different types of roles. The apps likewise impose constraints, such as API rate limits and, alternately, per-API charges, that can complicate the process of accessing and using data in the cloud.
In a follow-up article, I will delve into this problem, focusing specifically on API rate limits.