Digital transformation is not just a matter of digitizing workflows and processes. It’s also a matter of retrofitting legacy and proprietary systems, along with other siloed sources of data, to participate in an ecosystem of connected systems, applications and services. It is, in essence, a problem of facilitating data exchange among all the resources that undergird a business’s essential workflows and processes.
Data fabric architecture has emerged as a promising solution to this problem. The data fabric is used to knit together distributed resources irrespective of where they are located (cloud or on-premises; local or remote), or of the APIs they expose for data exchange. This is quite useful, so far as it goes.
Like anything, however, data fabric architecture has pluses and minuses, costs and benefits. This article will explore these issues.
Three Modes of Data Fabric Architecture
Broadly speaking, there seem to be at least three prevailing conceptions of data fabric architecture.
The first sees the data fabric as a strictly decentralized architecture—that is, a means of getting at data that is otherwise distributed without first consolidating it into a central repository, such as a data lake or a data warehouse. At its most anodyne, a scheme like this de-emphasizes the role of centralized access in data architecture; at its most radical, it completely rejects the need for centralized access.
By contrast, a second, more inclusive take on data fabric sees these centralized repositories as non-privileged participants in a distributed data architecture: Data in the lake or the warehouse gets exposed for access much like other sources--via the data fabric. This take on data fabric architecture is inclusive of centralized data resources, but it nonetheless privileges decentralized access.
A third take on data fabric sees it as underpinning a hybrid data architecture. This scheme actually mandates a key role for the data lake and/or the data warehouse. It is biased in favor of centralized, as against decentralized, access: The data fabric gives data architects a way both to tie together otherwise dispersed data resources and to accommodate the unpredictable data access needs of specialized consumers, such as data scientists, ML/AI engineers and software engineers.
The Technology Components of Data Fabric Architecture
The result is that a kind of terminological vacuity has developed around the concept of the data fabric: At its most generic, it has something for everybody; at its most concrete, it describes a very specific kind of distributed data architecture. To resolve this vacuity, let’s explore the core technologies that undergird data fabric architecture. This should give us a better sense of how it actually works.
- Data virtualization
DV does several useful things.
First, it simplifies access to data resources irrespective of their GPS coordinates. DV provides a virtual abstraction layer for dispersed resources. So far as a machine or human consumer is concerned, the cloud-based resources exposed via DV behave just like resources in the on-premises data center. DV can be used to knit together dispersed resources into a unified view, not unlike a virtual database.
Second, DV enables versatile, API-based access to dispersed data resources. Modern DV is ecumenical with respect to data access interfaces; in addition to SQL access, DV technologies now access data via SOA, RESTful and GraphQL endpoints. Experts can also use their preferred tools (Java and JDBC; Python and ODBC, or JDBC) to acquire data via the fabric.
Third, data virtualization makes it possible to construct and expose different types of pre-built views of data. This is useful for running common queries against data in dispersed resources--be it in the cloud or on-premises. Another DV use case is as an enabling technology for frequently refreshed reporting. This involves integrating data from upstream resources into different kinds of composite “views,” which are functionally equivalent to reports, dashboards, and so on. In this way, DV can support basic end user-oriented practices (such as decision support and BI analysis), as well as expert practices (such as data science or ML/AI engineering) that tend to require significant data conditioning. (In the latter case, the DV engine could be used to transform and integrate the data destined to populate an ML training data set.) In this sense, DV incorporates several otherwise discrete data integration capabilities--for example, data profiling, ETL processing and data cleansing--into one engine.
- Data cataloging
The data catalog uses metadata—that is, data that describes data--to discover, identify and classify useful data. If data lacks helpful metadata, data cataloging uses technologies (such as data profiling) to generate new metadata: Is it customer data? Product data? Sales data?
Within limits, advanced data cataloging technologies can discover and/or generate other types of metadata, such as data lineage. (Where did the data come from? What has been done to it? When? By whom?) Above all, the catalog is an essential tool for data discovery. For example, a business analyst can interactively query the data catalog (ideally using natural language) to discover useful data. Potentially valuable sources include not only applications, services and databases, but file data: CSV, spreadsheet, PDF, even PowerPoint files exposed via SMB and NFS network shares, or persisted to an object storage layer, such as Amazon S3.
- Knowledge graph
This is where the magic happens. The knowledge graph identifies and establishes relations between the entities it discovers across different data models. At a formal level, the knowledge graph attempts to “fit” its discoveries into an evolving ontology. In this way, it generates a schema of interrelated entities, both abstract (“customer”) and concrete (“Jane Doe”), groups them into domains, and, if applicable, establishes relations across domains.
So, for example, the knowledge graph determines that “CSTMR” and “CUST” are identical to “CUSTOMER,” or that a group of numbers formatted in a certain way (xxx-xx-xxxx) relates to the entity “SSN,” or that this SSN correlates with this CUSTOMER. It is one thing to achieve something like this in a single database with a unified data model; it is quite another to link entities across different data models: for example, “CUSTOMER” in a SaaS sales and marketing app = “CUST” in an on-premises sales data mart = “SSN” in an HR database = “EMPLOYEE Jane Doe who has this SSN is also a CUSTOMER.” This last is completely new knowledge.
The Built-in Limits of Data Fabric Architecture
Proponents tend to present a best-case take on data fabric architecture. This best-case view emphasizes simplified data access, irrespective of interface or location, via abstraction. Proponents likewise emphasize the benefits of federated, as distinct to centralized, access. For example, an organization neither moves nor duplicates data; business units, groups, practices, etc. own and control the data that they produce. But the technologies that underpin the data fabric have costs and benefits of their own.
It is worth briefly exploring these to grasp the limits of data fabric architecture.
No data history: The data fabric uses DV to connect directly to business applications and services, including the OLTP systems that support finance, sales and marketing, HR, and other critical business function areas. These systems do not retain a history of transactional data; rather, they overwrite existing transactions as new transactions occur. As a result, the DV platform must incorporate some kind of persistent store to preserve and manage historical transaction data. At a certain point, this begins to look suspiciously like a DV platform with a data warehouse at its core. Nor does the data warehouse itself preserve raw transaction data; rather, it ingests and manages a derived subset of this data. The problem is that this raw or “detail” data--that is, the chaff that is not preserved by the warehouse--is potentially useful grist for business analysts, data scientists, ML engineers and other expert users. So, the DV platform must incorporate some kind of data lake-like repository to capture this data, too.
A different kind of labor intensiveness: In the DV model, IT technicians and expert users configure different types of pre-built connections for non-expert users. This work involves exposing individual data sources (for example, SaaS finance, HR and sales/marketing apps), as well as building and maintaining the pre-built views used to replicate the functionality of reports, dashboards, and so on. It involves building and maintaining the complex data engineering pipelines used to acquire, cleanse, and transform the data used in SQL analytics or ML data processing.
This is true of data catalog technologies, too. On the one hand, catalogs are premised on the idea of human-directed search and discovery. On the other hand, they expose tools that permit users to identify, classify, annotate and share data. Most catalogs also expose tools that experts can use to alter or transform data, as well as to track changes to this data. Data catalogs automate the building and maintenance of metadata dictionaries and business glossaries, but, in practice, human experts usually curate these resources on their own.
The same holds for knowledge graphing technologies. The knowledge graph is useful as a means of discovering entities, along with the relationships that obtain between entities. It is a powerful tool for surfacing new knowledge. But its discoveries are irreducibly probabilistic. For sensitive applications and use cases, then, both the entities and relationships it discovers, along with the new knowledge it surfaces, must be reviewed and approved by human experts.
Location matters: The data fabric masks the physical location of distributed data sources. But data is most valuable when it is integrated into different kinds of useful combinations. This is the basic function of the SQL query. Data warehouse architecture addresses this problem by integrating and consolidating data and then moving it into a single place: the warehouse. On top of this, the data warehouse uses persistent data structures (indexes, pre-aggregated roll-ups, etc.) to accelerate queries. Most of these acceleration schemes involve caching data.
In a data fabric, data is accessed via dispersed locations and physically moved into the DV platform, where it is integrated and consolidated. Once again, the DV platform must take over at least some of the functions of the warehouse. To this end, it caches and pre-aggregates data as well as creates indexes to accelerate the performance of common queries. For truly ad hoc queries, or for processing analytic/ML models that require data from dispersed sources (such as sensors at the enterprise edge), data cannot be cached or pre-aggregated; instead, it must be fetched on demand--irrespective of its location. At a minimum, this introduces significant latency; at worst--as when the DV layer must access edge data via a high-latency connection--it results in non-responsive jobs. The upshot is that the data fabric tends to perform unpredictably (in comparison to the data warehouse) as a data processing engine.
These are just a few of the trade-offs that (like the reverse of the coin) offset the positive benefits of the data fabric. Neither they nor others (for example, the increased complexity of data governance) comprise show-stopping problems; they are, however, issues that would-be adopters need to be aware of.
Another problem has to do with the essential bias of the data fabric--namely, its bias in favor of data access, as against data management. To cite one example, Gartner’s notion of data fabric is of a technology infrastructure used to access and move data This bias is a feature, not a bug, of the data fabric: it is a useful means of simplifying access to data--for example, data that is dispersed across multiple resources and accessed by APIs. It is especially useful as an integration technology for distributed application workflows, as in application modernization or digital transformation efforts.
However, this usefulness is always in tension with the priority of managing data.
The thing is, we do not manage data solely to govern or control it; we manage data when we design schemes to preserve data history, or when we optimize data structures in order to improve performance for different types of workloads. We manage data whenever we create replicable, reusable data flows, as well as replicable, reusable data cleansing and conditioning routines.
We manage data when we implement data versioning capabilities, or when we define objective standards for the production of the cleansed, consistent data used to support decision-making, planning, forecasting and other activities. The data fabric is frequently positioned as a disruptive, zero-sum architecture--a means of eliminating centralized repositories, or of shrugging off onerous data management tools, policies and practices. It is more helpful to conceive of it as a both-and proposition--a complement to, not a replacement for, data management tools, practices and concepts.