In recognition of the value proposition of cloud computing, a significant level of effort is being expended in modernizing existing on-premises data warehouse environments by migrating both data and applications to the cloud.
Cloud environments provide a great degree of flexibility, especially when it comes to managing different types of data assets. This benefits data warehouse architects because they are not limited to using a relational database management system (RDBMS) to hold the data warehouse’s data; rather, they can take advantage of a collection of data storage and management paradigms that can be virtualized and used for downstream reporting and analytics.
Enterprise data lakes are particularly valuable for cloud-based data warehouse modernization. A data lake provides a place for collecting data sets in their original format, making those data sets available to different consumers and allowing data users to consume that data in ways specific to their own defined needs. However, blindly dumping data assets into enterprise data lakes without asserting some architectural oversight--as well as some degree of governance over the ingestion, storage and use of data lake data--may result in some data becoming effectively unusable.
The objectives of using enterprise data lakes in support of an overall data warehousing/reporting/analytics strategy are to enable data analysts and data scientists to access and use shared data assets while simultaneously enforcing data controls to ensure against unauthorized access of protected information.
Developing practical methodologies to maximize data accessibility and ensure scalable performance--while ensuring proper compliance with data protection policies--requires understanding the different roles that participate in the data lake, reviewing the factors that influence architectural decisions, and selecting technologies to meet requirements that are defined based on data lake use cases.
Here we examine three different types of roles that interact with enterprise data lakes:
A data contributor is an entity that contributes data into the data lake. The data contributor can be an individual, organization, system or external provider. In the best-case scenario, the data contributor is responsible for providing:
- Data description: A description of the data asset, which should include a high-level description of the data asset, the creator, when (or with what cadence) the data asset is produced, when the data asset is delivered, how the data asset is delivered, and data model(s) if available
- Business rules: Specifications of any assertions, rules or calculations for the data within the data asset that a data consumer would need to know prior to using the data
- Obligations: Constraints and requirements associated with the data asset; examples include rules about data protection or contractual restrictions on the number of data consumers that may access the data.
Because data contributors are supplying data for shared enterprise use, it is worthwhile to encourage parties to become data contributors. This suggests reducing the burden on data contributors by developing procedures and embracing tools that can help to produce this necessary metadata.
A data consumer is a person or an application that uses (“consumes”) data from the data lake. The data consumer can be an individual, organization, system or external consumer. Data consumers are expected to provide:
- Descriptions: A list of the data assets that are required with a description, and model(s) for each
- Use cases: Use cases that describe how the data lake assets are to be used and the methods to be used to access the data
- Performance expectations: Performance information--such as how many users and how much data they will use--necessary to influence the ways in which data assets are stored (for example, storage orientation and partitioning)
- Transformations: The types of transformations to be applied to data in preparation for its use
The more data consumers there are, the more likely it will be that the data lake will succeed. Therefore, similar to the data contributors, it will be beneficial all around to encourage the use of the data lake while reducing the burden on consumers while doing so.
The directive to reduce burdens on both data contributors and data consumers is at odds with the need for a data lifecycle in enterprise data lakes: There must be processes for data ingestion, preparation, curation, storage, access and control. This suggests the need for a third data lake role: the DataOps practitioner. People in this role are responsible for the processes, tools and techniques for supporting the data lake lifecycle, including:
- Customer enablement: Processes ranging from soliciting information about use cases, to identifying data assets that can be used, to exploring access methods, and beyond
- Data ingestion: All types of data assets, including batch-loaded data sets and streaming data sets, as well as support for data sets with intermittent updates (for example, Change Data Capture)
- Data curation: Process involving assessing data asset metadata, enforcing rules for conformance with reference model standards, transformations for standardization, reorganizing data (for example, storage in columnar layout) and documenting data asset metadata in the data catalog.
- Data access: focusing on the ways that data consumers want to access data and configuring methods for access (such as direct SQL query access, APIs, or via NoSQL query languages).
- Security management: The ways in which restrictions, constraints and data protection obligations are analyzed and the right methods of protection are configured
In the best scenarios, the DataOps practitioners leverage good practices and a suite of tools that simplify data contribution and make data use transparent. These processes encompass the end-to-end data pipelines for ingesting data from a variety of source types, data preparation, formatting, storage, and producing data access methods to streamline the reporting and analytics workflows.
However, determining the right set of tools to accommodate these processes requires identifying the different entities touching enterprise data lakes, the actions and events they generate, and what expectations for accessibility, performance and data security.