The concept of “bring your own compute” (or BYOC) is the foundation of a modern approach to supporting a data warehousing architecture in the cloud. In essence, BYOC is based on the loose coupling of computation and storage resources. A shared cloud-based data lake allows enterprise assets to be pooled in a common storage environment that is segregated from computing resources by decoupling the computation from the storage. This model allows data consumers to use their own resources to perform desired analyses and subsequent reporting.
As organizations evaluate the opportunities for data asset sharing through a cloud-based data lake environment, they will see how BYOC ultimately enables shared self-service reporting and analytics. And there is an additional benefit to BYOC that may not be immediately obvious because it is not about the technical environment, but rather how it shifts operational management responsibilities and budget accountability from the data producer to the data consumers.
In a typical legacy on-premises reporting and analytics data warehousing architecture, one team is assigned the tasks for the production of the data for the data warehouse as well as the management of the data warehouse environment itself. Often called a “data warehouse group,” a “data warehouse center of excellence” or some other designation, the producer of the data warehouse is often also expected to provide, support, and pay for the platform and environment that others use to access the shared reporting and analytics environment.
In the typical enterprise, the data warehouse is not just the storage framework for data assets; it is also the system through which requests are made, queries are executed and analyses are performed. For example, the enterprise data warehouse team often extracts the data from the sources. It also often provides the platforms for data staging, validation and transformation; the hardware platform for the data warehouse itself; the processes for loading the data into the target data warehouse environment; and end user access technologies (such as business intelligence, reporting and analysis front ends) that data consumers employ to use the data.
Unfortunately, though, this model’s sustainability is inversely related to the data warehouse’s success. As the number of data warehouse consumers increases, the greater the investment required to support scalability, concurrency and a growing array of required end user capabilities. Yet, as an enterprise service, the data warehouse team often struggles to acquire the budget necessary to support this expansion.
This lack of sustainability contributes to the continuing trend of organizations migrating their data and applications to the cloud. Clearly, there are practical factors driving cloud migration, such as simplified operations, lowered costs, or the ability to take advantage of effectively unlimited scalable computing and storage resources. And at the same time, the fact that cloud providers compartmentalize their services and separate the storage from the compute allows the data warehouse team to focus on the production and subsequent management of shared data resources without having to also concentrate on end user accessibility.
In other words, the organization that is responsible for managing the data does not have to be the same organization responsible for providing the computing resources for processing and analyzing the data. Instead, data consumers can supply (and pay for) their own computing resources and bring those resources to the shared data--the essence of “bring your own compute.” Pushing the cost and management of the computing resources to the consumer communities is not only more equitable (since it no longer imposes the costs on the data producers), but it also frees data consumers from the platform constraints that typically linger in underbudgeted on-premises data warehouse system environments. This effectively allows a variety of different consumers to process and analyze shared data on their own schedules and budgets.
That being said, some confusion about the practical implementation of BYOC remains. The notion suggests that all data consumers have to do is conjure up computing resources and magically “bring those resources” and align them with the data to accomplish their reporting and analytics processes. The idea implies that the data does not move, but instead remains within the shared data lake. This is not actually the case: Even when data consumers bring their own computing resources, data must still be streamed from the data lake to the computational environment to be locally loaded, processed and analyzed.
The trick is for each data consumer to figure out the optimal way to use the shared data. In some cases this might be manifested as a relational database system temporarily instantiated on a computing instance that is loaded from the data lake. Other options would be to use an in-memory NoSQL database or to rely on the pipelined processing provided by programming environments like Apache Spark. BYOC lends a degree of freedom to the enterprise by dissociating the operational costs of data access from the data producer. At the same time, it enables data consumers to manage their own resource scalability without relying on a constrained enterprise shared services budget.