Data lakes are having a moment. According to one recent report, they are expected to grow by about 30% over the next five years. Tomer Shiran, CEO of data-as-a-service vendor Dremio, explains what they are, why they make sense in the cloud and how to make them work best for your company. Dremio has its own cred: Founded just five years ago, the company is routinely named to "best of" lists, including Pagan Research's list of "Most Promising Big Data Startups in the World" and CRN's "10 Hottest Big Data Startups of 2019."
What is a data lake?
It provides a united source for all data in an organization by replacing the many siloed file and object stores that hold collections of data that tend to spring up inside organizations. Data lakes are also inherently open, providing clear separation of storage from compute and processing, and provide a non-proprietary storage alternative compared to ingesting data into proprietary solutions such as data warehouses. This gives organizations the flexibility to bring best-of-breed processing to their data as needed, while maintaining full control of the data itself.
What type of data do data lakes typically include?
Everything from structured data from relational databases (rows and columns), to semi-structured data such as CSV, JSON, to unstructured data like documents and binary data like images or video. Each of these data types can then be collectively transformed, analyzed and more.
How is a data lake different from a data warehouse?
Data lakes are a more modern approach to storing data for analytics. Even if you don't have the structure at the time you're ingesting into a data lake, you can structure it after you ingest it. It's also more flexible in terms of what you can do with the data. If you have data in cloud data lake storage, you could use Dremio to perform SQL queries and use business intelligence tools on that data, and then use Spark to perform processing and ETL jobs. With a data warehouse, you have to structure the data at time of ingestion, which makes it more difficult to get the data in. Also, it's more restricted in what you can do; basically, you are just running SQL queries on tables.
What types of organizations or use cases are best for data lake storage?
If there is strategic value in your data today or there may be in the future, you'll want to hold onto it and treat it as a strategic asset. It's also a good choice for companies that collect large volumes of data, with high variety that accrue at a high rate. Also consider it if you value maintaining control of your data at all times so you can apply best-of-breed approaches to analyze that data. Last, if you want to streamline and simplify your data architecture to support their lines of business and improve overall business agility.
Should a data lake be built in the cloud or on-premises?
We recommend organizations build new data lakes in the cloud. Cloud data lake storage can easily be leveraged by many cloud services for things like processing, analytics and reporting. From a scalability point of view, you can start with a few small files and grow your data lake to exabytes in size, without the worries associated when expanding storage and data maintenance internally. Cloud storage providers also allow for multiple storage classes and pricing options, which helps organizations only pay for exactly as much as they need, instead of planning for an assumed cost and capacity as is needed when building a data lake locally. Cloud data lake storage is also proven to be highly durable and available – for example, eleven 9’s of durability for Amazon S3. Finally, all companies have a responsibility to protect their data, and with data lakes designed to store all types of data, including sensitive information like financial records or customer details, security becomes even more important. Cloud providers guarantee security of data as defined by the shared responsibility model.
What about organizations that maintain some combination of cloud and on-premises data lake solutions? How can they make sure everything is synced up?
Most of the time some data lives on premises because that's where it's generated, while other data is generated and lives in the cloud. IoT [internet of things] data, for example, is generated in many different places and then aggregated in the cloud, but the company may have some business data that is stored somewhere in an on-premises source. In both cases, the data is not being copied from one location to another, so there is no need to keep it synced. Data consumers are accessing the data wherever it lay.
What if the organization already has an on-premises data lake and is considering moving it to the cloud? What advice do you have for a migration?
Migrating an on-premises data lake to the cloud can be challenging, since data consumers are connecting into all of the existing data sources, and any changes to those sources as a result of migrations break those connections and require data engineers to rebuild them. So, they need a way to abstract the underlying migration to the cloud from the usage by data analysts, business intelligence users and data scientists.
If you choose to standardize on cloud data lake storage, do you have to stick with one cloud provider?
Not necessarily. The multicloud model means you are using more than one set of cloud data lake storage, but that storage is used by a separate and distinct set of applications or workloads. In this way, an organization can spread its workloads around multiple cloud providers. This is true whether some of the data lake storage is on-premises (private cloud) or only in public clouds. Then there is the hybrid cloud model, where you are using more than one set of data lake storage in support of a single workload. In the hybrid cloud, the workload joins data from multiple data lake storage services. In this scenario, there could be multiple workloads, each joining data from more than one set of data lake storage. Once again, this is true whether some of the data lake storage is on-premises (private cloud) or only in public clouds.
How can businesses get the most value from a cloud-based data lake?
To take full advantage of data processing and analytics on your data, you need technologies that were built for the type of platform you are using. That's because the latency and performance aren't the same as what you would get with local NVMe on a single server. Because you're going over a network to another service, it can slow down the workflow if you aren't using a technology designed specifically for data lake storage.