On his LinkedIn profile, Mark McQuade describes himself as a "knowledge addict." That's pretty appropriate, considering his line of work. McQuade is a lead solutions architect at Onica, an Amazon Web Services (AWS) consultancy and managed services provider. In that role, McQuade has broadened his knowledge of everything from Docker and Kubernetes to artificial intelligence and deep learning. Here, McQuade shares his perspective about all things data lakes.
Why do organizations use data lakes?
Use cases range from feeding machine learning algorithms developed by data scientists to building statistical visualizations and using the generated insights to guide business decisions.
Why are data lakes so complicated?
With data expanding by 10 times every five years, data platforms need to scale 1,000 times to be sufficient for 15 years of storage and processing requirements. Data lakes exist to ease this burden, but the process of building data lakes involves a series of steps that can become quite cumbersome, lasting months on end due to the complexities of data cleansing, data preparation and security configurations. Additionally, over the life of the data lake, further manual steps are involved, such as managing and monitoring ETL [extract, transform, load] jobs, updating metadata based on data changes, maintaining cleansing scripts and more.
How long does it take to build a data lake?
Building a full-fledged data lake can be hard and time consuming. This process can take upward of three to six months. Simplifying the build of your data lake using AWS Lake Formation can take a lot of that manual work out and can reduce the time to build your data lake to weeks. But it doesn't have to be that complicated or take so long.
What are the benefits of simplifying data lakes?
Businesses save a massive amount of time and hassle. By perfecting your organization’s maintenance of data lakes, you cut back on the in-house expertise and resources needed to keep everything running smoothly, freeing up IT teams to focus on more pressing projects, which saves your organization costs in the long run.
Data can also help businesses anticipate customer behavior, make a variety of predictions or forecasts, automate processes to improve efficiency, and enhance product offerings with speed and availability in addition to automating customer service. These use cases require that the data is secure and available in real time, and with growing numbers of people accessing data, it is important that data platforms are flexible and scalable. AWS Lake Formation can tackle all the above-mentioned concerns.
How can organizations simplify data lakes?
We recommend AWS Lake Formation, which eliminates a lot of the manual work and can reduce the time to build the data lake to weeks. It also allows organizations to simplify data lakes in three ways:
- Use Blueprints to ingest data: Data can be ingested in bulk loads or incremental loads. If you choose to incrementally load for ingestion, you can specify which tables and columns to incrementally load and set some bookmark keys and specify key sort orders based on your preference. Once all these parameters have been set, you can monitor the incremental import to check that the ingestion is successful.
- Grant permissions to share data securely: Once the data has been ingested, you can assign users access permissions to the tables that hold the data in the database. These permissions may be specific to each user, with individually selectable options such as create, select, insert, alter, drop or delete data.
- Run queries: After the data has been ingested and security permissions have been defined, queries can be run using Amazon services, such as Amazon Athena, that utilize the data present in the tables in the data lake. Creating and managing data lakes with AWS Lake Formation is a process that is much simpler, intuitive and dramatically faster than manual efforts.
Are there other ways that organizations could reduce the complexity of their data lakes that perhaps don't involve these specific steps or Amazon?
While all three of the hyperscalers offer ways to manage data lakes, it’s always important for enterprises to ask themselves what problem they’re looking to solve before adopting new technologies. While simplifying data lakes might be key for some enterprises, there may be underlying circumstances that can only be solved with another solution.
What are some things that organizations definitely shouldn't do when simplifying their data lakes?
Avoid on-premises deployment and stick with serverless data lakes. A serverless data lake allows IT teams to scale effectively, while on-premises requires frequent software upgrades and attention to physical hardware.
Also, it’s best to never cut corners when it comes to your data lake. When building out your data lake, it can take time and effort and you may be tempted to take a shortcut here or there, but when it comes to your data and your data platform that will power your organization’s data for years to come, avoid the shortcuts.
How can organizations ensure they are factoring in the future when building their data lakes?
Make sure your data platform is built for long-term success, not just your immediate needs.
For example, you may not be interested in machine learning at this moment in time, but two to three years down the road you will most likely want to derive some predictions or forecasting on your data. It's also good practice to make sure you have a robust, scalable and secure data platform in place which will allow your business and your data to power through for many years to come.