Big Data by definition implies lots of data. For many applications the more data you have the better. The question companies have to answer is how best to store their data. Part of the answer depends on what they plan to do with it. Training a machine learning model, for example, in most cases requires the compute muscle doing the training to be physically close to where the training data is stored.
Most companies don't have computing resources available for a large machine learning problem inhouse. That's where cloud providers like Amazon and Microsoft come in. Both cater to this problem offering data storage and large amounts of CPU horsepower on-demand. Once you’ve decided to use one of them, getting your data into a public cloud without breaking the bank becomes a problem.
Big Data Storage
Both Amazon Web Services and Microsoft Azure offer services called “data lakes.” The term has come to mean a large repository of data most often stored in raw format. AWS announced general availability of its data lake offering, called AWS Lake Formation, only recently. It uses the cloud provider’s S3 cloud storage service, which, when linked with any of Amazon’s machine learning services, can provide foundation for a machine learning infrastructure. Amazon also offers several other tools to help with data import and cleansing.
Microsoft's Azure Data Lake has been in production for a while and provides similar functionality to that of AWS Lake Formation. Microsoft's HDInsight offering brings the power of the open source Hadoop toolset to Big Data processing. Microsoft uses the Hadoop Distributed File System (HDFS) as the primary data lake storage format, since it's compatible with most open source Big Data tools.
Microsoft publishes pricing for its Gen1 data lake storage: 100 terabytes of storage will cost you $2,900 per month, and a petabyte will run $26,000 per month if you commit to a specific amount. Using the service on a pay-as-you-go basis costs more, starting at $0.039 per GB and dropping by 0.1 of a cent for the next two tiers. Amazon's basic S3 pricing starts at $0.023 per GB for the first 50TB and goes down from there. AWS’s infrequent-access storage tiers cost as low as 1 cent per GB and down to $0.004 per GB for S3 Glacier storage.
Making It Work
Customer references abound for both Amazon and Microsoft from a wide variety of industries. Based on published pricing alone, Amazon appears to have an edge, but there's more to the story than the storage cost. Microsoft has a broad offering on the compute side and has what it calls Data Factory to integrate data from disparate sources. Like AWS Lake Formation, it provides the extract, transform, and load (ETL) functions necessary to pull data from existing databases.
Turning large amounts of data into usable and actionable information is where the biggest focus lies. Companies are hiring data scientists in large numbers to bring to bear their tools and techniques on specific business problems. Finding value in detailed production information or customer purchasing habits could easily pay for the investments made in a short amount of time. Simplifying the process of gathering and categorizing the data is what Amazon and Microsoft are trying to do.
Making machine learning work for a specific application is not necessarily a straightforward task. It takes someone who’s part detective, part mathematician, and part computer programmer to pull it off. Having easy access to the data makes the development process much easier.