Thanks to its performance, convenience and cost effectiveness, object storage has emerged as the go-to mechanism for managing storage in the cloud.
Behind the scenes, many (or perhaps most) cloud services depend on object storage as a persistence layer. Elsewhere, enterprises have long used cloud-based object storage as a place to put “stuff” -- archival data, for one, along with backups and files of any kind.
But is cloud-based object storage by itself a suitable destination for cloud data? Yes and no. Even though object storage is a useful place to put cloud data, it is less useful for accessing and using it, at least for data management and analytics use. For these purposes, cloud services that use object storage as a persistence layer -- e.g., a cloud data lake, data warehouse or database -- are superior options.
Cloud Object Storage Pros and Cons
The following is a summary of the advantages and disadvantages of cloud-based object storage.
Object storage is not a database, nor is it meant to be. Object storage is ideally suited for storing large files, as opposed to hundreds of thousands of small files or -- as with relational database management systems -- records.
Note: On the one hand, a large file size has the benefit of reducing the frequency of GET requests, which, as a function of network delay, can be slow. Large file size also keeps data in memory. On the other hand, this causes problems in query scenarios (SQL and otherwise) inasmuch as it forces the query or compute engine to load and process extraneous data. Behind the scenes, cloud data lakes, data warehouses and so forth optimize for this in different ways. Generally, most query and/or compute engines load data from S3 into memory or into a local cache and keep it there.
In fact, object storage does not expose APIs that a “consumer” -- e.g., a machine or a human being -- could use to create, read, update, or delete records.
When data is persisted in object storage, it is usually stored in an optimized columnar file format, such as Apache Parquet or Apache ORC (Optimized Row Columnar) files. These files can efficiently store and compress analytic data. But, by themselves, neither of these formats approximates the features of a database system.
Imagine, for example, that you store the contents of a database table in a Parquet file set. Depending on the size of the table and the minimum size of the Parquet files, the table could be distributed across multiple files. Now imagine that you store the contents of a separate database table in another Parquet file set. On their own, these file sets do not provide a means to manage and join data across tables. You need another mechanism -- e.g., a serverless ETL (extract, transform, load) service such as AWS Glue. A database, by contrast, uses a data dictionary to manage tables, along with primary key definitions and foreign key constraints. The database engine itself performs joint operations on data in separate tables.
A database offers just one example. Data lake and data warehouse services provide equivalent functions. An organization could also design its own data engineering, data modeling and transactional logic and use a compute engine (such as Spark) to provide similar functions.
This last approach may be ideal in specific use cases (and is a favorite of software engineers). It is difficult to scale in support of typical BI-analytics workloads, however.
API rate limits work differently for cloud-based object storage. On balance, Amazon, Google and Microsoft enforce access-friendly API rate limits in connection with their cloud object storage services. That said, these providers do still limit API access to object storage. To cite one example, Amazon permits customers to make up to 5,500 GET requests per minute per S3 storage bucket. If this seems insufficient, the tools and methods used to store data (as distinct to files) in object storage help to mitigate the effects of this limit.
Note: At ~92 requests per second, this might not seem like a lot. Owing to how object storage works, however, it is usually sufficient, although another solution is to distribute data across multiple S3 buckets in the same region. Another catch is that some providers, such as Amazon, also bill on a per-API-request basis, such that each GET, HEAD, PUT, etc., request is billed at a fixed rate. So, for example, an individual S3 GET request costs “just” $0.0004. However, a customer that issues 100,000 GET requests over the course of 20 seconds will incur $40.00 in costs.
The cost(s) of API calls to object storage can add up, at least for BI and analytic workloads. One thing to keep in mind is that Amazon, Google, Microsoft and other cloud providers charge customers a fixed rate per GET, HEAD, PUT, COPY, POST, DELETE, etc., request to object storage. Even for read-intensive analytic workloads, these costs can quickly add up, especially if multiple consumers access object storage concurrently.
Another thing to keep in mind is that it can also be costly (i.e., as a function of both time and, especially, cloud resource requirements) to access and process data in object storage. So, for example, a GET request to an S3 bucket will grab the entire contents of a compressed data set -- say, a 1-GB Parquet file -- even if the consumer that requests it requires only a few kilobytes or megabytes of data. However, if a data set -- e.g., a database table -- is distributed across several dozen 1-GB Parquet files, the consumer must issue multiple GET requests to retrieve each of these files.
In a common access scenario, a “consumer” -- in this case, the Spark compute engine -- GETs data from object storage, decompresses it on the fly and loads it into memory. At this point, it can be operated on.
Again, it is difficult to scale this model to support numerous concurrent users. From the perspective of both query performance and cost effectiveness, it is suboptimal.
Consider using a fit-for-purpose cloud service to support BI and analytics workloads. Unlike online transaction processing (OLTP) workloads, analytics workloads tend to be read-intensive. With read-intensive analytic workloads, it becomes especially practicable to use a fit-for-purpose cloud service to host them. Data is read once, cached in memory and fetched in response to queries. This scheme is advantageous vis-à-vis provider-imposed API rate limits and per-API pricing, too.
Even though you could use a third-party data warehouse or database to access data in object storage, it is probably a good idea to use these services to also store and manage your data.
For example, imagine that you mount a Parquet data set in object storage as an external table from a cloud database. Now imagine that a human user initiates an ad hoc query against this data. Behind the scenes, the database GETs the data from object storage, decompresses it and loads it into memory, where (assuming sufficient resources) it remains. In this way, data in object storage can be read once, by a single requestor -- the cloud database -- and made available for access by authorized consumers.
This scheme would perform much better, however, if the data were stored in (and managed by) the cloud database itself. For one thing, what if the user also needed to update or delete a record in the dataset? In most cases, this would be impossible in this scheme.
Similarly, an organization could design its own software to retrieve and modify data from object storage, and human consumers could also manually retrieve data from object storage. For typical use cases, it makes more sense to use a separate service -- a data lake, data warehouse, database or query service -- to support BI and analytics workloads.
Cloud data lakes and data warehouses offer friendlier API rate limits and data-transfer pricing. Essentially, all cloud data lake, data warehouse and database services use object storage as an underlying storage substrate.
In almost all cases, however, cloud data lake, data warehouse, and database providers use composite billing metrics (e.g., billing on a total-volume-of-storage, volume-of-data-scanned, per-query, per-successful-query basis) as distinct to the API-based metrics used with object storage.
A SQL query service may be a viable option, as well -- with caveats. Customers can use a SQL query service to query against and perform operations on data in object storage. A basic SQL query service should support most common data warehouse workloads. There are a few important limitations worth noting.
To start with, a basic SQL query service supports the ad hoc query use case. However, it can also run reports and power dashboards, visualizations and other analytics. Its API can likewise be called from a data pipeline or application workflow. Some cloud-based SQL query services also expose a semantic layer that expert users can employ to instantiate data modeling logic. This logic can support a wide range of business uses.
All of this is extremely helpful. In most cases, however, a SQL query service can’t enforce ACID guarantees at the level of the object storage layer (e.g., one or more concurrent users -- say, a routine ETL process -- could access and modify data even as other concurrent users are querying against it). For this reason, a SQL query service is best suited for uses in which workloads do not require strong consistency. It is unsuitable for workloads (such as financial reporting) that do. And, on balance, it will probably prove more difficult to manage large volumes of data using a scheme like this.
There is one well-known “gotcha,” however. This has to do with outbound data-transfer, or “data egress,” charges.
Cloud apps and services usually impose data egress charges. These would include SaaS (e.g., Salesforce) and PaaS (e.g., Snowflake) products as well as core cloud object storage and provider-specific data lake, data warehouse, or database-like services that run on top of it. So, on the one hand, it is usually free to move data between and among services that live in the same intra-cloud environment in the same region -- i.e., the AWS, Google Cloud Platform or Azure ecosystems. On the other hand, it can be costly to move data between separate regions (e.g., U.S. Northeast to E.U. West) in the same cloud infrastructure environment, or to transfer data from one provider’s cloud infrastructure environment to another’s (e.g., to move data from AWS S3 to Azure Blob storage).
To reduce these charges, the onus is on customers to minimize inter-cloud data movement and govern the amount of data that consumers transfer out of the cloud provider’s environment. A customer might address this by, say, deploying a data fabric scheme with a large, persistent on-premises cache.
Ultimately, most data in the cloud is stored in object storage. All cloud data lake, data warehouse and database services depend on cloud object storage as an enabling substrate. Enterprises, too, depend on object storage, in both the cloud and on-premises contexts, to host a growing share of their data. It makes sense: Object storage is a cost-effective means of storing arbitrary data structures.
But object storage is not in any sense a panacea: If it is useful for storing files, it is less useful for managing data. On its own, object storage does not expose APIs for creating, reading, updating or deleting records, as a database would. Instead, it puts everything into files (such as Apache Avro or Parquet) that are optimized for different types of data storage uses. The upshot is that even though you can put data into a cloud-based object storage service, you should not expect to use that service’s built-in facilities to manage this data. A separate data lake, data warehouse or database is a better option.
That said, cloud object storage is a great place to store archival data -- for example, “cold” or infrequently accessed data or different types of backup archives. (In fact, most hyperscale cloud providers offer a separate, cheaper object storage service for this purpose.)
Customers can use SQL query services, such as Amazon Athena or Google BigQuery, to query archival data in situ. Alternately, most databases now permit customers to mount data in object storage as one or more external tables.
A final caveat, however: Depending on how you architect your cloud data management and analytics services, some architectural schemes could prove to be cost prohibitive. Consider, for example, Amazon’s duo of SQL query services: Athena and Redshift Spectrum. The former permits consumers to query against data in S3 object storage using pooled (shared) cloud infrastructure resources. The latter is an S3 SQL query facility powered by Amazon’s Redshift massively parallel processing data warehouse. In both cases, Amazon charges customers for the amount of data each service has to scan in order to process the query. If a customer stores its data in large Parquet files, and if a data set is distributed across multiple Parquet files, Athena or Redshift Spectrum must scan several gigabytes of data. For this reason, Amazon recommends using smaller, columnar-optimized files.