When Is It OK to Drop ACID Database Safeguards?

In data management, the term ACID seems to have the power of Holy Writ. ACID database safeguards comprise the four properties that are said to guarantee the consistency of a database that performs operations on data:

Atomicity. Each transaction is an atomic unit--that is, self-contained; its constituent operations do not interleave with those of other transactions. Each transaction either succeeds or fails completely. If it succeeds completely, the database records the transaction; however, if even one of its constituent operations fails--for example, the database crashes; power fails; the network splits; the operation violates a rule or constraint--the transaction does not get recorded.
Consistency. The database’s state is always completely consistent. That is, the database does not commit the results of a transaction unless and until it processes all of that transaction’s operations. If even one operation fails, the database’s state remains unchanged. Once the transaction successfully completes, the database commits any changes and updates its state.
Isolation. A database processes more than one transaction at once. It’s possible for concurrent transactions to conflict with one another--for example, by instructing the database to update data in the same table, row and column at the same time. If only one of these updates gets recorded, the database loses data. An ACID-compliant database implements concurrency control logic to isolate transactions, enforce serializability, and safeguard against conflict, data loss, and so on.
Durability. An ACID-compliant database does not lose data. Therefore, once it is committed, a transaction stays committed. To this end, the database does not record transactions in a volatile context (for example, physical memory), but, rather, in a non-volatile context, such as disk. The database maintains a persistent log of all transactions to permit the recovery of its state.

Thus, ACID. But why do you need ACID-like safeguards? More precisely, when do you need them?

Why you need it: ACID is version control for data.

We need ACID because we require databases to schedule and process multiple transactions at the same time. A transaction consists of one or more logical operations, each of which involves reading or making changes to data. If a database were to host just one user and perform just one read or change operation at a time, and if these operations could not violate database rules, ACID would not matter.

Think of ACID as analogous to a version control system (VCS) for data. Like a database, a VCS supports multiple, concurrent users. It enables concurrent users to access and read the same baseline code (or one of its branches) at the same time; it manages changes (that is, “commits”) to branches, merges these changes into the baseline code and performs other transaction-like functions. In this capacity, it must arbitrate between and among different kinds of conflicts, such as concurrent commits.

A system that enforces strict ACID database safeguards is said to achieve strong consistency. A system that does not enforce ACID database safeguards is said to achieve weak or eventual consistency.

Just because a database is ACID-compliant does not mean it strictly enforces ACID safeguards. For example, databases usually employ locking mechanisms to enforce isolation. For example, if a transaction needs to update a row, the database first locks that row to prevent access by other concurrent users. It does this to forestall different kinds of consistency issues, such as dirty and non-repeatable reads. Sometimes, however, DBAs will relax a database’s transaction isolation levels to improve performance, especially for concurrent users. In such cases, other problems, such as duplicate rows, can occur.

When you need it: ACID database safeguards are necessary if you cannot afford to lose data.

An ACID-compliant database is essential for usage scenarios that require strong consistency. For example, if a use case requires concurrent users to read or make changes to data at the same time, ACID database safeguards prevent data loss and ensure that the database returns consistent, correct results.

Think of it this way: If you eschew in-database ACID database safeguards, you must either roll your own ACID enforcement mechanisms or accept data loss as inevitable. This means, in effect, that you have a choice between (a) building ACIDic logic into your application code; (b) designing and maintaining your own ACID-compliant database; or (c) delegating this task to a third-party database. Unless you are Linus Torvalds and have a set of utterly unique requirements, (c) is the most pragmatic approach.

Is it ever OK to drop ACID database safeguards?

So, when is it OK to drop ACID database safeguards? Is it ever OK? Try asking yourself the following questions:

1. Does your usage scenario require that changes take place in a definite (for example, serialized) order?

2. If n consumers read the same data at the same time, do they need to get the same results?

3. Is replicability a requirement? That is, if you replay the changes, should you get the same results?

Most important:

4. Is it OK to lose data? To record duplicate data? If so, under what conditions?

Not all usage scenarios require strong consistency. For example, if your use case involves ingesting streaming data from sensors in support of an ML engineering practice, you probably are OK with eventual consistency. That is, you are OK with losing at least a portion of relevant data; your concern is to capture a representative sample of data that your ML engineers can use to train their models.

The same is true of something like next-best-action analysis in ecommerce: The analytics that power this usage scenario expect to work against the clickstream data generated by consumers as they browse. Ideally, a retailer would capture all of the data that pertains to the consumer’s browsing experience; in practice, however, it is probably OK if some data gets lost, overwritten, or recorded more than once.

Why would you want to drop ACID database safeguards?

Even though it confers important benefits, strict ACID compliance imposes costs, too.

ACID’s isolation property can constrain performance in usage scenarios that involve a large number of concurrent users. To wit: When the database goes to update the data in a page, it locks that page. (Database pages are analogous to the pages in a book, database rows to individual sentences.) The upshot is that other users cannot access the locked page until the transaction completes. Relaxing a database’s isolation level is one way to improve performance with a large number of concurrent users.

If your use case is tolerant of lost updates, dirty reads, duplicate data, etc., this may be a viable option.

This gets at a problem that is often elided by the distinction between strong and eventual consistency: How much consistency do you actually need? Strong consistency equals absolute consistency--that is, the database state is always absolutely consistent. That said, strong consistency comes at a cost, especially vis-à-vis scalability. After all, the easiest way to scale a database is by distributing it--by breaking up its data set or its data-processing workload--and distributing the pieces across two or more clustered nodes. The problem is that it is extremely difficult to achieve ACID-like guarantees in a distributed architecture. So, for example, a massively parallel processing (MPP) relational database management system (RDBMS) can enforce strict ACID safeguards. However, this comes at a comparative cost premium: It costs more to buy or license, and, in certain usage scenarios, it can prove prohibitively costly to scale.

BASE As a Scalable, Albeit Potentially Lossy, Alternative to ACID

For some usage scenarios, databases that achieve BASE (basically available soft-state eventual) consistency scale better and approach the reliability achieved by databases that enforce strict ACID guarantees. However, eventual consistency does not equal absolute consistency: A BASE database can lose data; dirty and non-repeatable reads can occur; concurrent operations that attempt to read data from a database page as it is being updated can return duplicate or non-existent data.

Thanks to Teradata fellow Mark Madsen for suggesting the data-versus-code-versioning analogy. Also, thanks to Kyle Kingsbury for his work with Jepsen.io. Jepsen is an invaluable resource, and it is free.

Comments

Plain text