Concept of Transaction

A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback).

Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability).

ACID

Atomicity

For a group of operations, if no fault occurs, then execute all operations and change the state (commit); otherwise, no operation will be executed, keep original state (abort).

Different with atomic in multi-threaded.

In multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.

For ACID, it does not describe what happens if several processes try to access the same data at the same time.

Consistency

The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true, if a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied.

This idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. This is not something that the database can guarantee.

This consistency is not the same as consistency in distributed systems (called CAP). ACID consistency is a user-defined rule, while CAP consistency is “request to any nodes will get the same response”.

Isolation

Any read or write in the same transaction will not be affected by other transactions.

For details, please refer to “Concurrent Operations & Isolation Levels” section.

Possible Problems

Dirty Read → A Dirty read is a situation when a transaction reads data that has not yet been committed.
Non Repeatable read → Non Repeatable read occurs when a transaction reads the same row twice and gets a different value each time.
Phantom Read → Phantom Read occurs when two same queries are executed, but the rows retrieved by the two, are different.

Difference between non repeatable and phantom is: non repeatable focus on certain row, and the returned value of that row; phantom focus on same query and the returned set.

Isolation levels

Levels from loose to strict:

Read Uncommitted (no isolation) → In this level, one transaction may read not yet committed changes made by other transactions, thereby allowing dirty reads.
Read Committed → This isolation level guarantees that any data read is committed at the moment it is read. Thus it does not allow dirty read. The transaction holds a read or write lock on the current row, and thus prevents other transactions from reading, updating, or deleting it.
Repeatable Read → The transaction holds read locks on all rows it references and writes locks on referenced rows for update and delete actions. Since other transactions cannot read, update or delete these rows, consequently it avoids non-repeatable read.
Serializable (purely like ordered transactions) → A serializable execution is guaranteed to be serializable. Serializable execution is defined to be an execution of operations in which concurrently executing transactions appears to be serially executing.

Durability

Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

single-node → write data into disk or WAL for recovery.
multi-nodes → data has been successfully copied to some number of nodes.

BASE

Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency.

Single-Object & Multi-Object Operations

single-object writes

Single object modification means multiple threads may access the same value (row, table, etc). Atomicity can be implemented using a log for crash recovery and isolation can be implemented using a lock on each object.

Atomicity for single object may be hard to understand because only one operation happens, you can imagine it like “write a very large data chunk into disk”, there are two outcomes: (1) all data in disk; (2) no data in disk.

Some databases also provide more complex atomic operations, such as an increment operation, which removes the need for a read-modify-write cycle. Similarly popular is a compare-and-set operation, which allows a write to happen only if the value has not been concurrently changed by someone else. We will cover this later.

These single-object operations are useful, as they can prevent lost updates when several clients try to write to the same object concurrently. However, they are not transactions in the usual sense of the word, because transaction means we operate more than one objects.

multi-object transactions

Some distributed databases abandon multi-object transactions because it’s difficult to implement across partitions and may influent performance and availability.

But there are some situations that still require multi-object transactions, for example:

foreign key constrain
secondary index

Handling errors and aborts

ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.

Not all systems follow that philosophy, though. In particular, datastores with leaderless replication work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done”—so it’s the application’s responsibility to recover from errors.

Retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect:

If the transaction success but the network fails to response to client, then retry may execute same transaction twice.
If the error is due to overload, then retry will make the problem worse.
It’s only worth retrying after transient errors, like deadlock, temporary network break, etc. For permanent error such as constraint violation, retrying is meaningless.
If the transaction has side effect, then even transaction failed, the side effect may affect other part already.
If the client fails while retrying, then everything is lost.

Weak Isolation Levels

Read Committed

Dirty Read

No dirty read → When reading from the database, you will only see data that has been committed.

Why need to avoid dirty read?

If a transaction needs to update several objects, a dirty read means that another transaction may see some of the updates but not others.

For example, the user sees the new unread email but not the updated counter. This is a dirty read of the email. Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.
If a transaction aborts, any writes it has made need to be rolled back. If the database allows dirty reads, that means a transaction may see data that is later rolled back.

Dirty Write

No dirty write → When writing to the database, you will only overwrite data that has been committed.

We normally assume that the later write overwrites the earlier write. However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a dirty write.

Implement Read Committed

Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given object; if another transaction wants to write to the same object, it must wait until the first transaction is committed or aborted before it can acquire the lock and continue.

What about preventing dirty reads?

One option is use the same read lock, but it’s not a good idea, because one long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed. This harms the response time of read-only transactions and is bad for operability.
Another option is keeping both committed and uncommitted values. For every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. While the transaction is ongoing, any other transactions that read the object are simply given the old value. Only when the new value is committed do transactions switch over to reading the new value.

Snapshot Isolation and Repeatable Read

This is called non repeatable read.

Snapshot isolation is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.

Implement Snapshot Isolation (MVCC)

Like read committed isolation, implementations of snapshot isolation typically use write locks to prevent dirty writes.

However, reads do not require any locks. From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers. This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.

The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as multiversion concurrency control (MVCC).

Each row in a table has a created_by field, containing the ID of the transaction that inserted this row into the table.

Moreover, each row has a deleted_by field, which is initially empty. If a transaction deletes a row, the row isn’t actually deleted from the database, but it is marked for deletion by setting the deleted_by field to the ID of the transaction that requested the deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space.

An update is internally translated into a delete and a create.

Visibility rules

When a transaction reads from the database, transaction IDs are used to decide which objects it can see and which are invisible.

At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit.
Any writes made by aborted transactions are ignored.
Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started) are ignored, regardless of whether those transactions have committed.
All other writes are visible to the application’s queries.

In other words, an object is visible if both of the following conditions are true:

At the time when the reader’s transaction started, the transaction that created the object had already committed.
The object is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed at the time when the reader’s transaction started.

Indexes

The index simply point to all versions of an object and require an index query to filter out any object versions that are not visible to the current transaction. When garbage collection removes old object versions that are no longer visible to any transaction, the corresponding index entries can also be removed.
Use an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages.

Preventing Lost Updates

There are several interesting kinds of conflicts that can occur between concurrently writing transactions. The best known of these is the lost update problem.

The lost update problem can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification.

Atomic write operaitons

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code.

This means a new operation always happens after the old operation finishing.

Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. Another option is to simply force all atomic operations to be executed on a single thread.

Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database.

Although DB provide threading-safe SQL commands, user may use the DB in a wrong way and violate the atomicity :(.

# threading-safe
MyDB.execute("UPDATE counters SET value = value + 1 WHERE key = 'foo';")
# violate atomicity
value = MyDB.execute("SELECT value FROM counters WHERE key = 'foo';")
new_value = value + 1
MyDB.execute("UPDATE counters SET value = new_value WHERE key = 'foo';")

Explicit locking

Add lock in your own code.

1 2	with resource.get_lock() as re: do_something()

Automatically detecting lost updates

Allow to run operations in parallel, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.

Compare and Set

Compare-and-set operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, abort the update, and the read-modify-write cycle must be retried.

Conflict resolution and replication

Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context.

Solution → allow concurrent writes to create several conflicting versions of a value (also known as siblings), and to use application code or special data structures to resolve and merge these versions after the fact.

Write Skew and Phantoms

Write Skew

It is neither a dirty write nor a lost update, because the two transactions are updating two different objects.

You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly.

Phantoms

A write in one transaction changes the result of a search query in another transaction, is called a phantom. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions, phantoms can lead to particularly tricky cases of write skew.

The general steps that cause phantoms are:

A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition.
Depending on the result of the first query, the application code decides how to continue
If the application decides to go ahead, it makes a write (INSERT, UPDATE, or DELETE) to the database and commits the transaction

Materializing conflicts

Materializing conflicts → takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database.

It can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model.

Serializability

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency.

Most databases that provide serializability today use one of three techniques:

actual serial execution
two-phase locking
concurrency control

Actual Serial Execution

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread.

It seems very straight forward, why this appears only recently?

RAM became cheap enough that for many use cases is now feasible to keep the entire active dataset in memory.

OLTP transactions are usually short and only make a small number of reads and writes. For long-running analytics quries, they are typically read-heavy and can use snapshot.

A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking.

However, its throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.

Encapsulating transactions in stored procedures

If a database transaction needs to wait for input from a user, the database needs to support a potentially huge number of concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction.

In this interactive style of transaction, a lot of time is spent in network communication between the application and the database. If you were to disallow concurrency in the database and only process one transaction at a time, the throughput would be dreadful because the database would spend most of its time waiting for the application to issue the next query for the current transaction.

Systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure.

Normal process is running multiple queries one by one, which looks like execute multiple commands in console;
Stored procedure is packing multiple queries together and send this batch via network, which looks like running script.

Pros and cons of stored procedures

Some cons:

Each database vendor has its own language for stored procedures
Hard to debug, monitor, test, version control, etc
A badly written stored procedures in the database may cause much more trouble than bad code in applications, because one database may be used by several applications

Some pros:

With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.
Some database use stored procedures for replication: instead of copying a transaction’s writes from one node to another, they execute the same stored procedure on each replica. (no need to transfer data, just transfer the script or “how to get these data”)

Partitioning

Executing all transactions serially makes concurrency control much simpler, but limits the transaction throughput of the database to the speed of a single CPU core on a single machine. Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with high write throughput, the single-threaded transaction processor can become a serious bottleneck.

In order to scale to multiple CPU cores, and multiple nodes, you can potentially partition your data.

If you can find a way of partitioning your dataset so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own partition, which allows your transaction throughput to scale linearly with the number of CPU cores.
However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system.

Summary

Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:

Every transaction must be small and fast, because we only have one tread, low-speed transaction will block following tasks.
It is limited to use cases where the active dataset can fit in memory so that no need to wait to load data from disk.
Write throughput must be low enough to be handled on a single CPU core.
Rare cross-partition transations.

Two-Phase Locking (2PL)

NOTE: TWO-PHASE LOCKING IS DIFFERENT WITH TWO-PHASE COMMIT!!!

Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:

If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue.
If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue.

Implementation

There are 2 lock types: shared lock (read lock) and exclusive lock (write lock).

Readers → acquire and hold shared lock, multiple readers can share the lock on the same object. If the object already has an exclusive lock, then readers need to wait.
Writer → acquires and holds exclusive lock. One object can only has one exclusive lock, so if there is any existing lock on the object, the transaction must wait.
Lock Upgrade → If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock.
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort).

One of the problem of 2PL is deadlock. The most common solution is detecting the deadlocks between transactions then aborting one of them to break the tie.

Performance

Why 2PL is not a ultimate solution? The transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.

The overhead of acquiring and releasing all those locks.
Reduce the concurrency. Some transactions need to wait other to finish first.
Deadlock. If we choose to abort some transations, the redo processes are costy.

Locks

There are several choices for locks, they have different granularities.

Row-based locks

This lock will lock the certain rows. It has the finest granularity, but the performance is not good, because there may exist lots of locks.

Condition-based locks (Predicate locks)

It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition.

The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.

The problem of predicate locks is its performance: if there are many locks by active transactions, checking for matching locks becomes time-consuming.

Index-based locks

For example, if you have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just room 123) between noon and 1 p.m.

Imagine it has index for time and for room, so the index-range lock will lock the whole index (time or room).

It’s not very precise, but since it has much lower overheads, it’s a good compromise.

Concurrency Control

There are two types of Concurrency Control policy:

Pessimistic → if anything might possibly go wrong, it’s better to wait until the situation is safe again before doing anything.
Optimistic → instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened.

Serializable Snapshot Isolation (SSI) is one of the most famous optimistic algorithm. It based on snapshot, every transaction runs in its own snapshot, then before committing, checking the conflicts.

Why wait until committing? Why not abort transaction immediately when the conflict is detected?
Because if other transaction aborted, then no conflict exist, or current transaction is read-only.

We call the snapshot in the start of transaction as premise. Because a typical transaction is read-modify-write, and we use snapshot to isolate changes in other transactions, so the premise maybe out-of-date when we commit this transaction (for example, other transaction modified some values and committed before this transaction).

Any changes to the results of the read or query may invalidate the writes in the transaction. That is, the database must know that the writes in this transaction are based on an outdated premise, and then abort the transaction.

How does the database know whether the query results may have changed? There are two situations to consider.

When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted, because this transaction based on an outdated (changed) premise.

In this example, Tx 42 modified the “Alice” but not yet commit, so in the view of Tx 43, this modification is ignored, it still treat “Alice” = true. Then Tx 43 makes its changes in “Bob”. After Tx 42 is committed, the premise for Tx 42 is outdated (“Alice” from true to false), so any writes in Tx 43 are invalid (because any writes in Tx 43 are based on its premise, the premise has changed, so the writes maybe invalid).

In this example, the premises for Tx 42 and Tx 43 are the same. After Tx 42 modified “Alice”, Tx 43 also wants to modify “Bob”. Because no Tx is committed, so Tx 42 and 43 are modifying different premises (snapshots), no conflict. When Tx 42 commits, the premise doesn’t change, so it is approved. When Tx 43 commits, the premise is outdated (“Alice” from true to false), so it is rejected.

Performance

Compared to 2PL, the big advantage of SSI is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa. This design principle makes query latency much more predictable and less variable. In particular, read-only queries can run on a consistent snapshot without requiring any locks, which is very appealing for read-heavy workloads.

Compared to serial execution, SSI is not limited to the throughput of a single CPU core. Even though data may be partitioned across multiple machines, transactions can read and write data in multiple partitions while ensuring serializable isolation.

The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so SSI requires that read-write transactions be fairly short (long-running read-only transactions may be okay).