Intro

Consensus → get several nodes to agree on something.

Some situations:

Leader Election → The leadership position might become contested if some nodes can’t communicate with others due to a network fault. In this case, consensus is important to avoid a bad failover, resulting in a split brain situation in which two nodes both believe themselves to be the leader
Atomic Commit → In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others, we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back or they all commit

Consensus Vs Data Consistency

The two are very similar and can even be interchanged in many occasions, but we can also experience subtle differences:

Data Consistency → more like final outcomes and goals, is the desired state of the system, but does not define how this state is achieved
Consensus → more like a general algorithm or method to reach this state of consensus (like voting), and sometimes it contains more than data

For example, consensus algorithm can be used in leader election, it doesn’t include any data, it’s just a concept; but data consistency always appears in replication, which focus on that the data is the same in different nodes.

The Impossibility of Consensus

FLP result → there is no algorithm that is always able to reach consensus if there is a risk that a node may crash.

It’s correct, but it has a prerequisite: in asynchronous system model, which cannot use any clocks or timeouts! If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable.

Note: the sync or async here is not about data consistency, it’s a system model, which represents whether the system has time bounds.

Two-Phase Commit (2PC)

In single machine, the atomicity is achieved by storage engine.

When the client asks the database node to commit the transaction, the database makes the transaction’s writes durable (typically in a write-ahead log) and then appends a commit record to the log on disk. If the database crashes in the middle of this process, the transaction is recovered from the log when the node restarts: if the commit record was successfully written to disk before the crash, the transaction is considered committed; if not, any writes from that transaction are rolled back.

For distributed systems, if some nodes commit the transaction but others abort it, then inconsistency occurs, because committed transaction cannot be revert. So a node must only commit once it is certain that all other nodes in the transaction are also going to commit.

Two Phases

Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).

Coordinator → 2PC uses a new component that does not normally appear in single-node transactions: a coordinator or transaction manager. The coordinator is often implemented as a library within the same application process that is requesting the transaction, but it can also be a separate process or service.
Participant → A distributed transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction.

When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. In pahse 2: coordinator keep track of ack response and make decision.

If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.
If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

Details

When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.
The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage (for example, a node crashes or a request times out), the coordinator or any of the participants can abort.
When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.
When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances. This includes writing all transaction data to disk and checking for any conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested.
When the coordinator has received ack responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction. The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point.
Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes”, it cannot refuse to commit when it recovers.

Thus, the protocol contains two crucial “points of no return”, which ensure the atomicity of 2PC:

Once a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort)
Once the coordinator decides, that decision is irrevocable.

Coordinator Failure

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction via timeout or other mechanism.

If the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain. In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

The participant cannot timeout to abort by themselves after they voting, because the coordinator may already made final decision, but current participant doesn’t receive the ack due to network for example, then abortion will cause inconsistency.

2PL Vs 2PC

2PL is used to achieve serializable isolation, whereas 2PC is used to achieve atomic commit.

They both have 2 phases:

2PL → get lock / upgrade lock - relase lock
2PC → prepare - commit

Three-Phase Commit (3PC)

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover.

3PC can achieve nonblocking commit, but:

assumes a network with bounded delay and nodes with bounded response times
requires a perfect failure detector, like a reliable mechanism for telling whether a node has crashed or not

These are hard to achieve, so 3PC is not common.

Distributed Transactions in Practice

XA transactions

Because the coordinator and participants may use different applications (heterogeneous technologies), so how to make sure the messages are understandable by all of them is important.

XA is short for eXtended Architecture, it’s a standard for implementing two-phase commit across heterogeneous technologies. XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.

With the help of XA, the system can use generic APIs to achieve communication, e.g. what does prepare request look like, what does ack response look like, how to use same callback in different parts, etc.

Holding locks while in doubt

Database transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty writes. In addition, if you want serializable isolation, a database using 2PL would also have to take a shared lock on any rows read by the transaction.

And if the participants are in doubt state, they cannot abort by themselves, so they will constantly hold the lock. While those locks are held, no other transaction can modify those rows. Depending on the database, other transactions may even be blocked from reading those rows. Thus, other transactions cannot simply continue with their business.

Recovering from coordinator failure

In practice, orphaned in-doubt transactions do occur—that is, transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because the transaction log has been lost or corrupted due to a software bug). These transactions cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions.

Even rebooting your database servers will not fix this problem, since a correct implementation of 2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk violating the atomicity guarantee).

The only way out is for an administrator to manually decide whether to commit or roll back the transactions. Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator.

The heuristic decisions may violate atomicity rule!

Limitations

If the coordinator is not replicated but runs only on a single machine, it is a single point of failure for the entire system.
Many server-side applications are developed in a stateless model. But the coordinator’s logs become a crucial part of the durable system state—as important as the databases themselves, since the coordinator logs are required in order to recover in-doubt transactions after a crash. Such application servers are no longer stateless.
Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator (cannot well-designed for certain language or technology for example).

Fault-Tolerant Consensus

In this formalism, a consensus algorithm must satisfy the following properties:

Uniform agreement → No two nodes decide differently.
Integrity → No node decides more than once.
Validity → If a node decides value v, then v was proposed by some node.
Termination → Every node that does not crash eventually decides some value.

Termination is a liveness property, whereas the other three are safety properties.

liveness → something good eventually happens

safety → nothing bad happens

The system model of consensus assumes that when a node “crashes,” it suddenly disappears and never comes back. In this system model, any algorithm that has to wait for a node to recover is not going to be able to satisfy the termination property (like alive nodes need to wait crashed coordinator or coordinator needs to wait crashed participants to ack).

Thus, the termination property is subject to the assumption that fewer than half of the nodes (quorum) are crashed or unreachable. However, most implementations of consensus ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make invalid decisions.

Consensus algorithms and total order broadcast

In practice, one transaction or action may contains multiple values. We need to make sure the sequence of values are the same in all nodes.

The total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. This is equivalent to performing several rounds of consensus: in each round, nodes propose the message that they want to send next, and then decide on the next message to be delivered in the total order.

So in a high level perspective, total order broadcast == multiple rounds of single value consensus.

1	consensus([B, A, C]) == consensus([consensus(B), consensus(A), consensus(C)])

Due to the agreement property of consensus, all nodes decide to deliver the same messages in the same order.
Due to the integrity property, messages are not duplicated.
Due to the validity property, messages are not corrupted and not fabricated out of thin air.
Due to the termination property, messages are not lost.

Single-leader replication and consensus

If the leader is selected manually, it can works well and follows the first three rules, but does not satisfy the termination property because of the need of human intervention.

For automatic leader election and failover, it will promote a follower to be the new leader if the old leader fails. The protocols define an epoch number (called the ballot number in Paxos, and term number in Raft) and guarantee that within each epoch, the leader is unique. Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs, then the leader with the higher epoch number prevails.

For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal.

Difference with 2PC

2PC requires all participants reply “yes”, while fault-tolerant consensus algorithm only requires quorum.
2PC’s coordinator cannot be elected automatically, while fault-tolerant consensus algorithm can achieve leader election.

Membership and Coordination Services

Projects like ZooKeeper or etcd are often described as “coordination and configuration services”. They are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability)—so you wouldn’t want to store all of your application’s data here. That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm.

ZooKeeper has some features:

Linearizable atomic operations
Total ordering of operations
Failure detection
Change notifications (new node join / old node exit)

Allocating work to nodes

One example in which the ZooKeeper works well is if you have several instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.

Another example arises when you have some partitioned resource and need to decide which partition to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load. As nodes are removed or fail, other nodes need to take over the failed nodes’ work.

Normally, the kind of data managed by ZooKeeper is quite slow-changing. ZooKeeper is not intended for storing the runtime state of the application, which may change thousands or even millions of times per second.

Service discovery

Service discovery → to find out which IP address you need to connect to in order to reach a particular service.

Membership service

Membership service → which nodes are currently active and live members of a cluster.

Due to unbounded network delays it’s not possible to reliably detect whether another node has failed. However, if you couple failure detection with consensus, nodes can come to an agreement about which nodes should be considered alive or not.

no matter the node is alive or crashed, if quorum thinks it is dead, then it’s dead.