Faults and Partial Failures

For single computer system, it is usually either fully functional or entirely broken. System prefers to crash completely rather than return some “error message”.

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure. The key of distributed system is nondeterministic.

Supercomputing

Supercomputing (high-performance computing, HPC) → focus on intensive scientific computing tasks with the help of thousands of CPUs and powerful machines.

In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time. If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint. Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure—if any part of the system fails, just let everything crash.

Cloud Computing

Cloud computing → focus on multi-tenant datacenters, commodity computers connected with an IP network and elastic/on-demand resource allocation.

If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software.

Unreliable Networks

In distributed systems, we always use shared-nothing architecture, so the network is the only way those machines can communicate.

The internet and most internal networks in datacenters are asynchronous packet networks. In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all.

Your request may have been lost;
Your request may be waiting in queue;
Remote node may have failed;
Remote node may have processed your request, but the response has been lost;
Remote node may have processed your request, but the response has been delayed due to traffic;

The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not.

Timeout and Unbounded Delays

There are two metrics:

Accuracy → every detected failure corresponds to a crashed process (no mistakes)
Completeness → every process failure is eventually detected (no misses)

A short timeout detects faults faster, it has high completeness, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown, which means it has low accuracy. Vice versa for a long timeout.

If we predict for a system

the maximum delay for packets — every packet is either delivered within some time d or it’s lost.
the maximum processing time for service — a non-failed node always handles a request within some time r.

Then the reasonable timeout value is 2d+r. But the prerequisites are impossible: we cannot guarantee any bound for a system, this called unbounded delays.

Possible Problems

When a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure.
Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of performing some action, and another node takes over, the action may end up being performed twice.

Unbounded Delays

There are 4 network delays:

Queuing Delay → If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one;
Processing Delay → The amount of time it takes processors to process the packet;
Transmission Delay → If the request is very large, sender will chop it into several packets, it will take time to put all of these packets into network;
Propagation Delay → Time taken for a single bit to traverse the physical medium from one end to the other;

Sync Vs Async

In sync network, it’s just like the circuit for telephone, we have bounded delays (e.g. known maximum round-trip time, no queuing delays).

Even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the next hop of the network.

In async network, the unbounded delays occur.

Why distributed system doesn’t use circuit (sync network) logic? Because distributed system has bursty traffic (we don’t know how many bandwidth should be allocated). A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement—we just want it to complete as quickly as possible. TCP is good at dynamic allocation, so we choose TCP over circuit.

TCP has traffic and congestion control, it dynamically adapts the rate of data transfer to the available network capacity.

Unreliable Physical Clocks

Clocks and time are import:

Has the request timeout yet?
What’s the 99th percentile response time?
How may QPS?
When does the cache expire?
What is the timestamp for logging?

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. What’s more, some machines maybe faster or slower than other machines.

Monotonic Vs Time-of-Day Clocks

Time-of-Day → return the difference with 1970-1-1 00:00:00, the value is meaningful, but due to clock skew in different clusters, the value may not be accurate.
Monotonic → the single value is meaningless, but we can get two values then calculate its difference to get the elapsed time, it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.

Clock Sync and Accuracy

Clock Sync is a pretty hard task:

The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should).
If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset, which may influence ongoing tasks.
If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time.
NTP synchronization can only be as good as the network delay.
…

Relying on Sync Clocks

If you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster.

Last write wins (LWW)

It is widely used in both multi-leader replication and leaderless databases. Its logic is: if there is conflict of multiple writes, keep the write with maximum timestamp, it represents the newest (last) write, and will overwrite old (previous) writes.

But it still has some problems:

Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed.
LWW cannot distinguish between writes that occurred sequentially in quick succession or concurrent.
Causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality.

Confidence Interval

The most common implementation of snapshot isolation requires a monotonically increasing transaction ID. However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all partitions) is difficult to generate, because it requires coordination.

Can we use timestamp? Yes! But it requires materialized design.

So instead of treating Time-of-Day value as a precise value, we can treat it as a range of time, just like [minimum possible timestamp, maximum possible timestamp], this is called confidence interval.

Google Spanner use this “confidence interval” concept to implement its distributed transaction semantics, because Spanner depends on Google’s well-designed clock system!

Knowledge, Truth and Lies

Quorum

A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes: decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

Fencing Tokens

Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token. This mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed.

Byzantine Faults

Byzantine fault → a node may claim to have received a particular message when in fact it didn’t.

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances.

Most Byzantine fault-tolerant algorithms require a supermajority of more than twothirds of the nodes to be functioning correctly.

System Model

System model → formalize the kinds of faults that we expect to happen in a system.

Sync
Synchronous model → The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound.
Partially synchronous model → Partial synchrony means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift.
Asynchronous model → In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock.
Faults
Crash-stop faults → Node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back.
Crash-recovery faults → Nodes may crash at any moment, and perhaps start responding again after some unknown time.
Byzantine (arbitrary) faults → Nodes may do absolutely anything, including trying to trick and deceive other nodes.

Safety & Liveness

Safety → Nothing bad happens, for example: transaction with smaller timestamp should happens before transaction with larger timestamp.
Liveness → Something good eventually happens.