DDIA cookbook - (9)Ordering
Kexin Tang

Ordering and Causality

Causality imposes an ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer. And, like in real life, one thing leads to another: one node reads some data and then writes something as a result, another node reads the thing that was written and writes something else in turn, and so on.

Causal order ≠ Total order

Total order → ordering by time, every pair of events can be placed in some order.

Causal ordering is only a partial order - sometimes events happen with order, sometimes are concurrently.

In a linearizable system, we have a total order of operations: if the system behaves as if there is only a single copy of the data, and every operation is atomic, this means that for any two operations we can always say which one happened first. It’s just like there is a single timeline along which all operations are totally ordered.

It maybe very hard to understand the difference. Let’s use twitter as an example. If you post a twitte, then A, B and C comment your post.

  • Causal order (some events have causality restriction / time order, but some may be concurrent, which leads to unkonwn time order)
    • ABC need to see your post first, then comment, so “your post” is the cause, and “comments from A, B and C” is the effect (can be either concurrent or sequential).
    • Causal order only guarantees “post→A comment”, “post→B comment” and “post→C comment”.
    • That means in A’s phone, maybe the comments order is ABC, but in B’s phone, the comments order is BCA.
  • Total order (in any nodes/views, events have the same time order, no need to have causality)
    • It only guarantees that in ABC’s phones, all events have the same order, even they aren’t reasonable.
    • For example, the events are “A comment→C comment→post→B comment”.

Causal ordering doesn’t imply total ordering, and total ordering doesn’t imply causal ordering as well.

image
image

Linearizability is stronger than causal consistency

Linearizability implies causality.

Linearizability is not the only way of preserving causality. A system can be causally consistent without incurring the performance hit of making it linearizable. In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.

In order to maintain causality, you need to know which operation happened before which other operation. This is a partial order: concurrent operations may be processed in any order, but if one operation happened before another, then they must be processed in that order on every replica. Thus, when a replica processes an operation, it must ensure that all causally preceding operations (all operations that happened before) have already been processed.


Sequence Number Ordering

Although causality is an important theoretical concept, actually keeping track of all causal dependencies can become impracticable. In many applications, clients read lots of data before writing something, and then it is not clear whether the write is causally dependent on all or only some of those prior reads. Explicitly tracking all the data that has been read would mean a large overhead.

We can use sequence numbers or logic timestamps to order events. These sequence numbers also provide total order: because every event has its own sequence number, so we can compare any two of them.

Lamport timestamps

image

Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.

A Lamport timestamp bears no relationship to a physical time-of-day clock, but it provides total ordering: if you have two timestamps, the one with a greater counter value is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp.

Lamport timestamps vs Version vector

Although there are some similarities, they have a different purpose:

  • Version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other
  • Lamport timestamps always enforce a total ordering

From the total ordering of Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

Timestamp ordering is not sufficient

The problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

In order to implement something meet uniqueness constraint, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized. This idea of knowing when your total order is finalized is captured in the topic of total order broadcast.


Total Order Broadcast

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:

  • Reliable delivery → No messages are lost: if a message is delivered to one node, it is delivered to all nodes.
  • Totally ordered delivery → Messages are delivered to every node in the same order.

A correct algorithm for total order broadcast must ensure that the reliability and ordering properties are always satisfied, even if a node or the network is faulty. Of course, messages will not be delivered while the network is interrupted, but an algorithm can keep retrying so that the messages get through when the network is eventually repaired.

Use case

Total order broadcast is exactly what you need for database replication: if every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other.

Similarly, total order broadcast can be used to implement serializable transactions: if every message represents a deterministic transaction to be executed as a stored procedure, and if every node processes those messages in the same order, then the partitions and replicas of the database are kept consistent with each other.

An important aspect of total order broadcast is that the order is fixed at the time the messages are delivered, a node is not allowed to retroactively insert a message into an earlier position in the order if subsequent messages have already been delivered. This fact makes total order broadcast stronger than timestamp ordering.

Implementing linearizable storage using total order broadcast

Imagine we have a system for register unique username. For every possible username, you can have a linearizable register with an atomic compare-and-set operation.

You can implement such a linearizable compare-and-set operation as follows by using total order broadcast as an append-only log:

  1. Append a message to the log, tentatively indicating the username you want to claim.
  2. Total order broadcast, and wait for the message you appended to be delivered back to you.
  3. Check for any messages claiming the username that you want. If the first message for your desired username is your own message, then you are successful: you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client. If the first message for your desired username is from another user, you abort the operation.

Because messages are delivered to all nodes in the same order (total order broadcast), if there are several concurrent writes, all nodes will agree on which one came first (who get the username).

Imagine A, B and C wants to use username “hello world”. No matter who sends the message first, because we use total order broadcast, so if the total order is B->A->C, then all nodes will have the same order, then B wins.

Implementing total order broadcast using linearizable storage

The algorithm is simple: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

Note that unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. Thus, if a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6. The same is not the case with Lamport timestamps—in fact, this is the key difference between total order broadcast and timestamp ordering.

Because the linearizable storage is no gap, so if I receive message 4 and message 6, which means message 5 also be used by someone else, but I havn’t receive it, I need to wait message 5 then deliver message 6.