Intro

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes.

Partition & Replication

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes. This means that, even though each record belongs to exactly one partition, it may still be stored on several different nodes for fault tolerance.

Each partition’s leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the leader for some partitions and a follower for other partitions.

Partitioning of Key-Value Data

The goal of partition is to evenly distribute dataset among machines. If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed. A partition with disproportionately high load is called a hot spot.

Partitioning by Key Range

One way of partitioning is to assign a continuous range of keys to each partition.

For example, ‘a’-‘c’ in machine1, ‘d’-‘f’ in machine2, and so on.

The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed. In order to distribute the data evenly, the partition boundaries need to adapt to the data.

For example, if ‘a’ and ‘e’ have more data, we can let ‘a’ in machine1, ‘b’-‘d’ in machine2, ‘e’ in machine3, and so on.

Advantage

The advantage of partitioning by key range is we can keep keys in sorted order (just like in SSTable). This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one query (just like yyyy-mm-dd).

Disadvantage

However, the downside of key range partitioning is that certain access patterns can lead to hot spots (because the order). For example, it we store the timestamp, and we always care about up-to-date value, then the machine stores new values will be hot spot.

Partitioning by Hash

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.

Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition.

The partition boundaries can be evenly spaced, or they can be chosen pseudorandomly (in which case the technique is sometimes known as consistent hashing).

Downside

Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries.

Cassandra achieves a compromise between the two partitioning strategies. A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.

Skewed Workloads and Relieving Hot Spots

In the extreme case where all reads and writes are for the same key, you still end up with all requests being routed to the same partition. (for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something)

A simple way to solve this problem is: if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different partitions.

However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it. You also need some way of keeping track of which keys are being split.

Partitioning and Secondary Indexes

A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value. The problem with secondary indexes is that they don’t map neatly to partitions.

Document Index (local)

In this indexing approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions.

Advantage → when write, just add the seconday index in local partition.
Disadvantage → when read, because we only konw the local state, so need to iter all partitions to gather same seconday index information.

Term Index (global)

Rather than each partition having its own secondary index, we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index must also be partitioned, but it can be partitioned differently from the primary key index.

Advantage → when read, just read one partition.
Disadvantage → when write, a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

Rebalancing Partitions

The process of moving load from one node in the cluster to another is called rebalancing.

The requirements are:

After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.
While rebalancing is happening, the database should continue accepting reads and writes.
No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.

Strategies for Rebalancing

The simplest way to do partition is hash by mod n, which means key % N = partition id. But the problem is: when we do rebalancing due to some nodes fail or change, the N will change, and most of the keys will need to be moved.

Fixed number of partitions

Create many more partitions than there are nodes, and assign several partitions to each node.

For example, we have 10 nodes and 100 partitions, then every node stores 10 partitions.

Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions (we don’t change content of the partition, we just move it as a whole group). The only thing that changes is the assignment of partitions to nodes.

This change of assignment is not immediate, so the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress.

In this configuration, the number of partitions is fixed when the database is first set up and not changed afterward. Although in principle it’s possible to split and merge partitions (see the next section), a fixed number of partitions is operationally simpler. Thus, the number of partitions configured at the outset is the maximum number of nodes you can have (make sure every nodes have at least more than one partition).

Disadvantage → You need to choose the fixed number high enough to accommodate future growth (because the number cannot be changed). However, each partition also has management overhead, so it’s counterproductive to choose too high a number. It’s pretty hard to configure in advance especially when the data may change drasticly.

Dynamic partitioning

When a partition grows to exceed a configured size, it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition.

Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load.

Advantage → The number of partitions adapts to the total data volume.
Disadvantage → An empty database starts off with a single partition, since there is no a priori information about where to draw the partition boundaries. So all read and write will hit the same node while other nodes are idle.

Partitioning proportionally to nodes (Consistent Hash)

For fixed partition, the size of each partition is proportional to the size of the dataset. For dynamic partition, the the number of partitions is proportional to the size of the dataset. In both of these cases, the number of partitions is independent of the number of nodes.

A third option is to make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node. In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. The randomization can produce unfair splits, but when averaged over a larger number of partitions, the new node ends up taking a fair share of the load from the existing nodes.

Picking partition boundaries randomly requires that hash-based partitioning is used.

When an old node leave the cluster, it also rebalances its partitions to other nodes, and do partition merging process.

Operations: Automatic or Manual Rebalancing

Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable. Rebalancing is an expensive operation, because it requires rerouting requests and moving a large amount of data from one node to another. If it is not done carefully, this process can overload the network or the nodes and harm the performance of other requests while the rebalancing is in progress.

For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.

Request Routing

When a client wants to make a request, how does it know which node to connect to? As partitions are rebalanced, the assignment of partitions to nodes changes.

This is an instance of a more general problem called service discovery. There are several ways:

Allow clients to contact any node. If that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along to the client.
Send all requests from clients to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier does not itself handle any requests; it only acts as a partition-aware load balancer.
Require that clients be aware of the partitioning and the assignment of partitions to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.

In all cases, the key problem is: how does the component making the routing decision (which may be one of the nodes, or the routing tier, or the client) learn about changes in the assignment of partitions to nodes?

Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata. Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in ZooKeeper. Whenever a partition changes ownership, or a node is added or removed, ZooKeeper notifies the routing tier so that it can keep its routing information up to date.