Log
Here, we define log as an append-only sequence of records.
Index
Index is an additional data structure, which doesn’t affect the contents of database, but affect the operation performance.
Index can speed up the read, but slow down the write cuz we need to maintain the index.
Hash Index
Key-Value Store
Hash index is commonly used in Key-Value Storage. We can keep an in-memory hash map where every key is mapped to a offset in the data file. And we also store the k-v in log file which locates in disk. Every time we write a new k-v pair, we append the k-v in log file (notice log file in disk), and modify the index in memory.
However, it may run out of disk space. A good way to solve this problem is Compaction.
Compaction

We can store the index information in the segment, when the segment reach its size limitation, we make subsequent write in new segment. Then we can perform compaction process for these frozen segments. If same key appears many times in these segment, we just keep the newest value. It will generate the new compacted segments in new files and delete old segments to save space.
The whole process can happen in background thread, so it will not affect the service.
To find a value for a key, we just check the most recent segment, if the key is not present, then second-most-recent and so on.
Considerations
- Deleting Records → we need to keep the delete operation in logs (sometimes use special mark called tombstone)
- to mark the pervious operations for this key are useless, so if there is no new record for this key, this key shouldn’t appear in compacted segment.
- Crash Recovery → when database is restarted, in-memory hashmap is erased. We can iter all segment k-v info to rebuild the hashmap, but it’s costy. We can choose to store the k-v info and snapshots of the in-memory hashmap on disk, so that we just load the snapshots to rebuild hashmap.
- Concurrency Control → as writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.
log-structured vs update-in-place
log-structured > update-in-place
- append (sequential access) is always more efficient than random access
- concurrency control and crash recovery are simpler → thanks for immutable log (or say it’s append-only)
- compaction can reduce fragment problem
log-structured < update-in-place
- key must in memory
- key is good for point searching but is bad for range searching
SSTable & LSM-Tree
If we let the key are sorted in segment, we call it Sorted String Table or SSTable.
Why SSTable is useful?

- Merging/Compacting segments is simple and efficient. It use merge sort algorithm to achieve high efficiency. If one key appears in multiply segments, just use the most up-to-date segment value.

- Sparse index. We don’t need to keep all keys in memory, instead, just keep sparse index.
For example, create key for “A”, and key for “C”, then key “B” must between these two.
- Compression. Since read requests need to scan over several key-value pairs in the requested range, it’s possible to group those records into a block and compress it before writing it to disk.
Construct and maintain SSTable
- When write comes in, add it to a in-memory data structure (red-black tree, AVL tree, etc). This in-memory tree is called memtable.
This data structure can insert unordered data, and dump ordered data, for simplicity, just imagine binary search tree, we can insert data in any order, then read them via pre-order traversal to get ordered output.
- When the memtable reachs its threshold, dump it into disk as a new SSTable (the data structure ensures when dump, it must be ordered). After SSTable is being written in disk, writes can continue to a new memtable.
- When read comes in, it check memtable first, then the most up-to-date SSTable, and so on.
- From time to time, a background process is running for compacting.
The only problem is: if crash happens, the memtable will lose all fresh data. To deal with it, we can add a log file in disk for memtable, if crash happens, just recover it from log file; if memtable dumps to disk, then delete the log file cuz it’s useless.
Making LSM-Tree
Log-Structured Merge Tree is based on SSTable and memtable priciple. It will compact SSTable according to level or size.

Optimization
- Bloom Filter to aviod not exist keys.
- size-tiered and level-tiered compaction.
Bloom Filter → it can judege that an element must not exist or possible exist
For a given input
x, apply multiple hash functions to map its output in different place (you can imagine we have avector<bool>), e.g.hash1(x),hash2(x),hash3(x). Then when we have a new inputy, ifhash1(y),hash2(y)andhash3(y)all have value (true in vector), which meansyis possible exist; if any of these output don’t have value (false in vector), which meansymust miss.
The basic idea of LSM Tree is: keeping a cascade of ordered SSTables that are merged in the background.
Because the data is sorted and sparse index can narrow the possible range, range query is also efficient, cuz you can use like binary search to boost query speed.
B-Tree
Different with LSM Tree who uses variable size of segments, B Tree breaks the database down into fixed size block or page, which aligns with disk design. So for any operations, the B-Tree uses page as unit.
For read, it’s just like find a value in binary search tree, whereas here is N-ary search tree, and the N called branching factor.
For update, it needs to load the whole page, update, then write the whole page back.
For write and delete, it may need to split / merge node to make sure the tree is balanced.
Making tree reliable
For LSM Tree, the modifications will not change data in-place, instead, it will append new data and write to new place when compaction happens. But for B Tree, it wants to modify data in-place, which means modification doesn’t change the location of page.
When the tree structure is adjusted, many pages may be modified in cascade. For example, after a leaf node is split, two new leaf nodes and a parent node need to be written (updating the leaf pointer).
- crash → WAL (write ahead log), which means write the operation in log before execute it
- concurrency control → latch (lighweight lock)
Optimization
- Instead of WAL, use Copy-On-Write mechanism → instead of write page back to original place, create a new page in new place and modify parent’s pointer
- Instead of keep entire key, we can share prefix → for example, if the key is YYYYMMDD, the parent keeps YYYY, then its children keep MM, then DD. When we query, we must start from root, which means from left to right for key
- Add pointers between leaf nodes to boost range query
LSM-Tree vs B-Tree
Write Amplification → one write in database resulting in multiple writes to the disk.
For SSD, if there is a “delete” for some data, system cannot delete it immediately, system will give it a marker to indicate it’s useless. Remember for disk, the operation unit is page (or you can imagine it’s an area which contains many data), so when we execute a write:
- move all useful data out of the page, and store them in somewhere
- erase all data in this page
- move useful data back and add new data
It’s obviously writing more data than what we actually want, and this is the amplification.
| | B-Tree | LSM-Tree |
| — | — | — | — |
| R/W | read faster | write faster |
| Write Amplification | 1. data + WAL
2. massive data may cover different pages | 1. data + WAL
2. compaction needs to write in a new file |
| Write Throughput | low → random access | high → lower amplification; sequential access; compression makes smaller SSTable |
| Storage performance | lots of fragments | compaction save space |
| Bandwidth | predictable | compaction will take some bandwidth, which may infer service |
| Storage Amplification | some pages have unused space due to alignment | the same key exist in multiple segments |
| Concurrency | can add latch in parent nodes | one key exist in many places → MVCC |
Other Index
Primary vs Secondary
1 | | Id | Name | Age | |
Primary Index → index key is unique (such as primary key in MySQL), such as id
Secondary Index → index key may be mapped to several values, such as age
Cluster vs Non-Cluster
1 | | Id | Name | Age | |
Cluster → index value is data, such as index[id=1] -> {id: 1, name: "Tom", age: 18}
Non-Cluster → index value is pointer of data, such as index[age=18] -> [id=1, id=2]
For MySQL
- primary index is always cluster index → the leaf node of B-Tree is
<primary_key, row>- secondary index is always non-cluster index → the leaf node of B-Tree is
<secondary_index, primary_key>
In-Memory DB
The basic property for memory is: its data will lose after we restart the system. (except Non-Volatile Memory, aka NVM)
We can categorize in-memory DB into two categories:
- Cache only → no persistency
- Persistency → use WAL, snapshot, replica for recovery and reload, but execute all operations in memory
Why in-memory DB is so efficient? Many people may think the biggest reason is it doesn’t need to communicate with disk. However, the truth is in-memory DB doesn’t need to decode/encode data structures to fit into disk.