OLTP & OLAP
OLTP → Online Transaction Processing.
OLAP → Online Analytics Processing.
A transaction needn’t have ACID. Transation processing just means allowing client to make low-latency reads and writes, which is opposed to batch processing.
Compare
| Property | OLTP | OLAP |
|---|---|---|
| read | small number of records per query, fetched by key | aggregate large amount of records |
| write | random access, low-latency | bulk ELT or streaming |
| primary used by | uesr, client | internal analyst for decision making |
| what data represents | latest data state | history of events that happened over time |
| data size | small or medium | large |
Data Warehouse

Warehouse will Extract-Transform-Load (ELT) data from multiple TP database and keep a read-only version data, so that when run AP in data warehouse, there is no inference with TP’s tasks.
Stars / Snowflakes Model
Stars model → center is the main table, and connect with several second level tables via foreign key
Snowflakes model → similar to star model, but it may have more levels like third or fourth level to represent more detailed information
Column-oriented
Storage Layout

For data warehouse, its main table may have 100+ even 1000+ columns, but for certain query, we just need roughly 2 ~ 3 columns for calculation. If we use row-oriented storage layout, it needs to fetch one row with 1000+ fields, then get only 2 ~ 3 data, then iter to next row. It has bad performance.
If we use column-oriented storage layout, it just need to fetch the required columns from disk and do aggregations.
Compression
Another advantage of column-oriented storage is: it can be compressed.
- for column, it represents the same concept, so its domain may be small, for example, if column represents country, then it can only have 200 or 300 possible values
- for column, it always has the same data type, like integer column, string column or boolean column, etc
Due to these two properties, we can compress column via compression algorithm, e.g. Bitmask for bool, Run Length Encoding, etc.
Memory bandwidth
Because column can be compressed, and for query we just fetch required columns without any useless fields, we can utilize memory bandwidth efficiently.
Vectorized processing
Single-Instruction-Multi-Data (a.k.a SIMD) technology can achieve vectorized processing, especially for bitwise operator, such as AND, OR, etc.
1 | v1 = ["tom", "bob", "jack"] |
Sort in column storage
- Most cases, we focus on aggregation of certain column, so the order is unnecessary
- Some cases, order can help us compress the column and do aggregation
For example, if we sort “age” column, then we can store the column as “(18, 100), (20, 50)” which means 100 rows have age=18, 50 rows have age=20.
- It wouldn’t make sense to sort each column independently, because we will lose the information about which fields are in the same row
- Different queries benefit from different sort orders, so it can store the same data sorted in several ways
For example, in a cluster we have 3 machines, all of them can provide service. Then we can store data sorted by age in machine 1, sorted by name in machine 2, sorted by sex in machine 3.
Writing to Column Storage
An update-in-place approach, like B-trees use, is not possible with compressed columns.
We can use LSM-Tree like method: All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk.
Aggregation: Data Cube & Materialized View
Data warehouse queries often involve aggregations, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. So we can cache some of the counts or sums that queries use most often in disk.
The concept to persistent store something from memory into disk called materialize.
The materialized view is an actual copy of the query results, written to disk. When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.
A common special case of a materialized view is known as a data cube, which is a grid of aggregates grouped by different dimensions.

Column familiy ≠ Column storage
Column family is a new concept for some databases like HBase and Cassandra.
Its schema may looks like
| Id | Name | Work | Personal | ||
|---|---|---|---|---|---|
| Work.Phone | Work.Address | Personal.Phone | Personal.Address | ||
| 1 | Tom | xxx-xxx-xxxx | xxxx | yyy-yyy-yyyy | yyyy |
| 2 | Bob | xxx-xxx-xxxx | xxxx | yyy-yyy-yyyy | yyyy |
The column family means the Work.Phone and Work.Address are under Work family, the storage still uses row-oriented.