DDIA cookbook - (3)OLTP, OLAP and Columnar Store
Kexin Tang

OLTP & OLAP

OLTP → Online Transaction Processing.

OLAP → Online Analytics Processing.

A transaction needn’t have ACID. Transation processing just means allowing client to make low-latency reads and writes, which is opposed to batch processing.

Compare

Property OLTP OLAP
read small number of records per query, fetched by key aggregate large amount of records
write random access, low-latency bulk ELT or streaming
primary used by uesr, client internal analyst for decision making
what data represents latest data state history of events that happened over time
data size small or medium large

Data Warehouse

image

Warehouse will Extract-Transform-Load (ELT) data from multiple TP database and keep a read-only version data, so that when run AP in data warehouse, there is no inference with TP’s tasks.

Stars / Snowflakes Model

Stars model → center is the main table, and connect with several second level tables via foreign key

Snowflakes model → similar to star model, but it may have more levels like third or fourth level to represent more detailed information


Column-oriented

Storage Layout

image

For data warehouse, its main table may have 100+ even 1000+ columns, but for certain query, we just need roughly 2 ~ 3 columns for calculation. If we use row-oriented storage layout, it needs to fetch one row with 1000+ fields, then get only 2 ~ 3 data, then iter to next row. It has bad performance.

If we use column-oriented storage layout, it just need to fetch the required columns from disk and do aggregations.

Compression

Another advantage of column-oriented storage is: it can be compressed.

  1. for column, it represents the same concept, so its domain may be small, for example, if column represents country, then it can only have 200 or 300 possible values
  2. for column, it always has the same data type, like integer column, string column or boolean column, etc

Due to these two properties, we can compress column via compression algorithm, e.g. Bitmask for bool, Run Length Encoding, etc.

Memory bandwidth

Because column can be compressed, and for query we just fetch required columns without any useless fields, we can utilize memory bandwidth efficiently.

Vectorized processing

Single-Instruction-Multi-Data (a.k.a SIMD) technology can achieve vectorized processing, especially for bitwise operator, such as AND, OR, etc.

1
2
3
4
v1 = ["tom", "bob", "jack"]
bitmask = 0b010

v1 SIMD bitmask = [None, "bob", None]

Sort in column storage

  • Most cases, we focus on aggregation of certain column, so the order is unnecessary
  • Some cases, order can help us compress the column and do aggregation

    For example, if we sort “age” column, then we can store the column as “(18, 100), (20, 50)” which means 100 rows have age=18, 50 rows have age=20.

  • It wouldn’t make sense to sort each column independently, because we will lose the information about which fields are in the same row
  • Different queries benefit from different sort orders, so it can store the same data sorted in several ways

    For example, in a cluster we have 3 machines, all of them can provide service. Then we can store data sorted by age in machine 1, sorted by name in machine 2, sorted by sex in machine 3.

Writing to Column Storage

An update-in-place approach, like B-trees use, is not possible with compressed columns.

We can use LSM-Tree like method: All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk.

Aggregation: Data Cube & Materialized View

Data warehouse queries often involve aggregations, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. So we can cache some of the counts or sums that queries use most often in disk.

The concept to persistent store something from memory into disk called materialize.

The materialized view is an actual copy of the query results, written to disk. When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.

A common special case of a materialized view is known as a data cube, which is a grid of aggregates grouped by different dimensions.

image


Column familiy ≠ Column storage

Column family is a new concept for some databases like HBase and Cassandra.

Its schema may looks like

Id Name Work Personal
Work.Phone Work.Address Personal.Phone Personal.Address
1 Tom xxx-xxx-xxxx xxxx yyy-yyy-yyyy yyyy
2 Bob xxx-xxx-xxxx xxxx yyy-yyy-yyyy yyyy

The column family means the Work.Phone and Work.Address are under Work family, the storage still uses row-oriented.