DDIA cookbook - (2)Data Models and Query Languages
Kexin Tang

Data Model

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.

Morden applications are built by layering one data model on top of another. Each layer hides the complexity of the layers below it by providing a data model.


Relational Vs Document Model

One-to-Many

image

For relational model, is hard to represent One-to-Many relationship, like one person may have 0 to infinite work experience.

  • Relational → create multiple table to store company info, school info, then use foreign key to JOIN several tables
  • Document → can use JSON-like structure, easy to read by human, and better locality (store these info in one place)

Many-to-One & Many-to-Many

When we store the region, we use ID rather than pure text. This is because ID has no meaning, it never needs to change. For example, if we want to update “Greater Seattle Area” to “Seattle”, we just need to modify the text in region_table.

Document model is good at One-to-Many because you can imagine it as a tree, but it’s not good at Many-to-X, because it looks like a graph.

If document model doesn’t support JOIN, then we need to use iteration to mock JOIN in application level. Even if it supports JOIN, we still need to use document reference (just like foreign key), which is similar to relational model.

locality

For document database, it always store the whole document as a single object.

For read, it need to load the whole document from disk to memory, so if we need most of parts inside the document, it’s fine; otherwise, its performance is poor.

For write, it also need to rewrite the whole document from memory to disk, and only modifications that don’t change the total encoded size of a document can easily be performed in place; otherwise, system need to assign new space for new document.

schema-on-read vs schema-on-write

Document Database is not schemaless. Actually, it has implicit constrain, like when we write service code to read something from DB, we assume we can get some fields, so schema-on-read is a more accurate term.

schema-on-read check when we READ poor efficiency, cuz we cannot do any optimizations when we write it
schema-on-write check when we WRITE good efficiency cuz we can check the type then do optimization

Summary

document relation
relation map tree, one-to-many can use foreign key to achieve many-to-X
JOIN :( :)
flexibility flexible, can add fields easily schema, hard to change
locality if operate the whole doc, performance is good; but if only operate partial doc, performance is not good scatter in tables

Query Language

Declarative vs Imperative Language

Declarative Imperative
Concept declare the logic rather than actual execution define the execution plan
Example SQL, CSS C++, Python, …
Abstraction high low
Parallel good, cuz we let system do the optimization poor, cuz we already defined the steps
What u want? How to do that?

Here are some advantages for declarative language:

  • More concise and easily use
  • Hide implementation details
  • Good support for parallelism

MapReduce Query

MapReduce is neither a declarative language nor a imperative language.

  • declarative → we don’t need to specify how to iter or shuffle dataset
  • imperative → we need to implement map and reduce functions

It requires the map and reduce are pure function, which means they only use input data, they cannot do anything else like query database.

And they cannot have any side effects, which means no matter when we run the function for a given input, the output should be the same.

What’s more, mapreduce is a very low-level model for distributed execution, so engineers can implement higher-level query language base it, like SQL can be implemented as a pipeline of mapreduce.


Graph Data Model

Suitable for Many-to-Many relationships.

  • vertice / node
  • edge / relation
  • attribute

Graph can store both homogeneous and heterogeneous data. For example, node can represents people, city, animal, activity, etc.