Data Model

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.

Morden applications are built by layering one data model on top of another. Each layer hides the complexity of the layers below it by providing a data model.

Relational Vs Document Model

One-to-Many

For relational model, is hard to represent One-to-Many relationship, like one person may have 0 to infinite work experience.

Relational → create multiple table to store company info, school info, then use foreign key to JOIN several tables
Document → can use JSON-like structure, easy to read by human, and better locality (store these info in one place)

Many-to-One & Many-to-Many

When we store the region, we use ID rather than pure text. This is because ID has no meaning, it never needs to change. For example, if we want to update “Greater Seattle Area” to “Seattle”, we just need to modify the text in region_table.

Document model is good at One-to-Many because you can imagine it as a tree, but it’s not good at Many-to-X, because it looks like a graph.

If document model doesn’t support JOIN, then we need to use iteration to mock JOIN in application level. Even if it supports JOIN, we still need to use document reference (just like foreign key), which is similar to relational model.

locality

For document database, it always store the whole document as a single object.

For read, it need to load the whole document from disk to memory, so if we need most of parts inside the document, it’s fine; otherwise, its performance is poor.

For write, it also need to rewrite the whole document from memory to disk, and only modifications that don’t change the total encoded size of a document can easily be performed in place; otherwise, system need to assign new space for new document.

schema-on-read vs schema-on-write

Document Database is not schemaless. Actually, it has implicit constrain, like when we write service code to read something from DB, we assume we can get some fields, so schema-on-read is a more accurate term.

schema-on-read	check when we READ	poor efficiency, cuz we cannot do any optimizations when we write it
schema-on-write	check when we WRITE	good efficiency cuz we can check the type then do optimization

Summary

	document	relation
relation map	tree, one-to-many	can use foreign key to achieve many-to-X
JOIN	:(	:)
flexibility	flexible, can add fields easily	schema, hard to change
locality	if operate the whole doc, performance is good; but if only operate partial doc, performance is not good	scatter in tables

Query Language

Declarative vs Imperative Language

	Declarative	Imperative
Concept	declare the logic rather than actual execution	define the execution plan
Example	SQL, CSS	C++, Python, …
Abstraction	high	low
Parallel	good, cuz we let system do the optimization	poor, cuz we already defined the steps
	What u want?	How to do that?

Here are some advantages for declarative language:

More concise and easily use
Hide implementation details
Good support for parallelism

MapReduce Query

MapReduce is neither a declarative language nor a imperative language.

declarative → we don’t need to specify how to iter or shuffle dataset

imperative → we need to implement map and reduce functions

It requires the map and reduce are pure function, which means they only use input data, they cannot do anything else like query database.

And they cannot have any side effects, which means no matter when we run the function for a given input, the output should be the same.

What’s more, mapreduce is a very low-level model for distributed execution, so engineers can implement higher-level query language base it, like SQL can be implemented as a pipeline of mapreduce.

Graph Data Model

Suitable for Many-to-Many relationships.

vertice / node
edge / relation
attribute

Graph can store both homogeneous and heterogeneous data. For example, node can represents people, city, animal, activity, etc.