DDIA cookbook - (4)Encoding
Kexin Tang

Compatibility

Morden software development will use rolling upgrade or staged rollout, which means deploying the new version to a few nodes at a time, checking whether the new version is bug free or not, then gradually upgrade all nodes.

And also we need to consider that our client may not install new version for some time.

So the new and old data formats, codes, policies will coexist in our system.

  • Backward compatibility → newer data can read data that written by older code
  • Forward compatibility → older data can read data that written by newer code

Encoding

In memory, we have various data structures like hash map, dictionary, vector, etc. But when we want to write data to a file or send it over network, we have to transform the complex data structure to simple sequence of bytes.

The process to transform data structure to bytes called encoding / serialization; the process to translate bytes to data structure called decoding / deserialization.

The problems are:

  • how can we encode / decode data to save time and space
  • how can we make the format be compatible

thrift and protobuf

They are binary encoding libraries that are based on the same principle. Both Thrift and Protocol Buffers require a schema for any data that is encoded.

1
2
3
4
5
struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}
1
2
3
4
5
message Person {
required string user_name = 1;
optional int64 favorite_number = 2;
repeated string interests = 3;
}

Note: this format called Interface Defination Language(IDL).

Field tags and schema evolution

Each field is identified by its tag number (the numbers 1, 2, 3 in the sample schemas) and annotated with a datatype (like int64, string, etc). You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag, since that would make all existing encoded data invalid.

You can add new fields to the schema, provided that you give each field a new tag number and make it optional or has default value.

  • forward: If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field. The datatype annotation allows the parser to determine how many bytes it needs to skip
  • backward: The new code can always read old data, because the tag numbers still have the same meaning. The only detail is that if you add a new field, you cannot make it required. If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added.

Removing a field is just like adding a field, you can only remove a field that is optional and you can never use the same tag number again.

The merits of schemas

  • They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.s
  • The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date.
  • Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.
  • For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

Modes of Dataflow

Dataflow through Databases

For database, it may be accessed by several processes, some requests are old, some are new, so compatibility is very important. Sometimes DBMS may alter the table schema to add or delete fields, which may cause problem:

image

different values written at different times

In database, you may have some values that were written five milliseconds ago, and some values that were written five years ago. When you change your service code (e.g. change the encoding policy), the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code.

Rewriting (migrating) data into a new schema is certainly possible, but it’s an expensive thing to do on a large dataset.

Most relational databases allow simple schema changes, such as adding a new column with a null default value, without rewriting existing data. When an old row is read, the database fills in nulls for any columns that are missing from the encoded data on disk.

archive data

When you take a snapshot of your database from time to time, say for backup purposes or for loading into a data warehouse. In this case, the data dump will typically be encoded using the latest schema, even if the original encoding in the source database contained a mixture of schema versions from different eras.

Since you’re copying the data anyway, you might as well encode the copy of the data consistently.

Dataflow through service calls

Microservices → make the application easier to change and maintain by making services independently deployable and evolvable.

For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams. In other words, we should expect old and new versions of servers and clients to be running at the same time.

RESTful

REST (Representational state transfer) is not a protocol, but rather a design philosophy that builds upon the principles of HTTP.

TL;DR → use URL to locate resources, use HTTP verbs to describe actions, use HTTP status codes to indicate results

www.myblog.com/introduce is a webpage related to introduce myself; www.myblog.com/blog/python/python_intro is a blog to introduce python

GET means fetch data from server, POST means submit form from client, DELETE means delete resources in server, etc (although POST can also be achieved by GET, but we need to clarify our actions)

Reference → What is RESTful API

Uniform interface

It indicates that the server transfers information in a standard format. The formatted resource is called a representation in REST. This format can be different from the internal representation of the resource on the server application. For example, the server can store data as text but send it in an HTML representation format.

Uniform interface imposes four architectural constraints:

  1. Requests should identify resources. They do so by using a uniform resource identifier (URI).
  2. Clients have enough information in the resource representation to modify or delete the resource if they want to. The server meets this condition by sending metadata that describes the resource further.
  3. Clients receive information about how to process the representation further. The server achieves this by sending self-descriptive messages that contain metadata about how the client can best use them.
  4. Clients receive information about all other related resources they need to complete a task. The server achieves this by sending hyperlinks in the representation so that clients can dynamically discover more resources.

Statelessness

In REST architecture, statelessness refers to a communication method in which the server completes every client request independently of all previous requests. Clients can request resources in any order, and every request is stateless or isolated from other requests. This REST API design constraint implies that the server can completely understand and fulfill the request every time.

Layered system

In a layered system architecture, the client can connect to other authorized intermediaries between the client and server, and it will still receive responses from the server. Servers can also pass on requests to other servers. You can design your RESTful web service to run on several servers with multiple layers such as security, application, and business logic, working together to fulfill client requests. These layers remain invisible to the client.

Cacheability

RESTful web services support caching, which is the process of storing some responses on the client or on an intermediary to improve server response time. For example, suppose that you visit a website that has common header and footer images on every page. Every time you visit a new website page, the server must resend the same images. To avoid this, the client caches or stores these images after the first response and then uses the images directly from the cache. RESTful web services control caching by using API responses that define themselves as cacheable or noncacheable.

Code on demand

In REST architectural style, servers can temporarily extend or customize client functionality by transferring software programming code to the client. For example, when you fill a registration form on any website, your browser immediately highlights any mistakes you make, such as incorrect phone numbers. It can do this because of the code sent by the server.

RPC

Problems

  • A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. RPC is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control.
  • A local function call either returns a result, or throws an exception, or never returns. RPC has another possible outcome: it may return without a result, due to a timeout.
  • If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost. In that case, retrying will cause the action to be performed multiple times.
  • Every time you call a local function, it normally takes about the same time to execute. A network request is much slower than a function call, and its latency is also wildly variable.
  • When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network.
  • The client and the service may be implemented in different programming languages.

Evolution

For services dataflow, it is reasonable to assume that all the servers will be updated first, and all the clients second. Thus, we only need backward compatibility on requests, and forward compatibility on responses.

Service compatibility is made harder by the fact that RPC is often used for communication across organizational boundaries, so the provider of a service often has no control over its clients and cannot force them to upgrade.

For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP Accept header.

Dataflow through asynchronous message passing

  • Service → one process sends a request over the network to another process and expects a response as quickly as possible
  • Database → one process writes encoded data, and another process reads it again sometime in the future

The asynchronous message-passing systems, which are somewhere between RPC and databases.

They are similar to RPC in that a client’s request (usually called a message) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (a.k.a. message queue), which stores the message temporarily.

one database → store message in queue + two RPC → sender with queue, queue with receiver

Using a message broker has several advantages compared to direct RPC:

  • It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
  • It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
  • It avoids the sender needing to know the IP address and port number of the recipient.
  • It allows one message to be sent to several recipients.
  • It logically decouples the sender from the recipient because the sender just publishes messages and doesn’t care who consumes them.

However, a difference compared to RPC is that message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages. This communication pattern is asynchronous: the sender doesn’t wait for the message to be delivered, but simply sends it and then forgets about it.

message broker

Message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic.

  • several consumers share one topic (mutual exclusive)
  • every consumer own its topic

A topic provides only one-way dataflow. However, a consumer may itself publish messages to another topic like a chain, or to a reply queue that is consumed by the sender of the original message. So by combining several topics together, we can achieve complex topology.