Data Integration

Appropriate choice of software tool depends on the circumstances.

Combining Specialized Tools by Deriving Data

As the number of different representations of the data increases, the integration problem becomes harder.

Reasoning about dataflows

When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs.

Let’s imagine that when a system gets a new input, and the input should be processed to build multiple derived data. If there are multiple clients sending conflicting writes, and the multiple processing parts consume writes in different orders, inconsistency may occur.

If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order. This is “Total Order Broadcast”.

Derived data VS Distributed transactions

The classic approach for keeping different data systems consistent with each other involves distributed transactions, which decides the order of writes by using lock. And it can ensure the changes take effect only once by atomicity.

For derived system, the log-based logic also keeps order, and makes sure the retry can have deterministic output.

The biggest difference is that transaction systems usually provide linearizability, which implies useful guarantees such as reading your own writes. On the other hand, derived data systems are often updated asynchronously, and so they do not by default offer the same timing guarantees.

The limits of total ordering

In huge system, total ordering has some limits:

In most cases, constructing a totally ordered log requires all events to pass through a single leader node that decides on the ordering. If the throughput of events is greater than a single machine can handle, you need to partition the log across multiple machines. The order of events in two different partitions is then ambiguous.
For geographically distributed datacenters, every center has its own leader, this implies an undefined ordering of events that originate in two different datacenters.
For microservices, it’s stateless. When two events originate in different services, there is no order information.
Some applications support immediately even offline updates on client side. Clients and servers are likely to see events in different orders.

Total order broadcast is equivalent to consensus, which is designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and don’t provide a mechanism for multiple nodes to share the work of ordering the events.

Ordering events to capture causality

In cases where there is no causal link between events, the lack of a total order is not a big problem, since concurrent events can be ordered arbitrarily. Some other cases are easy to handle: for example, when there are multiple updates of the same object, they can be totally ordered by routing all updates for a particular object ID to the same log partition.

However, things become difficult when there are multiple derived systems (different objects). For example, if the user modify something in system A, then based on that, modify system B. If the causal dependency is not captured, error occurs. One solution is let system A JOIN system B in the modified fields, then make next step. However, things always become difficult:

Logical timestamps can provide total ordering without coordination. However, they still require recipients to handle events that are delivered out of order, and they require additional metadata to be passed around.
Conflict resolution algorithms help with processing events that are delivered in an unexpected order. They are useful for maintaining state, but they do not help if actions have external side effects.

Batch and Stream Processing

The goal of data integration is to make sure that data ends up in the right form in all the right places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors are the tools for achieving this goal.

In principle, one type of processing can be emulated on top of the other, although the performance characteristics vary.

For example, Spark can do stream processing based on batch processing core via cut events into microbatches.

Maintaining derived state

In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed.

However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system.

What’s more, secondary indexes often cross-partition boundaries. A partitioned system with secondary indexes either needs to send writes to multiple partitions or send reads to all partitions. Such cross-partition communication is also most reliable and scalable if the index is maintained asynchronously.

Reprocessing data

Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.

In particular, reprocessing existing data provides a good mechanism for maintaining a system, evolving it to support new features and changed requirements. For example, add a new column in schema or change the data type and layout.

Derived views allow gradual evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data.

The lambda architecture

The lambda architecture can combine batch and stream processing (reprocess historical data and process recent updates). The core idea of the lambda architecture is that incoming data should be recorded by appending immutable events to an always-growing dataset, just like event sourcing.

The lambda architecture proposes running two different systems in parallel. In the lambda approach, the stream processor consumes the events and quickly produces an approximate update to the view; the batch processor later consumes the same set of events and produces a corrected version of the derived view.

The reasoning behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant. Moreover, the stream process can use fast approximate algorithms while the batch process uses slower exact algorithms.

However, lambda architecture also has its own drawbacks:

Having to keep both batch and stream processing logics which have the same goal.
The output of batch and stream processing need to be merged.
To avoid reproduce the whole history, batch processing may need some mechanism to support incremental batches.