Design Data Intensive Applications Note 3

Design Data-Intensive Applications

Linearizability: make a system appeart as if there were only one copy of data and all operations are atomic
- Linearizability vs Serializability
- CAP theorem
Ordering Guarantees
- Linearizability: total order of operations
- Causality defines partial order
- Linearizability is stronger than causal consistency
- Total order broadcast
Distributed Transactions and Consensus
- Atomic commit
- two-phase commit: use coordinator
- fault-tolerant consensus algorithms: VSR, Paxos, Raft and Zab, total order algorithms

Unix tools: awk, sed, grep, sort, uniq, xargs
MapReduce and distributed filesystems
- Reduce-Side join
- Map-Side join
  - broadcast hash join
  - partitioned hash join
  - map-side merge join
- Batch Process Output
  - Build search index
  - Key-value stores
- Map reduce vs MPP: handling of faults and use of memory/disk
Beyond MapReduce
- Map reduce materilize intermediate state, other engines:Spark, Tez and Flink
- Pregel processing model to process graph data

Messaging Systems: publish/subscribe model
- direct messaging from producers to consumers
- Message brokers
  - AMQP/JMS-style message broker
  - log-based message brokers: Kafka, Kinesis etc., taking ideas from db
- Change Data Capture: mutable
- Event Sourcing: record immutable events
Processing Streams
- Event time vs Processing Time
- Stream-stream join, stream-table join, table-table join
- fault-tolerance: exactly-once semantics
  - Microbatching: Spark Streaming; checkpointing: Apache Flink