Design Data-Intensive Applications
CH9 Consistency and Consensus
- Linearizability: make a system appeart as if there were only one copy of data and all operations are atomic
- Linearizability vs Serializability
- CAP theorem
- Ordering Guarantees
- Linearizability: total order of operations
- Causality defines partial order
- Linearizability is stronger than causal consistency
- Total order broadcast
- Distributed Transactions and Consensus
- Atomic commit
- two-phase commit: use coordinator
- fault-tolerant consensus algorithms: VSR, Paxos, Raft and Zab, total order algorithms
CH10 Batch Processing
- Unix tools: awk, sed, grep, sort, uniq, xargs
- MapReduce and distributed filesystems
- Reduce-Side join
- Map-Side join
- broadcast hash join
- partitioned hash join
- map-side merge join
- Batch Process Output
- Build search index
- Key-value stores
- Map reduce vs MPP: handling of faults and use of memory/disk
- Beyond MapReduce
- Map reduce materilize intermediate state, other engines:Spark, Tez and Flink
- Pregel processing model to process graph data
CH11 Stream Processing
- Messaging Systems: publish/subscribe model
- direct messaging from producers to consumers
- Message brokers
- AMQP/JMS-style message broker
- log-based message brokers: Kafka, Kinesis etc., taking ideas from db
- Change Data Capture: mutable
- Event Sourcing: record immutable events
- Processing Streams
- Event time vs Processing Time
- Stream-stream join, stream-table join, table-table join
- fault-tolerance: exactly-once semantics
- Microbatching: Spark Streaming; checkpointing: Apache Flink
CH12 The Future of Data Systems
- Data Integration
- Unifying batch and stream processing
- Unbundling Databases
- Ensure processing is correct
- Predictive analytics bias and privacy