Design Data Intensive Applications Note 1

Design Data-Intensive Applications

ORM: Object-relational mapping frameworks (ActiveRecord and Hibernate), reduce translation layer between OOP and SQL data model
self-contained document, a JSON representation, simpler than XML (Document-oriented DB, MongoDB, REthinkDB, CouchDB and Espresso)
CODASYL model: the network model IMS (different from graph model)
Relational(schema-on-write) vs Document (schema-on-read)
Document databases are schemaless
SQL is a declarative query language; IMS and CODASYL quiried DB using imperative code(tells the computer to perform certain operations)
Graph-like model
- Property graph model: Neo4j, Titan and InfiniteGraph (Cypher)
- triple-store model: turtle format, Neptune (SPARQL)
- sematic web, RDF data
- Datalog: older language

Index
- Hash Indexes: e.g. Bitcask(Riak), must fit in memory, range query not efficient
- SSTable: e.g. LevelDB and RocksDB, Sorted-String Table, red-black trees/AVL trees (memtable). LSM-tree (Log-Structured Merge-Tree)
- B-Tree: WAL (write-ahead log to restore B-tree)
- Other Indexes: clustered-index(store the indexed row directly within an index), Multi-column index, Lucene for full-text search
Transaction Processing vs Analytics: OLTP vs OLAP, Data Warehousing, Star and Snowflakes
Column-Orientied Storage
- Column Compression: bitmap encoding (column family are stored together, Bigtable model is still row-oriented)
- Sort Order in Column Storage, having multiple sort orders in a column-oriented store is similar to having multiple secondary indexes in a row-oriented store
- Writing
- Aggregation: data cube is a common special case of materialized view

Encoding (serialization or marshalling): the translation from the in-memory representation to a byte sequence
Data Encoding Formets: JSON, XML, Protocol Buffers, Thrift, Avro
Textual formats such as JSON, XML and CSV
Apache Thrift and Protocol Buffers are binary encoding libraries
Avro: encoding doesn’t contain fields/datatypes, need writer’s schema and reader’s schema. Friedlier to dynamically generated schemas
Modes of Dataflow
- Via databases
- Via service calls: two popular approached to web services (REST, SOAP and RPC)
  - RPC modes tries to make a request to a remote network service look the same as calling a function/method in the programming language, within the same process. Current main focus of RPC framwork is on requests between services owned by the same organization.
- Via asyschronous message passing: between RPC and databases by using a message broker.
  - The actor model is a programming model for concurrency in a single process. Distributed actor framework integrates a message broker and the actor programming model.