Papers on Database Management Systems
This post summarizes the key dependencies and evolution among several influential distributed systems(GFS, MapReduce, DryadLINQ, CIEL, and Spark RDD) papers. See assets/papers_to_read/.
1. The Google File System (GFS, 2003)
- Role: Foundational distributed storage layer that powered MapReduce.
- Dependency: None.
- Influenced: MapReduce, Hadoop DFS, Spark’s support for HDFS.
2. MapReduce: Simplified Data Processing on Large Clusters (2004)
- Depends on: GFS (for reading/writing input and output data).
- Influenced:
- Hadoop MapReduce
- Spark RDD (as a reaction and evolution)
- DryadLINQ and CIEL (as generalizations of the dataflow idea)
3. DryadLINQ (2008)
- Depends on:
- MapReduce (as inspiration)
- LINQ query expression model
- Adds: DAG-based dataflows and general-purpose parallelism beyond map/reduce.
- Influenced: CIEL and partially Spark’s DAG Scheduler
4. CIEL (2011)
- Depends on:
- Dryad’s dataflow DAG model
- Adds:
- Dynamic task generation (tasks can spawn new tasks at runtime)
- Influenced: Dynamic execution models in later systems, including Spark’s adaptive planning.
5. Resilient Distributed Datasets (Spark RDD, 2012)
- Depends on:
- MapReduce’s fault tolerance model
- Adds:
- Lineage-based fault recovery
- In-memory computation and coarse-grained transformations
- Influenced: Spark’s evolution (DataFrames, Datasets) and lineage-aware systems like Flink.
Summary Dependency Graph
GFS (2003)
↓
MapReduce (2004)
↓ ↘
DryadLINQ (2008) → Spark RDD (2012)
↓ ↑
CIEL (2011) --------┘