Papers on Database Management Systems

This post summarizes the key dependencies and evolution among several influential distributed systems(GFS, MapReduce, DryadLINQ, CIEL, and Spark RDD) papers. See assets/papers_to_read/.

1. The Google File System (GFS, 2003)

Role: Foundational distributed storage layer that powered MapReduce.
Dependency: None.
Influenced: MapReduce, Hadoop DFS, Spark’s support for HDFS.

2. MapReduce: Simplified Data Processing on Large Clusters (2004)

Depends on: GFS (for reading/writing input and output data).
Influenced:
- Hadoop MapReduce
- Spark RDD (as a reaction and evolution)
- DryadLINQ and CIEL (as generalizations of the dataflow idea)

3. DryadLINQ (2008)

Depends on:
- MapReduce (as inspiration)
- LINQ query expression model
Adds: DAG-based dataflows and general-purpose parallelism beyond map/reduce.
Influenced: CIEL and partially Spark’s DAG Scheduler

4. CIEL (2011)

Depends on:
- Dryad’s dataflow DAG model
Adds:
- Dynamic task generation (tasks can spawn new tasks at runtime)
Influenced: Dynamic execution models in later systems, including Spark’s adaptive planning.

5. Resilient Distributed Datasets (Spark RDD, 2012)

Depends on:
- MapReduce’s fault tolerance model
Adds:
- Lineage-based fault recovery
- In-memory computation and coarse-grained transformations
Influenced: Spark’s evolution (DataFrames, Datasets) and lineage-aware systems like Flink.

Summary Dependency Graph

GFS (2003)
  ↓
MapReduce (2004)
  ↓               ↘
DryadLINQ (2008)  → Spark RDD (2012)
      ↓                  ↑
     CIEL (2011) --------┘

1. The Google File System (GFS, 2003)

2. MapReduce: Simplified Data Processing on Large Clusters (2004)

3. DryadLINQ (2008)

4. CIEL (2011)

5. Resilient Distributed Datasets (Spark RDD, 2012)

Summary Dependency Graph

References