Consolidated list of papers on distributed database systems and parallel computing papers from Google Research
MapReduce programming model fundamentally changed the data processing world. The MapReduce pattern may be older, but Google’s MapReduce seminal paper and the Hadoop open-source distribution enabled programmers to run analysis on large data sets without having expertise in distributed servers and cluster management.
Google published many such papers over some time.
Some of the tools are run at planet scale internally, and some are available for external customers via the Google Cloud Platform.
The following is the list of papers in the area of distributed systems and parallel computing published by Google. These provide a wealth of information for someone working in the modern cloud, distributed databases, and systems.
- Google File System
– 2021: Colossus is our cluster-level file system, successor to the Google File System (GFS)
– 2003: Google File System - MapReduce
– 2004: MapReduce: Simplified Data Processing on Large Clusters
– HTML Slides - Distributed databases & Query Engines
– 2022: Procella: Unifying serving and analytical data at YouTube
– 2021: Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
– 2021: Monarch: Google’s Planet-Scale In-Memory Time Series Database
– 2014: Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
– 2009: Megastore: Providing Scalable, Highly Available Storage for Interactive Services
– 2009: Pregel: A System for Large-Scale Graph Processing - Dremel (BigQuery)
– 2020: Dremel: A Decade of Interactive SQL Analysis at Web Scale (How these ideas became foundation for BigQuery)
– 2016: Inside Capacitor, BigQuery’s next-generation columnar storage format
– 2010: Dremel: Interactive Analysis of Web-Scale Datasets - Spanner
– 2017: Spanner: Becoming a SQL System
– 2017: Spanner, TrueTime and the CAP Theorem
– 2012: Spanner: Google’s Globally-Distributed Database - Bigtable
– 2006: Bigtable - F1
– 2018: F1 Query: Declarative Querying at Scale
– 2013: F1: A Distributed SQL Database That Scales
– 2013: Online, Asynchronous Schema Change in F1
– 2012: F1 — The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business - Cluster Services
– 2020: Borg: The Next Generation
– 2016: Borg, Omega, and Kubernetes
– 2015: Large-scale cluster management at Google with Borg
– 2013: Omega: flexible, scalable schedulers for large compute clusters - Stream Processing
– 2020: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
– 2013: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
– 2013: Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
– 2010: FlumeJava: Easy, Efficient Data-Parallel Pipelines
Disclaimer: All the opinions expressed are personal independent thoughts and not to be attributed to my current or previous employers.