Reading list · 23 papers

Reading.

Papers I keep coming back to for doing AI over 1 TB–1 PB datasets — from the systems that store and move the data, to training across thousands of accelerators, to feeding the pipeline and choosing what data is even worth keeping.

Foundations — storing & processing data at scale

The substrate every later system assumes.

  1. The Google File System
    Ghemawat, Gobioff, Leung · SOSP 2003

    Commodity-hardware, fault-tolerant storage for data-intensive workloads.

  2. MapReduce: Simplified Data Processing on Large Clusters
    Dean, Ghemawat · OSDI 2004

    The programming model that made petabyte batch processing routine.

  3. Resilient Distributed Datasets
    Zaharia et al. · NSDI 2012

    Spark's in-memory abstraction — the basis of most large-scale ETL feeding ML.

  4. Dremel: Interactive Analysis of Web-Scale Datasets
    Melnik et al. · VLDB 2010

    Columnar + tree execution over trillion-row tables; ancestor of BigQuery and Parquet.

  5. Spanner: Google's Globally-Distributed Database
    Corbett et al. · OSDI 2012

    Global consistency at scale via the TrueTime API.

The lakehouse & streaming — moving PB into ML

Where analytics data and training data finally converge.

  1. Lakehouse: A New Generation of Open Platforms
    Armbrust, Ghodsi, Xin, Zaharia · CIDR 2021

    The architecture argument for unifying warehousing and ML on open formats.

  2. Delta Lake: ACID Table Storage over Cloud Object Stores
    Armbrust et al. · VLDB 2020

    ACID transactions and time travel over object storage.

  3. The Dataflow Model
    Akidau et al. · VLDB 2015

    The what / where / when / how framework behind Beam, Flink, and Kinesis-style streaming.

  4. Apache Flink: Stream and Batch Processing in a Single Engine
    Carbone et al. · 2015

    Treats batch as a special case of streaming.

Distributed systems for training AI on big data

How models scale across thousands of accelerators.

  1. Ray: A Distributed Framework for Emerging AI Applications
    Moritz et al. · OSDI 2018

    Task + actor model now underpinning a lot of large-scale AI.

  2. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
    Rajbhandari et al. · SC20

    Shards optimizer, gradient, and parameter state so model size scales with devices.

  3. Efficient Large-Scale LM Training on GPU Clusters (Megatron-LM)
    Narayanan et al. · SC21

    Tensor, pipeline, and data parallelism combined.

  4. Pathways: Asynchronous Distributed Dataflow for ML
    Barham et al. · 2022

    Single-controller orchestration over thousands of accelerators.

Feeding the accelerators — the real 1 TB–1 PB bottleneck

Often the actual limit on training throughput.

  1. tf.data: A Machine Learning Data Processing Framework
    Murray et al. · VLDB 2021

    Input pipelines as the thing that gates end-to-end training time.

  2. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
    Zhao et al., Meta · ISCA 2022

    Serving training data at tens of TB/s; ingestion as the dominant constraint.

  3. Data Validation for Machine Learning
    Breck et al., Google/TFX · MLSys 2019

    Catching data errors across petabyte pipelines before they poison models.

How much data, and which data

The economics of scale and the shift to data-centric AI.

  1. Scaling Laws for Neural Language Models
    Kaplan et al. · 2020

    Loss as a power law in model size, dataset size, and compute.

  2. Training Compute-Optimal LLMs (Chinchilla)
    Hoffmann et al. · 2022

    Scale data and parameters together; most models are under-trained on data.

  3. DataComp: In Search of the Next Generation of Multimodal Datasets
    Gadre et al. · NeurIPS 2023

    Treats the dataset itself as the thing you optimize.

  4. Scaling Laws for Data Filtering
    Goyal et al. · CVPR 2024

    Data curation cannot be compute-agnostic.

Retrieval over billions of vectors

Serving RAG and similarity search at scale.

  1. Billion-scale similarity search with GPUs (FAISS)
    Johnson, Douze, Jégou · 2017

    The workhorse vector-search library.

  2. DiskANN: Billion-point Nearest Neighbor Search on a Single Node
    Subramanya et al. · NeurIPS 2019

    Billion-vector approximate nearest neighbor search from SSD.

  3. SPANN: Highly-efficient Billion-scale ANN
    Chen et al. · NeurIPS 2021

    A memory/disk hybrid index for huge vector sets.

Suggestions welcome — this list grows as the work does.