Reading list · 23 papers

Reading.

Papers I keep coming back to for doing AI over 1 TB–1 PB datasets — from the systems that store and move the data, to training across thousands of accelerators, to feeding the pipeline and choosing what data is even worth keeping.

Foundations — storing & processing data at scale

The substrate every later system assumes.

The Google File System
Ghemawat, Gobioff, Leung · SOSP 2003

Commodity-hardware, fault-tolerant storage for data-intensive workloads.
MapReduce: Simplified Data Processing on Large Clusters
Dean, Ghemawat · OSDI 2004

The programming model that made petabyte batch processing routine.
Resilient Distributed Datasets
Zaharia et al. · NSDI 2012

Spark's in-memory abstraction — the basis of most large-scale ETL feeding ML.
Dremel: Interactive Analysis of Web-Scale Datasets
Melnik et al. · VLDB 2010

Columnar + tree execution over trillion-row tables; ancestor of BigQuery and Parquet.
Spanner: Google's Globally-Distributed Database
Corbett et al. · OSDI 2012

Global consistency at scale via the TrueTime API.

The lakehouse & streaming — moving PB into ML

Where analytics data and training data finally converge.

Lakehouse: A New Generation of Open Platforms
Armbrust, Ghodsi, Xin, Zaharia · CIDR 2021

The architecture argument for unifying warehousing and ML on open formats.
Delta Lake: ACID Table Storage over Cloud Object Stores
Armbrust et al. · VLDB 2020

ACID transactions and time travel over object storage.
The Dataflow Model
Akidau et al. · VLDB 2015

The what / where / when / how framework behind Beam, Flink, and Kinesis-style streaming.
Apache Flink: Stream and Batch Processing in a Single Engine
Carbone et al. · 2015

Treats batch as a special case of streaming.

Distributed systems for training AI on big data

How models scale across thousands of accelerators.

Ray: A Distributed Framework for Emerging AI Applications
Moritz et al. · OSDI 2018

Task + actor model now underpinning a lot of large-scale AI.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Rajbhandari et al. · SC20

Shards optimizer, gradient, and parameter state so model size scales with devices.
Efficient Large-Scale LM Training on GPU Clusters (Megatron-LM)
Narayanan et al. · SC21

Tensor, pipeline, and data parallelism combined.
Pathways: Asynchronous Distributed Dataflow for ML
Barham et al. · 2022

Single-controller orchestration over thousands of accelerators.

Feeding the accelerators — the real 1 TB–1 PB bottleneck

Often the actual limit on training throughput.

tf.data: A Machine Learning Data Processing Framework
Murray et al. · VLDB 2021

Input pipelines as the thing that gates end-to-end training time.
Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Zhao et al., Meta · ISCA 2022

Serving training data at tens of TB/s; ingestion as the dominant constraint.
Data Validation for Machine Learning
Breck et al., Google/TFX · MLSys 2019

Catching data errors across petabyte pipelines before they poison models.

How much data, and which data

The economics of scale and the shift to data-centric AI.

Scaling Laws for Neural Language Models
Kaplan et al. · 2020

Loss as a power law in model size, dataset size, and compute.
Training Compute-Optimal LLMs (Chinchilla)
Hoffmann et al. · 2022

Scale data and parameters together; most models are under-trained on data.
DataComp: In Search of the Next Generation of Multimodal Datasets
Gadre et al. · NeurIPS 2023

Treats the dataset itself as the thing you optimize.
Scaling Laws for Data Filtering
Goyal et al. · CVPR 2024

Data curation cannot be compute-agnostic.

Retrieval over billions of vectors

Serving RAG and similarity search at scale.

Billion-scale similarity search with GPUs (FAISS)
Johnson, Douze, Jégou · 2017

The workhorse vector-search library.
DiskANN: Billion-point Nearest Neighbor Search on a Single Node
Subramanya et al. · NeurIPS 2019

Billion-vector approximate nearest neighbor search from SSD.
SPANN: Highly-efficient Billion-scale ANN
Chen et al. · NeurIPS 2021

A memory/disk hybrid index for huge vector sets.

Suggestions welcome — this list grows as the work does.