Reading.
Papers I keep coming back to for doing AI over 1 TB–1 PB datasets — from the systems that store and move the data, to training across thousands of accelerators, to feeding the pipeline and choosing what data is even worth keeping.
Foundations — storing & processing data at scale
The substrate every later system assumes.
- The Google File System
Commodity-hardware, fault-tolerant storage for data-intensive workloads.
- MapReduce: Simplified Data Processing on Large Clusters
The programming model that made petabyte batch processing routine.
- Resilient Distributed Datasets
Spark's in-memory abstraction — the basis of most large-scale ETL feeding ML.
- Dremel: Interactive Analysis of Web-Scale Datasets
Columnar + tree execution over trillion-row tables; ancestor of BigQuery and Parquet.
- Spanner: Google's Globally-Distributed Database
Global consistency at scale via the TrueTime API.
The lakehouse & streaming — moving PB into ML
Where analytics data and training data finally converge.
- Lakehouse: A New Generation of Open Platforms
The architecture argument for unifying warehousing and ML on open formats.
- Delta Lake: ACID Table Storage over Cloud Object Stores
ACID transactions and time travel over object storage.
- The Dataflow Model
The what / where / when / how framework behind Beam, Flink, and Kinesis-style streaming.
- Apache Flink: Stream and Batch Processing in a Single Engine
Treats batch as a special case of streaming.
Distributed systems for training AI on big data
How models scale across thousands of accelerators.
- Ray: A Distributed Framework for Emerging AI Applications
Task + actor model now underpinning a lot of large-scale AI.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Shards optimizer, gradient, and parameter state so model size scales with devices.
- Efficient Large-Scale LM Training on GPU Clusters (Megatron-LM)
Tensor, pipeline, and data parallelism combined.
- Pathways: Asynchronous Distributed Dataflow for ML
Single-controller orchestration over thousands of accelerators.
Feeding the accelerators — the real 1 TB–1 PB bottleneck
Often the actual limit on training throughput.
- tf.data: A Machine Learning Data Processing Framework
Input pipelines as the thing that gates end-to-end training time.
- Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Serving training data at tens of TB/s; ingestion as the dominant constraint.
- Data Validation for Machine Learning
Catching data errors across petabyte pipelines before they poison models.
How much data, and which data
The economics of scale and the shift to data-centric AI.
- Scaling Laws for Neural Language Models
Loss as a power law in model size, dataset size, and compute.
- Training Compute-Optimal LLMs (Chinchilla)
Scale data and parameters together; most models are under-trained on data.
- DataComp: In Search of the Next Generation of Multimodal Datasets
Treats the dataset itself as the thing you optimize.
- Scaling Laws for Data Filtering
Data curation cannot be compute-agnostic.
Retrieval over billions of vectors
Serving RAG and similarity search at scale.
- Billion-scale similarity search with GPUs (FAISS)
The workhorse vector-search library.
- DiskANN: Billion-point Nearest Neighbor Search on a Single Node
Billion-vector approximate nearest neighbor search from SSD.
- SPANN: Highly-efficient Billion-scale ANN
A memory/disk hybrid index for huge vector sets.
Suggestions welcome — this list grows as the work does.