The data wall

“We have achieved peak data.” — Ilya Sutskever

The scaling laws assume unlimited training data. The Chinchilla prescription demands roughly 20 tokens per parameter. For a model with 10 trillion parameters, the recipe calls for 200 trillion tokens. Where do these tokens come from?

The finite supply

Villalobos and colleagues at Epoch AI published the most careful analysis of this question in 2024, presented at ICML. Their estimate: the total stock of publicly available, high-quality text on the internet amounts to roughly 300 trillion tokens. This is not a conservative guess; it includes web pages, digitized books, scientific papers, code repositories, social media, forums, and news archives. The “high-quality” qualifier matters enormously: the FineWeb dataset, one of the most careful web-scraping efforts, discards roughly 85% of raw web text during quality filtering. The actual supply of text that meets the quality threshold for training frontier models is a fraction of the raw total.

To appreciate the scale of the constraint, consider the components:

All books ever printed: approximately 170 million distinct titles. At a rough average of 70,000 words per book, this yields roughly 12 trillion words, or about 16 trillion tokens. All scientific papers ever published: roughly 100 million papers at an average of 5,000 words each, yielding 500 billion words or approximately 650 billion tokens. The entire scientific output of humanity across all disciplines, from the first journal in 1665 to today, would not fill a single training run for a frontier model.

Current frontier models train on roughly 13 to 18 trillion tokens. The data supply is growing, but not fast: new high-quality text is generated at perhaps 2 to 3 trillion tokens per year, while model appetite grows at roughly 2.5x per year. The curves cross. Villalobos and colleagues estimate that the data wall, the point where demand for training data exceeds the supply of high-quality human-generated text, arrives by approximately 2028 at the median estimate.

The implications for the scaling laws are direct. The power law \(L(D) = a_D \cdot D^{-\alpha_D}\) holds only when more data is available to train on. When the supply is exhausted, the curve hits a ceiling. No amount of additional compute or parameters can compensate, because the Chinchilla result demonstrated that undertrained models (too many parameters for the available data) perform worse, not better, than properly scaled ones. At the data wall, making models bigger actively degrades performance.

The multimodal extension

“Multimodal data extends the wall.” Adding images, video, and audio to the training data increases the total supply beyond text alone. Epoch AI estimates this provides roughly a 3x multiplier in effective tokens. Video is particularly data-rich: a single hour of video at modest resolution contains more raw information than a large book. But three considerations limit the impact.

First, 3x against a data wall measured in hundreds of trillions of tokens delays the wall by one to two years, not one to two decades.

Second, multimodal data is still passively observed: the model watches video; it does not interact with the physical world that the video depicts. The gap between observed and embodied experience, quantified in Chapter 3, is not closed by adding more observation.

Third, the scaling laws for multimodal models have not been established with the same rigor as for text. It is an assumption, not an empirical finding, that vision-language scaling follows the same power law.

When demand meets supply

Code: the special case

Code repositories represent a massive corpus of structured, high-quality text. GitHub alone hosts over 300 million repositories, containing trillions of tokens of code across hundreds of programming languages. Unlike natural language, code has formal semantics: it must compile, it must run, and its behavior is (in principle) verifiable. This makes code an attractive training target.

Models trained on code (GitHub Copilot, Code Llama, GPT-4 with code capabilities) show impressive performance on standard programming tasks. They autocomplete functions, translate between languages, fix common bugs, and generate boilerplate with high accuracy. Code generation has become one of the most economically valuable applications of language models.

But code faces the same data wall as text, with additional constraints:

Finite supply. While GitHub grows daily, the growth rate is linear or sublinear, not exponential. New code is generated at perhaps 100 billion to 1 trillion tokens per year (estimating from public GitHub commits). Model appetite grows at 2.5x per year. The curves cross. By 2026-2027, model training will exhaust the supply of high-quality public code.

Quality degradation. Not all code is equally valuable for training. Code repositories contain bugs, deprecated patterns, security vulnerabilities, copy-pasted boilerplate, and abandoned projects with poor practices. The signal-to-noise ratio is lower than for curated text like books or scientific papers. Aggressive quality filtering discards perhaps 50-70% of raw code, reducing the effective supply.

Copyright and licensing. Much valuable code is proprietary or restrictively licensed. Training on copyrighted code without permission has triggered lawsuits (GitHub Copilot, Stable Diffusion, and others). Even if legal barriers are overcome, proprietary code represents information that models cannot access. The public code commons is smaller than the total code produced.

AI-generated code pollution. As code generation tools become widely adopted, repositories increasingly contain AI-generated code. This creates the same ouroboros problem as text: models trained on AI-generated code ingest the biases and limitations of previous models. Stack Overflow already reports a significant fraction of answers are AI-generated. Within a few years, distinguishing human-written code from AI-generated code in public repositories may become difficult.

Diminishing returns from code. While code teaches models formal reasoning, syntax, and algorithm implementation, it does not address the broader gaps documented in earlier chapters. Code does not provide embodied grounding, causal understanding, or the sensory bandwidth of physical experience. A model trained on all the code in the world will be an excellent code generator but no closer to general intelligence.

Code extends the data supply by perhaps 1-2x in effective tokens compared to text alone. This delays the data wall by a year or two. It does not eliminate it.

Data quality: can better curation extend the wall?

If raw data supply is constrained, perhaps higher-quality data can compensate. The hypothesis: one token of textbook-quality, carefully curated data is worth ten tokens of random web scraping. If true, better curation could effectively extend the data supply.

There is evidence for this. Phi-2, a small model (2.7B parameters) trained on carefully curated “textbook-quality” data, outperformed much larger models on reasoning benchmarks. The Chinchilla paper itself emphasized data quality, not merely quantity. Training on high-quality data allows models to reach target performance with fewer tokens.

But quality curation does not create new information. It filters existing information, selecting the highest-value subset. This is valuable for efficiency, but it does not extend the frontier of what can be learned. If the total supply of high-quality data is 300 trillion tokens, aggressive curation might extract 50-100 trillion tokens of genuinely excellent data. This is a 3-6x reduction in effective supply, which trains better models faster. But it does not make the data wall disappear; it makes the wall arrive sooner.

Consider the extreme: suppose we could perfectly curate the highest-quality data and achieve a 10x efficiency improvement (one curated token equals ten random tokens). The data wall moves from 2028 to perhaps 2030. Then what? The perfect curation has been applied. There is no further efficiency to extract. The wall still stands.

Quality curation is an efficiency optimization, not a solution to scarcity. It helps, but it does not change the fundamental constraint: high-quality human-generated data is finite.

When exactly do we run out?

The timeline depends on model appetite (parameters, training tokens) and data supply growth. We can project based on current trends.

Current state (2024): - Frontier models: 1-2 trillion parameters, trained on 13-18 trillion tokens - Available high-quality data: ~300 trillion tokens (text + code) - Annual new data generation: ~2-3 trillion tokens

Projected scaling (optimistic): - Models double in parameter count every 12-18 months - Training data scales proportionally (Chinchilla ratio: 20 tokens per parameter) - Data generation grows linearly at 2-3 trillion tokens/year

GPT-5 equivalent (2025): ~5 trillion parameters, trained on ~100 trillion tokens. Data supply is sufficient but reserves are depleting.

GPT-6 equivalent (2026-2027): ~10-20 trillion parameters, requiring ~200-400 trillion tokens. Data supply is exhausted. Training at this scale requires reusing the same data multiple times (overtraining) or incorporating lower-quality data, both of which degrade performance.

Beyond 2027: Models cannot scale further without synthetic data or radically different data sources. As Chapter 6 documents, synthetic data leads to model collapse unless carefully mixed with fresh human data, which is not available at the required scale.

Villalobos and colleagues at Epoch AI estimated the median data wall arrival at 2028, with uncertainty spanning 2026-2030. Their analysis is the most careful published estimate, and it aligns with the projection above. Industry insiders (Ilya Sutskever’s “peak data” comment, internal discussions at labs) suggest awareness of this constraint.

The wall is not a distant theoretical concern. It is a near-term practical constraint that labs are already encountering.

The industry’s response

What are companies doing about the data wall?

Licensing deals: OpenAI, Google, and others are signing deals with publishers (News Corp, Associated Press, Stack Overflow, Reddit) to license previously inaccessible data. These deals provide perhaps 1-5 trillion additional tokens, delaying the wall by months, not years.

Scraping expansion: Labs are scraping less-common languages, historical archives, multimedia transcripts, and other previously untapped sources. This provides marginal gains but lower-quality data.

Synthetic data augmentation: Explored extensively in Chapter 6. The consensus: synthetic data can regularize and augment but cannot extend the frontier without causing collapse.

Multimodal expansion: Training on images, video, and audio increases data supply by 2-3x in effective information content. This delays the wall but does not eliminate it, and multimodal data faces its own quality and copyright constraints.

Test-time compute: As discussed in Chapter 4, inference-time search can improve performance without new training data, but only for verifiable tasks within the learned distribution.

None of these approaches solves the fundamental problem. They are optimizations that delay the inevitable. The data wall is not a problem that can be engineered away. It is a conservation law: you cannot learn more information than the data contains, and the data is finite.

When demand meets supply

The data wall is not a software problem awaiting an engineering solution. It is a hard constraint: the finite supply of high-quality human-generated text meets exponentially growing demand. The industry’s response has been to search for alternatives: synthetic data, data augmentation, curriculum learning, data quality filtering. The next chapter examines whether these approaches can extend the frontier or merely delay the inevitable.

Chapter summary

High-quality human-generated text totals approximately 300 trillion tokens (books, web, papers, code)
Current frontier models train on 13-18 trillion tokens; demand grows at ~2.5x per year while supply grows linearly at 2-3 trillion tokens/year
The curves cross around 2026-2028, marking the data wall where demand exceeds supply
Code repositories (300M+ on GitHub) provide trillions of additional tokens but face finite supply, quality issues, copyright constraints, and AI pollution
Quality curation (textbook-quality data) improves efficiency but does not create new information; it filters existing supply, making the wall arrive sooner
Timeline projection: GPT-5 (2025) barely fits; GPT-6 (2026-2027) exhausts supply; beyond 2027 requires overtraining or synthetic data
Industry responses (licensing deals, scraping expansion, multimodal data, test-time compute) delay the wall by months to 1-2 years, not decades
The Chinchilla result demonstrates that undertrained models (too many parameters for available data) perform worse, making the data wall a hard constraint
This is not a software problem but a conservation law: you cannot extract more information from a corpus than it contains, and the corpus is finite