The stagnation thesis

“The low-hanging fruit has been picked.” — Tyler Cowen

The previous six chapters have established a series of hard constraints. Each constraint alone would slow progress. Together, they compound into something more decisive: a ceiling. This chapter argues that frontier model capabilities will plateau within the next 3-5 years, not because we stop trying, but because we hit the convergence of limits that cannot be overcome on the current path.

The convergence of constraints

The gaps and bottlenecks are not independent problems. They interlock, and each attempted solution runs into another wall.

The learning instances gap. Chapter 1 established that biological intelligence required \(10^{25}\) learning instances at the optimistic vertebrate-only estimate. Current training runs perform roughly \(10^7\) gradient updates. The gap is eighteen orders of magnitude.

The architectural bottleneck. Chapter 2 demonstrated that one-shot learning with catastrophic forgetting cannot replicate the continuous consolidation cycle that enabled biological knowledge accumulation. Even if we could generate \(10^{25}\) learning instances, the architecture cannot process them the way biology did. The path matters, not just the quantity.

The sensory bandwidth gap. Chapter 3 showed that text is a \(10^{-8}\) compression of embodied experience. Language models train on the fraction of knowledge that survived articulation, not the full sensory bandwidth that grounded biological learning. The training signal lacks the information content.

Diminishing returns. Chapter 4 proved mathematically that each doubling of performance under the observed power law requires a million-fold increase in compute. Progress does not stop, but it decelerates to asymptotic approach toward a floor.

The data wall. Chapter 5 documented that high-quality human-generated text is finite at roughly 300 trillion tokens. The wall arrives around 2028. Models cannot scale beyond the data supply, and undertrained models perform worse, not better.

The ouroboros trap. Chapter 6 showed that synthetic data cannot extend the frontier. Models trained on their own outputs undergo collapse. The information conservation law holds: you cannot create knowledge from itself. Meanwhile, AI-generated content pollutes the web, degrading future training data quality.

These constraints compound. You cannot solve the data wall with synthetic data because of model collapse. You cannot brute-force the learning instances gap because of diminishing returns and the data ceiling. You cannot replicate embodied grounding because text lacks the information content, and multimodal observation is not interaction. You cannot overcome the architectural bottleneck without redesigning the learning paradigm, which would require starting from scratch with no guarantee of success.

The walls close in from all sides.

What stagnation looks like

Stagnation is not a hard stop. It is a deceleration to asymptotic improvement. The trajectory is visible in the historical record.

GPT-2 (2019): 1.5 billion parameters, impressive text generation, clearly superhuman at next-token prediction but limited reasoning capability.

GPT-3 (2020): 175 billion parameters, emergent few-shot learning, surprising breadth, but still fragile on tasks requiring robust reasoning.

GPT-4 (2023): estimated 1+ trillion parameters, multimodal, passes professional exams, writes working code, carries on sophisticated conversations. A large jump from GPT-3.

The jump from GPT-3 to GPT-4 took three years and required roughly 100x more compute. The improvement was real but not revolutionary in kind, only in degree. GPT-4 is a better predictor than GPT-3. It is not a different kind of system.

GPT-5, when it arrives, will be better still. It will saturate more benchmarks, pass more exams, write cleaner code. But the improvement will be smaller than the GPT-3 to GPT-4 jump, because we are further up the power law curve where gains are expensive. Each subsequent model will show diminishing improvements.

By GPT-6 or GPT-7, the gains will be imperceptible to most users. The model will have approached the asymptote: the best predictor possible given text training data, within the constraints of one-shot learning, under the power law that governs scaling.

The benchmark saturation cycle

Benchmarks will continue to improve, but benchmarks measure what is measurable, not what matters. When GPT-4 saturated undergraduate-level exams, the community created graduate-level benchmarks. When those saturate, we will create expert-level benchmarks. When those saturate, we will create adversarial benchmarks specifically designed to expose model weaknesses.

This is Goodhart’s law applied to AI: when a measure becomes a target, it ceases to be a good measure. Models optimize for benchmarks, and benchmarks become proxies for capabilities rather than measurements of them. A model that scores 95% on a reasoning benchmark has not necessarily learned to reason; it has learned the statistical regularities of reasoning-like text.

The saturation cycle is already visible. MMLU, a broad knowledge benchmark, was considered challenging when introduced. Frontier models now exceed 85%. The response: create harder benchmarks. But harder is not the same as more meaningful. Eventually, every benchmark becomes a game that models learn to play through pattern matching on training data.

The underlying question is whether performance on text-based benchmarks, at any level of difficulty, constitutes the capabilities we actually care about: robust reasoning, causal understanding, novel problem-solving, grounded common sense. The stagnation thesis says no. Benchmarks will improve asymptotically, but the gap between “excellent text predictor” and “general intelligence” remains.

The capability plateau

Stagnated frontier models will plateau at a level best described as “impressively competent within distribution, fragile outside it.”

Where they excel:

Text generation and summarization: models are already near-human on these tasks and will approach indistinguishability.
Code completion for common patterns: standard libraries, well-documented APIs, conventional algorithms.
Question answering on well-documented topics: anything in the training corpus with sufficient examples.
Translation between languages: statistical regularities are strong, performance is already high.
Classification and pattern recognition: given labeled examples, models generalize well.
Style matching and tone adaptation: mimicking writing styles is a pattern-matching task.

Where they remain weak:

Novel reasoning requiring grounding not present in training data: models cannot deduce from first principles what they have not seen.
Physical intuition: a model that has never interacted with objects cannot reliably predict “what happens if I drop this?”
Causal understanding: correlation is in the data, causation is not. Models confuse the two.
Genuine creativity: true novelty requires generating patterns not present in training data. Models recombine seen patterns.
Robust common sense in unfamiliar situations: common sense is grounded in embodied experience. Text captures some of it but not the substrate.
Out-of-distribution robustness: adversarial examples, distribution shift, novel contexts all expose brittleness.

This is not a temporary limitation awaiting more scale. It is the consequence of training on text, using one-shot learning, without embodied grounding. Scaling makes the within-distribution performance better, but it does not close the gap to capabilities requiring information not present in text.

The “good enough” threshold

Stagnation is not failure. For many applications, even a plateau at current frontier capability levels delivers enormous value.

Most enterprise tasks are well-documented and within-distribution. Code completion for standard libraries is useful even if the model cannot invent new algorithms. Email drafting and report summarization are useful even if the model cannot generate truly novel insights. Customer service for common questions is useful even if the model fails on edge cases. Translation is useful even if the model occasionally makes errors on idioms.

The economic value of “GPT-4 level but no better” is measured in trillions of dollars of productivity gains. Legal document review. Medical literature summarization. Software development acceleration. Content generation at scale. Personalized tutoring for standard curricula. These applications do not require AGI. They require competent text manipulation, and stagnated frontier models deliver that.

The stagnation thesis is not “AI is useless.” It is “AI will not reach AGI on the current path, but will deliver enormous value at sub-AGI capability levels.” The hype promised AGI by 2030. The reality delivers impressive, economically transformative, but decidedly non-general intelligence.

When do we hit each wall?

We can project timelines for each constraint based on current trajectories.

Data wall: 2026-2028. High-quality text is finite, model appetite grows exponentially, the curves cross within this window. Villalobos and colleagues estimated 2028 at the median. Some optimism comes from multimodal data extending the supply, but this delays the wall by 1-2 years, not a decade.

Compute ceiling: 2028-2030. The power law requires \(10^6\) more compute for each doubling of performance. Current frontier models consume \(10^{26}\) FLOP. To double performance again requires \(10^{32}\) FLOP. This exceeds projected global AI compute capacity in the next 5 years. Compute growth is slowing as Moore’s Law decelerates and as energy and manufacturing constraints bind.

Capability saturation: 2027-2032. As models hit the data wall and compute ceiling, capability improvements decelerate. The compounding constraints converge. By 2030, frontier models will be measurably better than GPT-4, but not categorically different. By 2032, the improvements become marginal.

These are not precise predictions. They are informed projections based on observed trends and established constraints. The timeline could shift by a few years in either direction. But the qualitative outcome, stagnation within the next decade, is robust to uncertainty in the details.

Historical parallels

Technological progress often follows an S-curve: slow initial growth, rapid exponential improvement, then deceleration to a plateau as fundamental limits bind. AI is not the first technology to encounter this pattern.

Flight speed. The sound barrier was broken in 1947. By 1960, aircraft routinely flew at Mach 2. The SR-71, flying at Mach 3.3, set records in the 1960s that still stand. Hypersonic flight exists but remains experimental. Why? Because physics imposes hard limits. Air resistance scales with the square of velocity, heating scales with the cube. Beyond Mach 3, the engineering challenges become extraordinary and the returns diminish. We did not stop trying. We hit a wall.

Moore’s Law. Transistor density doubled every two years from 1970 to 2010, driving exponential compute growth. Around 2010, the law began to slow. By 2020, the doubling time had stretched to 3-4 years. Why? Because quantum mechanics imposes limits at atomic scales. Gates are now a few nanometers wide, approaching the size of atoms. Further miniaturization faces physical barriers. We have not stopped trying. We are hitting a wall.

Energy efficiency of computation. The Landauer limit, derived from thermodynamics, sets a floor on the energy required to erase a bit: \(kT \ln 2 \approx 3 \times 10^{-21}\) joules at room temperature. Current DRAM operates roughly \(10^9\) times above this limit. Progress toward the limit has been exponential for decades, but the limit is absolute. No technology can violate thermodynamics. We will approach the Landauer limit asymptotically but never breach it.

AI scaling faces analogous limits: finite data, power law diminishing returns, architectural bottlenecks rooted in the physics of computation. The pattern is familiar. Rapid progress during the exponential phase, then deceleration as the limits bind.

The emergence argument

Perhaps the strongest counterargument to stagnation is emergence: the observation that capabilities appear suddenly at scale that were not present or predictable at smaller scales. Wei and colleagues documented numerous examples in their 2022 paper: chain-of-thought reasoning, in-context learning, multi-step arithmetic, instruction following. These capabilities were not explicitly programmed or trained; they emerged from scaling.

The emergence phenomenon is genuine and was surprising. It suggests that scaling unlocks latent structure in the training data that smaller models cannot access. This is important: it demonstrates that scale is not merely quantitative improvement but can produce qualitative capability jumps.

But emergence has limits, and understanding those limits is critical to assessing whether it can overcome the gaps documented in earlier chapters.

Emergence within the training distribution. Every documented emergent capability can be traced to patterns present in the training data. In-context learning emerged because the training corpus contains examples of learning from context (few-shot examples in documentation, tutorials that build on previous concepts). Chain-of-thought reasoning emerged because the training data contains worked examples with explicit reasoning steps (textbooks, Stack Overflow answers, math forums). Arithmetic emerged because numerical patterns appear throughout text.

These are not trivial pattern-matching tasks. The model had to learn abstractions, generalizations, and compositional structure to exhibit these capabilities. But they are still recognition of patterns present in text, not generation of capabilities absent from text.

The “sparks of AGI” debate. Bubeck and colleagues at Microsoft Research published a provocative paper in 2023 titled “Sparks of Artificial General Intelligence: Early experiments with GPT-4.” They documented impressive performance on novel tasks: drawing unicorns, solving theory-of-mind problems, generating Python code to visualize concepts. They argued that GPT-4’s breadth and flexibility suggested early signs of general intelligence.

The paper sparked intense debate. Critics pointed out that all demonstrated capabilities, while impressive, involved recombination of patterns in the training data. Drawing a unicorn requires understanding “unicorn” (fantasy creature, horse-like body, single horn, often depicted in specific styles) and “drawing” (generating vector graphics code or describing visual appearance). Both concepts appear extensively in training data. The task requires creative synthesis, which the model achieves, but not generation of concepts genuinely absent from training.

Theory-of-mind tasks (understanding that others have beliefs different from one’s own) appeared more challenging. But Sally-Anne tests and similar tasks appear in psychology literature, education materials, and discussions of cognitive development—all in the training corpus. The model likely learned the structure of these problems from seeing many examples, not by developing actual theory of mind through social interaction.

The “sparks” paper is valuable because it documents the frontier of what scaling has achieved. But it does not demonstrate that further scaling will produce capabilities qualitatively beyond what patterns in text can support. Emergence unlocks latent structure in training data; it does not create structure absent from training data.

Can embodied grounding emerge? This is the critical question. Chapters 1-3 documented that biological learning operated over \(10^9\) bits per second of sensory bandwidth, grounded in physical interaction with consequences (survival, reproduction). Text captures perhaps \(10^{-8}\) of this experience. Can the missing 99.9999% emerge from scaling text prediction?

There is no evidence for this. The failures documented in Chapter 1 (novel physical reasoning, causal inference, compositional generalization in unfamiliar contexts, robust common sense) persist in GPT-4 despite its scale. These failures cluster precisely where embodied grounding is required. Scaling from GPT-3 to GPT-4 reduced the failure rate but did not eliminate the failure mode.

The hypothesis that embodied grounding will emerge from text alone requires believing that text contains the information needed for physical intuition, even though that information was explicitly compressed away when experience was articulated into language. This is not impossible, but it is a strong claim requiring evidence. The current evidence suggests the opposite: the gaps persist despite scaling.

GPT-5 and beyond: falsifiable predictions

The stagnation thesis makes predictions that can be tested against future model releases. If GPT-5 (or its equivalent from other labs) appears in 2025-2026, we can check whether it conforms to the predicted trajectory.

Prediction 1: Benchmark saturation. GPT-5 will achieve higher scores on standard benchmarks (MMLU, HumanEval, etc.) than GPT-4, but the improvement will be smaller than the GPT-3 to GPT-4 improvement. Expect 5-15 percentage point gains on MMLU (from GPT-4’s ~86% toward 90-95%), not the 27-point jump from GPT-3 to GPT-3.5.

Prediction 2: Persistent failure modes. The failures documented in Chapter 1 will persist. GPT-5 will still struggle with: - Novel physical reasoning not present in training data (e.g., predicting outcomes of unfamiliar mechanical configurations) - Robust causal inference beyond memorized patterns - Compositional generalization in truly novel contexts - Out-of-distribution adversarial robustness

The failure rate will decrease, but the failure mode will remain. The model will be more often correct but not qualitatively more robust.

Prediction 3: Diminishing qualitative improvement. GPT-4’s release felt like a capability jump: it could pass professional exams, write complex code, handle multimodal inputs. GPT-5 will feel like refinement: better writing quality, fewer hallucinations, faster inference, but not a new category of capability. Users will struggle to articulate what GPT-5 can do that GPT-4 could not, beyond “it’s better.”

Prediction 4: Training data exhaustion. GPT-5 will either (a) train on a similar token count to GPT-4 (~15-20 trillion tokens) but with better curation and architectural improvements, or (b) attempt to scale beyond 100 trillion tokens and encounter quality degradation from data reuse or lower-quality sources. If (b), performance on some benchmarks may plateau or even regress slightly.

Prediction 5: Economic value plateau. GPT-5 will be economically valuable (better coding assistants, better writing tools, better customer service bots) but not transform additional industries beyond what GPT-4 already enabled. The economic impact curve is flattening as the technology reaches saturation in applications where text manipulation suffices.

How to falsify the stagnation thesis: If GPT-5 demonstrates genuinely novel capabilities not derivable from patterns in text—robust physical reasoning, reliable causal inference, zero-shot compositional generalization to structures it has never seen, stable lifelong learning without catastrophic forgetting—then stagnation is falsified. If GPT-5 delivers qualitative jumps comparable to GPT-3 to GPT-4, and if subsequent models continue delivering such jumps without hitting the data wall, then the thesis is wrong.

But if GPT-5 conforms to the predictions above—incremental benchmark gains, persistent failure modes, diminishing qualitative improvement, data constraints binding—then the stagnation thesis is supported. The null hypothesis should be the trend: deceleration along a power law toward an asymptote.

The case against

“Emergent capabilities suggest we are on the cusp of a phase transition. More scale will unlock qualitatively new behaviors.”

Emergent capabilities are real, but they are not magic. Wei and colleagues documented that certain capabilities appear suddenly at scale: chain-of-thought reasoning, in-context learning, arithmetic. These were surprising, and they demonstrated that scale unlocks latent structure in the training data.

But emergence within text does not imply that capabilities not present in text will also emerge. In-context learning emerged because the training data contains examples of learning from context. Arithmetic emerged because the training data contains numerical patterns. These are still pattern matching, just at higher abstraction. There is no evidence that capabilities requiring embodied grounding, causal reasoning not present in text, or true novelty will emerge from scaling text prediction, because the training signal does not contain them.

“Industry leaders are confident. They are building toward AGI.”

Industry leaders have strong financial incentives to project confidence. Their companies are valued on the assumption of continued exponential progress toward AGI. Admitting stagnation would collapse valuations. This does not mean they are lying; it means their incentives are not aligned with dispassionate assessment.

Some leaders genuinely believe AGI is near. Belief is not evidence. The constraints documented in this book are empirical: finite data, power law diminishing returns, architectural bottlenecks, information-theoretic limits. Optimism does not override mathematics.

Why this diverges from the narrative

The dominant narrative in 2024-2025 is exponential progress toward AGI within a decade. This narrative serves many purposes: attracting investment, recruiting talent, justifying massive compute expenditures, generating media attention.

But stagnation is not failure. It is a realistic assessment of what the current path delivers. Frontier models plateau at a capability level that is genuinely impressive and economically transformative, even if it is not AGI. The sooner we accept this, the faster we can redirect resources toward what actually works: making stagnated-but-useful models radically cheaper, faster, and more accessible.

That is the subject of the next chapter.

Chapter summary

Frontier capability gains will plateau within 3-5 years (2027-2032) due to compounding constraints from Chapters 1-6
The constraints interlock: data wall + ouroboros trap + diminishing returns + architectural bottleneck + sensory bandwidth gap
Stagnation is not a hard stop but deceleration to asymptotic improvement; GPT-3 to GPT-4 was a larger jump than GPT-4 to GPT-5 will be
Benchmark saturation cycle: as models saturate existing tests, harder benchmarks are created, but this measures test-taking ability not robust intelligence
Stagnated models will excel within distribution (text generation, code completion, Q&A on documented topics, translation) but remain fragile outside distribution
The “good enough” threshold: GPT-4 level capability delivers enormous economic value (trillions in productivity) even without reaching AGI
Historical parallels (flight speed, Moore’s Law, computational energy efficiency) show S-curve patterns where fundamental limits cause deceleration
Emergent capabilities (chain-of-thought, in-context learning, arithmetic) are real but operate within training distribution; no evidence that embodied grounding will emerge from text alone
Bubeck et al.’s “Sparks of AGI” documented impressive GPT-4 performance but all demonstrated capabilities involve recombination of training data patterns
Falsifiable predictions for GPT-5: benchmark saturation (5-15pp gains not 27pp), persistent failure modes, diminishing qualitative improvement, data constraints binding
The stagnation thesis is falsified if GPT-5+ demonstrates genuinely novel capabilities not derivable from text patterns
Industry incentives (valuations depend on AGI narrative) create pressure to project confidence despite empirical constraints