Diminishing returns

“It is difficult to get a man to understand something when his salary depends upon his not understanding it.” — Upton Sinclair

Chapter 1 established the target: \(10^{25}\) learning instances at the optimistic vertebrate-only estimate, against roughly \(10^7\) gradient updates in frontier AI training. The gap is eighteen orders of magnitude. The question this chapter asks is whether scaling, the strategy of simply making models bigger and training them longer, can close it.

The power law promise

The case for scaling rests on a genuine empirical discovery. Kaplan and colleagues at OpenAI demonstrated in 2020 that language model performance, measured as cross-entropy loss on held-out text, follows a power law in three variables: the amount of compute \(C\), the number of model parameters \(N\), and the size of the training dataset \(D\). Specifically:

\[L(C) = a_C \cdot C^{-\alpha_C}, \quad \alpha_C \approx 0.050\] \[L(N) = a_N \cdot N^{-\alpha_N}, \quad \alpha_N \approx 0.076\] \[L(D) = a_D \cdot D^{-\alpha_D}, \quad \alpha_D \approx 0.095\]

These are not theoretical predictions. They are fits to experimental data spanning five orders of magnitude of compute, from small models trained on modest datasets to the largest systems available at the time. The fits are remarkably clean: the power law holds with minimal deviation across the entire range.

Hoffmann and colleagues refined this picture in 2022 with the Chinchilla study, which demonstrated that Kaplan’s original scaling prescription was suboptimal. Kaplan had suggested scaling parameters faster than data; Hoffmann showed that compute-optimal training requires scaling both in roughly equal proportion, at approximately 20 tokens per parameter. A model with 70 billion parameters, trained on 1.4 trillion tokens (the Chinchilla recipe), outperformed a 280-billion-parameter model trained on 300 billion tokens (the Gopher recipe), despite using the same compute budget. The lesson was clear: the field had been training models that were too large on too little data.

Power laws in complex systems reflect deep structural properties of the underlying optimization landscape. The scaling community’s central claim, that performance improves predictably with scale, has been validated repeatedly across model families, training methodologies, and evaluation benchmarks.

The question is not whether the scaling laws hold. The question is what they actually promise.

Diminishing returns as mathematical certainty

A power law with exponent \(\alpha < 1\) is, by definition, a function of diminishing returns. It is a mathematical consequence of the functional form itself. Let us derive the implications explicitly.

The scaling law for compute is:

\[L(C) = a \cdot C^{-\alpha}, \quad \alpha \approx 0.05\]

Suppose we are currently at compute level \(C_0\) with loss \(L_0 = a \cdot C_0^{-\alpha}\). To reduce the loss by a factor of 2 (halve it), we need compute \(C_1\) such that:

\[a \cdot C_1^{-\alpha} = \frac{L_0}{2} = \frac{a \cdot C_0^{-\alpha}}{2}\]

Solving:

\[C_1^{-\alpha} = \frac{C_0^{-\alpha}}{2}\]

\[C_1 = C_0 \cdot 2^{1/\alpha}\]

For \(\alpha = 0.05\):

\[2^{1/0.05} = 2^{20} \approx 10^6\]

To halve the loss, we need one million times more compute. To halve it again from that new level, we need another factor of \(10^6\) on top: \(10^{12}\) times the original budget. Each successive halving costs a million-fold increase.

We can express this more generally. The compute required to reduce loss by a factor of \(k\) from any starting point is:

\[\frac{C_{\text{new}}}{C_{\text{old}}} = k^{1/\alpha}\]

For even modest improvements, the cost becomes extreme:

Loss reduction factor	Compute multiplier (\(\alpha = 0.05\))
2x better	\(\sim 10^6\)
3x better	\(\sim 10^{9.5}\)
5x better	\(\sim 10^{14}\)
10x better	\(\sim 10^{20}\)

A tenfold improvement in loss requires \(10^{20}\) times more compute than the current level. For context, the entire global compute capacity deployed for AI training in 2024 was estimated at roughly \(10^{26}\) FLOP. A tenfold loss improvement from that baseline would require \(10^{46}\) FLOP, exceeding the estimated computational capacity of a Kardashev Type I civilization.

Now connect this to the gap from Chapter 1. We argued that the comparison between biological learning and artificial training should be measured in gradient updates versus learning instances, yielding a gap of \(10^{18}\). But even if we accept the FLOP-to-FLOP comparison favored by scaling optimists, the scaling law itself tells us that progress decelerates at a rate that makes closing the gap extraordinarily expensive. The power law does not promise convergence. It promises asymptotically slowing approach toward a floor.

The Chinchilla revision, which roughly doubled the effective exponent by fixing the data-parameter ratio, was a genuine improvement. But doubling \(\alpha\) from 0.05 to 0.1 changes the compute multiplier for a tenfold loss reduction from \(10^{20}\) to \(10^{10}\). Better, but still astronomical. And Chinchilla was a one-time correction of a systematic error in training methodology. There is no reason to expect repeated corrections of comparable magnitude.

The bitter lesson and its limits

In 2019, Rich Sutton published a short essay titled “The Bitter Lesson” that became one of the most cited pieces of informal writing in AI research. His argument was simple and historically well-supported: across the history of artificial intelligence, general methods that leverage computation have consistently won over methods that leverage human knowledge. Hand-crafted features lose to learned features. Expert systems lose to neural networks. Carefully engineered game-playing programs lose to brute-force search combined with learning. The lesson is “bitter” because it means that human cleverness about the structure of problems is less valuable than raw compute applied to general-purpose learning algorithms.

Sutton was right about the history. The trend he identified is real and has continued to hold. But the bitter lesson carries an implicit assumption that we should make explicit: it assumes compute is the binding constraint. Given sufficient data of sufficient quality, more compute yields better performance. The lesson’s historical examples confirm this, because in every case Sutton cited, more data of the relevant kind was available.

The question is whether the assumption holds for the frontier we are now approaching. The scaling laws were measured over text: token prediction on natural language corpora. They describe how loss on text decreases as a function of compute, parameters, and data, all applied to text. There is no empirical evidence that these same scaling laws extend to the kind of learning that evolution performed: embodied, continuous, multi-sensory, physically grounded experience evaluated against survival and reproduction.

This is not a pedantic distinction. Chapter 1 established that the evolutionary training signal was qualitatively different from text in at least three ways: its information density (roughly \(10^9\) bits per second of sensory input versus the bandwidth of written language), its grounding in physical causation (organisms interacted with a world that obeys consistent physical laws, not with statistical regularities in token sequences), and its evaluation criterion (survival and reproduction, not cross-entropy loss on held-out text). The scaling laws tell us how fast text prediction improves with scale. They tell us nothing about whether text prediction, at any scale, converges to the capabilities that embodied evolutionary learning produced.

Chapter 2 also established a fourth difference: the architecture of learning itself. Nature’s learning was continuous, with daily consolidation cycles that interleaved new experience with existing knowledge. It ran on co-located memory and computation with massive parallel write bandwidth. The scaling laws were measured on systems that learn in a single pass, with separated memory and compute, and no consolidation mechanism. Even if more compute and more data were available without limit, the one-shot training paradigm cannot replicate the learn-consolidate cycle that enabled biological knowledge accumulation.

The bitter lesson says: do not bet against scale. Sound advice, as far as it goes. But it does not say: scale solves all problems. The lesson is about the relative merit of general methods versus hand-crafted ones within a domain where more data is available. It is silent on what happens when the data runs out, when the training signal lacks the information content of the target domain, or when the learning architecture cannot support the required mode of knowledge accumulation.

Empirical evidence of deceleration

The power law predicts deceleration. What does the empirical trajectory show?

GPT-2 to GPT-3 (2019 to 2020): Model size increased from 1.5B parameters to 175B parameters, roughly 100x. Training compute increased proportionally. The improvement was dramatic: GPT-3 demonstrated few-shot learning, could follow complex instructions, and showed surprisingly broad knowledge. This was a genuine capability jump, not merely incremental improvement.

GPT-3 to GPT-4 (2020 to 2023): Training compute increased by roughly another 10-100x (estimates vary; OpenAI has not released precise figures). Model architecture became more sophisticated, incorporating multimodality and likely mixture-of-experts. The improvement was real: GPT-4 passes professional exams (bar exam, AP exams), writes more coherent long-form text, handles more complex reasoning chains, integrates vision and text. But the improvement was smaller in qualitative terms than GPT-2 to GPT-3. GPT-4 is not a different kind of system; it is a better version of the same kind of system.

Benchmarks confirm this. On MMLU (Massive Multitask Language Understanding), a broad knowledge benchmark: - GPT-3: ~43% - GPT-3.5: ~70% - GPT-4: ~86%

The jump from GPT-3 to GPT-3.5 was 27 percentage points. The jump from GPT-3.5 to GPT-4 was 16 percentage points. The rate of improvement is slowing even as compute expenditure increases exponentially.

On HumanEval, a code generation benchmark: - GPT-3: ~0% - GPT-3.5 (code-davinci-002): ~47% - GPT-4: ~67%

Again, the largest jump was early. GPT-4 is better, but not 100x better, despite 100x more compute.

Claude 2 to Claude 3 to Claude 3.5 (2023 to 2024): Anthropic’s models show a similar pattern. Claude 3 Opus outperformed Claude 2 significantly on reasoning benchmarks. Claude 3.5 Sonnet (mid-2024) showed further improvement but on a smaller scale. The deceleration is visible.

Gemini models (2023-2024): Google’s Gemini Ultra achieved performance comparable to GPT-4, using massive compute. Gemini 1.5 introduced a 1-million-token context window, a genuine architectural innovation, but capability improvements on standard benchmarks were incremental.

The pattern is consistent across labs and model families: each generation requires more compute, and each generation delivers smaller improvements. This is not surprising. It is what the power law predicts. But it is direct evidence that we are moving up the curve into the regime of diminishing returns.

Test-time compute: does it change the picture?

OpenAI’s o1 model, released in late 2024, introduced a new approach: test-time compute. Rather than simply generating the most likely next token, the model performs internal “reasoning” steps, searching over possible solution paths before committing to an answer. On some benchmarks, particularly mathematical and coding problems, o1 dramatically outperforms GPT-4.

Does this change the scaling picture?

The answer depends on what we mean by “scaling.” Test-time compute does not extend the training data, does not increase the number of learning instances, and does not solve the architectural or sensory bandwidth gaps. What it does is allow the model to spend more inference compute searching over possibilities within the distribution it has already learned.

This is valuable. For problems that admit search (math, coding, formal reasoning), allocating more compute at inference time can find better solutions. This is conceptually similar to chess engines, which improve with more search depth even if the evaluation function remains constant.

But test-time compute has limits: 1. It only helps for problems where the solution can be verified (math, code, logic). For open-ended generation, summarization, or creative tasks, there is no verifier to guide the search. 2. It operates within the learned distribution. If the model has not learned the relevant concepts during training, search cannot discover them. o1 does not suddenly develop physical intuition or embodied common sense; it searches more carefully over the text-based knowledge it already has. 3. It is expensive. Inference cost scales with search depth. If o1 uses 100x more compute per query than GPT-4, then deploying it at scale costs 100x more. This is acceptable for high-value tasks (scientific research, complex coding) but not for general-purpose use.

Test-time compute is a genuine innovation and will be valuable in specific domains. But it does not solve the fundamental gaps. It is better search over a limited map, not a larger map. The deceleration of capability improvements with training compute remains, and test-time compute does not bypass the data wall, the sensory bandwidth gap, or the architectural constraints.

The case against

We owe the strongest counterarguments a fair hearing.

“Scaling laws have held for five orders of magnitude. Betting against them is foolish.” This is true, and it is the strongest version of the scaling argument. Five orders of magnitude is a large extrapolation base, and the fits have been clean. But a power law with \(\alpha < 1\) is self-limiting by definition. The fact that it holds does not mean it is sufficient. A function can hold perfectly and still guarantee that the destination is unreachable within any feasible budget. We are not betting against the scaling laws. We are reading them carefully, and what they say is that each unit of progress costs exponentially more than the last.

“Algorithmic improvements change the exponent.” Possible, and some improvements have been genuine. The Chinchilla correction roughly doubled the effective exponent for a fixed compute budget. Mixture-of-experts architectures, better tokenization, and curriculum learning have all contributed incremental gains. But no demonstrated algorithmic improvement has delivered more than a roughly 2x efficiency gain in compute-equivalent terms. The gap is \(10^{18}\). To close it through algorithmic improvement alone would require discovering, in sequence, roughly 60 independent doublings of efficiency, each one a Chinchilla-scale breakthrough. The history of computer science offers no precedent for sustained improvement at this rate on a single problem class.

Chapter summary

Scaling laws demonstrate that language model loss follows a power law \(L(C) = a \cdot C^{-\alpha}\) with \(\alpha \approx 0.05\), a function of diminishing returns by definition
To halve the loss requires \(10^6\) times more compute; to reduce by 10x requires \(10^{20}\) times more compute (exceeding projected global AI compute capacity)
The Chinchilla correction (2022) roughly doubled the effective exponent by fixing the data-parameter ratio, but this was a one-time correction of a systematic error
Empirical evidence confirms deceleration: GPT-3 to GPT-4 required 10-100x more compute for smaller qualitative improvements than GPT-2 to GPT-3
Benchmark trajectories show slowing gains: each generation delivers fewer percentage points of improvement despite exponentially more compute
The bitter lesson (Sutton, 2019) says general methods that leverage compute beat hand-crafted approaches, but this assumes unlimited data and does not address architectural constraints
Scaling laws were measured over text prediction; they tell us nothing about whether text prediction converges to embodied, physically grounded intelligence
Test-time compute (OpenAI’s o1) improves performance on verifiable problems through search but does not extend training data or solve fundamental gaps
Test-time compute operates within the learned distribution; it cannot discover concepts not present in training data
The \(10^{18}\) gap cannot be closed by algorithmic improvements alone; this would require 60 sequential Chinchilla-scale breakthroughs with no historical precedent