The efficiency revolution

“The future is already here — it’s just not evenly distributed.” — William Gibson

Frontier model capabilities will stagnate. But efficiency will not. While the performance ceiling is real, the cost to reach that ceiling will collapse. This chapter makes the case that within 10-15 years, models at current frontier quality will run locally on consumer devices at near-zero marginal cost. This is the genuine revolution: not AGI, but impressive machines everywhere.

The efficiency gap

The gap between capability and accessibility is enormous.

Current state of frontier models: - Training cost: $100M-$500M in compute expenditure for a single run - Inference infrastructure: requires datacenter deployment with thousands of GPUs - Access model: gated through API endpoints or subscription services - Latency: network round-trips add 100-500ms, queuing adds more during peak usage - Privacy: all queries transit through corporate servers - Marginal cost: $0.01-$0.10 per 1000 tokens, depending on model size

GPT-4 level intelligence exists, but most of the world cannot run it locally. The model weights occupy hundreds of gigabytes. Inference requires hardware most consumers do not own. The intelligence is concentrated in datacenters, accessed through network pipes.

This centralization is not permanent. It is an artifact of the current efficiency level. As efficiency improves, the frontier moves from datacenter to device.

Training cost collapse

Training frontier models is expensive primarily because we use brute force: massive models, enormous datasets, one-shot learning on hardware optimized for throughput rather than efficiency. Multiple pathways exist to reduce this cost by orders of magnitude.

Algorithmic improvements

Better architectures. The transformer, introduced in 2017, was not designed for efficiency. It was designed for expressiveness and parallelizability. Attention is quadratic in sequence length. Feed-forward layers are dense and parameter-heavy. These were acceptable tradeoffs when compute was the bottleneck, but as efficiency becomes the focus, architectural innovations deliver gains.

Mixture of experts (MoE) routes different inputs through different subnetworks, activating only a fraction of total parameters for any given token. Models like GPT-4 reportedly use MoE, achieving better performance per compute than dense models. Sparse models carry more parameters but use fewer during inference, reducing effective cost.

State space models (Mamba, Hyena) replace attention with linear-time sequence processing, eliminating the quadratic bottleneck. Early results suggest they match transformer performance on long sequences while using far less compute. Whether they scale to frontier quality remains an open question, but the direction is promising.

Improved optimizers. Gradient descent is not the only way to train neural networks, just the most established. Second-order methods, which use curvature information, converge faster but require more memory. Approximate second-order methods (K-FAC, Shampoo) capture some benefits with manageable overhead. Each generation of optimizers reduces the number of training steps required to reach a target loss.

Curriculum learning. The order in which a model sees data affects how efficiently it learns. Training on easy examples first, then harder ones, allows faster convergence than random sampling. Careful curriculum design can reduce required training compute by 2-5x.

Knowledge distillation. A large teacher model can train a smaller student model to match its performance through distillation: the student learns from the teacher’s outputs rather than raw data. The student is cheaper to run, and distillation is cheaper than training from scratch. Distillation does not extend the capability frontier, but it democratizes access to frontier capabilities at lower cost.

Conservative projection: Compounding algorithmic improvements deliver 10-50x training cost reduction over 10 years.

Hardware improvements

Next-generation accelerators. NVIDIA’s H100 GPU, released in 2022, is already being superseded by H200 and the upcoming B100 series. Each generation delivers roughly 2-3x improvement in FLOP per watt and FLOP per dollar. This is not Moore’s Law, which has slowed, but it is sustained progress driven by specialized AI architectures, improved memory bandwidth, and better chip design.

AMD, Google (TPU), Cerebras, Graphcore, and others compete in the accelerator market. Competition drives innovation. Over 10 years, expect 10-30x improvement in training efficiency from hardware alone.

Neuromorphic approaches. IBM’s NorthPole and Intel’s Loihi represent a different paradigm: analog computation with co-located memory. These chips achieve 10-25x better energy efficiency than GPUs for specific workloads, primarily inference. Training on neuromorphic hardware remains experimental, but if successful, it could deliver another 10-100x efficiency gain.

Memory technology. HBM3 is the current standard for high-bandwidth memory. HBM4 is in development, promising higher capacity and lower energy per access. Memristors, which store weights in resistance states, remain in the lab but show potential for orders-of-magnitude improvement in energy efficiency. Whether memristors transition from research to production within 10 years is uncertain, but the trajectory is promising.

Conservative projection: Hardware improvements deliver 10-30x training cost reduction over 10 years.

Compounding training efficiency

Algorithmic and hardware gains multiply. Conservative estimates: \(10 \times 10 = 100\)x reduction in training cost over 10 years. Optimistic estimates: \(50 \times 30 = 1500\)x reduction.

What costs $100M today might cost $1M in 10 years, or potentially as little as $100K in the optimistic case. This does not extend the capability frontier (stagnation still applies), but it makes reaching the frontier far more accessible. More organizations can afford to train frontier models. Fine-tuning becomes economically feasible for domain-specific applications. The centralization of frontier model development begins to erode.

Inference cost collapse

Training happens once. Inference happens billions of times. Inference efficiency is where the revolution actually occurs.

The inference bottleneck

Inference cost in transformers is dominated by three factors:

  1. Memory bandwidth. Moving weights from memory to compute units costs energy and time. For large models, memory access dominates arithmetic operations.

  2. Precision overhead. Models typically use 16-bit floating-point weights. Higher precision is unnecessary for inference but remains standard because training requires it.

  3. Quadratic attention. Transformer attention scales as \(O(n^2)\) in sequence length. Long contexts become prohibitively expensive.

Each bottleneck has solutions.

Quantization: reducing precision

Weights trained at 16-bit precision retain most of their functionality when compressed to lower precision. This is quantization: representing weights with fewer bits.

8-bit quantization: Reduces memory footprint and bandwidth by 2x. Quality loss is minimal for most tasks. This is already standard in production deployments.

4-bit quantization: Reduces memory by 4x. Quality degradation is measurable but acceptable for many applications. Recent methods (GPTQ, AWQ) achieve surprisingly good 4-bit performance.

2-bit quantization: Reduces memory by 8x. Quality loss becomes significant, but the model remains functional for simpler tasks.

1-bit (binary) quantization: Each weight is +1 or -1. Reduces memory by 16x. Recent work (BitNet) demonstrates that carefully trained 1-bit models retain substantial capability, though below full-precision frontiers.

Quantization is not free. Lower precision reduces quality. But the tradeoff is favorable: a 4-bit quantized GPT-4 might perform at 90-95% of full quality while running at 4x lower cost. For most applications, this is acceptable.

Projection: Quantization delivers 4-16x inference cost reduction with acceptable quality loss.

Sparsity: activating less

Most neural network activations are near zero. Sparse models explicitly zero out connections, activating only a fraction of the network for any given input. Mixture of experts is one form of sparsity. Magnitude-based pruning is another.

Sparse models reduce computation proportionally to sparsity. A 90% sparse model uses 10% of the compute of a dense model. The challenge is maintaining quality during pruning. Careful techniques (gradual pruning during training, structured sparsity that matches hardware) achieve high sparsity with minimal quality loss.

Projection: Sparsity delivers 5-10x inference cost reduction.

Architectural efficiency

Transformers are not the final word in neural architectures. Alternatives that reduce attention cost are under active development.

State space models: Mamba, Hyena, and related models replace quadratic attention with linear-time updates. Early results suggest they match transformer quality on long-context tasks while using far less compute. If this holds at frontier scale, state space models could deliver 10-100x reduction in inference cost for long sequences.

Mixture of experts: Route inputs through specialized subnetworks. Only a fraction of total parameters activate per token, reducing effective compute.

Local attention: Not every token needs to attend to every other token. Local attention (attending only to nearby tokens) plus occasional global attention (attending to key tokens) reduces quadratic cost while preserving most capability.

Projection: Architectural improvements deliver 5-20x inference cost reduction over 10 years.

Compounding inference efficiency

Again, the gains multiply. Conservative: \(4 \times 5 \times 5 = 100\)x. Optimistic: \(16 \times 10 \times 20 = 3200\)x.

What costs $0.05 per inference today might cost $0.0005 in 10 years (conservative) or $0.000015 (optimistic). Near-zero marginal cost. Trillions of inferences become economically feasible.

From datacenter to device

When inference cost drops 100-1000x, models that today require datacenter infrastructure become runnable on consumer hardware.

Timeline projection:

2026: GPT-3.5 equivalent (175B parameters, 8-bit quantized, sparse) runs on high-end laptops (64GB RAM, integrated GPU). Inference is slow (seconds per response) but functional. Privacy-conscious users adopt local models for sensitive queries.

2028: GPT-4 equivalent (1T parameters, 4-bit quantized, sparse) runs on high-end laptops. Inference speed approaches real-time (100ms per token). Smartphones begin running GPT-3.5 equivalent models. Local-first AI becomes mainstream.

2030: GPT-4 equivalent runs on mid-range laptops and high-end smartphones. Inference is fast (10-50ms per token). Edge devices (tablets, smart glasses) run GPT-3.5 equivalent models. Network dependence for AI evaporates.

2035: Frontier-quality models (GPT-4 or better) run on all consumer devices. Watches, glasses, earbuds, cars. Inference is near-instantaneous (1-10ms per token). Ambient intelligence: AI is everywhere, always available, offline-capable, private.

This is not science fiction. It is the compounding of demonstrated efficiency trends. The physics allows it. The engineering trajectory points toward it. The economic incentives demand it.

The neuromorphic wildcard

Everything discussed so far assumes digital von Neumann architecture. But biology achieves far greater efficiency with analog computation and co-located memory. What if silicon could capture some of biology’s efficiency advantages?

Neuromorphic inference: IBM’s NorthPole achieves 25x efficiency over GPUs for inference. Intel’s Loihi 2 demonstrates event-driven spiking networks with minimal idle power. These are inference accelerators, not training platforms, but for frozen models, that is sufficient.

Challenges remain. Neuromorphic chips are limited in capacity (NorthPole holds ~6B parameters). Scaling to frontier model sizes requires multi-chip systems, which reintroduce communication overhead. Programming models are immature. But the physics is favorable: analog computation operates closer to the Landauer limit.

Projection (speculative): If neuromorphic inference matures, it delivers an additional 10-100x efficiency gain over digital inference. This is less certain than the other projections but physically plausible.

The compounding multipliers

Let us be conservative. Over 10-15 years: - Training efficiency: 100x (algorithmic + hardware) - Quantization: 4x (8-bit standard, 4-bit for edge) - Sparsity: 5x (structured pruning) - Architecture: 5x (incremental improvements) - Total inference: \(4 \times 5 \times 5 = 100\)x

Conservative total: 100x training, 100x inference. A model that cost $100M to train and $0.05 per inference in 2024 costs $1M to train and $0.0005 per inference in 2035.

Optimistic scenario: - Training efficiency: 1500x - Quantization: 8x (4-bit standard, 2-bit for edge) - Sparsity: 10x - Architecture: 10x - Neuromorphic: 10x (speculative) - Total inference: \(8 \times 10 \times 10 \times 10 = 8000\)x

Optimistic total: 1500x training, 8000x inference. A model that cost $100M to train and $0.05 per inference costs $67K to train and $0.000006 per inference.

Even the conservative case puts frontier-quality models on laptops. The optimistic case puts them on watches.

This is the realistic horizon: not AGI, but stagnated frontier capabilities becoming universally accessible through efficiency gains rather than capability scaling.

Chapter summary

  • While frontier capability stagnates (Chapter 7), efficiency improvements will continue exponentially over the next 10-15 years
  • Training efficiency gains: 10-50x from algorithms (better architectures, optimizers, curriculum learning, distillation), 10-30x from hardware (next-gen accelerators, neuromorphic approaches), conservative total 100x
  • Inference efficiency gains: 4-16x from quantization (8-bit standard, 4-bit for edge), 5-10x from sparsity, 5-20x from architectural improvements, conservative total 100x, optimistic 3200x
  • From datacenter to device timeline: GPT-3.5 on laptops by 2026, GPT-4 on laptops by 2028, GPT-4 on smartphones by 2030, frontier quality on all devices by 2035
  • Neuromorphic wildcard: IBM NorthPole achieves 25x efficiency over GPUs for inference; if scaled to frontier models, could deliver additional 10-100x efficiency
  • Compounding multipliers: conservative case (100x training, 100x inference) puts GPT-4 on laptops; optimistic case (1500x training, 8000x inference) puts GPT-4 on watches
  • The efficiency revolution delivers what capability scaling cannot: stagnated frontier quality becomes accessible to everyone at near-zero marginal cost