The sensory bandwidth gap

“We can know more than we can tell.” — Michael Polanyi

The sheer number of learning instances is only half the story. The other half is what those instances were computed over: the training data, to borrow the machine learning framing. This chapter examines the fundamental difference between text and embodied experience.

Nature’s training data

Every organism that contributed to evolution’s computation was embedded in a physical environment. It did not read about sunlight; it photosynthesized or basked in it. It did not process text descriptions of predators; it heard them, smelled them, ran from them, and sometimes was eaten by them. The “training signal” was not a loss function on token prediction. It was survival and reproduction, evaluated against the full sensory bandwidth of embodied existence.

We can now quantify how much bandwidth that actually is. Zheng and Meister established in their 2024 analysis in Neuron that the human sensory periphery transmits approximately \(10^9\) bits per second: roughly one gigabit, dominated by the optic nerve but with substantial contributions from auditory, somatosensory, proprioceptive, and vestibular channels. Of this torrent, conscious experience processes roughly 10 bits per second. The compression ratio from raw sensation to conscious awareness is on the order of \(10^8\) to one.

Over a human lifetime of roughly 80 years, with about 16 waking hours per day, the total raw sensory input amounts to:

\[80 \times 365 \times 16 \times 3600 \times 10^9 \approx 1.5 \times 10^{18} \text{ bits}\]

Now consider text. The entire written output of human civilization, from Sumerian cuneiform to the modern internet, has been estimated at roughly \(3 \times 10^{18}\) bits (including all digitized books, all web pages, all archived documents). This is a generous upper bound; the high-quality subset that language models actually train on is far smaller. The comparison is devastating: all the text humanity has ever produced contains roughly the same quantity of raw information as a single human lifetime of sensory experience. The entire written record of civilization, 5,000 years of accumulated thought, fits inside one pair of eyes.

Information quantity vs. information content

But information quantity is not the same as information content. Text is not merely a smaller quantity of the same substance. It is a fundamentally different kind of signal: a lossy compression of experience into symbols, with the vast majority of the original information discarded. When we read “the coffee was hot,” we bring to that sentence a lifetime of thermal experience that the sentence itself does not contain. A language model processes the tokens. A human recalls the burn. The word “hot” in a corpus is a pointer to an experience that the corpus cannot store.

This is not merely a philosophical observation. Polanyi formalized the problem in 1966: “we can know more than we can tell.” The domain of tacit knowledge, skill, intuition, perceptual judgment, embodied understanding, is not a small residual left over after we articulate what we know. It is the majority of what we know. The knowledge management literature consistently estimates that 70 to 80 percent of organizational knowledge is tacit: non-verbalizable, non-transferable through text. Autor brought this into economics in 2014 as “Polanyi’s paradox,” demonstrating that the tasks most resistant to automation are precisely those that rely on tacit knowledge, because we cannot write down rules for what we cannot articulate.

Language captures the 10 bits per second that survive the compression into conscious, articulable thought. It does not capture the \(10^9\) bits per second of raw sensation from which that thought was distilled. A corpus trained on text is trained on the \(10^{-8}\) fraction of experience that made it through the bottleneck of articulation.

The grounding problem

This distinction matters for a precise reason. The learning instances we counted were not performed over tokens. They were performed over the full sensory bandwidth of embodied organisms interacting with physics, or in the microbial case, over the direct chemical and thermal realities of survival. If we want to claim that a system trained on text can match the output of this process, we need a theory of how lossy compression of experience into language preserves the adaptive information that the original experience carried. No such theory exists.

And there is a further consequence: even on a step-for-step basis, the comparison flatters AI. Each biological learning instance involves a whole organism perceiving and acting in a physical environment across its full sensory bandwidth. Each gradient step in a language model processes a batch of text tokens. The informational richness per step is not comparable. If anything, counting one gradient update as equivalent to one learning instance is generous to silicon.

The multimodal response

The obvious objection: multimodal models that process images, video, and audio are closing this gap. They are no longer text-only; they observe the world through vision and sound.

But observation is not interaction. A model that watches a video of fire has not been burned. A model that processes images of food has never been hungry. The difference between passive observation and embodied experience is not merely one of bandwidth; it is one of stakes. Organisms learn because failure has consequences: starvation, predation, reproductive failure, death. The training signal is not mean squared error on pixel prediction. It is survival.

Text bandwidth vs. sensory bandwidth: Human language communicates at roughly 40 bits per second (controlled articulation rate). Human sensory input runs at roughly \(10^9\) bits per second. Text is a compression ratio of approximately \(2.5 \times 10^7\):1. Even with multimodal data added, video at typical compression delivers perhaps \(10^6\) bits per second: still three orders of magnitude below raw sensory experience, and with no physical consequences tied to the learning signal.

What is lost in compression?

The 99.9999% of experience that does not make it into text (or even into video) is not random noise. It is the substrate from which understanding emerges. The weight of an object, the texture of a surface, the proprioceptive feedback from muscle tension, the thermal sensation of temperature, the vestibular sense of balance, the olfactory landscape of a physical space: these are not decorative details. They are the grounding for concepts that language can only name, not convey.

When we say a language model “understands” physics because it can solve physics problems stated in text, we are using “understand” in a sense that would be unrecognizable to a physicist who has spent years in a laboratory, manipulating physical systems, observing outcomes, developing intuition through embodied interaction. The model has learned the symbol manipulation rules of physics. Whether it has learned physics is the question this book exists to explore.

Tacit knowledge that text cannot capture

The most revealing examples of the sensory bandwidth gap are skills that even young children possess but that no amount of text can convey.

Riding a bicycle. Ask any cyclist to explain how they balance. The answer will be vague: “You just lean into it,” “You feel when you’re tipping,” “Your body knows what to do.” This is not evasiveness. It is the honest acknowledgment that the knowledge is tacit. The cerebellum and motor cortex maintain a control loop involving vestibular input (balance), proprioceptive feedback (body position), visual flow (velocity), and predictive models of dynamics. This loop operates at millisecond timescales below conscious awareness. Reading a thousand pages about bicycle physics does not create this control loop. The knowledge is encoded in synaptic weights shaped by thousands of trials, falls, and recoveries.

Catching a ball. The solution to this problem requires solving differential equations in real time: given the ball’s trajectory (which must be estimated from incomplete visual data), predict the interception point and move there. Humans do this effortlessly by age five. The computation happens in visual cortex, parietal cortex, and motor cortex without conscious access. When asked to explain how they catch, subjects say: “I just watch it and move to where it’s going.” This is not an explanation; it is a description of phenomenology. The actual computation, involving optical flow analysis, predictive extrapolation, and motor planning, is entirely tacit.

Judging if ice will hold your weight. This requires integrating visual cues (color, transparency, surface texture), auditory cues (cracking sounds), proprioceptive cues (how the ice flexes underfoot), and contextual knowledge (temperature, wind, time of year). An experienced person makes this judgment in seconds with high confidence. Ask them to articulate their decision process, and they struggle: “It looks solid,” “The color seems right,” “I’ve walked on ice like this before.” These descriptions capture fragments of the input but not the integration process. The judgment is a weighted combination of dozens of features, most of which the person cannot consciously access or verbalize.

Tying shoelaces. This motor skill is learned through repetition until it becomes automatic. Ask someone to describe how they tie their shoes, and they will struggle unless they slow down and consciously monitor their hands. The procedural knowledge is stored in motor cortex and cerebellum, encoded as sequences of muscle activations, not as verbal instructions. You cannot learn to tie shoes by reading instructions alone; you must practice until the motor memory forms.

Estimating object weight from vision. Before lifting an object, humans visually estimate its weight based on size, material, and context. Pick up a box that looks heavy but is empty, and you will apply too much force—your motor system prepared for the visually estimated weight. This mapping from visual features to expected weight is learned through thousands of lifting experiences and is entirely non-verbal. No amount of text describing “metal is denser than wood” creates this perceptual-motor calibration.

Navigating a crowd. Walking through a dense crowd without colliding requires real-time prediction of other people’s trajectories, planning a path, adjusting based on peripheral vision, and coordinating muscle activation to execute the plan. This is continuous sensory-motor integration at millisecond timescales. People do it effortlessly while conversing, thinking about other things, their attention elsewhere. The computation is entirely tacit, operating below the threshold of conscious articulation.

These are not exotic skills. They are everyday embodied intelligence that nearly all humans possess by adulthood. Text rarely describes them in detail because they are difficult to articulate. When text does describe them (instructional manuals, coaching guides), the descriptions are crude approximations. A manual on bicycle riding cannot replace the lived experience of falling and recovering until the sensory-motor loop is calibrated.

A language model trained on text sees the words “riding a bicycle” millions of times. It learns that bicycles have two wheels, that balance is required, that people learn as children. It can answer questions about bicycles and even generate plausible instructions. But it does not have the sensory-motor knowledge that a five-year-old possesses after a weekend of practice. The \(10^9\) bits per second of embodied experience—the falls, the vestibular feedback, the proprioceptive calibration—are not in the training data.

This is the tacit knowledge gap. It is not a small residual left over after articulation. It is the majority of human intelligence: perceptual, procedural, embodied, grounded in physics, and inaccessible to any system trained only on text.

Chapter summary

  • Human sensory bandwidth is approximately \(10^9\) bits per second; conscious articulation captures roughly 10 bits per second, a compression ratio of \(10^8\) to one
  • All human-generated text (books, web, papers) contains roughly \(3 \times 10^{18}\) bits, equivalent to one human lifetime of sensory experience
  • Text is lossy compression of experience into symbols; the word “hot” points to thermal experience the text does not contain
  • Polanyi’s paradox: 70-80% of knowledge is tacit, non-verbalizable, non-transferable through text
  • Biological learning instances operated over full sensory bandwidth of embodied organisms interacting with physics; gradient updates operate over token sequences
  • Multimodal models observe but do not interact; observation is not equivalent to embodied experience with stakes (survival, reproduction)
  • Tacit knowledge examples: riding a bicycle, catching a ball, judging if ice is safe, tying shoelaces, estimating object weight from vision, navigating crowds
  • These are skills even young children possess but that no text can convey; they require lived sensory-motor calibration
  • The training signal for evolution was survival evaluated against physics; the training signal for LLMs is cross-entropy loss on token prediction
  • This is not a quantitative gap that more scale can close; it is a qualitative difference in the substrate of learning