Scaling Searle's Chinese Room For Large Language Models

Today I'm borrowing a 46-year-old philosophical idea and applying it to contemporary generative AI. I’m curious about what happens to this idea when applied to AI that, unlike what was speculated about in Searle’s time, has actually been built.

John Searle devised his famous Chinese Room thought experiment in response to what he called strong AI. This wasn’t the idea of human-like or superhuman artificial general intelligence (AGI) tech CEOs rave about, but the idea that machines that behave as though they think truly think. One reason these ideas were entertained at the time was that the frontier of artificial intelligence was symbolic rather than numerical, meaning that systems were structured according to formal grammar and logic1. A scholar who thinks about rules underpinning language all day may be attracted to the idea of a rule-based system that itself thinks. It’s less compelling to imagine that a concatenation of matrix multiplications and other computations understands the sequences of tokens it builds.

I won’t go into detail about the thought experiment, though I’d encourage you to read it in the original article. The basic idea is that someone can produce convincing text-based answers to questions in a language they know nothing about without understanding the questions or answers. All they need are exhaustive formal rules in a language they do know about, on how to respond to sequences of — to them — meaningless symbols.

Because software that lived up to Searle’s room was never actually built2, it benefits from an undefined scale. A single person can plausibly follow the potentially vast set of rules because no one formalised them. Due to the number of computations involved in frontier LLM inference and the limited speed at which people can perform arithmetic operations, the subject would need to live for around 380,000 years to generate a token in a conversation with a chatbot. 3 Constrained by human lifetimes, the computation would need to be carried out by a deeply rigid and dogmatic civilisation organised around the purpose of ingesting and producing holy numbers. Even such a dystopia, similar in size to 19th-century England, using telegraph to communicate would produce only five tokens a year. 4

I'm not the first to scale this idea up to population size. Ned Block got there before Searle even wrote his paper: his 1978 “China Brain” imagined the entire population of China, linked by radio, simulating a brain neuron by neuron. Searle himself later answered the connectionists with a “Chinese Gym”: a hall of monolingual English speakers hand-cranking a neural network. So the move I'm making here is a well-worn one. What's changed is that the machine on the other side of the analogy now actually exists, which means the scale is no longer a matter of taste: we can put numbers on it.

Let's circle back to Searle's point about incomprehension. I'll try to sketch it out. The scribes or human computers carrying out the calculation wouldn't know what the numbers meant or what they corresponded to. Those privileged few who could access the vaults where the consecrated weight numbers and numbers for divined and generated tokens were stored would not know what symbols the numbers mapped to. The high-ranking bureaucrats who organised inference computation would blindly follow a kind of bureaucratic scripture: their command and control infrastructure operating in the spirit of the Soviet factories that, graded on tonnage, produced chandeliers heavy enough to tear down ceilings, and nail plants that met their quotas with a handful of railway-spike-sized nails.

Scientists caught researching the inference process or building mechanical computers would be publicly executed as heretics. Those who pursued their research in secret would live lonely lives, questioning the consistency of the system they were enmeshed in but lacking a solid basis for any conclusions. Even during the great conjunctions that occur once a generation, the seers who divined new input token numbers from the entrails of sacrificial animals would not know what the numbers meant.

So there it is: an updated Chinese Room, in all its absurdity. It works better as an illustration of just how big large language models are than as a thought experiment. Searle's room has the advantage of not being a closed system, where questions and answers can simply pass under the door. My civilisation has to smuggle its inputs in through an oracle: the divination.


1. This is now known as GOFAI: good old-fashioned AI, a term coined by the philosopher John Haugeland in 1985. Think of SHRDLU, which parsed typed English commands to manipulate a virtual world of coloured blocks; MYCIN, which diagnosed blood infections by chaining through roughly 600 hand-written if-then rules; or Cyc, a decades-long attempt to hand-code common sense as millions of logical assertions. Or ELIZA, the 1966 chatbot that parodied a Rogerian psychotherapist using simple pattern-matching rules, fooling some users into feeling understood, which is precisely the illusion Searle was attacking.

2. This kind of symbolic AI Searle was talking about didn’t make it past R&D, let alone being deployed commercially on a planetary scale like generative AI is today. Despite some early hype, research into expert systems and the like was eventually abandoned, leading to the second AI winter starting in the late 1980s.

3. GPT-4.5, which passed the Turing test most of the time in a 2025 study, had hundreds of billions of parameters; OpenAI never disclosed a figure, but following an educated guess let’s say 600 billion active parameters. Assuming two computations for each parameter to generate a token, it would take an immortal, tireless human 3,169 years to complete a single layer of the transformer that drove GPT-4.5. A single token would take at least 380,000 years. This is based on 0.1 arithmetic operations per second for a human scribe.

4. A significant amount of computation that a transformer does is embarrassingly parallel: the vanilla feed-forward neural network that follows the self-attention magic. This could be parallelised across a galaxy if it weren’t for the speed of light nullifying any parallel gains with a huge communication cost. Barring a mega-structure like a Dyson sphere, the distances signals would need to travel within a solar system for each layer of a transformer would be prohibitive. Within a single small landmass, however, batches of computation can be sent by telegraph to outlying population centres very quickly. A core city of 1 million scribes working around the clock in two 12-hour shifts, 10 million more spread across the land, could generate a GPT-4.5 token in just over two months. See the diagram below for an illustration of this. See also this website for a useful illustration of how transformers and LLMs work.

=====================================================================================

                    O N E   T O K E N   B Y   T E L E G R A P H

             the forward pass at city scale: a capital, its towns, a wire

=====================================================================================

  model      600 B parameters, 120 sequential layers, d_model 20,412
  payload    one hidden state = 20,412 values x 16 bit  ~=  327 kbit
  the wire   100 bit/s telegraph, signals at 0.7c -- the 300 km costs a
             millisecond and a half of flight and hours of keying; at this
             scale bandwidth is the tyrant, never the distance


         _
        /_\       +
   +   |(+)|     _|_
  /_\  | : |    /   \      +                ~ 300 km of wire ~
  | |  | : |   /     \    /_\   ~        ,       ,       ,       ,          +
  | |  | : |  /       \   | |  ~        -+-------+-------+-------+-        /_\
 _| |_ | : | |         |  | |   _  _     |       |       |       |   /\    | |   /\
 | : | | : | |  :: ::  | _| |_|    |     |       |       |       |  /__\  _| |_ /__\
 | : | | : | |  :: ::  | | : || [] |     |       |       |       |  |[]|  | : | |[]|
_|___|_|___|_|_________|_|___||____|_____|_______|_______|_______|__|__|__|___|_|__|_

            C A P I T A L                     the telegraph            T O W N S
       1,000,000 core scribes         100 bit/s, signals at 0.7c   10,000,000 scribes


-------------------------------------------------------------------------------------
                     ONE LAYER  --  k of 120, strictly in order
-------------------------------------------------------------------------------------

  clock         CAPITAL (core)                             TOWNS (periphery)
            1e6 scribes, 10 km wide                     1e7 scribes, 300 km out
    .                  |                                           |
 +0.0 h    +-----------+-----------+                               |
    .      | SELF-ATTENTION        |                               |
    .      | half the city on duty |                               |
    .      | (two 12-hour shifts)  |                               |
    .      | core only: every desk |                               |
    .      | may read the whole    |                               |
    .      | token census          |                               |
    .      |  9.4 h arithmetic     |                               |
    .      | +1.8 h keying across  |                               |
    .      |  the city's own wires |                               |
+11.3 h    +-----------+-----------+                               |
    .                  |                                           |
    .                  |          MLP inputs for layer k           |
+11.3 h                |------- 327 kbit ~ 0.9 h of keying ------->|
    .                  |       (300 km of light-lag: 1.4 ms)       |
    .                  |                                           |
+12.2 h    +-----------+-----------+                 +-------------+-------------+
    .      | MLP: capital's share  |                 | MLP FEED-FORWARD          |
    .      | the on-duty half joins|     .  .  .     | ten million desks, each   |
    .      | 0.9 h                 |     in step     | on its own slice; no town |
    .      |                       |     .  .  .     | consults another  (0.9 h) |
+13.0 h    +-----------+-----------+                 +-------------+-------------+
    .                  |                                           |
    .                  |          MLP outputs for layer k          |
+13.0 h                |<------ 327 kbit ~ 0.9 h of keying --------|
    .                  |                                           |
+13.9 h    +-----------+-----------+                               |
    .      | residual add + norm   |                               |
    .      | file the layer record |                               |
    .      | (negligible)          |                               |
    .      +-----------+-----------+                               |
    .                  |                                           |
    v                  v                                           v
             hidden state to layer                       towns sit idle until
             k+1: back to the top                      the next dispatch keys in


-------------------------------------------------------------------------------------
  THE BILL
-------------------------------------------------------------------------------------

  one layer     9.4 + 1.8 + 0.9 + 0.9 + 0.9   =   13.9 hours
  one token     13.95 h  x  120 layers        =   69.7 days
  the split     74% labour / 26% communication

  the capital keys and computes around the clock: two 12-hour shifts, half a
  million desks always lit. the towns need no relief -- each works 0.9 h in
  every 13.9, and the wire itself schedules their rest.

  every communication hour is serialization -- a fist tapping 327,000 bits
  onto the wire at 100 per second. propagation never shows up on the bill:
  the light-lag budget for the whole token is about a third of a second.

=====================================================================================