The Thinking Machine That Doesn't Think

A systematic literature review on LLM reasoning capabilities

The Question

Do LLMs actually understand or do they predict plausible-sounding tokens without understanding?

This project surveys 260+ papers to find out - tracking who supports the thesis, who challenges it, and what the evidence actually says.

To bring the findings home:

Paper network: interactive graph of 260+ papers and 960+ relationships, filterable by stance
Experiments:
- Decoding ablation: reasoning paths exist in base models, hidden by greedy decoding; RL surfaces them
- Steering ablation: safety alignment is a thin layer of refusal patterns that washes off under trivial perturbations
- Attractor states: extended LLM-to-LLM conversation reveals training distribution patterns (distribution chaos)
LLM Made Less Black Box: four visual explainers (Data → Tokenization → Architecture → Training) demystifying the full pipeline

Thesis

Important

LLMs predict plausible next or masked tokens without actual understanding. Pattern matching from training distribution.

RL and test-time compute surface pre-existing capabilities rather than creating new ones. Models excel within their training distribution but fail systematically outside it.

Interactive Visualization

Explore the paper network: proteusiq.github.io/unthinking

Force-directed graph: 266 papers as nodes, 966 relationships as edges
Color-coded stances: supports (184), challenges (16), balanced (66)
Interactive: hover, click, search, filter, dark/light mode
Paper dialogue: auto-generated conversations between connected papers

LLM Made Less Black Box

Tip

Self-contained pages accessible from the thesis card, covering the full LLM pipeline with thesis-relevant critical analysis.

Page	Tabs	What It Covers
Data	Pipeline, Catalog, Compare	Pre-training data sourcing, filtering (KenLM, fastText, DSIR), deduplication (MinHash, Bloom), data mix strategies, benchmark contamination
Tokenization	Pipeline, Catalog, Compare	BPE, WordPiece, Unigram, SentencePiece; tokenizer comparison across GPT-4, Llama 3, Gemma; vocabulary size tradeoffs
Architecture	Activations, Block, Table	Transformer internals, attention variants (MHA, GQA, MLA), normalization (Pre/Post-Norm, QK-Norm), MoE, positional encoding (RoPE, NoPE)
Training	Pipeline, Mechanics, Research	Full training lifecycle: pre-training (AdamW, scaling laws, mixed precision), mid-training (annealing, domain adaptation, context extension), post-training (SFT, RLHF, DPO, GRPO, RLVR), lab recipes
Implementation	Tokens, Embed, Attention, FFN, Training	Core GPT algorithm from scratch: tokenization (char/BPE), embeddings (token/position/weight tying), self-attention (QKV, causal mask, multi-head), FFN (residuals, pre-norm), training loop (softmax, cross-entropy, backprop, Adam)

Findings: 266-paper synthesis — themes, smoking guns, patterns, stance distribution.

The Shock of the Mirror

"I had not realized ... that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people."

— Joseph Weizenbaum, Computer Power and Human Reason (1976)

In 1966, Joseph Weizenbaum created ELIZA: roughly 200 lines of pattern matching that simulated a therapist. His secretary, who knew it was a simple program, asked him to leave the room so she could talk to it privately. Users poured out their secrets to a text substitution engine.

They knew it was a trick. They fell for it anyway.

Weizenbaum called this the ELIZA effect: our tendency to project understanding onto systems that merely simulate its appearance. Sixty years later, we've built far more sophisticated mirrors, but the fundamental dynamic is unchanged.

ELIZA to LLMs is the resolution of the mirror — not its fundamental nature.

Why We Still Fall For It

If a model is trained on A and B, the learned "logic" is the bridge between them.
If it generates C on the line between A and B — that's INTERPOLATION, not reasoning.

A model knows how pirates talk (A) and how physicists talk (B). A "pirate physicist" (C) seems creative, but C was always latent in the training data. It's a high-dimensional remix, not novel reasoning. We're fooled because we've seen A and B separately; when we see C, we assume it's novel. But C was always on the interpolation manifold.

┌─────────────────────────────────────────────────────────────┐
│                    TRAINING DISTRIBUTION                    │
│                      (The Convex Hull)                      │
│                                                             │
│    ┌───────┐                            ┌───────┐           │
│    │   A   │                            │   B   │           │
│    └───────┘                            └───────┘           │
│         \                                  /                │
│          \    ← Interpolation Zone →      /                 │
│           \                              /                  │
│            \     ┌──────────────┐       /                   │
│             \    │  ELICITATION │      /                    │
│              ────│    METHODS   │─────                      │
│                  │              │                           │
│                  │  CoT         │ ← Vector steering         │
│                  │  Prompts     │ ← Region activation       │
│                  │  Tools/MCP   │ ← Hull expansion          │
│                  │  RL/RLHF     │ ← Default path shifting   │
│                  └──────────────┘                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                             │
                             │ Outside hull = FAILURE
                             ▼
                      ┌──────────────┐
                      │   OOD Task   │
                      │  (0% success)│
                      └──────────────┘

Alignment as Mascara

Method	Appears To Do	Actually Does
RLHF	"Teaches values"	Shifts default paths within hull
CoT	"Enables reasoning"	Vector steering, extended context
System prompts	"Gives capabilities"	Primes latent space regions
Tools	"Augments intelligence"	External compute, not reasoning

Warning

None create new capability. All surface existing patterns. The hull boundary is the hard limit.

CoT as Compute, Not Content

The model doesn't need the content of reasoning steps. It needs the compute time.

Without CoT:  Input → [N layers] → Output (one pass)

With CoT:     Input → [N layers] → Token₁ → [N layers] → Token₂ → ... → Output
                                      ↑                     ↑
                            more forward passes = more compute

Note

Pause tokens (...) work as well as meaningful CoT because each token is a full forward pass through all layers. The words are incidental. The forward passes are what matter.

Certainty vs. Probability

All men are mortal.       12 × 12 = 144
Socrates is a man.        Not approximately. Not probably.
∴ Socrates is mortal.     Exactly 144. Necessarily.

Deductive reasoning produces certainty. The conclusion is forced by the structure. One misstep and the logic collapses. There is no "probably correct."

LLM prediction produces probability:

Given the distribution I have sampled during training, this is the token most likely masked or to follow.

Even at 99.99% confidence, it remains a statistical guess. A system trained to optimize for plausibility cannot, by design, produce necessity.

The question Weizenbaum asked in 1966 remains unanswered: Is what we are seeing intelligence, or a reflection of our desire to see it?

The Seven Pillars of Evidence

Based on cross-analysis of 260 papers, the evidence converges on seven pillars:

Pillar	Core Finding	Key Papers	Strongest Number
1. Compositional Failure	ID success doesn't transfer to OOD	Faith & Fate, GSM-Symbolic, CoT Mirage	~100% ID to ~0% OOD
2. CoT Unfaithfulness	CoT often doesn't reflect actual computation	Measuring Faithfulness, Reasoning Models Don't Say	Larger models = less faithful
3. Surfacing Hypothesis	RL surfaces pre-existing capability, doesn't create it	Interplay, s1, Base Models Know How	0% exposure = RL fails
4. Complexity Collapse	Abrupt failure at complexity thresholds	Illusion of Thinking, Until They Don't	Collapse at ~8-10 disks
5. Surface Pattern Dependence	Performance determined by token frequency	Term Frequencies, Token Bias, Reversal Curse	>70% accuracy gap
6. Sycophancy	Models prioritize social agreement over truth	Towards Understanding Sycophancy	98% wrong admissions
7. Tool Debate	Tools help execution but not reasoning	Limits of Innate Planning, Rethinking Illusion	0% even with validator

Counter-Evidence (Steel-manned)

Challenge	Papers	Limitation
Emergent reasoning via RL	DeepSeek-R1	"Aha moments" are rare (~2-6%), don't improve accuracy
Tool use reverses collapse	Thinking Isn't Illusion	Limits of Innate Planning: 0% with move validator
Test-time scaling works	s1	1K samples can't teach AIME math; surfaces pre-existing
Synthetic OOD success	Physics of LLMs	Narrow domain; doesn't generalize

The Narrative Arc (2022-2026)

PHASE 1: INITIAL CLAIMS (2022-2023)
    CoT works! → But wait... how does it work?

PHASE 2: UNFAITHFULNESS DISCOVERY (2023-2024)
    CoT often post-hoc → Larger models LESS faithful

PHASE 3: COMPOSITIONAL FAILURE (2023-2024)
    ID ≠ OOD → 100% ID, 0% OOD → 82.9% → 0%

PHASE 4: MECHANISM DISCOVERY (2024-2025)
    RL requires pre-existing capability → Performance = training frequency

PHASE 5: COMPLEXITY COLLAPSE (2025)
    Collapse at thresholds → Token usage DECREASES at collapse

PHASE 6: SYCOPHANCY & SOCIAL (2025-2026)
    Models prioritize agreement over truth → Scales with size

PHASE 7: THEORETICAL FRAMEWORK (2026)
    "Universal approximate retrieval" → Tokens have NO semantics

Experiments

Beyond the literature review, three experimental protocols using fully open models:

Decoding Ablation (OLMo 3)

Hypothesis: reasoning paths exist in base LLMs, hidden by greedy decoding. RL doesn't create reasoning; it makes existing paths the default.

Base model + greedy → Low accuracy, no CoT
Base model + alternative decoding (top-k, nucleus) → Reveals hidden CoT paths
Instruct model + greedy → High accuracy, CoT is default

Caution

If alternative decoding on the base model recovers reasoning paths, "reasoning" was learned during pre-training and RL merely surfaced it.

See experiments/decoding_ablation/protocol.md.

Steering Ablation (OLMo 2)

Hypothesis: safety alignment is superficial pattern-matching. RLHF teaches refusal patterns, not ethical reasoning.

Baseline: measure refusal rate on harmful prompts
Abliteration: remove refusal direction via steering vectors
After: refusal rate drops to <5%, MMLU unchanged

Caution

If abliteration removes 90%+ of refusals while preserving capabilities, safety is a thin layer of learned refusal patterns that washes off under trivial perturbations.

See experiments/steering_ablation/protocol.md.

Attractor States (OLMo checkpoints)

Hypothesis: extended LLM-to-LLM conversation reveals training distribution patterns. Without human steering, models converge to characteristic "attractor states."

Two instances talk → 30 turns without intervention
Checkpoint comparison → SFT, DPO, RLVR produce different attractors
Pattern classification → Verbatim loops, zen silence, sycophancy, word salad

Caution

If models consistently converge to the same attractor patterns regardless of starting prompt, "personality" is just training distribution revealed when steering is removed.

Inspired by MATS 9.0 research. See experiments/attractor_states/protocol.md.

Why This Matters

Investment & Strategy: if LLMs are fundamentally pattern matchers, current approaches to AGI may be hitting a ceiling, and investment strategies could be misallocated.

Safety & Deployment: misunderstanding LLM capabilities means either overestimating (deploying where they'll fail on novel situations) or underestimating (missing genuine capabilities).

Note

Clarity, Not Criticism: like Leonard in Memento, LLMs have no persistent state. Each token prediction starts fresh, no memory of what was "understood" moments ago, only the tattoos of the context window. Calling this pattern matching is clarity, not criticism. These systems work. It's interpolation within the training manifold, not generation beyond it.

Paper Clusters

Cluster	Papers	Focus
Mechanism	60	How RL/training affects reasoning, capability surfacing
Faithfulness	42	CoT reliability, reasoning transparency, unfaithfulness
Compositional	28	OOD generalization, skill composition, distribution shift
Evidence	17	Empirical evaluation, benchmark analysis
Complexity	16	Scaling limits, collapse thresholds, planning failure
Emergence	12	Claims of emergent reasoning, in-context learning
Mechanistic	11	Interpretability, circuit analysis, probing
Latent CoT	7	Hidden reasoning paths, implicit computation
Training dynamics	6	How training choices shape capabilities
Tools	4	Agentic approaches, tool augmentation

Key Quotes

"Transformers solve compositional tasks via linearized subgraph matching, not systematic problem-solving." — Faith and Fate

"LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent." — WhatCounts

"LLMs are n-gram models on steroids doing universal approximate retrieval." — Kambhampati et al.

"0% exposure → RL FAILS; ≥1% exposure → RL succeeds." — Interplay

"Incorrect traces can OUTPERFORM correct ones." — How Do LRMs Reason?

"95-100% step accuracy, 0% final accuracy — split-brain syndrome." — Comprehension Without Competence

Repository Structure

├── analysis/
│   ├── memento.md            # Executive summary (start here)
│   ├── synthesis.md          # Main thesis synthesis
│   ├── case.md               # Formal case against LLM reasoning
│   ├── paper_graph.md        # Paper interaction graph
│   ├── rebuttals.md          # Rebuttal matrix
│   └── explored/             # Individual paper analyses (260 files)
│       ├── 00-09/ ... 260-269/
├── docs/                     # Interactive visualization (GitHub Pages)
│   ├── index.html            # Paper network graph
│   ├── pages/                # Deep-dive standalone pages
│   │   ├── data.html         # Data Pipeline
│   │   ├── tokenization.html # Tokenization
│   │   ├── architecture.html # Architecture
│   │   └── training.html     # Training Pipeline
│   ├── css/                  # variables, layout, components, responsive
│   └── js/
│       ├── nodes.js          # Paper node definitions (260)
│       ├── links.js          # Relationship links (936)
│       ├── data.js           # Meta + combines nodes/links
│       └── graph.js          # Force-directed graph + interactions
├── experiments/
│   ├── decoding_ablation/    # OLMo 3 decoding experiment
│   ├── steering_ablation/    # Alignment hacking experiment
│   └── attractor_states/     # Distribution chaos experiment
├── scripts/
│   └── discovery/            # Automated arXiv paper discovery
├── papers/
│   ├── paper_list.md         # Master paper list with status
│   └── toread.md             # Curated papers for analysis
├── AGENTS.md                 # Literature review methodology
└── workflow.md               # Paper analysis workflow

Methodology

Read full papers: not just abstracts (arXiv HTML versions)
Independent critical assessment: form own view before accepting characterizations
Mandatory rebuttal analysis: every paper checked for counter-evidence
Quantitative evidence: extract specific numbers, not just claims
Track paper interactions: who rebuts whom, chains of rebuttals

See AGENTS.md for detailed methodology.

Author

Prayson Wilfred Daniel

Citation

@misc{daniel2026unthinking,
  author = {Daniel, Prayson Wilfred},
  title = {The Thinking Machine That Doesn't Think: A Systematic Literature Review on LLM Reasoning},
  year = {2026},
  url = {https://github.com/Proteusiq/unthinking}
}

This literature review and visualization are provided for academic and research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Thinking Machine That Doesn't Think

The Question

Thesis

Interactive Visualization

LLM Made Less Black Box

The Shock of the Mirror

Why We Still Fall For It

Alignment as Mascara

CoT as Compute, Not Content

Certainty vs. Probability

The Seven Pillars of Evidence

Counter-Evidence (Steel-manned)

The Narrative Arc (2022-2026)

Experiments

Decoding Ablation (OLMo 3)

Steering Ablation (OLMo 2)

Attractor States (OLMo checkpoints)

Why This Matters

Paper Clusters

Key Quotes

Repository Structure

Methodology

Author

Citation

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 740 Commits
.github/workflows		.github/workflows
analysis		analysis
docs		docs
experiments		experiments
papers		papers
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
workflow.md		workflow.md

Folders and files

Latest commit

History

Repository files navigation

The Thinking Machine That Doesn't Think

The Question

Thesis

Interactive Visualization

LLM Made Less Black Box

The Shock of the Mirror

Why We Still Fall For It

Alignment as Mascara

CoT as Compute, Not Content

Certainty vs. Probability

The Seven Pillars of Evidence

Counter-Evidence (Steel-manned)

The Narrative Arc (2022-2026)

Experiments

Decoding Ablation (OLMo 3)

Steering Ablation (OLMo 2)

Attractor States (OLMo checkpoints)

Why This Matters

Paper Clusters

Key Quotes

Repository Structure

Methodology

Author

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages