Implementation of attention-based reasoning chain pruning for LLMs, inspired by:
- Think Clearly: Improving Reasoning via Redundant Token Pruning (Choi et al., 2025)
- TRAAC: Think Right with Adaptive, Attentive Compression (2025)
Reasoning LLMs (like DeepSeek-R1) generate long chain-of-thought traces that contain significant redundancy. By analyzing attention patterns, we can identify which reasoning steps the model actually relies on when producing its final answer — and prune the rest.
Core insight from "Think Clearly": Steps that receive low attention from subsequent tokens are redundant. Removing them reduces distraction and can actually improve accuracy.
Core insight from "TRAAC": Self-attention over reasoning trajectories identifies important vs. expendable steps, enabling adaptive compression without retraining.
- Generate a full reasoning chain for each math problem
- Segment the chain into discrete reasoning steps (by paragraph breaks / reasoning markers)
- Score each step by computing how much attention the answer-relevant tokens pay to it (averaged across all layers and heads)
- Prune steps below a percentile threshold of importance
- Re-evaluate the model with only the retained reasoning context
- Compare accuracy and token efficiency across pruning thresholds
| Threshold | Accuracy | Steps Kept | Avg Length | Notes |
|---|---|---|---|---|
| 0.0 (baseline) | X% | 100% | N chars | Full reasoning chain |
| 0.1 | X% | ~90% | N chars | Light pruning |
| 0.2 | X% | ~80% | N chars | Moderate pruning |
| 0.3 | X% | ~70% | N chars | Moderate-aggressive |
| 0.4 | X% | ~60% | N chars | Aggressive pruning |
| 0.5 | X% | ~50% | N chars | Heavy pruning |
Run the experiment to fill in actual values.
- Upload
Reasoning_Pruning_Experiment.ipynbto Colab - Set runtime to GPU (T4)
- Run all cells
- Create a new notebook, paste code from
reasoning_pruning_experiment.py - Enable GPU T4 x2 accelerator
- Run
pip install transformers accelerate bitsandbytes torch sentencepiece datasets
python reasoning_pruning_experiment.py- DeepSeek-R1-Distill-Qwen-1.5B (4-bit quantized via bitsandbytes)
- Uses
<think>...</think>structured reasoning - Small enough for free-tier Colab/Kaggle T4 GPU
- AIME 2024 (American Invitational Mathematics Examination)
- 5 problems included in code (expand to full 30 for comprehensive eval)
- Integer answers (0-999), straightforward to verify
├── README.md
├── reasoning_pruning_experiment.py # Full Python script
├── Reasoning_Pruning_Experiment.ipynb # Colab notebook
└── aime24_pruning_results.json # Results (generated after running)
For each reasoning step, we compute importance as:
importance(step_i) = Σ attention(last_token → token_j) for all token_j in step_i
Where attention is averaged across all layers and heads. This captures how much the model's final output "looks back" at each reasoning step.
Given threshold t (0 to 1):
- Compute the
t-th percentile of step importance scores - Remove all steps below this cutoff
- Always keep at least the highest-importance step
- Reconstruct the reasoning chain from remaining steps
After pruning, we feed the retained reasoning back to the model as context and ask it to produce a final answer. This simulates the effect of KV-cache pruning at inference time.
@article{choi2025thinkclearly,
title={Think Clearly: Improving Reasoning via Redundant Token Pruning},
author={Choi, Daewon and Lee, Jimin and Tack, Jihoon and Song, Woomin and others},
journal={arXiv preprint arXiv:2507.08806},
year={2025}
}
@article{traac2025,
title={Think Right with Adaptive, Attentive Compression},
journal={arXiv preprint arXiv:2510.01581},
year={2025}
}Naveen Puppala — Collaboration application for Sarvesh Gharat (IIT Bombay)