This repository contains the code for reproducing the results in the paper "Reward Is Enough: LLMs Are In-Context Reinforcement Learners" (ICLR 2026).
├── experiments/ # Core ICRL experiments
│ ├── game24/ # Game of 24
│ ├── creative_writing/ # Creative Writing
│ ├── math/ # Math Competitions (AIME/HMMT)
│ └── sciworld/ # ScienceWorld
├── analysis/ # Analysis experiments & visualization
│ ├── attention_analysis/ # Reward-sensitive attention heads
│ ├── beyond_parametric_knowledge/ # ArXiv abstract generation
│ └── data_analysis/ # Plotting & post-processing
├── requirements/ # Dependencies
│ ├── requirements_sciworld_math.txt # ScienceWorld & Math experiments
│ └── requirements_creative_writing_game24.txt # Creative Writing & Game of 24
└── README.md
Install the dependencies. We recommend using uv.
For ScienceWorld and Math experiments:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements/requirements_sciworld_math.txtFor Creative Writing and Game of 24 experiments:
uv pip install -r requirements/requirements_creative_writing_game24.txtConfigure the OpenAI API key and specify which ICRL method or ablation to run in the file, then run:
cd experiments/game24
python llm_game24_api.pyRun reflexion baseline:
python llm_game24_api_reflexion.pyRun self-refine baseline:
python llm_game24_api_self-refine.pyRun Best-of-N baseline:
python llm_game24_api_rejection.pyRun long CoT baseline:
python llm_game24_api_CoT.pyConfigure the OpenAI API key and specify which ICRL method or ablation to run in the file, then run:
cd experiments/creative_writing
python llm_creative_writing_api.pyRun reflexion baseline:
python llm_creative_writing_api_reflexion.pyRun self-refine baseline:
python llm_creative_writing_api_self-refine.pyRun long CoT baseline:
python llm_creative_writing_api_CoT.pyMake sure you have Java 1.8+ installed
javac -versionClone the ScienceWorld repository and install it
git clone https://github.com/allenai/ScienceWorld.git
cd ScienceWorld
pip install -e .cd experiments/sciworldRun ICRL preset:
python3 sciworld.py icrl_mode=ICRL num_envs=29Run ICRL ablations, e.g. explore_only:
python3 sciworld.py icrl_mode=ICRL num_envs=29 explore_only=trueRun other baselines, e.g. random sampling:
python3 sciworld.py icrl_mode=RANDOM_SAMPLING num_envs=29 max_env_steps=200For all the other options available including the ablations and baselines, refer to the SciWorldConfig class in experiments/sciworld/sciworld.py.
cd experiments/math
python math_bench.pycd analysis/beyond_parametric_knowledgeRun ICRL:
python beyond_parametric_knowledge.pyRun Best-of-N baseline:
python beyond_parametric_knowledge.py --rejection_samplingRun exploitation-only ablation:
python beyond_parametric_knowledge.py --exploitation_onlyRun exploration-only ablation:
python beyond_parametric_knowledge.py --explore_onlyAnalyzes attention patterns in Qwen3-32B to identify reward-sensitive heads. Requires 2 GPUs.
cd analysis/attention_analysisRun the initial analysis (layers -1 to -4, 64 heads each):
bash test_layers_heads.sh <path_to_output_list.json>Run the extended analysis across all 32 layers:
bash run_all_layers.sh <path_to_output_list.json>Generate the significant heads figure:
python plot_significant_heads_bar.pyAcknowledgement: We have borrowed code from the ScienceWorld, ARMAP, and CLIN repositories.
@inproceedings{song2026reward,
title={Reward Is Enough: LLMs Are In-Context Reinforcement Learners},
author={Kefan Song and Amir Moeini and Peng Wang and Lei Gong and Rohan Chandra and Shangtong Zhang and Yanjun Qi},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
eprint={2506.06303},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.06303},
}