This repository contains the official implementation of NuRL (Nudging the Boundaries of LLM Reasoning).
- Installation
- Quick Start
- Data Preparation
- Training
- Saved Checkpoints
- Evaluation
- Acknowledgements
- License
- Citation
This repository has been tested on Python 3.10.12.
- Create a conda environment:
conda create -n nurl python=3.10.12
conda activate nurl- Install dependencies:
bash install_requirements.sh- Install verl:
pip install --no-deps -e .- (Optional) Configure Weights & Biases:
export WANDB_API_KEY=your_key-
Download the preprocessed training data from Dropbox.
-
Extract the downloaded archive and place the
model_outputandtrain_datafolders in the root directory of this project. -
Start training (see the Training section).
If you have already downloaded the preprocessed data, you can skip this section. Otherwise, follow the steps below to process the data yourself.
Step 1: Generate Hints
Option A: Generate hints from open-source models:
cd NuRL
python get_os_model_reasoning.py
python get_os_model_abstract_hint.pyOption B: Generate hints from GPT:
cd NuRL
python get_gpt_abstract_hint.pyStep 2: Preprocess Data
Run the preprocessing scripts to generate training data:
cd data_preprocess
python stage1_grpo.py
python stage2_nurl.py(Optional) Before Stage 2 training, you can filter out easy samples by running:
cd NuRL
python filter_easy_sample.pyThis script generates k rollouts and filters out samples where all k rollouts are correct.
Before training, ensure you have completed the Data Preparation step.
Training scripts are located in the root directory and follow the naming convention: stage{1,2}_{model}.sh. Execute the appropriate script for your desired model and training stage.
Examples:
Stage 1 training with Llama:
bash stage1_llama.shStage 2 training with Qwen:
bash stage2_qwen.shtrainer.dynamic_hint_injection: Set toTrueto enable hint injection for samples where all rollouts fail.trainer.hint_type: Specifies the hint source. Options:self(self-generated hints) orgpt(GPT-generated hints).trainer.num_rollout_augmentation: Number of rollouts to augment with hints. We usenum_rollout - 1throughout our experiments.
We provide checkpoints for both Stage 1 and Stage 2 training across three models: Llama, OctoThinker, and Qwen.
| Model | Stage 1 (GRPO) | Stage 2 (Self Hint) | Stage 2 (GPT Hint) |
|---|---|---|---|
| Llama 3.2 3B | |||
| OctoThinker 3B | |||
| Qwen3 4B |
Run evaluation using the following command:
cd NuRL
python evaluate.py --model dinobby/Llama3.2-3B-Instruct-NuRL-Self --dataset MATH_500 --k 16--model: Path to the model to be evaluated--dataset: Dataset to evaluate on. Options:MATH_500,MATH_Hard,AIME24,MMLU_Pro,GPQA,Date--k: Number of evaluation runs
This repository builds upon verl and vllm. We thank them for their valuable contributions to the research community.
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0).
Important Usage Notice: This dataset was generated using GPT-4 and should not be used to develop models that compete with OpenAI.
For more details, see the LICENSE file.
If you find this work useful, please consider citing our paper:
@article{chen2025nudging,
title={Nudging the Boundaries of LLM Reasoning},
author={Chen, Justin Chih-Yao and Peng, Becky Xiangyu and Choubey, Prafulla Kumar and Huang, Kung-Hsiang and Zhang, Jiaxin and Bansal, Mohit and Wu, Chien-Sheng},
journal={arXiv preprint arXiv:2509.25666},
year={2025}
}