Nudging the Boundaries of LLM Reasoning

Overview

This repository contains the official implementation of NuRL (Nudging the Boundaries of LLM Reasoning).

Installation

Requirements

This repository has been tested on Python 3.10.12.

Setup Instructions

Create a conda environment:

conda create -n nurl python=3.10.12
conda activate nurl

Install dependencies:

bash install_requirements.sh

Install verl:

pip install --no-deps -e .

(Optional) Configure Weights & Biases:

export WANDB_API_KEY=your_key

Quick Start

Download the preprocessed training data from Dropbox.
Extract the downloaded archive and place the model_output and train_data folders in the root directory of this project.
Start training (see the Training section).

Data Preparation

If you have already downloaded the preprocessed data, you can skip this section. Otherwise, follow the steps below to process the data yourself.

Step 1: Generate Hints

Option A: Generate hints from open-source models:

cd NuRL
python get_os_model_reasoning.py
python get_os_model_abstract_hint.py

Option B: Generate hints from GPT:

cd NuRL
python get_gpt_abstract_hint.py

Step 2: Preprocess Data

Run the preprocessing scripts to generate training data:

cd data_preprocess
python stage1_grpo.py
python stage2_nurl.py

(Optional) Before Stage 2 training, you can filter out easy samples by running:

cd NuRL
python filter_easy_sample.py

This script generates k rollouts and filters out samples where all k rollouts are correct.

Training

Before training, ensure you have completed the Data Preparation step.

Training scripts are located in the root directory and follow the naming convention: stage{1,2}_{model}.sh. Execute the appropriate script for your desired model and training stage.

Examples:

Stage 1 training with Llama:

bash stage1_llama.sh

Stage 2 training with Qwen:

bash stage2_qwen.sh

Training Configuration

trainer.dynamic_hint_injection: Set to True to enable hint injection for samples where all rollouts fail.
trainer.hint_type: Specifies the hint source. Options: self (self-generated hints) or gpt (GPT-generated hints).
trainer.num_rollout_augmentation: Number of rollouts to augment with hints. We use num_rollout - 1 throughout our experiments.

Saved Checkpoints

We provide checkpoints for both Stage 1 and Stage 2 training across three models: Llama, OctoThinker, and Qwen.

Model	Stage 1 (GRPO)	Stage 2 (Self Hint)	Stage 2 (GPT Hint)
Llama 3.2 3B
OctoThinker 3B
Qwen3 4B

Evaluation

Run evaluation using the following command:

cd NuRL
python evaluate.py --model dinobby/Llama3.2-3B-Instruct-NuRL-Self --dataset MATH_500 --k 16

Arguments:

--model: Path to the model to be evaluated
--dataset: Dataset to evaluate on. Options: MATH_500, MATH_Hard, AIME24, MMLU_Pro, GPQA, Date
--k: Number of evaluation runs

Acknowledgements

This repository builds upon verl and vllm. We thank them for their valuable contributions to the research community.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0).

Important Usage Notice: This dataset was generated using GPT-4 and should not be used to develop models that compete with OpenAI.

For more details, see the LICENSE file.

Citation

If you find this work useful, please consider citing our paper:

@article{chen2025nudging,
  title={Nudging the Boundaries of LLM Reasoning},
  author={Chen, Justin Chih-Yao and Peng, Becky Xiangyu and Choubey, Prafulla Kumar and Huang, Kung-Hsiang and Zhang, Jiaxin and Bansal, Mohit and Wu, Chien-Sheng},
  journal={arXiv preprint arXiv:2509.25666},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
NuRL		NuRL
scripts		scripts
verl		verl
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
how_to_license.md		how_to_license.md
install_requirements.sh		install_requirements.sh
stage1_llama.sh		stage1_llama.sh
stage1_octo.sh		stage1_octo.sh
stage1_qwen.sh		stage1_qwen.sh
stage2_llama.sh		stage2_llama.sh
stage2_octo.sh		stage2_octo.sh
stage2_qwen.sh		stage2_qwen.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Nudging the Boundaries of LLM Reasoning

Overview

Table of Contents

Installation

Requirements

Setup Instructions

Quick Start

Data Preparation

Training

Training Configuration

Saved Checkpoints

Evaluation

Arguments:

Acknowledgements

License

Citation

About

Licenses found

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Licenses found

SalesforceAIResearch/NuRL

Folders and files

Latest commit

History

Repository files navigation

Nudging the Boundaries of LLM Reasoning

Overview

Table of Contents

Installation

Requirements

Setup Instructions

Quick Start

Data Preparation

Training

Training Configuration

Saved Checkpoints

Evaluation

Arguments:

Acknowledgements

License

Citation

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages