Skip to content

SalesforceAIResearch/NuRL

Nudging the Boundaries of LLM Reasoning

Overview of Nudging the Boundaries of LLM Reasoning

Overview

This repository contains the official implementation of NuRL (Nudging the Boundaries of LLM Reasoning).

Table of Contents

Installation

Requirements

This repository has been tested on Python 3.10.12.

Setup Instructions

  1. Create a conda environment:
conda create -n nurl python=3.10.12
conda activate nurl
  1. Install dependencies:
bash install_requirements.sh
  1. Install verl:
pip install --no-deps -e .
  1. (Optional) Configure Weights & Biases:
export WANDB_API_KEY=your_key

Quick Start

  1. Download the preprocessed training data from Dropbox.

  2. Extract the downloaded archive and place the model_output and train_data folders in the root directory of this project.

  3. Start training (see the Training section).

Data Preparation

If you have already downloaded the preprocessed data, you can skip this section. Otherwise, follow the steps below to process the data yourself.

Step 1: Generate Hints

Option A: Generate hints from open-source models:

cd NuRL
python get_os_model_reasoning.py
python get_os_model_abstract_hint.py

Option B: Generate hints from GPT:

cd NuRL
python get_gpt_abstract_hint.py

Step 2: Preprocess Data

Run the preprocessing scripts to generate training data:

cd data_preprocess
python stage1_grpo.py
python stage2_nurl.py

(Optional) Before Stage 2 training, you can filter out easy samples by running:

cd NuRL
python filter_easy_sample.py

This script generates k rollouts and filters out samples where all k rollouts are correct.

Training

Before training, ensure you have completed the Data Preparation step.

Training scripts are located in the root directory and follow the naming convention: stage{1,2}_{model}.sh. Execute the appropriate script for your desired model and training stage.

Examples:

Stage 1 training with Llama:

bash stage1_llama.sh

Stage 2 training with Qwen:

bash stage2_qwen.sh

Training Configuration

  • trainer.dynamic_hint_injection: Set to True to enable hint injection for samples where all rollouts fail.
  • trainer.hint_type: Specifies the hint source. Options: self (self-generated hints) or gpt (GPT-generated hints).
  • trainer.num_rollout_augmentation: Number of rollouts to augment with hints. We use num_rollout - 1 throughout our experiments.

Saved Checkpoints

We provide checkpoints for both Stage 1 and Stage 2 training across three models: Llama, OctoThinker, and Qwen.

Model Stage 1 (GRPO) Stage 2 (Self Hint) Stage 2 (GPT Hint)
Llama 3.2 3B Hugging Face Hugging Face Hugging Face
OctoThinker 3B Hugging Face Hugging Face Hugging Face
Qwen3 4B Hugging Face Hugging Face Hugging Face

Evaluation

Run evaluation using the following command:

cd NuRL
python evaluate.py --model dinobby/Llama3.2-3B-Instruct-NuRL-Self --dataset MATH_500 --k 16

Arguments:

  • --model: Path to the model to be evaluated
  • --dataset: Dataset to evaluate on. Options: MATH_500, MATH_Hard, AIME24, MMLU_Pro, GPQA, Date
  • --k: Number of evaluation runs

Acknowledgements

This repository builds upon verl and vllm. We thank them for their valuable contributions to the research community.

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC 4.0).

Important Usage Notice: This dataset was generated using GPT-4 and should not be used to develop models that compete with OpenAI.

For more details, see the LICENSE file.

Citation

If you find this work useful, please consider citing our paper:

@article{chen2025nudging,
  title={Nudging the Boundaries of LLM Reasoning},
  author={Chen, Justin Chih-Yao and Peng, Becky Xiangyu and Choubey, Prafulla Kumar and Huang, Kung-Hsiang and Zhang, Jiaxin and Bansal, Mohit and Wu, Chien-Sheng},
  journal={arXiv preprint arXiv:2509.25666},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.txt

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •