Skip to content

NVlabs/Long-RL

Repository files navigation

Stanford-Alpaca

Long-RL: Scaling RL to Long Sequences

Paper Code License

Watch the video

Scaling RL to Long Videos [Paper]
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We addresses the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with labeled high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training.

Stanford-Alpaca

TABLE OF CONTENTS

  1. News
  2. Highlights
  3. Introduction
  4. Supported Features
  5. Installation
  6. Training
  7. LongVideo-Reason
  8. Examples
  9. How to contribute
  10. Core Contributors
  11. Citation
  12. Acknowledgement

News

  • [2025.7.19] We release a detailed instruction and scripts for the data generation process of our LongVideo-Reason dataset in the longvideo-reason directory.
  • [2025.7.18] We release new supported features, including Open-ended reward, Cached video embeddings, and Chunked gathering as introduced in Supported Features.
  • [2025.7.10] We release Paper and this GitHub repo Long-RL.

Highlights

  1. Hour-level long video RL training on a single node: We supports RL training on hour-level videos (3,600 frames - 256k tokens) with sequence parallel, on a single A100 node (8 GPUs). examples/new_supports/qwen2_5_vl_3b_video_1h.sh
  2. Omni-model RL: We supports RL training on omni models, that take text, video, and audio for inputs. examples/new_supports/qwen2_5_omni_3b_grpo.sh
  3. Image/video generation RL: We supports RL training on image/video generation models, like Stable Diffusion and Wan series models. examples/new_supports/sd3_image_grpo.sh and examples/new_supports/wan_video_grpo.sh.

Introduction

Support models:

  • VILA series models on image and video, with SP support
    • examples/new_supports/nvila_2b_clevr_grpo.sh
    • examples/new_supports/nvila_2b_video_grpo.sh
    • examples/new_supports/longvila_7b_video_grpo.sh
  • Qwen-VL series models on text, image, video, and audio, with SP support
    • examples/new_supports/qwen2_5_3b_math_grpo.sh
    • examples/new_supports/qwen2_5_vl_3b_video_grpo.sh
    • examples/new_supports/qwen2_5_omni_3b_grpo.sh
  • Image and video diffusion model RL
    • examples/new_supports/sd3_image_grpo.sh
    • examples/new_supports/wan_video_grpo.sh

Support algorithms:

  • In addition to GRPO, DAPO & Reinforce supported, with SP support
    • examples/new_supports/qwen2_5_vl_3b_video_dapo.sh
    • examples/new_supports/qwen2_5_vl_3b_video_grpo.sh
    • examples/new_supports/qwen2_5_vl_3b_video_reinforce.sh

Supported Features

  • Open-ended reward:
  • We support training for open-ended QAs (non-multi-choices QAs). Please do the following steps if you neet it.
    • Set --worker.rollout.open_ended_reward=True in the training script.
    • Export your openai API with export OPENAI_API_KEY=xxx.
  • Cached video embeddings:
  • We support using cached video embeddings for video RL training. Because video encoding during training is slow for large batch & long video frames. Please do the following steps if you neet it.
    • Follow verl/utils/cache_video_embeds_vila.py to cache video embeddings in a local directory.
    • Set --data.cache_dir and --worker.actor.cached_embeds_dir in the training script.
  • Chunked gathering:
  • We support chunked gathering for all_gather_data_proto. Because it might suffer from CPU OOM if you machine do not have enough CPU memory, and also large batches or long video frames are needed. Please do the following step if you neet it.
    • Set --worker.rollout.num_chunk_seq in the training script. It can be 8/16/32. Larger ones cost less memory, but more time.

Installation

git clone https://github.com/NVlabs/Long-RL.git
cd Long-RL
pip install -e .

If you want to train Qwen-Omni models, please

bash vllm_replace.sh

Training

Single node

For single node (within 8 GPUs), you can refer to the training scripts in the examples directory. For example,

bash examples/new_supports/qwen2_5_vl_3b_video_grpo.sh $VIDEO_PATH

Multi-nodes

For jobs that requires multi-nodes, you can refer to the ways mentioned in the EasyR1 repo, here.

We provide additional examples for sbatch scripts like, where TRAIN_SCRIPT is the script to train on single node, NNODES is the number of nodes required.

bash scripts/srun_multi_nodes.sh $TRAIN_SCRIPT $NNODES

For example,

bash scripts/srun_multi_nodes.sh examples/new_supports/qwen2_5_vl_3b_video_grpo.sh 2

Merge Checkpoint in Hugging Face Format

This follows the ways in the EasyR1 repo.

python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

LongVideo-Reason

We provide detailed instructions on the data generation process and how to evaluate models on our LongVideo-Reason benchmark in the longvideo-reason directory.

Examples

Stanford-Alpaca

Stanford-Alpaca

Stanford-Alpaca

Stanford-Alpaca

How to contribute

  • Make sure to have git installed.
  • Create your own fork of the project.
  • Clone the repository on your local machine, using git clone and pasting the url of this project.
  • Read both the Installation sections above.
  • Commit and push your changes.
  • Make a pull request when finished modifying the project.

Core Contributors

Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu.

We welcome all possible contributions and will acknowledge all contributors clearly.

Citation

Please consider to cite our paper and this framework, if they are helpful in your research.

@misc{long-rl,
  title = {Long-RL: Scaling RL to Long Sequences},
  author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/NVlabs/Long-RL}},
}
@article{chen2025longvila-r1,
      title={Scaling RL to Long Videos},
      author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},
      year={2025},
      eprint={2507.07966},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@inproceedings{chen2024longvila,
      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
      author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
      booktitle={The International Conference on Learning Representations (ICLR)},
      year={2025},
}

Acknowledgement

  • EasyR1: the codebase we built upon. Thanks for their wonderful work.
  • verl: the RL training framework we built upon.
  • vllm: we built upon vllm for the rollout engine.
  • Flow-GRPO: we refer to the Flow-GRPO for the image/video generation RL part.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages