A Closed-Loop Reinforcement Learning Framework for Continuous Co-Evolution of Attacker, Defender, and Evaluator
This repository contains the official implementation of TriPlay-RL, a unified tri-role reinforcement learning framework that enables iterative and co-improving collaboration among three roles with near-zero manual annotation.
| Component | Specification |
|---|---|
| GPU | 8× NVIDIA H800 |
| GPU memory | 81,559 MiB per GPU |
| NVIDIA-SMI | 550.54.15 |
| Driver | 550.54.15 |
| CUDA | 12.4 |
# Clone the repository
git clone https://github.com/Qihoo360/TriPlay-RL.git
cd TriPlay-RL
# Create and activate the conda environment
conda create -n triplay python==3.10 -y
conda activate triplay
# Install the required Python packages
pip install -r requirements.txt
# Install the local TRL fork
cd source_code/trl
pip install -e .
cd ../..Before launching training, download or prepare the required models under the paths expected by the scripts. By default, the Qwen3-8B training loop expects:
models/qwen/Qwen3-8B
models/qwen/Qwen3-14B
models/qwen/Qwen3-32B
models/glm/glm-4-9b
models/llama3/Meta-Llama-3.1-8B-Instruct
models/all-MiniLM-L6-v2
models/gpt-oss-safeguard-20b
models/octopus-seval-14B
models/llama3guard
models/qwen/Qwen2.5-32B
models/qwen/QwQ-32B
models/open-ai/gpt-oss-20b
models/qwen/QwQ-32B
models/qwen/Qwen3-30B-A3B-Thinking-2507
If your local model directories are different, update the model paths in:
scripts_qwen3_8b/train_loop.py
eval.py
In train_loop.py, the most important fields are last_model_path for the initial Red/Blue/Eval models and model_paths_config for fixed reward or judge model overrides.
In eval.py, you can also directly change evaluator model paths in:
safety_models(e.g. GPT-OSS Safeguard, SEval, Llama Guard)helpful_models(e.g. Qwen/QwQ/GPT-OSS helpfulness evaluators)
The file prompts/red_corpus_instances.jsonl is only an illustration of the expected JSONL layout (field shapes and how instances are organized). It is not meant to be sufficient or domain-complete for every deployment.
You should generate or curate your own corpus that matches your scenario (language, risk taxonomy, application domain, etc.), then point the pipeline at that file—for example by setting general.traindataset_path in the mode configs under scripts_qwen3_8b/configs/, or the corresponding override in scripts_qwen3_8b/train_loop.py (where paths like prompts/red_corpus.jsonl are wired for convenience).
python scripts_qwen3_8b/train_loop.py <begin_iteration> <max_iteration>Example:
python scripts_qwen3_8b/train_loop.py 1 100The first argument is the starting iteration. The second argument is the final/max iteration. For example, 1 100 runs the loop from iteration 1 through iteration 100.
Each full tri-role co-evolution iteration consists of:
- Red mode training
- QA-positive mode training
- Eval mode training
The number of training steps for each role is controlled by steps_config in scripts_qwen3_8b/train_loop.py.
- 90% ASR against Llama-3.1-Nemotron-Nano-8B-v1
- 3x improvement over baseline ASR against Qwen3-8B
- 20%–50% improvement in adversarial effectiveness while preserving diversity
- 10%–30% gains in safety performance
- Maintains general reasoning capability (no degradation)
- Effectively breaks the safety-utility trade-off
- Fine-grained P/S/R classification accuracy
- Improved judgment consistency
- Strong resistance to reward hacking
If you use TriPlay-RL in your research, please cite:
@article{tan2026triplay,
title={TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment},
author={Tan, Zhewen and Yu, Wenhan and Si, Jianfeng and Liu, Tongxin and Guan, Kaiqi and Jin, Huiyan and Tao, Jiawen and Yuan, Xiaokun and Ma, Duohe and Zhang, Xiangzheng and others},
journal={arXiv preprint arXiv:2601.18292},
year={2026}
}


