Minzheng Wang1,2, Yongbin Li3, Haobo Wang4, Xinghua Zhang3🌟,
Nan Xu1, Bingli Wu3, Fei Huang3, Haiyang Yu3, Wenji Mao1,2🌟
🌟 Corresponding author
1 MAIS, Institute of Automation, Chinese Academy of Sciences
2 School of Artificial Intelligence, University of Chinese Academy of Sciences
3 Tongyi Lab, Alibaba Group
4 Peking University
This repository contains code and data for our paper Adaptive Thinking via Mode Policy Optimization for Social Language Agents. In this paper, we propose the Adaptive Mode Learning framework (AML) to empower social agents with the capability for adaptive thinking, enabling them to effectively respond in accordance with the dynamics of social interaction context. Specifically, we first develop four thinking modes inspired by hierarchical cognitive control theory, covering a spectrum from intuitive response, through shallow and strategic thinking, to deep deliberation. Next, we perform the injection of thinking modes, which consists of behavioral cloning for learning basic modes and RL-based adaptive thinking mode enhancement. For RL-based enhancement, we contrapuntally develop the Adaptive Mode Policy Optimization (AMPO) algorithm, which incorporates the mode-level and sample-level information into advantage estimation to strengthen the context-aware thinking mode switching. In terms of reward, we design three types of reward functions, including answer reward, format reward, and answer length reward, providing feedback for choosing the appropriate thinking mode and answer.
Extensive experimental results show that AML and AMPO achieves the SOTA performances in comparison with strong baselines. Details can be found in the paper.
- [2025.05.04]🔥AMPO is coming! We release the paper, code, data! The ckpt is still under security review and will be available soon!
The full optimization procedure. We employ a two-phase training procedure: The first phase utilizes mode behavioral cloning to enable the model to understand and follow specific thinking modes accurately. In the second phase, we perform adaptive mode policy optimization to enhance the adaptive thinking mode switch and reasoning.
Step1 Create conda environment and Install other dependencies.
- Clone this repository
git clone https://github.com/MozerWang/AMPO
cd AMPO
- Create BC conda environment (LLaMA Factory).
conda create --name BC python=3.11 -y
conda activate BC
cd BC
pip install -e ".[torch,metrics]"
- Create RL conda environment (verl).
# RL environment (verl)
conda create --name RL python=3.11 -y
conda activate RL
cd RL
pip3 install -e .[vllm]
pip install -r requirements.txt
you can also refer to the install instruction in verl and llamafactory.
Step2 Download the training data from huggingface
git lfs install
git clone https://huggingface.co/datasets/iiiiwis/AMPO
Step3 Preparing the Model API
- (Must) Set up your OPENAI key in config/gpt_4o.yaml (Evaluation)
api_key: "Your OPENAI key"
api_url: "API URL"
- (Must) Set up your key in config/qwen2.5_72b_instruct.yaml (Reward Model)
api_key: "Your key"
api_url: "API URL"
# We also recommend using vLLM. And we use HTTP server that implements OpenAI’s Completions and Chat API.
# Set up your vLLM settings in config/*.yaml
Step4 Behavior Cloning Training
conda activate BC
cd BC
## (Must) Firstly set the bc_training_data_path in ./BC/data/dataset_info.yaml
sh train.sh
Step5 RL Training
conda activate RL
cd RL
## (Must) Firstly, translate the rl training data into ".parquet" format by using the script in ./RL/example/data_preprocess/sotopia.py
sh sotopia_ampo_llama3.1_8b.sh
sh sotopia_ampo_qwen2.5_7b.sh
Step6 Evaluation and Inference
conda activate RL
cd RL
sh infer.sh
## show result
python result.py --env sotopia --data_path your_result_path
Thanks for these amazing work!
@article{wang2025ampo,
title={Adaptive Thinking via Mode Policy Optimization for Social Language Agents},
author={Minzheng Wang and Yongbin Li and Haobo Wang and Xinghua Zhang and Nan Xu and Bingli Wu and Fei Huang and Haiyang Yu and Wenji Mao},
year={2025},
journal={arXiv preprint arXiv:2505.02156},
url={https://arxiv.org/abs/2505.02156}
}