Adaptive Thinking via Mode Policy Optimization for Social Language Agents

Minzheng Wang^1,2, Yongbin Li³, Haobo Wang⁴, Xinghua Zhang^3🌟,
Nan Xu¹, Bingli Wu³, Fei Huang³, Haiyang Yu³, Wenji Mao^1,2🌟

🌟 Corresponding author

¹ MAIS, Institute of Automation, Chinese Academy of Sciences
² School of Artificial Intelligence, University of Chinese Academy of Sciences
³ Tongyi Lab, Alibaba Group
⁴ Peking University

[📖 ArXiv Paper] [📊 Code] [😊 Data] [🏆 Models (Coming Soon)] [📚 中文文档]

👀 Overview

This repository contains code and data for our paper Adaptive Thinking via Mode Policy Optimization for Social Language Agents. In this paper, we propose the Adaptive Mode Learning framework (AML) to empower social agents with the capability for adaptive thinking, enabling them to effectively respond in accordance with the dynamics of social interaction context. Specifically, we first develop four thinking modes inspired by hierarchical cognitive control theory, covering a spectrum from intuitive response, through shallow and strategic thinking, to deep deliberation. Next, we perform the injection of thinking modes, which consists of behavioral cloning for learning basic modes and RL-based adaptive thinking mode enhancement. For RL-based enhancement, we contrapuntally develop the Adaptive Mode Policy Optimization (AMPO) algorithm, which incorporates the mode-level and sample-level information into advantage estimation to strengthen the context-aware thinking mode switching. In terms of reward, we design three types of reward functions, including answer reward, format reward, and answer length reward, providing feedback for choosing the appropriate thinking mode and answer.

Main Results

Extensive experimental results show that AML and AMPO achieves the SOTA performances in comparison with strong baselines. Details can be found in the paper.

🔥 Update

[2025.05.04]🔥AMPO is coming! We release the paper, code, data! The ckpt is still under security review and will be available soon！

🔧How to use

The full optimization procedure. We employ a two-phase training procedure: The first phase utilizes mode behavioral cloning to enable the model to understand and follow specific thinking modes accurately. In the second phase, we perform adaptive mode policy optimization to enhance the adaptive thinking mode switch and reasoning.

Step1 Create conda environment and Install other dependencies.

Clone this repository

git clone https://github.com/MozerWang/AMPO
cd AMPO

Create BC conda environment (LLaMA Factory).

conda create --name BC python=3.11 -y
conda activate BC
cd BC 
pip install -e ".[torch,metrics]"

Create RL conda environment (verl).

# RL environment (verl)
conda create --name RL python=3.11 -y
conda activate RL
cd RL
pip3 install -e .[vllm]
pip install -r requirements.txt

you can also refer to the install instruction in verl and llamafactory.

Step2 Download the training data from huggingface

git lfs install
git clone https://huggingface.co/datasets/iiiiwis/AMPO

Step3 Preparing the Model API

(Must) Set up your OPENAI key in config/gpt_4o.yaml (Evaluation)

api_key: "Your OPENAI key"
api_url: "API URL"

(Must) Set up your key in config/qwen2.5_72b_instruct.yaml (Reward Model)

api_key: "Your key"
api_url: "API URL"
# We also recommend using vLLM. And we use HTTP server that implements OpenAI’s Completions and Chat API.
# Set up your vLLM settings in config/*.yaml

Step4 Behavior Cloning Training

conda activate BC
cd BC
## (Must) Firstly set the bc_training_data_path in ./BC/data/dataset_info.yaml
sh train.sh

Step5 RL Training

conda activate RL
cd RL
## (Must) Firstly, translate the rl training data into ".parquet" format by using the script in ./RL/example/data_preprocess/sotopia.py
sh sotopia_ampo_llama3.1_8b.sh
sh sotopia_ampo_qwen2.5_7b.sh

Step6 Evaluation and Inference

conda activate RL
cd RL
sh infer.sh
## show result
python result.py --env sotopia --data_path your_result_path

Acknowledgement

Thanks for these amazing work!

Citation

@article{wang2025ampo,
      title={Adaptive Thinking via Mode Policy Optimization for Social Language Agents}, 
      author={Minzheng Wang and Yongbin Li and Haobo Wang and Xinghua Zhang and Nan Xu and Bingli Wu and Fei Huang and Haiyang Yu and Wenji Mao},
      year={2025},
      journal={arXiv preprint arXiv:2505.02156},
      url={https://arxiv.org/abs/2505.02156}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
BC		BC
RL		RL
config		config
src		src
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adaptive Thinking via Mode Policy Optimization for Social Language Agents

👀 Overview

Main Results

🔥 Update

🔧How to use

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

MozerWang/AMPO

Folders and files

Latest commit

History

Repository files navigation

Adaptive Thinking via Mode Policy Optimization for Social Language Agents

👀 Overview

Main Results

🔥 Update

🔧How to use

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages