Skip to content

GRPO grouping in multi-turn agent RL: is it valid to mix samples with different prompts in the same group? #489

@yangdongdong2000

Description

@yangdongdong2000

While debugging the calx example, we observed that as the number of training iterations increases, the agent tends to stop calling tools.
When using GRPO for multi-turn agent RL training (e.g., tool-calling scenarios), a single rollout can produce multiple training samples — one per conversation turn — where each turn has a different prompt (because earlier turns' tool calls and tool responses get appended to the context).
For example, with rollout_n=4 for one question:

Sample Prompt Response Reward
Rollout1-Turn0 [system, user] "Let me calculate... <tool_call>..." 1.0
Rollout1-Turn1 [system, user, assistant, tool_response] "The answer is 40" 1.0
Rollout2-Turn0 [system, user] "The answer is π square cm" (no tool) 0.0
Rollout3-Turn0 [system, user] "I'll use the calculator... <tool_call>..." 1.0
Rollout3-Turn1 [system, user, assistant, tool_response] "### ANSWER: 40" 1.0
Rollout4-Turn0 [system, user] "The answer is 35" (no tool) 0.0

All 6 samples share the same uid (derived from the same original question), so GRPO computes advantages within this single group. Standard GRPO assumes all samples in a group are generated from the same prompt. However, in multi-turn agent scenarios, Turn0 samples have prompt [system, user], while Turn1 samples have prompt [system, user, assistant, tool_response] — these are fundamentally different inputs. Is it theoretically valid to compute group-normalized advantages across samples with different prompts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestion about a feature or some usageverl

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions