Potential Fix for flat reward by fixing GRPO loss: Correct importance sampling ratio calculation #571

karthik19967829 · 2025-11-14T11:56:08Z

Fixed bug in SimpleGRPOLoss where logprobs - logprobs.detach() always equaled 0
Now correctly uses behavior_logprobs for importance sampling ratio
Added behavior_logprobs field to Episode dataclass
Modified collate function to include behavior_logprobs in training targets
Updated simple_grpo_loss to accept and use behavior_logprobs
Enabled logprobs collection in policy sampling_params (qwen3_1_7b.yaml)
Added behavior_logprobs extraction from completion responses
Added GRPO_BUG_FIX_SUMMARY.md documenting the issue and solution

This fix ensures the GRPO algorithm can properly learn by calculating the correct importance sampling ratio between current and behavior policies.

- Fixed bug in SimpleGRPOLoss where logprobs - logprobs.detach() always equaled 0 - Now correctly uses behavior_logprobs for importance sampling ratio - Added behavior_logprobs field to Episode dataclass - Modified collate function to include behavior_logprobs in training targets - Updated simple_grpo_loss to accept and use behavior_logprobs - Enabled logprobs collection in policy sampling_params (qwen3_1_7b.yaml) - Added behavior_logprobs extraction from completion responses - Added GRPO_BUG_FIX_SUMMARY.md documenting the issue and solution This fix ensures the GRPO algorithm can properly learn by calculating the correct importance sampling ratio between current and behavior policies.

felipemello1 · 2025-11-14T14:30:11Z

hey @karthik19967829 , thanks for the investigation. Did you see an improvement in rewards? On our end, we found this:

"I ran experiments to diagnose the issue and here is the conclusions and suggested improvements.
Original max_res_tokens was set 1024 based on avg_tokens_generated. However, tracking max_token_generated revealed many LLM responses were truncated (while the answers are always at the end). This created episodes that are incorrectly given 0 reward. Experiments with max_res_tokens=2048 showed increased rewards ==> should set the max_res_tokens higher and should drop episodes that are truncated. ��
groups with advantage=0 (same rewards for all episodes in the group) should not be included as it'll make loss meaningless. Experiment including advantage=0 shows lower L2 Norm differences of params in trainer step, which means model is not learning and wasting cycles ==> should drop episodes with reward std dev < 0.001 (or some value)
�
SUM(rewards) shouldn't be used to track the progress as there are differences in number of episodes generated (or sampled) per step. ==> should track AVG(rewards) - generated and sampled
��
Based on all above changes, it shows that QWEN3 1.7B already achieves 0.8~0.9 of avg rewards on GSM9k from the start, which would be very hard to improve using traditional RL training ==> it shouldn't be used for an example scenario in README, and it should be changed to other datasets that are harder or the model is not already trained for."

sfc-gh-kganesan · 2025-11-14T16:09:50Z

@felipemello1 this particular line could be a bug https://github.com/meta-pytorch/torchforge/pull/571/files#diff-3bdea8452f8678efbb5410b0618ec2942e39e2678db862e9abf4c2fabf507fd8L132 that can impact other recipes too!
on gsm8k rewards are not steadily increasing as you pointed it might be due to saturation / truncation will tweak those and see it helps, I would I highly appreaciate review suggested loss change seeing if its a correctness issue that needs to be fixed as well

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 14, 2025

karthik19967829 mentioned this pull request Nov 14, 2025

[bug] investigate flat rewards #556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Fix for flat reward by fixing GRPO loss: Correct importance sampling ratio calculation #571

Potential Fix for flat reward by fixing GRPO loss: Correct importance sampling ratio calculation #571

karthik19967829 commented Nov 14, 2025

Uh oh!

felipemello1 commented Nov 14, 2025

Uh oh!

sfc-gh-kganesan commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Potential Fix for flat reward by fixing GRPO loss: Correct importance sampling ratio calculation #571

Are you sure you want to change the base?

Potential Fix for flat reward by fixing GRPO loss: Correct importance sampling ratio calculation #571

Conversation

karthik19967829 commented Nov 14, 2025

Uh oh!

felipemello1 commented Nov 14, 2025

Uh oh!

sfc-gh-kganesan commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants