Skip to content

Conversation

@karthik19967829
Copy link

  • Fixed bug in SimpleGRPOLoss where logprobs - logprobs.detach() always equaled 0
  • Now correctly uses behavior_logprobs for importance sampling ratio
  • Added behavior_logprobs field to Episode dataclass
  • Modified collate function to include behavior_logprobs in training targets
  • Updated simple_grpo_loss to accept and use behavior_logprobs
  • Enabled logprobs collection in policy sampling_params (qwen3_1_7b.yaml)
  • Added behavior_logprobs extraction from completion responses
  • Added GRPO_BUG_FIX_SUMMARY.md documenting the issue and solution

This fix ensures the GRPO algorithm can properly learn by calculating the correct importance sampling ratio between current and behavior policies.

- Fixed bug in SimpleGRPOLoss where logprobs - logprobs.detach() always equaled 0
- Now correctly uses behavior_logprobs for importance sampling ratio
- Added behavior_logprobs field to Episode dataclass
- Modified collate function to include behavior_logprobs in training targets
- Updated simple_grpo_loss to accept and use behavior_logprobs
- Enabled logprobs collection in policy sampling_params (qwen3_1_7b.yaml)
- Added behavior_logprobs extraction from completion responses
- Added GRPO_BUG_FIX_SUMMARY.md documenting the issue and solution

This fix ensures the GRPO algorithm can properly learn by calculating the
correct importance sampling ratio between current and behavior policies.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 14, 2025
@felipemello1
Copy link
Contributor

hey @karthik19967829 , thanks for the investigation. Did you see an improvement in rewards? On our end, we found this:

"I ran experiments to diagnose the issue and here is the conclusions and suggested improvements.
Original max_res_tokens was set 1024 based on avg_tokens_generated. However, tracking max_token_generated revealed many LLM responses were truncated (while the answers are always at the end). This created episodes that are incorrectly given 0 reward. Experiments with max_res_tokens=2048 showed increased rewards ==> should set the max_res_tokens higher and should drop episodes that are truncated. ���
groups with advantage=0 (same rewards for all episodes in the group) should not be included as it'll make loss meaningless. Experiment including advantage=0 shows lower L2 Norm differences of params in trainer step, which means model is not learning and wasting cycles ==> should drop episodes with reward std dev < 0.001 (or some value)

SUM(rewards) shouldn't be used to track the progress as there are differences in number of episodes generated (or sampled) per step. ==> should track AVG(rewards) - generated and sampled
��
Based on all above changes, it shows that QWEN3 1.7B already achieves 0.8~0.9 of avg rewards on GSM9k from the start, which would be very hard to improve using traditional RL training ==> it shouldn't be used for an example scenario in README, and it should be changed to other datasets that are harder or the model is not already trained for."

@sfc-gh-kganesan
Copy link

@felipemello1 this particular line could be a bug https://github.com/meta-pytorch/torchforge/pull/571/files#diff-3bdea8452f8678efbb5410b0618ec2942e39e2678db862e9abf4c2fabf507fd8L132 that can impact other recipes too!
on gsm8k rewards are not steadily increasing as you pointed it might be due to saturation / truncation will tweak those and see it helps, I would I highly appreaciate review suggested loss change seeing if its a correctness issue that needs to be fixed as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants