Skip to content

Conversation

@maitchison
Copy link

Motivation

GRPO style centering introduces bias into the gradient estimate. The bias introduced is equal to $$\frac{G-1}{G}$$, where $$G$$ is the group size. While this factor is generally small, and can be incorporated into the learning rate, it would be prefered to not have learning rate dependant on group size in this way.

References

Changes

  • Added a correction to the advantages to adjust for the bias introduced by GRPO reward centering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant