Correct for gradient bias in GRPO style reward centering. #135

maitchison · 2025-11-27T23:50:53Z

Motivation

GRPO style centering introduces bias into the gradient estimate. The bias introduced is equal to $$\frac{G-1}{G}$$, where $$G$$ is the group size. While this factor is generally small, and can be incorporated into the learning rate, it would be prefered to not have learning rate dependant on group size in this way.

References

See https://arxiv.org/pdf/2503.20783 (Pg.14)

Changes

Added a correction to the advantages to adjust for the bias introduced by GRPO reward centering.

Matthew Aitchison added 6 commits November 28, 2025 12:40

debias the gradient

239cf53

correct indentation

1725945

use correct ratio

690f925

fix typo

0f838c0

fix formatting for ruff

4016afc

another ruff formatting fix

d6c8541

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct for gradient bias in GRPO style reward centering. #135

Correct for gradient bias in GRPO style reward centering. #135

Uh oh!

maitchison commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Correct for gradient bias in GRPO style reward centering. #135

Are you sure you want to change the base?

Correct for gradient bias in GRPO style reward centering. #135

Uh oh!

Conversation

maitchison commented Nov 27, 2025

Motivation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant