fix: make OPSM reject whole off-policy sequences#1917
Open
haoyang9804 wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29a0338e27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
29a0338 to
147d795
Compare
147d795 to
ff40142
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Summary
slime.utils.ppo_utils.compute_opsm_maskcomputes a sequence-level OPSM KL, but the old mask decision was applied token by token using each token's advantage sign. A high-KL off-policy sequence with a negative sequence-level advantage and mixed token advantages could therefore keep positive-advantage tokens active in the policy-gradient loss. This silently changes the training signal instead of rejecting the sequence.This patch computes the OPSM decision from valid tokens at sequence scope and broadcasts that decision to every token in the sequence. It also avoids using masked log-prob deltas in the sequence KL and adds a focused unit test for the mixed-advantage trigger.
2. Concrete Triggering Example
Minimal tensors at the
compute_opsm_maskboundary:The sequence KL is
1.0, aboveopsm_delta, and the valid-token mean advantage is negative. The whole sequence should be rejected.Before the fix:
{ "opsm_mask": [0.0, 1.0, 0.0], "opsm_clipfrac": 0.6666666865348816, "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0], "reduced_policy_loss": -0.030656620860099792, "leaked_positions": [1] }Wrong intermediate value:
opsm_mask == [0.0, 1.0, 0.0], which leaves token position1active and reduces policy loss to-0.030656620860099792.After the fix:
{ "opsm_mask": [0.0, 0.0, 0.0], "opsm_clipfrac": 1.0, "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0], "reduced_policy_loss": 0.0, "leaked_positions": [] }Fixed value:
opsm_mask == [0.0, 0.0, 0.0], which rejects the whole sequence and reduces policy loss to0.0.3. Reproduction Recipe
Use a refreshed slime checkout as
${SLIME_REPO}and an output directory as${OUTPUT_DIR}:4. Validation Runner
5. Observed Output
Unfixed upstream validation from RL-Sentinel:
{ "status": "reproduced", "observed": { "opsm_mask": [0.0, 1.0, 0.0], "opsm_clipfrac": 0.6666666865348816, "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0], "reduced_policy_loss": -0.030656620860099792 }, "training_signal_effect": { "leaked_positions": [1], "leaked_token_count": 1, "non_crash": true } }Fixed branch verification:
{ "status": "fixed", "observed": { "opsm_mask": [0.0, 0.0, 0.0], "opsm_clipfrac": 1.0, "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0], "reduced_policy_loss": 0.0 }, "training_signal_effect": { "leaked_positions": [], "leaked_token_count": 0, "non_crash": true } }6. Root Cause
The old implementation calculated
seq_klat sequence scope:It then created a token-wise rejection mask:
That mixes a sequence-level off-policy decision with token-level advantage signs. When the same rejected sequence has a positive-advantage token, that token remains active and contributes to policy loss.
7. Fix
The fix:
seq_klwithout masked token deltas;seq_advantage;sequence_rejected = (seq_advantage < 0) & (seq_kl > args.opsm_delta);8. Tests And Checks
Commands run on the repair branch:
Results:
9. Contribution And Duplicate Checks
Contribution guidance checked:
CONTRIBUTING.mdfrom refreshedTHUDM/slimemain.Duplicate checks performed:
BUG_FINDINGS.md: no existing OPSM mixed-advantage sequence leak entry.myslimelocal and remote fix branches: no OPSM fix branch.myslime/pr_drafts: no OPSM draft.THUDM/slimeREST searches forOPSM,compute_opsm_mask,Off-Policy Sequence Masking, andopsm_deltafound OPSM feature/background PRs such asTHUDM/slime#999, but no exact bug report or fix for this mixed-advantage leak.10. Related PRs Or Fixes
THUDM/slime#999added OPSM support. This is related because it introduced the OPSM path, but it is not an exact duplicate; it does not fix the sequence-level decision being applied token-wise for mixed advantages.