fix: make OPSM reject whole off-policy sequences by haoyang9804 · Pull Request #1917 · THUDM/slime

haoyang9804 · 2026-05-18T04:35:57Z

1. Summary

slime.utils.ppo_utils.compute_opsm_mask computes a sequence-level OPSM KL, but the old mask decision was applied token by token using each token's advantage sign. A high-KL off-policy sequence with a negative sequence-level advantage and mixed token advantages could therefore keep positive-advantage tokens active in the policy-gradient loss. This silently changes the training signal instead of rejecting the sequence.

This patch computes the OPSM decision from valid tokens at sequence scope and broadcasts that decision to every token in the sequence. It also avoids using masked log-prob deltas in the sequence KL and adds a focused unit test for the mixed-advantage trigger.

2. Concrete Triggering Example

Minimal tensors at the compute_opsm_mask boundary:

full_log_probs = [torch.tensor([0.0, 0.0, 0.0])]
full_old_log_probs = [torch.tensor([1.0, 1.0, 1.0])]
advantages = [torch.tensor([-0.5, 0.25, -0.25])]
loss_masks = [torch.tensor([1.0, 1.0, 1.0])]
opsm_delta = 0.1

The sequence KL is 1.0, above opsm_delta, and the valid-token mean advantage is negative. The whole sequence should be rejected.

Before the fix:

{
  "opsm_mask": [0.0, 1.0, 0.0],
  "opsm_clipfrac": 0.6666666865348816,
  "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0],
  "reduced_policy_loss": -0.030656620860099792,
  "leaked_positions": [1]
}

Wrong intermediate value: opsm_mask == [0.0, 1.0, 0.0], which leaves token position 1 active and reduces policy loss to -0.030656620860099792.

After the fix:

{
  "opsm_mask": [0.0, 0.0, 0.0],
  "opsm_clipfrac": 1.0,
  "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0],
  "reduced_policy_loss": 0.0,
  "leaked_positions": []
}

Fixed value: opsm_mask == [0.0, 0.0, 0.0], which rejects the whole sequence and reduces policy loss to 0.0.

3. Reproduction Recipe

Use a refreshed slime checkout as ${SLIME_REPO} and an output directory as ${OUTPUT_DIR}:

cd "${SLIME_REPO}"
python3 -m pip install --user -e . --no-deps
SLIME_REPO="${SLIME_REPO}" OUTPUT_DIR="${OUTPUT_DIR}" python3 opsm_repro.py
python3 -m json.tool "${OUTPUT_DIR}/opsm_validation.json"

4. Validation Runner

#!/usr/bin/env python3
import json
import os
import sys
from argparse import Namespace
from pathlib import Path

repo = Path(os.environ["SLIME_REPO"]).resolve()
out = Path(os.environ["OUTPUT_DIR"]).resolve()
out.mkdir(parents=True, exist_ok=True)
sys.path.insert(0, str(repo))

import torch
from slime.utils.ppo_utils import compute_opsm_mask, compute_policy_loss

args = Namespace(opsm_delta=0.1)
full_log_probs = [torch.tensor([0.0, 0.0, 0.0], dtype=torch.float32)]
full_old_log_probs = [torch.tensor([1.0, 1.0, 1.0], dtype=torch.float32)]
advantages = [torch.tensor([-0.5, 0.25, -0.25], dtype=torch.float32)]
loss_masks = [torch.tensor([1.0, 1.0, 1.0], dtype=torch.float32)]

ppo_kl = full_old_log_probs[0] - full_log_probs[0]
pg_loss, pg_clipfrac = compute_policy_loss(
    ppo_kl=ppo_kl,
    advantages=advantages[0],
    eps_clip=0.2,
    eps_clip_high=0.2,
)
opsm_mask, opsm_clipfrac = compute_opsm_mask(
    args=args,
    full_log_probs=full_log_probs,
    full_old_log_probs=full_old_log_probs,
    advantages=advantages,
    loss_masks=loss_masks,
)

expected_opsm_mask = torch.zeros_like(opsm_mask)
masked_pg_loss = pg_loss * opsm_mask
expected_masked_pg_loss = pg_loss * expected_opsm_mask
reduced_loss = (masked_pg_loss * loss_masks[0]).sum() / loss_masks[0].sum()
expected_reduced_loss = (expected_masked_pg_loss * loss_masks[0]).sum() / loss_masks[0].sum()
leaked_positions = ((opsm_mask != 0) & (expected_opsm_mask == 0) & (loss_masks[0] != 0)).nonzero(as_tuple=True)[0]

payload = {
    "status": "fixed" if leaked_positions.numel() == 0 and torch.equal(opsm_mask, expected_opsm_mask) else "reproduced",
    "trigger": {
        "full_log_probs": [full_log_probs[0].tolist()],
        "full_old_log_probs": [full_old_log_probs[0].tolist()],
        "advantages": [advantages[0].tolist()],
        "loss_masks": [loss_masks[0].tolist()],
        "opsm_delta": args.opsm_delta,
    },
    "observed": {
        "opsm_mask": opsm_mask.tolist(),
        "opsm_clipfrac": float(opsm_clipfrac.item()),
        "policy_pg_loss_tokens": pg_loss.tolist(),
        "policy_pg_clipfrac_tokens": pg_clipfrac.tolist(),
        "masked_policy_pg_loss_tokens": masked_pg_loss.tolist(),
        "reduced_policy_loss": float(reduced_loss.item()),
    },
    "expected_fixed": {
        "opsm_mask": expected_opsm_mask.tolist(),
        "opsm_clipfrac": 1.0,
        "masked_policy_pg_loss_tokens": expected_masked_pg_loss.tolist(),
        "reduced_policy_loss": float(expected_reduced_loss.item()),
    },
    "training_signal_effect": {
        "leaked_positions": leaked_positions.tolist(),
        "leaked_token_count": int(leaked_positions.numel()),
        "loss_delta": float((reduced_loss - expected_reduced_loss).item()),
    },
}
print(json.dumps(payload, indent=2, sort_keys=True))
(out / "opsm_validation.json").write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")

5. Observed Output

Unfixed upstream validation from RL-Sentinel:

{
  "status": "reproduced",
  "observed": {
    "opsm_mask": [0.0, 1.0, 0.0],
    "opsm_clipfrac": 0.6666666865348816,
    "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0],
    "reduced_policy_loss": -0.030656620860099792
  },
  "training_signal_effect": {
    "leaked_positions": [1],
    "leaked_token_count": 1,
    "non_crash": true
  }
}

Fixed branch verification:

{
  "status": "fixed",
  "observed": {
    "opsm_mask": [0.0, 0.0, 0.0],
    "opsm_clipfrac": 1.0,
    "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0],
    "reduced_policy_loss": 0.0
  },
  "training_signal_effect": {
    "leaked_positions": [],
    "leaked_token_count": 0,
    "non_crash": true
  }
}

6. Root Cause

The old implementation calculated seq_kl at sequence scope:

seq_kl = ((full_old_log_prob - full_log_prob) * loss_mask).sum() / torch.clamp_min(loss_mask.sum(), 1)

It then created a token-wise rejection mask:

mask = ((advantage < 0) & (seq_kl > args.opsm_delta)).float()

That mixes a sequence-level off-policy decision with token-level advantage signs. When the same rejected sequence has a positive-advantage token, that token remains active and contributes to policy loss.

7. Fix

The fix:

computes valid-token seq_kl without masked token deltas;
computes a valid-token mean seq_advantage;
evaluates sequence_rejected = (seq_advantage < 0) & (seq_kl > args.opsm_delta);
broadcasts that scalar decision to a per-token mask for the existing downstream multiplication.

8. Tests And Checks

Commands run on the repair branch:

python3 -m pytest -q tests/test_ppo_utils.py
python3 -m ruff check slime/utils/ppo_utils.py tests/test_ppo_utils.py
python3 -m pre_commit run --files slime/utils/ppo_utils.py tests/test_ppo_utils.py
REPAIR_REPO="${SLIME_REPO}" VERIFY_DIR="${OUTPUT_DIR}" python3 run_verification.py

Results:

tests/test_ppo_utils.py .. [100%]
ruff check: All checks passed.
pre-commit: ruff, autoflake, isort, and black passed.
verification status: fixed

9. Contribution And Duplicate Checks

Contribution guidance checked: CONTRIBUTING.md from refreshed THUDM/slime main.

Duplicate checks performed:

BUG_FINDINGS.md: no existing OPSM mixed-advantage sequence leak entry.
RL-Sentinel bug DB: no existing entry for this trigger.
Historical RL-Sentinel loop artifacts: OPSM implementation references only; no prior mixed-advantage leak validation or fix.
myslime local and remote fix branches: no OPSM fix branch.
myslime/pr_drafts: no OPSM draft.
GitHub CLI was unavailable or unauthenticated; GitHub REST issue/PR search was used.
THUDM/slime REST searches for OPSM, compute_opsm_mask, Off-Policy Sequence Masking, and opsm_delta found OPSM feature/background PRs such as THUDM/slime#999, but no exact bug report or fix for this mixed-advantage leak.

10. Related PRs Or Fixes

THUDM/slime#999 added OPSM support. This is related because it introduced the OPSM path, but it is not an exact duplicate; it does not fix the sequence-level decision being applied token-wise for mixed advantages.
Prior RL-Sentinel slime findings for GSPO masked old-logprob NaNs and rollout reward normalization are not duplicates. They touch different invariants and code paths.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29a0338e27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread slime/utils/ppo_utils.py Outdated

haoyang9804 force-pushed the fix/opsm-mixed-adv-sequence-leak branch from 29a0338 to 147d795 Compare May 18, 2026 04:56

fix: make OPSM reject whole off-policy sequences

ff40142

haoyang9804 force-pushed the fix/opsm-mixed-adv-sequence-leak branch from 147d795 to ff40142 Compare May 18, 2026 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make OPSM reject whole off-policy sequences#1917

fix: make OPSM reject whole off-policy sequences#1917
haoyang9804 wants to merge 1 commit into
THUDM:mainfrom
haoyang9804:fix/opsm-mixed-adv-sequence-leak

haoyang9804 commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyang9804 commented May 18, 2026

1. Summary

2. Concrete Triggering Example

3. Reproduction Recipe

4. Validation Runner

5. Observed Output

6. Root Cause

7. Fix

8. Tests And Checks

9. Contribution And Duplicate Checks

10. Related PRs Or Fixes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant