Skip to content

fix: make OPSM reject whole off-policy sequences#1917

Open
haoyang9804 wants to merge 1 commit into
THUDM:mainfrom
haoyang9804:fix/opsm-mixed-adv-sequence-leak
Open

fix: make OPSM reject whole off-policy sequences#1917
haoyang9804 wants to merge 1 commit into
THUDM:mainfrom
haoyang9804:fix/opsm-mixed-adv-sequence-leak

Conversation

@haoyang9804
Copy link
Copy Markdown

1. Summary

slime.utils.ppo_utils.compute_opsm_mask computes a sequence-level OPSM KL, but the old mask decision was applied token by token using each token's advantage sign. A high-KL off-policy sequence with a negative sequence-level advantage and mixed token advantages could therefore keep positive-advantage tokens active in the policy-gradient loss. This silently changes the training signal instead of rejecting the sequence.

This patch computes the OPSM decision from valid tokens at sequence scope and broadcasts that decision to every token in the sequence. It also avoids using masked log-prob deltas in the sequence KL and adds a focused unit test for the mixed-advantage trigger.

2. Concrete Triggering Example

Minimal tensors at the compute_opsm_mask boundary:

full_log_probs = [torch.tensor([0.0, 0.0, 0.0])]
full_old_log_probs = [torch.tensor([1.0, 1.0, 1.0])]
advantages = [torch.tensor([-0.5, 0.25, -0.25])]
loss_masks = [torch.tensor([1.0, 1.0, 1.0])]
opsm_delta = 0.1

The sequence KL is 1.0, above opsm_delta, and the valid-token mean advantage is negative. The whole sequence should be rejected.

Before the fix:

{
  "opsm_mask": [0.0, 1.0, 0.0],
  "opsm_clipfrac": 0.6666666865348816,
  "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0],
  "reduced_policy_loss": -0.030656620860099792,
  "leaked_positions": [1]
}

Wrong intermediate value: opsm_mask == [0.0, 1.0, 0.0], which leaves token position 1 active and reduces policy loss to -0.030656620860099792.

After the fix:

{
  "opsm_mask": [0.0, 0.0, 0.0],
  "opsm_clipfrac": 1.0,
  "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0],
  "reduced_policy_loss": 0.0,
  "leaked_positions": []
}

Fixed value: opsm_mask == [0.0, 0.0, 0.0], which rejects the whole sequence and reduces policy loss to 0.0.

3. Reproduction Recipe

Use a refreshed slime checkout as ${SLIME_REPO} and an output directory as ${OUTPUT_DIR}:

cd "${SLIME_REPO}"
python3 -m pip install --user -e . --no-deps
SLIME_REPO="${SLIME_REPO}" OUTPUT_DIR="${OUTPUT_DIR}" python3 opsm_repro.py
python3 -m json.tool "${OUTPUT_DIR}/opsm_validation.json"

4. Validation Runner

#!/usr/bin/env python3
import json
import os
import sys
from argparse import Namespace
from pathlib import Path

repo = Path(os.environ["SLIME_REPO"]).resolve()
out = Path(os.environ["OUTPUT_DIR"]).resolve()
out.mkdir(parents=True, exist_ok=True)
sys.path.insert(0, str(repo))

import torch
from slime.utils.ppo_utils import compute_opsm_mask, compute_policy_loss

args = Namespace(opsm_delta=0.1)
full_log_probs = [torch.tensor([0.0, 0.0, 0.0], dtype=torch.float32)]
full_old_log_probs = [torch.tensor([1.0, 1.0, 1.0], dtype=torch.float32)]
advantages = [torch.tensor([-0.5, 0.25, -0.25], dtype=torch.float32)]
loss_masks = [torch.tensor([1.0, 1.0, 1.0], dtype=torch.float32)]

ppo_kl = full_old_log_probs[0] - full_log_probs[0]
pg_loss, pg_clipfrac = compute_policy_loss(
    ppo_kl=ppo_kl,
    advantages=advantages[0],
    eps_clip=0.2,
    eps_clip_high=0.2,
)
opsm_mask, opsm_clipfrac = compute_opsm_mask(
    args=args,
    full_log_probs=full_log_probs,
    full_old_log_probs=full_old_log_probs,
    advantages=advantages,
    loss_masks=loss_masks,
)

expected_opsm_mask = torch.zeros_like(opsm_mask)
masked_pg_loss = pg_loss * opsm_mask
expected_masked_pg_loss = pg_loss * expected_opsm_mask
reduced_loss = (masked_pg_loss * loss_masks[0]).sum() / loss_masks[0].sum()
expected_reduced_loss = (expected_masked_pg_loss * loss_masks[0]).sum() / loss_masks[0].sum()
leaked_positions = ((opsm_mask != 0) & (expected_opsm_mask == 0) & (loss_masks[0] != 0)).nonzero(as_tuple=True)[0]

payload = {
    "status": "fixed" if leaked_positions.numel() == 0 and torch.equal(opsm_mask, expected_opsm_mask) else "reproduced",
    "trigger": {
        "full_log_probs": [full_log_probs[0].tolist()],
        "full_old_log_probs": [full_old_log_probs[0].tolist()],
        "advantages": [advantages[0].tolist()],
        "loss_masks": [loss_masks[0].tolist()],
        "opsm_delta": args.opsm_delta,
    },
    "observed": {
        "opsm_mask": opsm_mask.tolist(),
        "opsm_clipfrac": float(opsm_clipfrac.item()),
        "policy_pg_loss_tokens": pg_loss.tolist(),
        "policy_pg_clipfrac_tokens": pg_clipfrac.tolist(),
        "masked_policy_pg_loss_tokens": masked_pg_loss.tolist(),
        "reduced_policy_loss": float(reduced_loss.item()),
    },
    "expected_fixed": {
        "opsm_mask": expected_opsm_mask.tolist(),
        "opsm_clipfrac": 1.0,
        "masked_policy_pg_loss_tokens": expected_masked_pg_loss.tolist(),
        "reduced_policy_loss": float(expected_reduced_loss.item()),
    },
    "training_signal_effect": {
        "leaked_positions": leaked_positions.tolist(),
        "leaked_token_count": int(leaked_positions.numel()),
        "loss_delta": float((reduced_loss - expected_reduced_loss).item()),
    },
}
print(json.dumps(payload, indent=2, sort_keys=True))
(out / "opsm_validation.json").write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")

5. Observed Output

Unfixed upstream validation from RL-Sentinel:

{
  "status": "reproduced",
  "observed": {
    "opsm_mask": [0.0, 1.0, 0.0],
    "opsm_clipfrac": 0.6666666865348816,
    "masked_policy_pg_loss_tokens": [0.0, -0.09196986258029938, 0.0],
    "reduced_policy_loss": -0.030656620860099792
  },
  "training_signal_effect": {
    "leaked_positions": [1],
    "leaked_token_count": 1,
    "non_crash": true
  }
}

Fixed branch verification:

{
  "status": "fixed",
  "observed": {
    "opsm_mask": [0.0, 0.0, 0.0],
    "opsm_clipfrac": 1.0,
    "masked_policy_pg_loss_tokens": [0.0, -0.0, 0.0],
    "reduced_policy_loss": 0.0
  },
  "training_signal_effect": {
    "leaked_positions": [],
    "leaked_token_count": 0,
    "non_crash": true
  }
}

6. Root Cause

The old implementation calculated seq_kl at sequence scope:

seq_kl = ((full_old_log_prob - full_log_prob) * loss_mask).sum() / torch.clamp_min(loss_mask.sum(), 1)

It then created a token-wise rejection mask:

mask = ((advantage < 0) & (seq_kl > args.opsm_delta)).float()

That mixes a sequence-level off-policy decision with token-level advantage signs. When the same rejected sequence has a positive-advantage token, that token remains active and contributes to policy loss.

7. Fix

The fix:

  • computes valid-token seq_kl without masked token deltas;
  • computes a valid-token mean seq_advantage;
  • evaluates sequence_rejected = (seq_advantage < 0) & (seq_kl > args.opsm_delta);
  • broadcasts that scalar decision to a per-token mask for the existing downstream multiplication.

8. Tests And Checks

Commands run on the repair branch:

python3 -m pytest -q tests/test_ppo_utils.py
python3 -m ruff check slime/utils/ppo_utils.py tests/test_ppo_utils.py
python3 -m pre_commit run --files slime/utils/ppo_utils.py tests/test_ppo_utils.py
REPAIR_REPO="${SLIME_REPO}" VERIFY_DIR="${OUTPUT_DIR}" python3 run_verification.py

Results:

tests/test_ppo_utils.py .. [100%]
ruff check: All checks passed.
pre-commit: ruff, autoflake, isort, and black passed.
verification status: fixed

9. Contribution And Duplicate Checks

Contribution guidance checked: CONTRIBUTING.md from refreshed THUDM/slime main.

Duplicate checks performed:

  • BUG_FINDINGS.md: no existing OPSM mixed-advantage sequence leak entry.
  • RL-Sentinel bug DB: no existing entry for this trigger.
  • Historical RL-Sentinel loop artifacts: OPSM implementation references only; no prior mixed-advantage leak validation or fix.
  • myslime local and remote fix branches: no OPSM fix branch.
  • myslime/pr_drafts: no OPSM draft.
  • GitHub CLI was unavailable or unauthenticated; GitHub REST issue/PR search was used.
  • THUDM/slime REST searches for OPSM, compute_opsm_mask, Off-Policy Sequence Masking, and opsm_delta found OPSM feature/background PRs such as THUDM/slime#999, but no exact bug report or fix for this mixed-advantage leak.

10. Related PRs Or Fixes

  • THUDM/slime#999 added OPSM support. This is related because it introduced the OPSM path, but it is not an exact duplicate; it does not fix the sequence-level decision being applied token-wise for mixed advantages.
  • Prior RL-Sentinel slime findings for GSPO masked old-logprob NaNs and rollout reward normalization are not duplicates. They touch different invariants and code paths.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29a0338e27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread slime/utils/ppo_utils.py Outdated
@haoyang9804 haoyang9804 force-pushed the fix/opsm-mixed-adv-sequence-leak branch from 29a0338 to 147d795 Compare May 18, 2026 04:56
@haoyang9804 haoyang9804 force-pushed the fix/opsm-mixed-adv-sequence-leak branch from 147d795 to ff40142 Compare May 18, 2026 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant