Skip to content

IS-MCTS: add optional variance penalty to UCT selection#1516

Open
tommygrammar wants to merge 2 commits intogoogle-deepmind:masterfrom
tommygrammar:ismcts-risk-uct
Open

IS-MCTS: add optional variance penalty to UCT selection#1516
tommygrammar wants to merge 2 commits intogoogle-deepmind:masterfrom
tommygrammar:ismcts-risk-uct

Conversation

@tommygrammar
Copy link
Copy Markdown

Summary

Adds an optional variance penalty to the IS-MCTS UCT selection rule via a new risk_lambda_ parameter. When nonzero, the penalty biases action selection toward branches with lower return variance, implementing a risk-sensitive search policy. When zero (the default), the original mean-based selection is recovered exactly. There is no behavior change unless explicitly enabled.


Motivation

Standard UCB/UCT treats actions with the same visit count identically
regardless of return variance. High-variance and low-variance actions are indistinguishable under pure expected value. A variance-aware selection criterion distinguishes them, preferring actions whose outcomes are more stable relative to their expected value. This changes the preference ordering of the search rather than just its accuracy.

The modification is intended to enable empirical study of risk-sensitive MCTS behavior in imperfect-information games, not to assert that variance penalization improves play strength.


Implementation

The UCT selection score is updated from:

uct_val = mean + uct_c_ * sqrt(log(node->total_visits) / visits);

to:

uct_val = (mean - risk_lambda_ * var) +
          uct_c_ * sqrt(log(node->total_visits) / visits);

Where var is the sample variance of returns observed at the node,
computed incrementally alongside the existing mean accumulation.

risk_lambda_ defaults to 0.0. At this value the variance term
vanishes and original behavior is preserved exactly.

Files changed:

  • open_spiel/algorithms/is_mcts.h - added risk_lambda_ member,
    variance accumulator per node
  • open_spiel/algorithms/is_mcts.cc - modified UCT selection score,
    updated variance accumulator on node visits

Baseline Compatibility

Default risk_lambda_ = 0.0 preserves the original OpenSpiel IS-MCTS
behavior exactly. The modification is opt-in — existing code using
IS-MCTS with no parameter changes is unaffected.


Empirical Validation

To validate that the modification produces a meaningfully different
search policy rather than noise, I compared original and modified builds
across four game settings: Kuhn Poker (2-player), Kuhn Poker (3-player),
Leduc Poker (2-player), Leduc Poker (3-player). Each setting ran 1500+
games. Complete JSON outputs and benchmark code are included in the PR.

Global search behavior

Metric Original Modified Change
Policy entropy 0.5426 0.2552 −53.0%
Max visit ratio (root) 0.8307 0.9327 +12.3%
Action 0 count 4,598,878 +4,598,878
Action 1 count 7,015,759 −7,015,759
Action 2 count 766,932 −766,932
JS divergence (action dist.) 0.0498 nontrivial
Chi-square p-value 0.0 reject equality

The 53% drop in policy entropy and 12% increase in root visit
concentration confirm the modified build is substantially less
exploratory and more decisive. The chi-square test rejects equality
of action distributions overwhelmingly (p = 0.0, dof = 2,
statistic = 3,211,252).

Game-level results

Setting Total games Actions (orig) Actions (mod) P0 mean diff P1 mean diff
Kuhn 2-player 1503 6,408 6,339 +0.0100 −0.0100
Kuhn 3-player 1503 9,908 9,857 +0.1051 −0.0645
Leduc 2-player 1503 10,669 9,298 +0.1191 −0.1191
Leduc 3-player 1504 15,786 13,021 +0.2926 −0.1436

Total game counts are essentially unchanged across all settings,
confirming the comparison is structurally fair.

Key action distribution shifts

Kuhn 2-player: Bet −256, Pass +187.
More conservative — the modified build avoids volatile betting branches.

Kuhn 3-player: Bet +95, Pass −146.
More aggressive — in multi-player interaction, some assertive lines
have cleaner risk-adjusted profiles than passive continuation.

Leduc 2-player: Call −1236, Fold +322.
Reduced tolerance for marginal continuation. The modified build prefers
early exit over uncertain middle-ground hands.

Leduc 3-player: Call −4060, Fold +1316, Raise +784.
Strongest effect. The strategy landscape reorganises substantially i.e
calls collapse, folds and raises both rise. The modified build is no
longer tolerating middle-ground continuation lines; it either commits
to a strong branch or exits early.


Interpretation

The variance penalty changes the search from expectation-driven to
risk-adjusted. The downstream effects are consistent across all four
settings and internally coherent:

Policy becomes more concentrated. Entropy drops 53%, root action
dominance increases. The modified build is more decisive and less
exploratory.

Strategic polarisation increases in richer games. In Leduc,
especially 3-player, the collapse of calls and rise of both folds and
raises is the clear empirical signature of variance penalisation:
mediocre continuation lines are suppressed, strong branches are
amplified, uncertain middle lines are removed.

Variance penalisation amplifies first-mover advantage. Player 0
improves consistently across all four settings — the only player to do
so. The effect is modest in Kuhn 2-player (+0.010) and substantial in
Leduc 3-player (+0.293), where player 0 moves from negative to positive
mean return. In sequential imperfect-information games, the first mover
can disproportionately influence the shape of the game tree. A
variance-penalised search appears to value that initiative more than a
pure expected-value search does — favoring branches that settle
uncertainty sooner and convert positional advantage into stable outcomes.

Variance penalisation does not necessarily reduce observed outcome
variance.
In Leduc, return standard deviations increase despite the
risk-sensitive objective. The modified build is optimising a
risk-adjusted criterion, not minimising final outcome spread. The
resulting strategies can produce more polarised realised outcomes even
while being selected for lower internal variance penalty. This
distinction — between the search objective and the downstream empirical
result — is important for interpreting the data correctly.

Effect scales with game richness. The modification produces mild
shifts in Kuhn (small, shallow game) and substantial reorganisation in
Leduc (richer, deeper game with more branching). This is consistent with
the hypothesis that variance-penalised search has more opportunity to
redirect the tree in games with more complex continuation structures.


What This Is Not

This PR does not claim that variance penalisation improves play strength
or produces a stronger agent. The empirical results show that the
modified build produces a different strategy profile, not a better one.
The intent is to enable study of risk-sensitive search behavior and to
provide a parameterised baseline for researchers interested in
risk-adjusted MCTS variants.


tommygrammar and others added 2 commits April 2, 2026 13:00
Introduces a risk_lambda_ parameter to the IS-MCTS UCT selection rule.
When nonzero, the selection score becomes:

    uct_val = (mean - risk_lambda_ * var) +
              uct_c_ * sqrt(log(node->total_visits) / visits)

When risk_lambda_ == 0.0 (the default), the variance term vanishes and
original mean-based selection is recovered exactly. No behavior change
unless the penalty is explicitly enabled.

Empirical validation across Kuhn Poker and Leduc Poker (2-player and
3-player variants) confirms the modification produces a meaningfully
different search policy: policy entropy drops 53%, root visit
concentration increases 12%, and action distributions diverge
significantly from the canonical implementation (JS divergence 0.0498,
chi-square p=0.0). Player 0 improves consistently across all four
settings, consistent with variance penalisation amplifying first-mover
advantage in sequential imperfect-information games.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant