IS-MCTS: add optional variance penalty to UCT selection#1516
Open
tommygrammar wants to merge 2 commits intogoogle-deepmind:masterfrom
Open
IS-MCTS: add optional variance penalty to UCT selection#1516tommygrammar wants to merge 2 commits intogoogle-deepmind:masterfrom
tommygrammar wants to merge 2 commits intogoogle-deepmind:masterfrom
Conversation
Introduces a risk_lambda_ parameter to the IS-MCTS UCT selection rule.
When nonzero, the selection score becomes:
uct_val = (mean - risk_lambda_ * var) +
uct_c_ * sqrt(log(node->total_visits) / visits)
When risk_lambda_ == 0.0 (the default), the variance term vanishes and
original mean-based selection is recovered exactly. No behavior change
unless the penalty is explicitly enabled.
Empirical validation across Kuhn Poker and Leduc Poker (2-player and
3-player variants) confirms the modification produces a meaningfully
different search policy: policy entropy drops 53%, root visit
concentration increases 12%, and action distributions diverge
significantly from the canonical implementation (JS divergence 0.0498,
chi-square p=0.0). Player 0 improves consistently across all four
settings, consistent with variance penalisation amplifying first-mover
advantage in sequential imperfect-information games.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional variance penalty to the IS-MCTS UCT selection rule via a new
risk_lambda_parameter. When nonzero, the penalty biases action selection toward branches with lower return variance, implementing a risk-sensitive search policy. When zero (the default), the original mean-based selection is recovered exactly. There is no behavior change unless explicitly enabled.Motivation
Standard UCB/UCT treats actions with the same visit count identically
regardless of return variance. High-variance and low-variance actions are indistinguishable under pure expected value. A variance-aware selection criterion distinguishes them, preferring actions whose outcomes are more stable relative to their expected value. This changes the preference ordering of the search rather than just its accuracy.
The modification is intended to enable empirical study of risk-sensitive MCTS behavior in imperfect-information games, not to assert that variance penalization improves play strength.
Implementation
The UCT selection score is updated from:
to:
Where
varis the sample variance of returns observed at the node,computed incrementally alongside the existing mean accumulation.
risk_lambda_defaults to0.0. At this value the variance termvanishes and original behavior is preserved exactly.
Files changed:
open_spiel/algorithms/is_mcts.h- addedrisk_lambda_member,variance accumulator per node
open_spiel/algorithms/is_mcts.cc- modified UCT selection score,updated variance accumulator on node visits
Baseline Compatibility
Default
risk_lambda_ = 0.0preserves the original OpenSpiel IS-MCTSbehavior exactly. The modification is opt-in — existing code using
IS-MCTS with no parameter changes is unaffected.
Empirical Validation
To validate that the modification produces a meaningfully different
search policy rather than noise, I compared original and modified builds
across four game settings: Kuhn Poker (2-player), Kuhn Poker (3-player),
Leduc Poker (2-player), Leduc Poker (3-player). Each setting ran 1500+
games. Complete JSON outputs and benchmark code are included in the PR.
Global search behavior
The 53% drop in policy entropy and 12% increase in root visit
concentration confirm the modified build is substantially less
exploratory and more decisive. The chi-square test rejects equality
of action distributions overwhelmingly (p = 0.0, dof = 2,
statistic = 3,211,252).
Game-level results
Total game counts are essentially unchanged across all settings,
confirming the comparison is structurally fair.
Key action distribution shifts
Kuhn 2-player: Bet −256, Pass +187.
More conservative — the modified build avoids volatile betting branches.
Kuhn 3-player: Bet +95, Pass −146.
More aggressive — in multi-player interaction, some assertive lines
have cleaner risk-adjusted profiles than passive continuation.
Leduc 2-player: Call −1236, Fold +322.
Reduced tolerance for marginal continuation. The modified build prefers
early exit over uncertain middle-ground hands.
Leduc 3-player: Call −4060, Fold +1316, Raise +784.
Strongest effect. The strategy landscape reorganises substantially i.e
calls collapse, folds and raises both rise. The modified build is no
longer tolerating middle-ground continuation lines; it either commits
to a strong branch or exits early.
Interpretation
The variance penalty changes the search from expectation-driven to
risk-adjusted. The downstream effects are consistent across all four
settings and internally coherent:
Policy becomes more concentrated. Entropy drops 53%, root action
dominance increases. The modified build is more decisive and less
exploratory.
Strategic polarisation increases in richer games. In Leduc,
especially 3-player, the collapse of calls and rise of both folds and
raises is the clear empirical signature of variance penalisation:
mediocre continuation lines are suppressed, strong branches are
amplified, uncertain middle lines are removed.
Variance penalisation amplifies first-mover advantage. Player 0
improves consistently across all four settings — the only player to do
so. The effect is modest in Kuhn 2-player (+0.010) and substantial in
Leduc 3-player (+0.293), where player 0 moves from negative to positive
mean return. In sequential imperfect-information games, the first mover
can disproportionately influence the shape of the game tree. A
variance-penalised search appears to value that initiative more than a
pure expected-value search does — favoring branches that settle
uncertainty sooner and convert positional advantage into stable outcomes.
Variance penalisation does not necessarily reduce observed outcome
variance. In Leduc, return standard deviations increase despite the
risk-sensitive objective. The modified build is optimising a
risk-adjusted criterion, not minimising final outcome spread. The
resulting strategies can produce more polarised realised outcomes even
while being selected for lower internal variance penalty. This
distinction — between the search objective and the downstream empirical
result — is important for interpreting the data correctly.
Effect scales with game richness. The modification produces mild
shifts in Kuhn (small, shallow game) and substantial reorganisation in
Leduc (richer, deeper game with more branching). This is consistent with
the hypothesis that variance-penalised search has more opportunity to
redirect the tree in games with more complex continuation structures.
What This Is Not
This PR does not claim that variance penalisation improves play strength
or produces a stronger agent. The empirical results show that the
modified build produces a different strategy profile, not a better one.
The intent is to enable study of risk-sensitive search behavior and to
provide a parameterised baseline for researchers interested in
risk-adjusted MCTS variants.