Fix D9: z-score thresholds from two-tailed to one-tailed#2518
Open
jucor wants to merge 1 commit into
Open
Conversation
This was referenced Mar 30, 2026
24de40d to
add1343
Compare
ballPointPenguin
approved these changes
Apr 26, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to align Delphi’s representativeness significance gating with the Clojure reference by switching z-score significance checks from two-tailed (abs(z) >= threshold) to one-tailed, strict comparisons (z > threshold), and by removing the now-dead pca_kmeans_rep/stats.py module + its associated tests/docs.
Changes:
- Update z-score significance semantics to one-tailed + strict
>inrepness.py, and update rep-comment selection filters accordingly. - Remove
delphi/polismath/pca_kmeans_rep/stats.pyanddelphi/tests/test_stats.py; adjust docs to point atrepness.py. - Expand D9 discrepancy tests and update repness unit test boundary expectations.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/tests/test_stats.py |
Removes tests for the deleted stats.py module. |
delphi/tests/test_repness_unit.py |
Updates z-significance unit tests for one-tailed strict behavior. |
delphi/tests/test_old_format_repness.py |
Mirrors the updated z-significance unit test expectations for the old-format path. |
delphi/tests/test_discrepancy_fixes.py |
Adds D9 semantic tests and new (currently xfailed) constant/threshold checks. |
delphi/polismath/pca_kmeans_rep/stats.py |
Deletes the legacy stats helper module. |
delphi/polismath/pca_kmeans_rep/repness.py |
Switches z-sig checks and rep-comment selection to one-tailed strict comparisons. |
delphi/docs/usage_examples.md |
Updates docs to reference statistical helpers in repness.py (needs name correction). |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Updates plan metadata / cross-references (currently references an inconsistent PR number). |
delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md |
Adds a handoff doc for the upcoming vectorized refactor/testing work. |
Comments suppressed due to low confidence (1)
delphi/tests/test_discrepancy_fixes.py:554
- These D9 tests are still marked
xfail, so CI won’t catch regressions and the suite won’t enforce the new one-tailedZ_90value. OnceZ_90is updated, remove thexfailmarker so the threshold change is actually validated.
@pytest.mark.xfail(reason="D9: Z_90=1.645 (two-tailed), target is 1.2816 (one-tailed)")
def test_z90_matches_clojure(self):
"""Z_90 should be one-tailed (1.2816), not two-tailed (1.645)."""
check.almost_equal(Z_90, 1.2816, abs=0.001,
msg=f"Z_90 should be 1.2816 (one-tailed), got {Z_90}")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
43
to
+47
|
|
||
| Returns: | ||
| True if significant at 90% confidence | ||
| """ | ||
| return abs(z) >= Z_90 | ||
| return z > Z_90 |
Comment on lines
56
to
+60
|
|
||
| Returns: | ||
| True if significant at 95% confidence | ||
| """ | ||
| return abs(z) >= Z_95 | ||
| return z > Z_95 |
Comment on lines
556
to
560
| @pytest.mark.xfail(reason="D9: Z_95=1.96 (two-tailed), target is 1.6449 (one-tailed)") | ||
| def test_z95_matches_clojure(self): | ||
| """Z_95 should be one-tailed (1.6449), not two-tailed (1.96).""" | ||
| check.almost_equal(Z_95, 1.6449, abs=0.001, | ||
| msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}") |
## Summary - Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj` - `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449 - Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values ## Why one-tailed? The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance. ## Test plan - [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green - [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private) - [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded) - [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds) - [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Plan: add task parallelization analysis for remaining fixes - Fix D9: match Clojure z-sig semantics (strict >, no abs) and remove dead stats.py - Re-record vw golden snapshot after D9 z-sig semantics change - Update plan: mark D9 as done, note stats.py removal for next PR - Add mathematical rigor and exhaustive testing guidance to fix plan - Plan: move PR 14 earlier (prerequisite for blob tests) + add handoff doc - Re-record golden snapshots after upstream cascade commit-id:0194003d
Delphi Coverage Report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
stats.cljZ_90: 1.645 → 1.2816,Z_95: 1.96 → 1.6449stats.pyalready used the correct one-tailed valuesWhy one-tailed?
The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.
Test plan
🤖 Generated with Claude Code
Squashed commits
commit-id:0194003d
Stack: