[Stack 16/27] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)#2448
[Stack 16/27] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)#2448jucor wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Updates Delphi’s representativeness (“repness”) proportion-test implementation to match the Clojure reference, aligning downstream pat/pdt (and derived agree/disagree metrics) with the parity plan.
Changes:
- Replaced the Python one-proportion z-test with Clojure’s Wilson-score-like
prop_test(succ, n)formula (with +1 pseudocount) and updated vectorized equivalent. - Updated repness callers to pass raw counts (
na/nd,ns) rather than smoothed proportions. - Refreshed and expanded tests (including removing the D5 xfail) and updated docs/journal plan status.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/repness.py |
Implements new prop_test(succ, n) + vectorized formula and updates callers to use raw counts. |
delphi/tests/test_discrepancy_fixes.py |
Enables D5 parity assertions (removes xfail) and expands coverage across multiple cases/edges. |
delphi/tests/test_repness_unit.py |
Updates unit + vectorized tests for the new prop-test signature/formula. |
delphi/tests/test_old_format_repness.py |
Updates backwards-compatible interface tests for the new prop-test signature. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Marks D5 / PR 4 as DONE in the discrepancy plan table. |
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md |
Adds PR4 journal entry describing the D5 change, rationale, and validation steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Returns: | ||
| Z-score | ||
| Z-score (positive means succ/n > 0.5) |
| # 70 successes out of 100: 2*sqrt(101)*((71/101)-0.5) = ~4.19 | ||
| assert np.isclose(prop_test(70, 100), | ||
| 2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01) | ||
| # 10 successes out of 50: 2*sqrt(51)*((11/51)-0.5) = ~-4.29 |
db36889 to
69350d5
Compare
25a1a19 to
3232cc6
Compare
69350d5 to
382de2f
Compare
3232cc6 to
82f1048
Compare
82f1048 to
1dbd17f
Compare
6cb475f to
fe09dd8
Compare
f8a7007 to
19cce44
Compare
fe09dd8 to
fe2b127
Compare
19cce44 to
bf2dd99
Compare
fe2b127 to
0a3752c
Compare
bf2dd99 to
f8c5793
Compare
0a3752c to
a511b52
Compare
f8c5793 to
7f733c1
Compare
a511b52 to
4b6c485
Compare
7f733c1 to
c920b61
Compare
4b6c485 to
140f87f
Compare
c920b61 to
e538293
Compare
34fc9ce to
ee798a6
Compare
6e59c6c to
c8a91ac
Compare
ee798a6 to
6e54a9c
Compare
c8a91ac to
f41dfb8
Compare
09747ea to
8d94246
Compare
f41dfb8 to
3526ab6
Compare
8d94246 to
9397ddf
Compare
3526ab6 to
27da50e
Compare
9397ddf to
e96a1f7
Compare
27da50e to
7d16f0e
Compare
Delphi Coverage Report
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Returns: | ||
| Z-score | ||
| Z-score (positive means succ/n > 0.5) |
There was a problem hiding this comment.
The return-value description is slightly inaccurate with the +1 pseudocount: the sign is determined by (succ+1)/(n+1) relative to 0.5 (so e.g. succ==n/2 yields a positive value for n>0). Consider rewording to avoid implying it’s based on the raw succ/n proportion.
| Z-score (positive means succ/n > 0.5) | |
| Z-score (sign determined by (succ + 1) / (n + 1) relative to 0.5; positive when (succ + 1) / (n + 1) > 0.5) |
7d16f0e to
618a693
Compare
e96a1f7 to
574c169
Compare
Extracts n-success and n-trials from every repness entry in the Clojure math blob and feeds them to Python's prop_test(). Compares output to the blob's p-test value — the ground truth oracle. Fails because Python's current prop_test uses the old formula (standard z-test) which produces different values than Clojure's Wilson-score-like formula with built-in +1 pseudocount. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eudocount) Replace Python's standard z-test prop_test(p, n, p0) with Clojure's formula prop_test(succ, n) = 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5). The Clojure formula (stats.clj:10-15) has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1) prior) that regularizes extreme values for small Polis groups. This is separate from the PSEUDO_COUNT=2.0 used for pa/pd estimation (Beta(2,2) prior). Changes: - prop_test: signature (p, n, p0) → (succ, n), Clojure formula - prop_test_vectorized: same signature change - comment_stats / compute_group_comment_stats_df: pass raw counts (na, ns) / (nd, ns) instead of (pa, ns, 0.5) - Tests updated for new signature and expected values - Golden snapshots re-recorded (pat/pdt/metric values changed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
618a693 to
de83253
Compare
574c169 to
b64cae8
Compare
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
Summary
Replace Python's standard one-proportion z-test
prop_test(p, n, p0)withClojure's Wilson-score-like formula
prop_test(succ, n)fromstats.clj:10-15:The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1)
prior) that regularizes extreme values for small Polis groups. This is separate
from the
PSEUDO_COUNT=2.0used forpa/pdestimation (Beta(2,2) prior):pa = (na + 1) / (ns + 2)— Beta(2,2) prior for probability estimationpat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)— Beta(1,1) prior for significance testingWhat changed in the output:
pat,pdtvalues (proportion test z-scores),and downstream
agree_metric/disagree_metricvalues. The z-scores arenow slightly different due to
sqrt(n+1)vssqrt(n)and(succ+1)/(n+1)vs(na+1)/(n+2)denominators.Changes
repness.py:prop_test(p, n, p0)→prop_test(succ, n)with Clojure formularepness.py:prop_test_vectorized(p, n, p0)→prop_test_vectorized(succ, n)repness.py: Callers updated to pass raw counts(na, ns)instead of(pa, ns, 0.5)test_discrepancy_fixes.py: Removed xfail from D5 formula test, added 8 test cases + edge casetest_repness_unit.py,test_old_format_repness.py: Updated for new signatureTest plan
🤖 Generated with Claude Code