[Stack 17/27] Fix D6: match Clojure two-proportion test formula (+1 pseudocount)#2449
[Stack 17/27] Fix D6: match Clojure two-proportion test formula (+1 pseudocount)#2449jucor wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Aligns Delphi’s Python representativeness scoring with the Clojure implementation for discrepancy D6 by changing the two-proportion z-test to use Laplace-style +1 pseudocounts on raw count inputs.
Changes:
- Changed
two_prop_test/two_prop_test_vectorizedto accept raw counts and apply +1 pseudocount to all four inputs (Clojure parity). - Updated key call sites (
add_comparative_stats,compute_group_comment_stats_df) to pass counts instead of proportions. - Updated/expanded unit and discrepancy tests and re-recorded golden snapshots reflecting new
rat/rdt-derived outputs.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/repness.py |
Reworks scalar + vectorized two-prop test API/formula and updates callers to pass raw counts. |
delphi/tests/test_repness_unit.py |
Updates unit tests for the new two-prop test signature and expected values. |
delphi/tests/test_old_format_repness.py |
Updates old-format compatibility tests for new two-prop test signature. |
delphi/tests/test_discrepancy_fixes.py |
Rewrites D6 parity tests with a reference implementation and adds additional coverage. |
delphi/real_data/r4tykwac8thvzv35jrn53-biodiversity/golden_snapshot.json |
Refreshes golden snapshot outputs impacted by the new z-score computation. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Marks D6 as completed and adds PR mapping entry. |
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md |
Documents D6 work, rationale, and test outcomes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| p1 = pop_in + 1 | ||
| p2 = pop_out + 1 | ||
|
|
||
| pi1 = s1 / p1 | ||
| pi2 = s2 / p2 | ||
| pi_hat = (s1 + s2) / (p1 + p2) | ||
|
|
||
| if pi_hat == 1.0: | ||
| # Clojure note (stats.clj:26-27): "this isn't quite right... could | ||
| # actually solve this using limits" — returning 0 for now, matching Clojure. | ||
| return 0.0 | ||
|
|
||
| se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2)) |
| # Add +1 pseudocount to all four inputs (Clojure: map inc) | ||
| s1 = succ_in + 1 | ||
| s2 = succ_out + 1 | ||
| p1 = pop_in + 1 | ||
| p2 = pop_out + 1 | ||
|
|
||
| # Standard error | ||
| se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2)) | ||
| pi1 = s1 / p1 | ||
| pi2 = s2 / p2 | ||
| pi_hat = (s1 + s2) / (p1 + p2) | ||
|
|
||
| # Z-score calculation | ||
| z = (p1 - p2) / se | ||
| se = np.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2)) |
| # With small n, the +1 pseudocount has a large effect | ||
| # succ=1, pop=1 → without pseudocount: p=1.0 (extreme) | ||
| # With pseudocount: (1+1)/(1+1) = 1.0, but denominator also shifts |
25a1a19 to
3232cc6
Compare
d11cc4c to
3232cc6
Compare
3232cc6 to
82f1048
Compare
5fd6423 to
1763fbd
Compare
1763fbd to
67298ed
Compare
82f1048 to
1dbd17f
Compare
67298ed to
de4485d
Compare
6cb475f to
fe09dd8
Compare
de4485d to
cb7496c
Compare
fe09dd8 to
fe2b127
Compare
cb7496c to
68242c4
Compare
fe2b127 to
0a3752c
Compare
68242c4 to
96347d5
Compare
0a3752c to
a511b52
Compare
96347d5 to
42aee66
Compare
a511b52 to
4b6c485
Compare
d0956ba to
c8e60c0
Compare
d8c8881 to
90c75e9
Compare
867fcbe to
e046d53
Compare
d6b65aa to
7f94b38
Compare
e046d53 to
e50a3d8
Compare
7f94b38 to
9867450
Compare
e50a3d8 to
6e59c6c
Compare
1308d91 to
d4b8ef6
Compare
c8a91ac to
f41dfb8
Compare
795026c to
b1eec11
Compare
f41dfb8 to
3526ab6
Compare
3526ab6 to
27da50e
Compare
b1eec11 to
109e60a
Compare
109e60a to
d1605f1
Compare
Delphi Coverage Report
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pop_in == 0 or pop_out == 0: | ||
| return 0.0 |
There was a problem hiding this comment.
two_prop_test() claims to match Clojure's stats/two-prop-test, but it returns 0.0 when pop_in==0 or pop_out==0. Clojure does not special-case zero populations; it increments pop-in/pop-out (so division-by-zero is avoided) and can yield a non-zero z-score when one side has no votes. Either remove this early return for true parity, or update the docstring/expected behavior to explicitly document this intentional deviation (and consider validating succ_* <= pop_* instead).
| # Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0 | ||
| z = z.where((pop_in > 0) & (pop_out > 0), 0.0) | ||
| z = z.fillna(0.0) | ||
| z = z.replace([np.inf, -np.inf], 0.0) |
There was a problem hiding this comment.
two_prop_test_vectorized() zeroes results when pop_in==0 or pop_out==0, which diverges from the referenced Clojure implementation (it applies +1 pseudocounts and still computes the statistic). If the goal is Clojure parity, drop this mask and rely on the +1 adjustment plus inf/NaN handling; if the goal is to treat no-data rows as 0, please document that deviation clearly (and align scalar/vectorized behavior + tests accordingly).
| # Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0 | |
| z = z.where((pop_in > 0) & (pop_out > 0), 0.0) | |
| z = z.fillna(0.0) | |
| z = z.replace([np.inf, -np.inf], 0.0) | |
| # Handle edge cases: pi_hat=0 or 1 → se=0 → inf/NaN; map these to 0 | |
| z = z.replace([np.inf, -np.inf], 0.0) | |
| z = z.fillna(0.0) |
| # Edge cases: pop=0 → 0 | ||
| assert two_prop_test(5, 5, 0, 100) == 0.0 | ||
| assert two_prop_test(5, 5, 100, 0) == 0.0 |
There was a problem hiding this comment.
The “pop=0 → 0” edge case here uses inconsistent raw counts (succ_in=5 even though pop_in=0). In real usage succ_* should never exceed pop_* (and succ_in must be 0 if pop_in is 0), so this test isn’t exercising a realistic boundary. Consider changing these to consistent inputs (e.g., succ_in=0 when pop_in=0) or explicitly testing/expecting input validation behavior (raise or return 0 for invalid succ>pop).
| # Edge cases: pop=0 → 0 | |
| assert two_prop_test(5, 5, 0, 100) == 0.0 | |
| assert two_prop_test(5, 5, 100, 0) == 0.0 | |
| # Edge cases: pop=0 → 0 (use consistent counts: succ_* must be 0 when pop_* is 0) | |
| assert two_prop_test(0, 5, 0, 100) == 0.0 | |
| assert two_prop_test(5, 0, 100, 0) == 0.0 |
| assert two_prop_test(5, 5, 0, 100) == 0.0 | ||
| assert two_prop_test(5, 5, 100, 0) == 0.0 |
There was a problem hiding this comment.
Same issue as in test_repness_unit.py: these edge cases use impossible raw counts (succ_in=5 with pop_in=0). Since the function now takes raw counts, the tests should either use consistent counts (succ==0 when pop==0) or assert a defined behavior for invalid succ>pop inputs.
| assert two_prop_test(5, 5, 0, 100) == 0.0 | |
| assert two_prop_test(5, 5, 100, 0) == 0.0 | |
| assert two_prop_test(0, 5, 0, 100) == 0.0 | |
| assert two_prop_test(5, 0, 100, 0) == 0.0 |
7d16f0e to
618a693
Compare
Also add D4 blob injection test (p-success pseudocount formula) D6: Reconstructs group-vs-other vote counts from group-votes blob data, feeds to two_prop_test(), compares to blob's repness-test. Fails because the old two_prop_test expects proportions, not raw counts. D4: Verifies (n_success+1)/(n_trials+2) matches blob's p-success. Already passes (PSEUDO_COUNT=2.0 was fixed in an earlier PR). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace standard two-proportion z-test with Clojure's version that adds +1 pseudocount to all four inputs (stats.clj:18-33). This Laplace smoothing regularizes z-scores for small group sizes common in Polis. Signature change: two_prop_test(p1, n1, p2, n2) taking proportions → two_prop_test(succ_in, succ_out, pop_in, pop_out) taking raw counts. Updated both scalar and vectorized versions plus all callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
Summary
The Python
two_prop_testused a standard two-proportion z-test with no pseudocounts,while Clojure's
stats/two-prop-test(stats.clj:18-33) adds +1 to all four inputs(
succ-in,succ-out,pop-in,pop-out) via(map inc ...)before computingthe pooled z-test. This Laplace smoothing regularizes z-scores for small group sizes,
which are common in Polis conversations.
Changes
two_prop_test(p1, n1, p2, n2)(proportions) →two_prop_test(succ_in, succ_out, pop_in, pop_out)(raw counts)pi1 = (succ_in+1)/(pop_in+1),pi_hat = (s1+s2)/(p1+p2)add_comparative_stats) and vectorized(
compute_group_comment_stats_df) now pass raw counts matching Clojure's(stats/two-prop-test (:na in-stats) (sum :na rest-stats) (:ns in-stats) (sum :ns rest-stats))(repness.clj:97-100)
Affected output fields
rat(agree representativeness test z-score)rdt(disagree representativeness test z-score)agree_metric,disagree_metric(downstream of rat/rdt)Test plan
🤖 Generated with Claude Code