Skip to content

[Stack 17/27] Fix D6: match Clojure two-proportion test formula (+1 pseudocount)#2449

Closed
jucor wants to merge 3 commits into
jc/clj-parity-d5-prop-testfrom
jc/clj-parity-d6-two-prop-test
Closed

[Stack 17/27] Fix D6: match Clojure two-proportion test formula (+1 pseudocount)#2449
jucor wants to merge 3 commits into
jc/clj-parity-d5-prop-testfrom
jc/clj-parity-d6-two-prop-test

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 16, 2026

Summary

Stacked on #2448 (Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)). Please review and merge #2448 first.
Next in stack: #2450 (Fix D7: match Clojure repness metric formula (product of 4 signed values))

The Python two_prop_test used a standard two-proportion z-test with no pseudocounts,
while Clojure's stats/two-prop-test (stats.clj:18-33) adds +1 to all four inputs
(succ-in, succ-out, pop-in, pop-out) via (map inc ...) before computing
the pooled z-test. This Laplace smoothing regularizes z-scores for small group sizes,
which are common in Polis conversations.

Changes

  • Signature change: two_prop_test(p1, n1, p2, n2) (proportions) →
    two_prop_test(succ_in, succ_out, pop_in, pop_out) (raw counts)
  • Formula: Standard pooled z-test on pseudocount-adjusted values:
    pi1 = (succ_in+1)/(pop_in+1), pi_hat = (s1+s2)/(p1+p2)
  • Callers updated: Both scalar (add_comparative_stats) and vectorized
    (compute_group_comment_stats_df) now pass raw counts matching Clojure's
    (stats/two-prop-test (:na in-stats) (sum :na rest-stats) (:ns in-stats) (sum :ns rest-stats))
    (repness.clj:97-100)

Affected output fields

  • rat (agree representativeness test z-score)
  • rdt (disagree representativeness test z-score)
  • agree_metric, disagree_metric (downstream of rat/rdt)

Test plan

  • Targeted D6 tests pass (formula, edge cases, regularization effect)
  • Full test suite passes (excluding DynamoDB/MinIO tests)
  • Private dataset tests pass (--include-local)
  • Golden snapshots re-recorded for all 7 datasets

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns Delphi’s Python representativeness scoring with the Clojure implementation for discrepancy D6 by changing the two-proportion z-test to use Laplace-style +1 pseudocounts on raw count inputs.

Changes:

  • Changed two_prop_test / two_prop_test_vectorized to accept raw counts and apply +1 pseudocount to all four inputs (Clojure parity).
  • Updated key call sites (add_comparative_stats, compute_group_comment_stats_df) to pass counts instead of proportions.
  • Updated/expanded unit and discrepancy tests and re-recorded golden snapshots reflecting new rat/rdt-derived outputs.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
delphi/polismath/pca_kmeans_rep/repness.py Reworks scalar + vectorized two-prop test API/formula and updates callers to pass raw counts.
delphi/tests/test_repness_unit.py Updates unit tests for the new two-prop test signature and expected values.
delphi/tests/test_old_format_repness.py Updates old-format compatibility tests for new two-prop test signature.
delphi/tests/test_discrepancy_fixes.py Rewrites D6 parity tests with a reference implementation and adds additional coverage.
delphi/real_data/r4tykwac8thvzv35jrn53-biodiversity/golden_snapshot.json Refreshes golden snapshot outputs impacted by the new z-score computation.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Marks D6 as completed and adds PR mapping entry.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md Documents D6 work, rationale, and test outcomes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +129 to +141
p1 = pop_in + 1
p2 = pop_out + 1

pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

if pi_hat == 1.0:
# Clojure note (stats.clj:26-27): "this isn't quite right... could
# actually solve this using limits" — returning 0 for now, matching Clojure.
return 0.0

se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
Comment on lines +535 to +545
# Add +1 pseudocount to all four inputs (Clojure: map inc)
s1 = succ_in + 1
s2 = succ_out + 1
p1 = pop_in + 1
p2 = pop_out + 1

# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

# Z-score calculation
z = (p1 - p2) / se
se = np.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
Comment on lines +879 to +881
# With small n, the +1 pseudocount has a large effect
# succ=1, pop=1 → without pseudocount: p=1.0 (extreme)
# With pseudocount: (1+1)/(1+1) = 1.0, but denominator also shifts
@jucor jucor changed the title Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 15/15] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 25a1a19 to 3232cc6 Compare March 16, 2026 16:05
@jucor jucor closed this Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from d11cc4c to 3232cc6 Compare March 16, 2026 16:05
@jucor jucor reopened this Mar 16, 2026
@jucor jucor changed the title [Stack 15/15] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 15/16] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 3232cc6 to 82f1048 Compare March 16, 2026 18:06
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 5fd6423 to 1763fbd Compare March 16, 2026 18:08
@jucor jucor changed the title [Stack 15/16] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 15/17] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 16, 2026
@jucor jucor marked this pull request as draft March 17, 2026 10:35
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 1763fbd to 67298ed Compare March 17, 2026 16:10
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 82f1048 to 1dbd17f Compare March 17, 2026 16:10
@jucor jucor changed the title [Stack 15/17] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 15/24] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 17, 2026
@jucor jucor changed the title [Stack 15/24] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 15/25] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 17, 2026
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 67298ed to de4485d Compare March 18, 2026 18:50
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 6cb475f to fe09dd8 Compare March 18, 2026 19:06
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from de4485d to cb7496c Compare March 18, 2026 19:06
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from fe09dd8 to fe2b127 Compare March 19, 2026 10:04
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from cb7496c to 68242c4 Compare March 19, 2026 10:08
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from fe2b127 to 0a3752c Compare March 19, 2026 10:43
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 68242c4 to 96347d5 Compare March 19, 2026 10:44
@jucor jucor changed the title [Stack 15/25] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 14/24] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 19, 2026
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 0a3752c to a511b52 Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 96347d5 to 42aee66 Compare March 19, 2026 12:32
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from a511b52 to 4b6c485 Compare March 19, 2026 14:52
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from d0956ba to c8e60c0 Compare March 24, 2026 10:27
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from d8c8881 to 90c75e9 Compare March 24, 2026 10:27
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 867fcbe to e046d53 Compare March 24, 2026 11:08
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from d6b65aa to 7f94b38 Compare March 24, 2026 11:08
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from e046d53 to e50a3d8 Compare March 26, 2026 21:24
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 7f94b38 to 9867450 Compare March 26, 2026 21:24
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from e50a3d8 to 6e59c6c Compare March 27, 2026 01:15
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch 2 times, most recently from 1308d91 to d4b8ef6 Compare March 27, 2026 01:53
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch 2 times, most recently from c8a91ac to f41dfb8 Compare March 27, 2026 02:10
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch 2 times, most recently from 795026c to b1eec11 Compare March 27, 2026 10:41
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from f41dfb8 to 3526ab6 Compare March 27, 2026 10:41
@jucor jucor changed the title [Stack 15/25] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 16/26] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 3526ab6 to 27da50e Compare March 30, 2026 12:48
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from b1eec11 to 109e60a Compare March 30, 2026 12:48
@jucor jucor changed the title [Stack 16/26] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) [Stack 17/27] Fix D6: match Clojure two-proportion test formula (+1 pseudocount) Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d6-two-prop-test branch from 109e60a to d1605f1 Compare March 30, 2026 12:54
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1107 320 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 312 34 89%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 87 37%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10785 7609 29%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +121 to 122
if pop_in == 0 or pop_out == 0:
return 0.0
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two_prop_test() claims to match Clojure's stats/two-prop-test, but it returns 0.0 when pop_in==0 or pop_out==0. Clojure does not special-case zero populations; it increments pop-in/pop-out (so division-by-zero is avoided) and can yield a non-zero z-score when one side has no votes. Either remove this early return for true parity, or update the docstring/expected behavior to explicitly document this intentional deviation (and consider validating succ_* <= pop_* instead).

Copilot uses AI. Check for mistakes.
Comment on lines +546 to 549
# Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two_prop_test_vectorized() zeroes results when pop_in==0 or pop_out==0, which diverges from the referenced Clojure implementation (it applies +1 pseudocounts and still computes the statistic). If the goal is Clojure parity, drop this mask and rely on the +1 adjustment plus inf/NaN handling; if the goal is to treat no-data rows as 0, please document that deviation clearly (and align scalar/vectorized behavior + tests accordingly).

Suggested change
# Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
# Handle edge cases: pi_hat=0 or 1 → se=0 → inf/NaN; map these to 0
z = z.replace([np.inf, -np.inf], 0.0)
z = z.fillna(0.0)

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +75
# Edge cases: pop=0 → 0
assert two_prop_test(5, 5, 0, 100) == 0.0
assert two_prop_test(5, 5, 100, 0) == 0.0
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “pop=0 → 0” edge case here uses inconsistent raw counts (succ_in=5 even though pop_in=0). In real usage succ_* should never exceed pop_* (and succ_in must be 0 if pop_in is 0), so this test isn’t exercising a realistic boundary. Consider changing these to consistent inputs (e.g., succ_in=0 when pop_in=0) or explicitly testing/expecting input validation behavior (raise or return 0 for invalid succ>pop).

Suggested change
# Edge cases: pop=0 → 0
assert two_prop_test(5, 5, 0, 100) == 0.0
assert two_prop_test(5, 5, 100, 0) == 0.0
# Edge cases: pop=0 → 0 (use consistent counts: succ_* must be 0 when pop_* is 0)
assert two_prop_test(0, 5, 0, 100) == 0.0
assert two_prop_test(5, 0, 100, 0) == 0.0

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +69
assert two_prop_test(5, 5, 0, 100) == 0.0
assert two_prop_test(5, 5, 100, 0) == 0.0
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in test_repness_unit.py: these edge cases use impossible raw counts (succ_in=5 with pop_in=0). Since the function now takes raw counts, the tests should either use consistent counts (succ==0 when pop==0) or assert a defined behavior for invalid succ>pop inputs.

Suggested change
assert two_prop_test(5, 5, 0, 100) == 0.0
assert two_prop_test(5, 5, 100, 0) == 0.0
assert two_prop_test(0, 5, 0, 100) == 0.0
assert two_prop_test(5, 0, 100, 0) == 0.0

Copilot uses AI. Check for mistakes.
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 7d16f0e to 618a693 Compare March 30, 2026 16:49
jucor and others added 3 commits March 30, 2026 18:04
Also add D4 blob injection test (p-success pseudocount formula)

D6: Reconstructs group-vs-other vote counts from group-votes blob data,
feeds to two_prop_test(), compares to blob's repness-test. Fails because
the old two_prop_test expects proportions, not raw counts.

D4: Verifies (n_success+1)/(n_trials+2) matches blob's p-success.
Already passes (PSEUDO_COUNT=2.0 was fixed in an earlier PR).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace standard two-proportion z-test with Clojure's version that adds
+1 pseudocount to all four inputs (stats.clj:18-33). This Laplace
smoothing regularizes z-scores for small group sizes common in Polis.

Signature change: two_prop_test(p1, n1, p2, n2) taking proportions →
two_prop_test(succ_in, succ_out, pop_in, pop_out) taking raw counts.
Updated both scalar and vectorized versions plus all callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 30, 2026
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 30, 2026

Superseded by spr-managed PR stack. See the new stack starting at #2508.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants