Skip to content

[Stack 16/27] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)#2448

Closed
jucor wants to merge 4 commits into
jc/clj-parity-d9-fixfrom
jc/clj-parity-d5-prop-test
Closed

[Stack 16/27] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)#2448
jucor wants to merge 4 commits into
jc/clj-parity-d9-fixfrom
jc/clj-parity-d5-prop-test

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 16, 2026

Summary

Stacked on #2446 (Fix D9: z-score thresholds from two-tailed to one-tailed). Please review and merge #2446 first.
Next in stack: #2449 (Fix D6: match Clojure two-proportion test formula (+1 pseudocount))

Replace Python's standard one-proportion z-test prop_test(p, n, p0) with
Clojure's Wilson-score-like formula prop_test(succ, n) from stats.clj:10-15:

2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)

The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1)
prior) that regularizes extreme values for small Polis groups. This is separate
from the PSEUDO_COUNT=2.0 used for pa/pd estimation (Beta(2,2) prior):

  • pa = (na + 1) / (ns + 2) — Beta(2,2) prior for probability estimation
  • pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5) — Beta(1,1) prior for significance testing

What changed in the output: pat, pdt values (proportion test z-scores),
and downstream agree_metric / disagree_metric values. The z-scores are
now slightly different due to sqrt(n+1) vs sqrt(n) and (succ+1)/(n+1) vs
(na+1)/(n+2) denominators.

Changes

  • repness.py: prop_test(p, n, p0)prop_test(succ, n) with Clojure formula
  • repness.py: prop_test_vectorized(p, n, p0)prop_test_vectorized(succ, n)
  • repness.py: Callers updated to pass raw counts (na, ns) instead of (pa, ns, 0.5)
  • test_discrepancy_fixes.py: Removed xfail from D5 formula test, added 8 test cases + edge case
  • test_repness_unit.py, test_old_format_repness.py: Updated for new signature
  • Golden snapshots re-recorded for all datasets

Test plan

  • D5 formula tests pass (8 input pairs + edge cases)
  • D5 Clojure blob consistency check passes (all datasets)
  • Full test suite passes (public + private, 19/19 regression tests)
  • Only pre-existing failure: pakistan-incremental D2 (unrelated)

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Delphi’s representativeness (“repness”) proportion-test implementation to match the Clojure reference, aligning downstream pat/pdt (and derived agree/disagree metrics) with the parity plan.

Changes:

  • Replaced the Python one-proportion z-test with Clojure’s Wilson-score-like prop_test(succ, n) formula (with +1 pseudocount) and updated vectorized equivalent.
  • Updated repness callers to pass raw counts (na/nd, ns) rather than smoothed proportions.
  • Refreshed and expanded tests (including removing the D5 xfail) and updated docs/journal plan status.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
delphi/polismath/pca_kmeans_rep/repness.py Implements new prop_test(succ, n) + vectorized formula and updates callers to use raw counts.
delphi/tests/test_discrepancy_fixes.py Enables D5 parity assertions (removes xfail) and expands coverage across multiple cases/edges.
delphi/tests/test_repness_unit.py Updates unit + vectorized tests for the new prop-test signature/formula.
delphi/tests/test_old_format_repness.py Updates backwards-compatible interface tests for the new prop-test signature.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Marks D5 / PR 4 as DONE in the discrepancy plan table.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md Adds PR4 journal entry describing the D5 change, rationale, and validation steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


Returns:
Z-score
Z-score (positive means succ/n > 0.5)
Comment on lines +48 to +51
# 70 successes out of 100: 2*sqrt(101)*((71/101)-0.5) = ~4.19
assert np.isclose(prop_test(70, 100),
2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01)
# 10 successes out of 50: 2*sqrt(51)*((11/51)-0.5) = ~-4.29
@jucor jucor changed the title Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/15] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from db36889 to 69350d5 Compare March 16, 2026 16:04
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 25a1a19 to 3232cc6 Compare March 16, 2026 16:05
@jucor jucor changed the title [Stack 14/15] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/16] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 69350d5 to 382de2f Compare March 16, 2026 18:06
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 3232cc6 to 82f1048 Compare March 16, 2026 18:06
@jucor jucor changed the title [Stack 14/16] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/17] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 16, 2026
@jucor jucor marked this pull request as draft March 17, 2026 10:35
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 82f1048 to 1dbd17f Compare March 17, 2026 16:10
@jucor jucor changed the title [Stack 14/17] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/24] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 17, 2026
@jucor jucor changed the title [Stack 14/24] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/25] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 17, 2026
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 6cb475f to fe09dd8 Compare March 18, 2026 19:06
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8a7007 to 19cce44 Compare March 19, 2026 10:03
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from fe09dd8 to fe2b127 Compare March 19, 2026 10:04
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 19cce44 to bf2dd99 Compare March 19, 2026 10:43
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from fe2b127 to 0a3752c Compare March 19, 2026 10:43
@jucor jucor changed the title [Stack 14/25] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 13/24] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 19, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from bf2dd99 to f8c5793 Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 0a3752c to a511b52 Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8c5793 to 7f733c1 Compare March 19, 2026 14:52
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from a511b52 to 4b6c485 Compare March 19, 2026 14:52
@jucor jucor changed the title [Stack 13/24] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 14/25] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 19, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 7f733c1 to c920b61 Compare March 23, 2026 15:11
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 4b6c485 to 140f87f Compare March 23, 2026 15:13
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from c920b61 to e538293 Compare March 23, 2026 15:41
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch 2 times, most recently from 34fc9ce to ee798a6 Compare March 27, 2026 01:15
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch 2 times, most recently from 6e59c6c to c8a91ac Compare March 27, 2026 01:53
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from ee798a6 to 6e54a9c Compare March 27, 2026 01:53
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from c8a91ac to f41dfb8 Compare March 27, 2026 02:10
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 09747ea to 8d94246 Compare March 27, 2026 10:41
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from f41dfb8 to 3526ab6 Compare March 27, 2026 10:41
@jucor jucor changed the title [Stack 14/25] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 15/26] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 8d94246 to 9397ddf Compare March 30, 2026 12:48
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 3526ab6 to 27da50e Compare March 30, 2026 12:48
@jucor jucor changed the title [Stack 15/26] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) [Stack 16/27] Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 9397ddf to e96a1f7 Compare March 30, 2026 12:54
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 27da50e to 7d16f0e Compare March 30, 2026 12:54
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1107 320 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 297 38 87%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 87 37%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10770 7613 29%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


Returns:
Z-score
Z-score (positive means succ/n > 0.5)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return-value description is slightly inaccurate with the +1 pseudocount: the sign is determined by (succ+1)/(n+1) relative to 0.5 (so e.g. succ==n/2 yields a positive value for n>0). Consider rewording to avoid implying it’s based on the raw succ/n proportion.

Suggested change
Z-score (positive means succ/n > 0.5)
Z-score (sign determined by (succ + 1) / (n + 1) relative to 0.5; positive when (succ + 1) / (n + 1) > 0.5)

Copilot uses AI. Check for mistakes.
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 7d16f0e to 618a693 Compare March 30, 2026 16:49
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from e96a1f7 to 574c169 Compare March 30, 2026 16:49
jucor and others added 4 commits March 30, 2026 18:04
Extracts n-success and n-trials from every repness entry in the Clojure
math blob and feeds them to Python's prop_test(). Compares output to
the blob's p-test value — the ground truth oracle.

Fails because Python's current prop_test uses the old formula
(standard z-test) which produces different values than Clojure's
Wilson-score-like formula with built-in +1 pseudocount.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eudocount)

Replace Python's standard z-test prop_test(p, n, p0) with Clojure's
formula prop_test(succ, n) = 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5).

The Clojure formula (stats.clj:10-15) has a built-in +1 pseudocount
(Laplace smoothing / Beta(1,1) prior) that regularizes extreme values
for small Polis groups. This is separate from the PSEUDO_COUNT=2.0
used for pa/pd estimation (Beta(2,2) prior).

Changes:
- prop_test: signature (p, n, p0) → (succ, n), Clojure formula
- prop_test_vectorized: same signature change
- comment_stats / compute_group_comment_stats_df: pass raw counts
  (na, ns) / (nd, ns) instead of (pa, ns, 0.5)
- Tests updated for new signature and expected values
- Golden snapshots re-recorded (pat/pdt/metric values changed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jucor jucor force-pushed the jc/clj-parity-d5-prop-test branch from 618a693 to de83253 Compare March 30, 2026 17:05
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 574c169 to b64cae8 Compare March 30, 2026 17:05
This was referenced Mar 30, 2026
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 30, 2026

Superseded by spr-managed PR stack. See the new stack starting at #2508.

@jucor jucor closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants