Fix D9: z-score thresholds from two-tailed to one-tailed by jucor · Pull Request #2518 · compdemocracy/polis

jucor · 2026-03-30T22:25:17Z

Summary

Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's stats.clj
Z_90: 1.645 → 1.2816, Z_95: 1.96 → 1.6449
Also resolves an internal inconsistency — Python's own stats.py already used the correct one-tailed values

Why one-tailed?

The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.

Test plan

TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
Repness unit tests: 36 passed (boundary values updated to match new thresholds)
4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)

🤖 Generated with Claude Code

Squashed commits

Plan: add task parallelization analysis for remaining fixes
Fix D9: match Clojure z-sig semantics (strict >, no abs) and remove dead stats.py
Re-record vw golden snapshot after D9 z-sig semantics change
Update plan: mark D9 as done, note stats.py removal for next PR
Add mathematical rigor and exhaustive testing guidance to fix plan
Plan: move PR 14 earlier (prerequisite for blob tests) + add handoff doc
Re-record golden snapshots after upstream cascade

commit-id:0194003d

Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

Copilot

Pull request overview

This PR aims to align Delphi’s representativeness significance gating with the Clojure reference by switching z-score significance checks from two-tailed (abs(z) >= threshold) to one-tailed, strict comparisons (z > threshold), and by removing the now-dead pca_kmeans_rep/stats.py module + its associated tests/docs.

Changes:

Update z-score significance semantics to one-tailed + strict > in repness.py, and update rep-comment selection filters accordingly.
Remove delphi/polismath/pca_kmeans_rep/stats.py and delphi/tests/test_stats.py; adjust docs to point at repness.py.
Expand D9 discrepancy tests and update repness unit test boundary expectations.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`delphi/tests/test_stats.py`	Removes tests for the deleted `stats.py` module.
`delphi/tests/test_repness_unit.py`	Updates z-significance unit tests for one-tailed strict behavior.
`delphi/tests/test_old_format_repness.py`	Mirrors the updated z-significance unit test expectations for the old-format path.
`delphi/tests/test_discrepancy_fixes.py`	Adds D9 semantic tests and new (currently xfailed) constant/threshold checks.
`delphi/polismath/pca_kmeans_rep/stats.py`	Deletes the legacy stats helper module.
`delphi/polismath/pca_kmeans_rep/repness.py`	Switches z-sig checks and rep-comment selection to one-tailed strict comparisons.
`delphi/docs/usage_examples.md`	Updates docs to reference statistical helpers in `repness.py` (needs name correction).
`delphi/docs/PLAN_DISCREPANCY_FIXES.md`	Updates plan metadata / cross-references (currently references an inconsistent PR number).
`delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md`	Adds a handoff doc for the upcoming vectorized refactor/testing work.

Comments suppressed due to low confidence (1)

delphi/tests/test_discrepancy_fixes.py:554

These D9 tests are still marked xfail, so CI won’t catch regressions and the suite won’t enforce the new one-tailed Z_90 value. Once Z_90 is updated, remove the xfail marker so the threshold change is actually validated.

    @pytest.mark.xfail(reason="D9: Z_90=1.645 (two-tailed), target is 1.2816 (one-tailed)")
    def test_z90_matches_clojure(self):
        """Z_90 should be one-tailed (1.2816), not two-tailed (1.645)."""
        check.almost_equal(Z_90, 1.2816, abs=0.001,
                            msg=f"Z_90 should be 1.2816 (one-tailed), got {Z_90}")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


    Returns:
        True if significant at 90% confidence
    """
-    return abs(z) >= Z_90
+    return z > Z_90



    Returns:
        True if significant at 95% confidence
    """
-    return abs(z) >= Z_95
+    return z > Z_95


    @pytest.mark.xfail(reason="D9: Z_95=1.96 (two-tailed), target is 1.6449 (one-tailed)")
    def test_z95_matches_clojure(self):
        """Z_95 should be one-tailed (1.6449), not two-tailed (1.96)."""
        check.almost_equal(Z_95, 1.6449, abs=0.001,
                            msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}")


## Summary - Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj` - `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449 - Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values ## Why one-tailed? The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance. ## Test plan - [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green - [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private) - [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded) - [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds) - [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Plan: add task parallelization analysis for remaining fixes - Fix D9: match Clojure z-sig semantics (strict >, no abs) and remove dead stats.py - Re-record vw golden snapshot after D9 z-sig semantics change - Update plan: mark D9 as done, note stats.py removal for next PR - Add mathematical rigor and exhaustive testing guidance to fix plan - Plan: move PR 14 earlier (prerequisite for blob tests) + add handoff doc - Re-record golden snapshots after upstream cascade commit-id:0194003d

github-actions · 2026-05-19T22:43:51Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	2	0	100%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	1	0	100%
components/config.py	165	133	19%
conversation/init.py	2	0	100%
conversation/conversation.py	1107	320	71%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	234	40%
database/postgres.py	305	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	257	22	91%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	52	16	69%
pca_kmeans_rep/repness.py	297	43	86%
regression/init.py	4	0	100%
regression/clojure_comparer.py	188	17	91%
regression/comparer.py	887	720	19%
regression/datasets.py	135	27	80%
regression/recorder.py	36	27	25%
regression/utils.py	138	94	32%
run_math_pipeline.py	260	114	56%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	53	53%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	785	785	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	108	108	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	584	518	11%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	62	41	34%
Total	10770	7625	29%

jucor changed the title ~~Fix D9: z-score thresholds from two-tailed to one-tailed~~ [Stack 11/17] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 30, 2026

jucor force-pushed the spr/edge/0194003d branch 3 times, most recently from 24de40d to add1343 Compare March 31, 2026 00:35

ballPointPenguin approved these changes Apr 26, 2026

View reviewed changes

jucor requested a review from Copilot May 19, 2026 21:44

Copilot started reviewing on behalf of jucor May 19, 2026 21:44 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

jucor changed the title ~~[Stack 11/17] Fix D9: z-score thresholds from two-tailed to one-tailed~~ Fix D9: z-score thresholds from two-tailed to one-tailed May 19, 2026

jucor force-pushed the spr/edge/b9062b50 branch from 517908a to 97af721 Compare May 19, 2026 22:09

jucor force-pushed the spr/edge/0194003d branch from add1343 to a244e09 Compare May 19, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix D9: z-score thresholds from two-tailed to one-tailed#2518

Fix D9: z-score thresholds from two-tailed to one-tailed#2518
jucor wants to merge 1 commit into
spr/edge/b9062b50from
spr/edge/0194003d

jucor commented Mar 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jucor commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why one-tailed?

Test plan

Squashed commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026

Delphi Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jucor commented Mar 30, 2026 •

edited

Loading