Skip to content

[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446

Closed
jucor wants to merge 7 commits into
jc/fix-test-db-connectionfrom
jc/clj-parity-d9-fix
Closed

[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446
jucor wants to merge 7 commits into
jc/fix-test-db-connectionfrom
jc/clj-parity-d9-fix

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 13, 2026

Summary

Stacked on #2443 (Fix test DB connection: use DATABASE_URL with dotenv). Please review and merge #2443 first.
Next in stack: #2448 (Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount))

  • Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's stats.clj
  • Z_90: 1.645 → 1.2816, Z_95: 1.96 → 1.6449
  • Also resolves an internal inconsistency — Python's own stats.py already used the correct one-tailed values

Why one-tailed?

The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.

Test plan

  • TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
  • Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
  • Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
  • Repness unit tests: 36 passed (boundary values updated to match new thresholds)
  • 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)

🤖 Generated with Claude Code

@jucor jucor changed the base branch from jc/vectorize-participant-info to jc/fix-test-db-connection March 13, 2026 14:12
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 0f32d4c to abfeacc Compare March 13, 2026 14:13
@jucor jucor changed the title Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/13] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 13, 2026
@jucor jucor requested a review from Copilot March 13, 2026 14:14
@jucor jucor force-pushed the jc/fix-test-db-connection branch from e2f39bb to 89297cf Compare March 13, 2026 14:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Delphi’s repness significance thresholds to use one-tailed z-score cutoffs (aligning with the Clojure implementation and existing stats.py expectations), and refreshes affected tests and golden snapshots.

Changes:

  • Update Z_90/Z_95 constants in repness.py to one-tailed thresholds (1.2816 / 1.6449).
  • Un-xfail and adjust unit/discrepancy tests to assert the new thresholds (including boundary assertions).
  • Re-record regression golden snapshot data reflecting the new repness behavior.

Reviewed changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
delphi/polismath/pca_kmeans_rep/repness.py Switch Z-score thresholds to one-tailed constants and update inline documentation.
delphi/tests/test_repness_unit.py Update z-score significance unit tests for the new thresholds.
delphi/tests/test_old_format_repness.py Update backwards-compat repness tests for the new thresholds.
delphi/tests/test_discrepancy_fixes.py Remove D9 xfail markers now that thresholds match expected values.
delphi/tests/simplified_repness_test.py Update the script constant to the new Z_90 value.
delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/golden_snapshot.json Update golden snapshot outputs after repness threshold change.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Add task parallelization notes related to discrepancy fix sequencing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread delphi/polismath/pca_kmeans_rep/repness.py Outdated
Comment thread delphi/docs/PLAN_DISCREPANCY_FIXES.md Outdated
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 89297cf to 403be8d Compare March 13, 2026 16:07
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 9149f7f to e3c5893 Compare March 13, 2026 16:07
@jucor jucor changed the title [Stack 13/13] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/15] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 403be8d to 464da16 Compare March 16, 2026 16:04
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from db36889 to 69350d5 Compare March 16, 2026 16:04
@jucor jucor changed the title [Stack 13/15] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/16] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 69350d5 to 382de2f Compare March 16, 2026 18:06
@jucor jucor changed the title [Stack 13/16] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/17] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 16, 2026
@jucor jucor changed the title [Stack 13/17] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/24] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 17, 2026
@jucor jucor changed the title [Stack 13/24] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 17, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8a7007 to 19cce44 Compare March 19, 2026 10:03
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 6b81d8e to 6da6172 Compare March 19, 2026 10:43
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 19cce44 to bf2dd99 Compare March 19, 2026 10:43
@jucor jucor changed the title [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 12/24] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 19, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 6da6172 to 0ef54ca Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from bf2dd99 to f8c5793 Compare March 19, 2026 12:31
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 0ef54ca to b62c0cd Compare March 19, 2026 14:52
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from f8c5793 to 7f733c1 Compare March 19, 2026 14:52
@jucor jucor changed the title [Stack 12/24] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 19, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from b62c0cd to 83bfe23 Compare March 23, 2026 15:09
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 7f733c1 to c920b61 Compare March 23, 2026 15:11
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 397f00d to e668601 Compare March 27, 2026 01:15
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from ee798a6 to 6e54a9c Compare March 27, 2026 01:53
@jucor jucor force-pushed the jc/fix-test-db-connection branch from e668601 to 4ea1333 Compare March 27, 2026 02:10
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch 2 times, most recently from 09747ea to 8d94246 Compare March 27, 2026 10:41
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 4ea1333 to 917cba8 Compare March 27, 2026 10:41
@jucor jucor changed the title [Stack 13/25] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 14/26] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 30, 2026
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 917cba8 to e45a120 Compare March 30, 2026 12:48
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 8d94246 to 9397ddf Compare March 30, 2026 12:48
@jucor jucor changed the title [Stack 14/26] Fix D9: z-score thresholds from two-tailed to one-tailed [Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed Mar 30, 2026
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 9397ddf to e96a1f7 Compare March 30, 2026 12:54
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1107 320 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 297 43 86%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 87 37%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10770 7618 29%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 44 to 48
Returns:
True if significant at 90% confidence
"""
return abs(z) >= Z_90
return z > Z_90

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The significance helpers were switched to one-tailed semantics (z > Z_90), but Z_90 is still set to 1.645 (two-tailed 90%). To match Clojure’s z-sig-90? (> 1.2816), update the constant (and its comment) to 1.2816; otherwise the new one-tailed check remains overly conservative and D9 parity tests can’t pass once enabled.

Copilot uses AI. Check for mistakes.
Comment on lines 57 to 61
Returns:
True if significant at 95% confidence
"""
return abs(z) >= Z_95
return z > Z_95

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as Z_90: z_score_sig_95 now uses one-tailed strict z > Z_95, but Z_95 is still 1.96 (two-tailed 95%). Clojure’s z-sig-95? uses 1.6449; update Z_95 accordingly so 95% gating matches the reference implementation.

Copilot uses AI. Check for mistakes.
Comment on lines 555 to 565
@@ -562,6 +564,20 @@ def test_z95_matches_clojure(self):
check.almost_equal(Z_95, 1.6449, abs=0.001,
msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The D9 threshold value assertions are still marked @pytest.mark.xfail, which means this PR’s stated purpose (changing Z_90/Z_95 values) isn’t actually enforced by CI. Once the constants are updated, these xfails should be removed so regressions in the threshold values will fail the suite.

Copilot uses AI. Check for mistakes.
z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100
print(f"Comparison Z-score: {z_score}")
```
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc references z_sig_90/z_sig_95, but repness.py defines z_score_sig_90/z_score_sig_95 (and stats.py has been removed). Update the names here (or add documented aliases) so the docs reflect the actual public API.

Suggested change
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
Statistical helpers (`z_score_sig_90`, `z_score_sig_95`, `prop_test`, `two_prop_test`) are

Copilot uses AI. Check for mistakes.
@jucor jucor force-pushed the jc/fix-test-db-connection branch from b68bd5b to 583f955 Compare March 30, 2026 16:49
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from e96a1f7 to 574c169 Compare March 30, 2026 16:49
jucor and others added 7 commits March 30, 2026 18:04
Map dependency graph and file boundaries for D5-D12. Two parallel
tracks possible: repness formulas (D5→D6→D7→D8→D10→D11) and
conversation/PCA (D3→D15→D12), with D1/D1b after both.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ead stats.py

repness.py: change z_score_sig_90/95 from abs(z) >= threshold to z > threshold,
matching Clojure's (> z-val 1.2816). Also fix inline significance filters in
select_rep_comments_df to use > without .abs().

Remove stats.py and test_stats.py — unused dead code from the original AI-generated
Clojure port. repness.py defines its own z-sig functions and never imported stats.py.

Update usage_examples.md to point to repness.py instead of deleted stats.py.

Add D9 blob comparison tests (significance sets and z-values, xfail until D5/D6),
D5/D6/D7/D8 blob comparison tests with shared_count guards, tid type fixes,
and D9 unit tests for strict > and one-tailed semantics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Require side-by-side Clojure/Python verification for every formula change
- Exhaustive RED phase: boundary conditions, edge cases, missing test audit
- Double-check array shapes, indices, aggregation axes after implementation
- Allow skipping very large private datasets during iteration, only run as final validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR 14 (vectorized code refactor) is now a prerequisite for all formula
fix PRs, not a post-parity cleanup. It branches off jc/clj-parity-d9-fix
(Stack 13) to make the vectorized production path readable and testable
against Clojure blob values. Remaining dead code cleanup split to PR 14b.

Handoff doc at delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D9 z-score threshold change (two-tailed → one-tailed) combined
with upstream fixes cascaded into this branch changed the repness
output enough to invalidate the vw golden snapshot. Biodiversity
re-recorded too for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jucor jucor force-pushed the jc/clj-parity-d9-fix branch from 574c169 to b64cae8 Compare March 30, 2026 17:05
@jucor jucor force-pushed the jc/fix-test-db-connection branch from 583f955 to 380b00d Compare March 30, 2026 17:05
This was referenced Mar 30, 2026
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 30, 2026

Superseded by spr-managed PR stack. See the new stack starting at #2508.

@jucor jucor closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants