[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446
[Stack 15/27] Fix D9: z-score thresholds from two-tailed to one-tailed#2446jucor wants to merge 7 commits into
Conversation
0f32d4c to
abfeacc
Compare
e2f39bb to
89297cf
Compare
There was a problem hiding this comment.
Pull request overview
Updates Delphi’s repness significance thresholds to use one-tailed z-score cutoffs (aligning with the Clojure implementation and existing stats.py expectations), and refreshes affected tests and golden snapshots.
Changes:
- Update
Z_90/Z_95constants inrepness.pyto one-tailed thresholds (1.2816 / 1.6449). - Un-xfail and adjust unit/discrepancy tests to assert the new thresholds (including boundary assertions).
- Re-record regression golden snapshot data reflecting the new repness behavior.
Reviewed changes
Copilot reviewed 19 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/repness.py |
Switch Z-score thresholds to one-tailed constants and update inline documentation. |
delphi/tests/test_repness_unit.py |
Update z-score significance unit tests for the new thresholds. |
delphi/tests/test_old_format_repness.py |
Update backwards-compat repness tests for the new thresholds. |
delphi/tests/test_discrepancy_fixes.py |
Remove D9 xfail markers now that thresholds match expected values. |
delphi/tests/simplified_repness_test.py |
Update the script constant to the new Z_90 value. |
delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/golden_snapshot.json |
Update golden snapshot outputs after repness threshold change. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Add task parallelization notes related to discrepancy fix sequencing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
89297cf to
403be8d
Compare
9149f7f to
e3c5893
Compare
403be8d to
464da16
Compare
db36889 to
69350d5
Compare
69350d5 to
382de2f
Compare
f8a7007 to
19cce44
Compare
6b81d8e to
6da6172
Compare
19cce44 to
bf2dd99
Compare
6da6172 to
0ef54ca
Compare
bf2dd99 to
f8c5793
Compare
0ef54ca to
b62c0cd
Compare
f8c5793 to
7f733c1
Compare
b62c0cd to
83bfe23
Compare
7f733c1 to
c920b61
Compare
397f00d to
e668601
Compare
ee798a6 to
6e54a9c
Compare
e668601 to
4ea1333
Compare
09747ea to
8d94246
Compare
4ea1333 to
917cba8
Compare
917cba8 to
e45a120
Compare
8d94246 to
9397ddf
Compare
9397ddf to
e96a1f7
Compare
Delphi Coverage Report
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Returns: | ||
| True if significant at 90% confidence | ||
| """ | ||
| return abs(z) >= Z_90 | ||
| return z > Z_90 | ||
|
|
There was a problem hiding this comment.
The significance helpers were switched to one-tailed semantics (z > Z_90), but Z_90 is still set to 1.645 (two-tailed 90%). To match Clojure’s z-sig-90? (> 1.2816), update the constant (and its comment) to 1.2816; otherwise the new one-tailed check remains overly conservative and D9 parity tests can’t pass once enabled.
| Returns: | ||
| True if significant at 95% confidence | ||
| """ | ||
| return abs(z) >= Z_95 | ||
| return z > Z_95 | ||
|
|
There was a problem hiding this comment.
Same issue as Z_90: z_score_sig_95 now uses one-tailed strict z > Z_95, but Z_95 is still 1.96 (two-tailed 95%). Clojure’s z-sig-95? uses 1.6449; update Z_95 accordingly so 95% gating matches the reference implementation.
| @@ -562,6 +564,20 @@ def test_z95_matches_clojure(self): | |||
| check.almost_equal(Z_95, 1.6449, abs=0.001, | |||
| msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}") | |||
There was a problem hiding this comment.
The D9 threshold value assertions are still marked @pytest.mark.xfail, which means this PR’s stated purpose (changing Z_90/Z_95 values) isn’t actually enforced by CI. Once the constants are updated, these xfails should be removed so regressions in the threshold values will fail the suite.
| z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100 | ||
| print(f"Comparison Z-score: {z_score}") | ||
| ``` | ||
| Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are |
There was a problem hiding this comment.
This doc references z_sig_90/z_sig_95, but repness.py defines z_score_sig_90/z_score_sig_95 (and stats.py has been removed). Update the names here (or add documented aliases) so the docs reflect the actual public API.
| Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are | |
| Statistical helpers (`z_score_sig_90`, `z_score_sig_95`, `prop_test`, `two_prop_test`) are |
b68bd5b to
583f955
Compare
e96a1f7 to
574c169
Compare
Map dependency graph and file boundaries for D5-D12. Two parallel tracks possible: repness formulas (D5→D6→D7→D8→D10→D11) and conversation/PCA (D3→D15→D12), with D1/D1b after both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ead stats.py repness.py: change z_score_sig_90/95 from abs(z) >= threshold to z > threshold, matching Clojure's (> z-val 1.2816). Also fix inline significance filters in select_rep_comments_df to use > without .abs(). Remove stats.py and test_stats.py — unused dead code from the original AI-generated Clojure port. repness.py defines its own z-sig functions and never imported stats.py. Update usage_examples.md to point to repness.py instead of deleted stats.py. Add D9 blob comparison tests (significance sets and z-values, xfail until D5/D6), D5/D6/D7/D8 blob comparison tests with shared_count guards, tid type fixes, and D9 unit tests for strict > and one-tailed semantics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Require side-by-side Clojure/Python verification for every formula change - Exhaustive RED phase: boundary conditions, edge cases, missing test audit - Double-check array shapes, indices, aggregation axes after implementation - Allow skipping very large private datasets during iteration, only run as final validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR 14 (vectorized code refactor) is now a prerequisite for all formula fix PRs, not a post-parity cleanup. It branches off jc/clj-parity-d9-fix (Stack 13) to make the vectorized production path readable and testable against Clojure blob values. Remaining dead code cleanup split to PR 14b. Handoff doc at delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D9 z-score threshold change (two-tailed → one-tailed) combined with upstream fixes cascaded into this branch changed the repness output enough to invalidate the vw golden snapshot. Biodiversity re-recorded too for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
574c169 to
b64cae8
Compare
583f955 to
380b00d
Compare
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
Summary
stats.cljZ_90: 1.645 → 1.2816,Z_95: 1.96 → 1.6449stats.pyalready used the correct one-tailed valuesWhy one-tailed?
The proportion tests in Polis check whether a comment's agree (or disagree) rate is significantly above 0.5 — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.
Test plan
🤖 Generated with Claude Code