diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md index d0092623a..1409f88cb 100644 --- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md +++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md @@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session. ### What's Next -1. **PR 3 — Fix D9 (Z-score thresholds)**: `Z_90=1.645` → `1.2816`, `Z_95=1.96` → `1.6449` +1. **PR 4 — Fix D5 (Proportion test)**: Change `prop_test` from standard z-test to Clojure formula. 2. Regression test performance optimization (separate session) --- +## PR 4: Fix D5 — Proportion Test Formula + +### TDD steps +1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed +2. **Red**: Wrote tests calling `prop_test(succ, n)` (new signature) → 3 failures (TypeError: missing p0 arg) +3. **Fix**: Replaced `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula: + `2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` (stats.clj:10-15) +4. **Green**: All 7 D5 tests pass (formula checks + sanity checks + edge cases) +5. **Full suite (public)**: 4 regression failures (expected — pat/pdt/metric values changed) +6. **Investigation**: All diffs are in `pat`, `pdt`, `agree_metric`, `disagree_metric` — direct + downstream of the prop_test formula change. No unexpected field changes. +7. **Re-recorded golden snapshots** for all 7 datasets (public + private) +8. **Full suite (with --include-local)**: 1 failed (pakistan-incremental D2, pre-existing), + 91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5 + +### Changes +- `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula + `2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)`. Added detailed docstring explaining + the Wilson-score-like regularization and the separate pseudocount from pa/pd. +- `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)` +- `repness.py`: Updated callers in `comment_stats()` and `compute_group_comment_stats_df()` + to pass raw counts `(na, ns)` / `(nd, ns)` instead of `(pa, ns, 0.5)` / `(pd, ns, 0.5)` +- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added comprehensive + test cases (8 input pairs including boundary conditions) and edge case test +- `test_repness_unit.py`: Updated `test_prop_test` and vectorized tests for new signature +- `test_old_format_repness.py`: Updated `test_prop_test` for new signature + +### Key insight: two separate pseudocounts +The Clojure `prop-test` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)), +separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw +success counts, not pre-smoothed probabilities. This means: +- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation +- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing + +These are conceptually different: the probability is for ranking, the z-score is for +significance filtering. Using different priors is intentional. + +### Session 7 (2026-03-13) + +- Created branch `jc/clj-parity-d5-prop-test` on top of `jc/clj-parity-d9-fix` +- Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula +- TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail) +- Full suite: 4 regression failures, all in pat/pdt/metric fields (expected) +- Re-recorded golden snapshots for all 7 datasets +- Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2) + +### What's Next + +1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature + from proportions to raw counts. + +--- + ## TDD Discipline **CRITICAL: For every fix, ALWAYS follow this order:** diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md index 0172d4837..f4c244e6d 100644 --- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md +++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md @@ -20,6 +20,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows: | PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula | | (perf) | #2436 | Stack 10/10 | Speed up regression tests | | PR 3 (D9) | #2518 | — | Fix D9: z-score thresholds (one-tailed) | +| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula | Future fix PRs will be appended to the stack as they're created. @@ -465,7 +466,7 @@ By this point, we should have good test coverage from all the per-discrepancy te | D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) | | D3 | K-smoother buffer | PR 10 | — | Fix | | D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ | -| D5 | Proportion test | PR 4 | — | Fix | +| D5 | Proportion test | **PR 4** | — | **DONE** ✓ | | D6 | Two-proportion test | PR 5 | — | Fix | | D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) | | D8 | Finalize cmt stats | PR 7 | — | Fix | diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py index cc869acdb..82f53556f 100644 --- a/delphi/polismath/pca_kmeans_rep/repness.py +++ b/delphi/polismath/pca_kmeans_rep/repness.py @@ -60,29 +60,39 @@ def z_score_sig_95(z: float) -> bool: return z > Z_95 -def prop_test(p: float, n: int, p0: float) -> float: +def prop_test(succ: int, n: int) -> float: """ - One-proportion z-test. - + One-proportion z-test, matching Clojure's stats/prop-test (stats.clj:10-15). + + Clojure formula: + (let [[succ n] (map inc [succ n])] + (* 2 (sqrt n) (+ (/ succ n) -0.5))) + + Which simplifies to: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) + + This is a Wilson-score-like test with built-in +1 pseudocount (Laplace + smoothing). Unlike the standard z-test ((p - p0) / sqrt(p0*(1-p0)/n)), + the +1 terms regularize extreme values for small samples, preventing + spurious significance in small Polis groups. + + Note: the pseudocount here (+1 to succ and n, i.e. Beta(1,1)) is + independent of the PSEUDO_COUNT used for pa/pd computation (Beta(2,2)). + Clojure's prop-test takes raw success counts, not pre-smoothed + probabilities. + Args: - p: Observed proportion - n: Number of observations - p0: Expected proportion under null hypothesis - + succ: Number of successes (e.g. agrees or disagrees) + n: Total number of trials (votes seen) + Returns: - Z-score + Z-score (positive means succ/n > 0.5) """ - if n == 0 or p0 == 0 or p0 == 1: + if n == 0: return 0.0 - - # Calculate standard error - se = math.sqrt(p0 * (1 - p0) / n) - - # Z-score calculation - if se == 0: - return 0.0 - else: - return (p - p0) / se + # Apply +1 pseudocount to both numerator and denominator + succ_pc = succ + 1 + n_pc = n + 1 + return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5) def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float: @@ -137,9 +147,10 @@ def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any] p_agree = (n_agree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5 p_disagree = (n_disagree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5 - # Calculate significance tests - p_agree_test = prop_test(p_agree, n_votes, 0.5) if n_votes > 0 else 0.0 - p_disagree_test = prop_test(p_disagree, n_votes, 0.5) if n_votes > 0 else 0.0 + # Calculate significance tests — pass raw counts, matching Clojure's + # (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75) + p_agree_test = prop_test(n_agree, n_votes) if n_votes > 0 else 0.0 + p_disagree_test = prop_test(n_disagree, n_votes) if n_votes > 0 else 0.0 # Return stats return { @@ -457,23 +468,28 @@ def select_consensus_comments(all_stats: List[Dict[str, Any]]) -> List[Dict[str, # Vectorized DataFrame-native functions for multi-group operations # ============================================================================= -def prop_test_vectorized(p: pd.Series, n: pd.Series, p0: float = 0.5) -> pd.Series: +def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series: """ - Vectorized one-proportion z-test. + Vectorized one-proportion z-test, matching Clojure's stats/prop-test. + + Formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) + + See prop_test() docstring for derivation and rationale. Args: - p: Series of observed proportions - n: Series of number of observations - p0: Expected proportion under null hypothesis (default: 0.5) + succ: Series of success counts (e.g. agrees or disagrees) + n: Series of total trial counts (votes seen) Returns: Series of z-scores """ - se = np.sqrt(p0 * (1 - p0) / n) - z = (p - p0) / se - # Handle edge cases: n=0, p0=0, p0=1 all result in 0 + succ_pc = succ + 1 + n_pc = n + 1 + z = 2 * np.sqrt(n_pc) * (succ_pc / n_pc - 0.5) + # Handle n=0 edge case (n_pc=1, succ_pc=1 → z = 2*1*(1/1 - 0.5) = 1.0, + # but we want 0 for no-data rows) + z = z.where(n > 0, 0.0) z = z.fillna(0.0) - z = z.replace([np.inf, -np.inf], 0.0) return z @@ -620,9 +636,10 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame, stats_df.loc[other_zero_mask, 'other_pa'] = 0.5 stats_df.loc[other_zero_mask, 'other_pd'] = 0.5 - # Compute proportion tests (group vs 0.5) - stats_df['pat'] = prop_test_vectorized(stats_df['pa'], stats_df['ns'], 0.5) - stats_df['pdt'] = prop_test_vectorized(stats_df['pd'], stats_df['ns'], 0.5) + # Compute proportion tests — pass raw counts, matching Clojure's + # (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75) + stats_df['pat'] = prop_test_vectorized(stats_df['na'], stats_df['ns']) + stats_df['pdt'] = prop_test_vectorized(stats_df['nd'], stats_df['ns']) # Compute representativeness ratios (group vs other) stats_df['ra'] = stats_df['pa'] / stats_df['other_pa'] diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py index e3abee434..316ad80d5 100644 --- a/delphi/tests/test_discrepancy_fixes.py +++ b/delphi/tests/test_discrepancy_fixes.py @@ -695,23 +695,32 @@ class TestD5ProportionTest: Clojure uses Wilson-score-like: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5) Clojure formula has built-in regularization via +1 terms. + After fix, prop_test(succ, n) matches Clojure exactly. """ - @pytest.mark.xfail(reason="D5: Python standard z-test vs Clojure Wilson-score-like") def test_prop_test_matches_clojure_formula(self): - """prop_test should match Clojure's formula for known inputs.""" - # Example: 12 successes out of 13 trials - succ, n = 12, 13 - # Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) - expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5) - - # Current Python: prop_test(p, n, 0.5) where p = (succ + pc/2) / (n + pc) - p = (succ + PSEUDO_COUNT / 2) / (n + PSEUDO_COUNT) - python_result = prop_test(p, n, 0.5) - - print(f"prop_test(succ={succ}, n={n}): Python={python_result:.4f}, Clojure={expected:.4f}") - check.almost_equal(python_result, expected, abs=0.01, - msg=f"prop_test mismatch: Python={python_result:.4f}, Clojure={expected:.4f}") + """prop_test(succ, n) should match Clojure's formula for known inputs.""" + test_cases = [ + (12, 13), # High success rate + (5, 8), # Moderate + (0, 10), # All failures + (10, 10), # All successes + (1, 2), # Tiny sample + (50, 100), # Larger sample + (0, 1), # Single trial, no success + (1, 1), # Single trial, success + ] + for succ, n in test_cases: + # Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) + expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5) + result = prop_test(succ, n) + check.almost_equal(result, expected, abs=1e-10, + msg=f"prop_test({succ}, {n}): got {result:.6f}, expected {expected:.6f}") + + def test_prop_test_edge_cases(self): + """prop_test handles n=0 gracefully.""" + # n=0 should return 0 (no data) + assert prop_test(0, 0) == 0.0 def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_name): """Sanity check: Clojure's p-test values match the documented formula.""" @@ -1157,14 +1166,13 @@ def test_z_thresholds_are_one_tailed(self): check.almost_equal(Z_95, 1.6449, abs=0.001, msg=f"Z_95={Z_95}, expected 1.6449 (one-tailed)") - def test_clojure_prop_test_formula(self): - """Verify Clojure's proportion test formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5).""" + def test_prop_test_matches_clojure_formula_synthetic(self): + """prop_test(succ, n) should produce 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5).""" # Small n: 5 successes out of 8 trials succ, n = 5, 8 - result = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5) - # Manual: 2 * 3 * (6/9 - 0.5) = 6 * 0.1667 = 1.0 - expected = 2 * 3.0 * (6.0 / 9.0 - 0.5) - assert abs(result - expected) < 1e-10 + expected = 2 * 3.0 * (6.0 / 9.0 - 0.5) # = 1.0 + result = prop_test(succ, n) + assert abs(result - expected) < 1e-10, f"prop_test({succ}, {n})={result}, expected {expected}" def test_clojure_repness_metric_product(self): """Verify Clojure's repness metric is a product: ra * rat * pa * pat.""" @@ -1179,3 +1187,51 @@ def test_clojure_repful_uses_rat_vs_rdt(self): # rat < rdt → disagree assert (0.5 < 1.5) # rat=0.5, rdt=1.5 → disagree + + +# ============================================================================ +# Blob Injection Tests — Compare Python functions against real Clojure values +# ============================================================================ +# +# These tests extract inputs from the Clojure math blob, feed them to Python +# functions, and compare outputs to the Clojure blob's values. This is the +# only non-tautological way to verify correctness: formula-only tests just +# re-implement our reading of the Clojure source and can't catch misreadings. +# +# Since Python and Clojure may produce different clusters (different k), we +# inject Clojure's own group memberships and vote counts from the blob, +# isolating each computation stage from upstream divergence. +# ============================================================================ + +@pytest.mark.clojure_comparison +class TestD5BlobInjection: + """D5: Verify prop_test against real Clojure blob p-test values. + + For each repness entry in the blob, extract n-success and n-trials, + feed to Python's prop_test(), compare to blob's p-test. + """ + + def test_prop_test_matches_blob_p_test(self, clojure_blob, dataset_name): + """prop_test(n_success, n_trials) should match blob's p-test for every repness entry.""" + repness = clojure_blob.get('repness', {}) + if not repness: + pytest.skip(f"No repness in Clojure blob for {dataset_name}") + + mismatches = [] + total = 0 + for gid, entries in repness.items(): + for entry in entries: + n_success = entry['n-success'] + n_trials = entry['n-trials'] + expected_p_test = entry['p-test'] + actual = prop_test(n_success, n_trials) + total += 1 + if abs(actual - expected_p_test) > 1e-4: + mismatches.append( + f"group={gid} tid={entry['tid']}: " + f"prop_test({n_success}, {n_trials})={actual:.6f}, " + f"blob p-test={expected_p_test:.6f}") + + assert not mismatches, ( + f"[{dataset_name}] {len(mismatches)}/{total} p-test mismatches:\n" + + "\n".join(mismatches[:10])) diff --git a/delphi/tests/test_old_format_repness.py b/delphi/tests/test_old_format_repness.py index 514ddf9e8..113afc700 100644 --- a/delphi/tests/test_old_format_repness.py +++ b/delphi/tests/test_old_format_repness.py @@ -5,6 +5,7 @@ that wraps the new DataFrame-native implementation. """ +import math import numpy as np import pandas as pd import sys @@ -43,15 +44,16 @@ def test_z_score_significance(self): assert not z_score_sig_95(1.64) def test_prop_test(self): - """Test one-proportion z-test.""" - # Test cases - assert np.isclose(prop_test(0.7, 100, 0.5), 4.0, atol=0.1) - assert np.isclose(prop_test(0.2, 50, 0.3), -1.6, atol=0.1) - - # Edge cases - assert prop_test(0.5, 0, 0.5) == 0.0 - assert prop_test(0.7, 100, 0.0) == 0.0 - assert prop_test(0.7, 100, 1.0) == 0.0 + """Test one-proportion z-test (Clojure formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)).""" + # 70 successes out of 100 + assert np.isclose(prop_test(70, 100), + 2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01) + # 10 successes out of 50 + assert np.isclose(prop_test(10, 50), + 2 * math.sqrt(51) * (11/51 - 0.5), atol=0.01) + + # Edge case: n=0 + assert prop_test(0, 0) == 0.0 def test_two_prop_test(self): """Test two-proportion z-test.""" diff --git a/delphi/tests/test_repness_unit.py b/delphi/tests/test_repness_unit.py index dfdcad8aa..3efb7672a 100644 --- a/delphi/tests/test_repness_unit.py +++ b/delphi/tests/test_repness_unit.py @@ -44,15 +44,19 @@ def test_z_score_significance(self): assert not z_score_sig_95(1.64) def test_prop_test(self): - """Test one-proportion z-test.""" - # Test cases - assert np.isclose(prop_test(0.7, 100, 0.5), 4.0, atol=0.1) - assert np.isclose(prop_test(0.2, 50, 0.3), -1.6, atol=0.1) - - # Edge cases - assert prop_test(0.5, 0, 0.5) == 0.0 - assert prop_test(0.7, 100, 0.0) == 0.0 - assert prop_test(0.7, 100, 1.0) == 0.0 + """Test one-proportion z-test (Clojure formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)).""" + # 70 successes out of 100: 2*sqrt(101)*((71/101)-0.5) = ~4.19 + assert np.isclose(prop_test(70, 100), + 2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01) + # 10 successes out of 50: 2*sqrt(51)*((11/51)-0.5) = ~-4.29 + assert np.isclose(prop_test(10, 50), + 2 * math.sqrt(51) * (11/51 - 0.5), atol=0.01) + + # Edge case: n=0 → 0.0 + assert prop_test(0, 0) == 0.0 + # Single trial: 2*sqrt(2)*((2/2)-0.5) = 2*1.414*0.5 = 1.414 + assert np.isclose(prop_test(1, 1), + 2 * math.sqrt(2) * 0.5, atol=0.01) def test_two_prop_test(self): """Test two-proportion z-test.""" @@ -550,23 +554,23 @@ class TestVectorizedFunctions: """Tests for DataFrame-native vectorized functions.""" def test_prop_test_vectorized(self): - """Test vectorized one-proportion z-test.""" - p = pd.Series([0.7, 0.2, 0.5]) + """Test vectorized one-proportion z-test (Clojure formula).""" + succ = pd.Series([70, 10, 50]) n = pd.Series([100, 50, 100]) - result = prop_test_vectorized(p, n, 0.5) + result = prop_test_vectorized(succ, n) # Compare with scalar version - assert np.isclose(result.iloc[0], prop_test(0.7, 100, 0.5), atol=0.01) - assert np.isclose(result.iloc[1], prop_test(0.2, 50, 0.5), atol=0.01) - assert np.isclose(result.iloc[2], prop_test(0.5, 100, 0.5), atol=0.01) + assert np.isclose(result.iloc[0], prop_test(70, 100), atol=0.01) + assert np.isclose(result.iloc[1], prop_test(10, 50), atol=0.01) + assert np.isclose(result.iloc[2], prop_test(50, 100), atol=0.01) def test_prop_test_vectorized_edge_cases(self): """Test vectorized prop test handles edge cases.""" - p = pd.Series([0.5, 0.7]) + succ = pd.Series([0, 70]) n = pd.Series([0, 100]) # n=0 should return 0 - result = prop_test_vectorized(p, n, 0.5) + result = prop_test_vectorized(succ, n) assert result.iloc[0] == 0.0 # n=0 case assert not np.isnan(result.iloc[1]) # normal case