Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 54 additions & 1 deletion delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session.

### What's Next

1. **PR 3 — Fix D9 (Z-score thresholds)**: `Z_90=1.645` → `1.2816`, `Z_95=1.96` → `1.6449`
1. **PR 4 — Fix D5 (Proportion test)**: Change `prop_test` from standard z-test to Clojure formula.
2. Regression test performance optimization (separate session)

---

## PR 4: Fix D5 — Proportion Test Formula

### TDD steps
1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed
2. **Red**: Wrote tests calling `prop_test(succ, n)` (new signature) → 3 failures (TypeError: missing p0 arg)
3. **Fix**: Replaced `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula:
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` (stats.clj:10-15)
4. **Green**: All 7 D5 tests pass (formula checks + sanity checks + edge cases)
5. **Full suite (public)**: 4 regression failures (expected — pat/pdt/metric values changed)
6. **Investigation**: All diffs are in `pat`, `pdt`, `agree_metric`, `disagree_metric` — direct
downstream of the prop_test formula change. No unexpected field changes.
7. **Re-recorded golden snapshots** for all 7 datasets (public + private)
8. **Full suite (with --include-local)**: 1 failed (pakistan-incremental D2, pre-existing),
91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5

### Changes
- `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)`. Added detailed docstring explaining
the Wilson-score-like regularization and the separate pseudocount from pa/pd.
- `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)`
- `repness.py`: Updated callers in `comment_stats()` and `compute_group_comment_stats_df()`
to pass raw counts `(na, ns)` / `(nd, ns)` instead of `(pa, ns, 0.5)` / `(pd, ns, 0.5)`
- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added comprehensive
test cases (8 input pairs including boundary conditions) and edge case test
- `test_repness_unit.py`: Updated `test_prop_test` and vectorized tests for new signature
- `test_old_format_repness.py`: Updated `test_prop_test` for new signature

### Key insight: two separate pseudocounts
The Clojure `prop-test` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)),
separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw
success counts, not pre-smoothed probabilities. This means:
- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing

These are conceptually different: the probability is for ranking, the z-score is for
significance filtering. Using different priors is intentional.

### Session 7 (2026-03-13)

- Created branch `jc/clj-parity-d5-prop-test` on top of `jc/clj-parity-d9-fix`
- Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula
- TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail)
- Full suite: 4 regression failures, all in pat/pdt/metric fields (expected)
- Re-recorded golden snapshots for all 7 datasets
- Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2)

### What's Next

1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
from proportions to raw counts.

---

## TDD Discipline

**CRITICAL: For every fix, ALWAYS follow this order:**
Expand Down
3 changes: 2 additions & 1 deletion delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
| PR 3 (D9) | #2518 | — | Fix D9: z-score thresholds (one-tailed) |
| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |

Future fix PRs will be appended to the stack as they're created.

Expand Down Expand Up @@ -465,7 +466,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) |
| D3 | K-smoother buffer | PR 10 | — | Fix |
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
| D5 | Proportion test | PR 4 | — | Fix |
| D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
| D6 | Two-proportion test | PR 5 | — | Fix |
| D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
Expand Down
83 changes: 50 additions & 33 deletions delphi/polismath/pca_kmeans_rep/repness.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,29 +60,39 @@ def z_score_sig_95(z: float) -> bool:
return z > Z_95


def prop_test(p: float, n: int, p0: float) -> float:
def prop_test(succ: int, n: int) -> float:
"""
One-proportion z-test.

One-proportion z-test, matching Clojure's stats/prop-test (stats.clj:10-15).

Clojure formula:
(let [[succ n] (map inc [succ n])]
(* 2 (sqrt n) (+ (/ succ n) -0.5)))

Which simplifies to: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)

This is a Wilson-score-like test with built-in +1 pseudocount (Laplace
smoothing). Unlike the standard z-test ((p - p0) / sqrt(p0*(1-p0)/n)),
the +1 terms regularize extreme values for small samples, preventing
spurious significance in small Polis groups.

Note: the pseudocount here (+1 to succ and n, i.e. Beta(1,1)) is
independent of the PSEUDO_COUNT used for pa/pd computation (Beta(2,2)).
Clojure's prop-test takes raw success counts, not pre-smoothed
probabilities.

Args:
p: Observed proportion
n: Number of observations
p0: Expected proportion under null hypothesis

succ: Number of successes (e.g. agrees or disagrees)
n: Total number of trials (votes seen)

Returns:
Z-score
Z-score (positive means succ/n > 0.5)
"""
if n == 0 or p0 == 0 or p0 == 1:
if n == 0:
return 0.0

# Calculate standard error
se = math.sqrt(p0 * (1 - p0) / n)

# Z-score calculation
if se == 0:
return 0.0
else:
return (p - p0) / se
# Apply +1 pseudocount to both numerator and denominator
succ_pc = succ + 1
n_pc = n + 1
return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)


def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
Expand Down Expand Up @@ -137,9 +147,10 @@ def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]
p_agree = (n_agree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5
p_disagree = (n_disagree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5

# Calculate significance tests
p_agree_test = prop_test(p_agree, n_votes, 0.5) if n_votes > 0 else 0.0
p_disagree_test = prop_test(p_disagree, n_votes, 0.5) if n_votes > 0 else 0.0
# Calculate significance tests — pass raw counts, matching Clojure's
# (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
p_agree_test = prop_test(n_agree, n_votes) if n_votes > 0 else 0.0
p_disagree_test = prop_test(n_disagree, n_votes) if n_votes > 0 else 0.0

# Return stats
return {
Expand Down Expand Up @@ -457,23 +468,28 @@ def select_consensus_comments(all_stats: List[Dict[str, Any]]) -> List[Dict[str,
# Vectorized DataFrame-native functions for multi-group operations
# =============================================================================

def prop_test_vectorized(p: pd.Series, n: pd.Series, p0: float = 0.5) -> pd.Series:
def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
"""
Vectorized one-proportion z-test.
Vectorized one-proportion z-test, matching Clojure's stats/prop-test.

Formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)

See prop_test() docstring for derivation and rationale.

Args:
p: Series of observed proportions
n: Series of number of observations
p0: Expected proportion under null hypothesis (default: 0.5)
succ: Series of success counts (e.g. agrees or disagrees)
n: Series of total trial counts (votes seen)

Returns:
Series of z-scores
"""
se = np.sqrt(p0 * (1 - p0) / n)
z = (p - p0) / se
# Handle edge cases: n=0, p0=0, p0=1 all result in 0
succ_pc = succ + 1
n_pc = n + 1
z = 2 * np.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
# Handle n=0 edge case (n_pc=1, succ_pc=1 → z = 2*1*(1/1 - 0.5) = 1.0,
# but we want 0 for no-data rows)
z = z.where(n > 0, 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
return z


Expand Down Expand Up @@ -620,9 +636,10 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
stats_df.loc[other_zero_mask, 'other_pa'] = 0.5
stats_df.loc[other_zero_mask, 'other_pd'] = 0.5

# Compute proportion tests (group vs 0.5)
stats_df['pat'] = prop_test_vectorized(stats_df['pa'], stats_df['ns'], 0.5)
stats_df['pdt'] = prop_test_vectorized(stats_df['pd'], stats_df['ns'], 0.5)
# Compute proportion tests — pass raw counts, matching Clojure's
# (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
stats_df['pat'] = prop_test_vectorized(stats_df['na'], stats_df['ns'])
stats_df['pdt'] = prop_test_vectorized(stats_df['nd'], stats_df['ns'])

# Compute representativeness ratios (group vs other)
stats_df['ra'] = stats_df['pa'] / stats_df['other_pa']
Expand Down
96 changes: 76 additions & 20 deletions delphi/tests/test_discrepancy_fixes.py
Original file line number Diff line number Diff line change
Expand Up @@ -695,23 +695,32 @@ class TestD5ProportionTest:
Clojure uses Wilson-score-like: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)

Clojure formula has built-in regularization via +1 terms.
After fix, prop_test(succ, n) matches Clojure exactly.
"""

@pytest.mark.xfail(reason="D5: Python standard z-test vs Clojure Wilson-score-like")
def test_prop_test_matches_clojure_formula(self):
"""prop_test should match Clojure's formula for known inputs."""
# Example: 12 successes out of 13 trials
succ, n = 12, 13
# Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)

# Current Python: prop_test(p, n, 0.5) where p = (succ + pc/2) / (n + pc)
p = (succ + PSEUDO_COUNT / 2) / (n + PSEUDO_COUNT)
python_result = prop_test(p, n, 0.5)

print(f"prop_test(succ={succ}, n={n}): Python={python_result:.4f}, Clojure={expected:.4f}")
check.almost_equal(python_result, expected, abs=0.01,
msg=f"prop_test mismatch: Python={python_result:.4f}, Clojure={expected:.4f}")
"""prop_test(succ, n) should match Clojure's formula for known inputs."""
test_cases = [
(12, 13), # High success rate
(5, 8), # Moderate
(0, 10), # All failures
(10, 10), # All successes
(1, 2), # Tiny sample
(50, 100), # Larger sample
(0, 1), # Single trial, no success
(1, 1), # Single trial, success
]
for succ, n in test_cases:
# Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
result = prop_test(succ, n)
check.almost_equal(result, expected, abs=1e-10,
msg=f"prop_test({succ}, {n}): got {result:.6f}, expected {expected:.6f}")

def test_prop_test_edge_cases(self):
"""prop_test handles n=0 gracefully."""
# n=0 should return 0 (no data)
assert prop_test(0, 0) == 0.0

def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_name):
"""Sanity check: Clojure's p-test values match the documented formula."""
Expand Down Expand Up @@ -1157,14 +1166,13 @@ def test_z_thresholds_are_one_tailed(self):
check.almost_equal(Z_95, 1.6449, abs=0.001,
msg=f"Z_95={Z_95}, expected 1.6449 (one-tailed)")

def test_clojure_prop_test_formula(self):
"""Verify Clojure's proportion test formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
def test_prop_test_matches_clojure_formula_synthetic(self):
"""prop_test(succ, n) should produce 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
# Small n: 5 successes out of 8 trials
succ, n = 5, 8
result = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
# Manual: 2 * 3 * (6/9 - 0.5) = 6 * 0.1667 = 1.0
expected = 2 * 3.0 * (6.0 / 9.0 - 0.5)
assert abs(result - expected) < 1e-10
expected = 2 * 3.0 * (6.0 / 9.0 - 0.5) # = 1.0
result = prop_test(succ, n)
assert abs(result - expected) < 1e-10, f"prop_test({succ}, {n})={result}, expected {expected}"

def test_clojure_repness_metric_product(self):
"""Verify Clojure's repness metric is a product: ra * rat * pa * pat."""
Expand All @@ -1179,3 +1187,51 @@ def test_clojure_repful_uses_rat_vs_rdt(self):

# rat < rdt → disagree
assert (0.5 < 1.5) # rat=0.5, rdt=1.5 → disagree


# ============================================================================
# Blob Injection Tests — Compare Python functions against real Clojure values
# ============================================================================
#
# These tests extract inputs from the Clojure math blob, feed them to Python
# functions, and compare outputs to the Clojure blob's values. This is the
# only non-tautological way to verify correctness: formula-only tests just
# re-implement our reading of the Clojure source and can't catch misreadings.
#
# Since Python and Clojure may produce different clusters (different k), we
# inject Clojure's own group memberships and vote counts from the blob,
# isolating each computation stage from upstream divergence.
# ============================================================================

@pytest.mark.clojure_comparison
class TestD5BlobInjection:
"""D5: Verify prop_test against real Clojure blob p-test values.

For each repness entry in the blob, extract n-success and n-trials,
feed to Python's prop_test(), compare to blob's p-test.
"""

def test_prop_test_matches_blob_p_test(self, clojure_blob, dataset_name):
"""prop_test(n_success, n_trials) should match blob's p-test for every repness entry."""
repness = clojure_blob.get('repness', {})
if not repness:
pytest.skip(f"No repness in Clojure blob for {dataset_name}")

mismatches = []
total = 0
for gid, entries in repness.items():
for entry in entries:
n_success = entry['n-success']
n_trials = entry['n-trials']
expected_p_test = entry['p-test']
actual = prop_test(n_success, n_trials)
total += 1
if abs(actual - expected_p_test) > 1e-4:
mismatches.append(
f"group={gid} tid={entry['tid']}: "
f"prop_test({n_success}, {n_trials})={actual:.6f}, "
f"blob p-test={expected_p_test:.6f}")

assert not mismatches, (
f"[{dataset_name}] {len(mismatches)}/{total} p-test mismatches:\n"
+ "\n".join(mismatches[:10]))
20 changes: 11 additions & 9 deletions delphi/tests/test_old_format_repness.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
that wraps the new DataFrame-native implementation.
"""

import math
import numpy as np
import pandas as pd
import sys
Expand Down Expand Up @@ -43,15 +44,16 @@ def test_z_score_significance(self):
assert not z_score_sig_95(1.64)

def test_prop_test(self):
"""Test one-proportion z-test."""
# Test cases
assert np.isclose(prop_test(0.7, 100, 0.5), 4.0, atol=0.1)
assert np.isclose(prop_test(0.2, 50, 0.3), -1.6, atol=0.1)

# Edge cases
assert prop_test(0.5, 0, 0.5) == 0.0
assert prop_test(0.7, 100, 0.0) == 0.0
assert prop_test(0.7, 100, 1.0) == 0.0
"""Test one-proportion z-test (Clojure formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5))."""
# 70 successes out of 100
assert np.isclose(prop_test(70, 100),
2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01)
# 10 successes out of 50
assert np.isclose(prop_test(10, 50),
2 * math.sqrt(51) * (11/51 - 0.5), atol=0.01)

# Edge case: n=0
assert prop_test(0, 0) == 0.0

def test_two_prop_test(self):
"""Test two-proportion z-test."""
Expand Down
Loading
Loading