Skip to content

Commit a387b9e

Browse files
committed
Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)
## Summary Replace Python's standard one-proportion z-test `prop_test(p, n, p0)` with Clojure's Wilson-score-like formula `prop_test(succ, n)` from `stats.clj:10-15`: ``` 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) ``` The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1) prior) that regularizes extreme values for small Polis groups. This is separate from the `PSEUDO_COUNT=2.0` used for `pa`/`pd` estimation (Beta(2,2) prior): - `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation - `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing **What changed in the output**: `pat`, `pdt` values (proportion test z-scores), and downstream `agree_metric` / `disagree_metric` values. The z-scores are now slightly different due to `sqrt(n+1)` vs `sqrt(n)` and `(succ+1)/(n+1)` vs `(na+1)/(n+2)` denominators. ## Changes - `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula - `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)` - `repness.py`: Callers updated to pass raw counts `(na, ns)` instead of `(pa, ns, 0.5)` - `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added 8 test cases + edge case - `test_repness_unit.py`, `test_old_format_repness.py`: Updated for new signature - Golden snapshots re-recorded for all datasets ## Test plan - [x] D5 formula tests pass (8 input pairs + edge cases) - [x] D5 Clojure blob consistency check passes (all datasets) - [x] Full test suite passes (public + private, 19/19 regression tests) - [x] Only pre-existing failure: pakistan-incremental D2 (unrelated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - RED: add D5 blob injection test (prop_test vs Clojure p-test values) - Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) - Update plan and journal: mark D5 as done - Plan: add D5 PR number and stack position to cross-reference commit-id:48b77ba3
1 parent 24de40d commit a387b9e

6 files changed

Lines changed: 214 additions & 81 deletions

File tree

delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session.
387387

388388
### What's Next
389389

390-
1. **PR 3 — Fix D9 (Z-score thresholds)**: `Z_90=1.645``1.2816`, `Z_95=1.96``1.6449`
390+
1. **PR 4 — Fix D5 (Proportion test)**: Change `prop_test` from standard z-test to Clojure formula.
391391
2. Regression test performance optimization (separate session)
392392

393393
---
394394

395+
## PR 4: Fix D5 — Proportion Test Formula
396+
397+
### TDD steps
398+
1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed
399+
2. **Red**: Wrote tests calling `prop_test(succ, n)` (new signature) → 3 failures (TypeError: missing p0 arg)
400+
3. **Fix**: Replaced `prop_test(p, n, p0)``prop_test(succ, n)` with Clojure formula:
401+
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` (stats.clj:10-15)
402+
4. **Green**: All 7 D5 tests pass (formula checks + sanity checks + edge cases)
403+
5. **Full suite (public)**: 4 regression failures (expected — pat/pdt/metric values changed)
404+
6. **Investigation**: All diffs are in `pat`, `pdt`, `agree_metric`, `disagree_metric` — direct
405+
downstream of the prop_test formula change. No unexpected field changes.
406+
7. **Re-recorded golden snapshots** for all 7 datasets (public + private)
407+
8. **Full suite (with --include-local)**: 1 failed (pakistan-incremental D2, pre-existing),
408+
91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5
409+
410+
### Changes
411+
- `repness.py`: `prop_test(p, n, p0)``prop_test(succ, n)` with Clojure formula
412+
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)`. Added detailed docstring explaining
413+
the Wilson-score-like regularization and the separate pseudocount from pa/pd.
414+
- `repness.py`: `prop_test_vectorized(p, n, p0)``prop_test_vectorized(succ, n)`
415+
- `repness.py`: Updated callers in `comment_stats()` and `compute_group_comment_stats_df()`
416+
to pass raw counts `(na, ns)` / `(nd, ns)` instead of `(pa, ns, 0.5)` / `(pd, ns, 0.5)`
417+
- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added comprehensive
418+
test cases (8 input pairs including boundary conditions) and edge case test
419+
- `test_repness_unit.py`: Updated `test_prop_test` and vectorized tests for new signature
420+
- `test_old_format_repness.py`: Updated `test_prop_test` for new signature
421+
422+
### Key insight: two separate pseudocounts
423+
The Clojure `prop-test` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)),
424+
separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw
425+
success counts, not pre-smoothed probabilities. This means:
426+
- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
427+
- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing
428+
429+
These are conceptually different: the probability is for ranking, the z-score is for
430+
significance filtering. Using different priors is intentional.
431+
432+
### Session 7 (2026-03-13)
433+
434+
- Created branch `jc/clj-parity-d5-prop-test` on top of `jc/clj-parity-d9-fix`
435+
- Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula
436+
- TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail)
437+
- Full suite: 4 regression failures, all in pat/pdt/metric fields (expected)
438+
- Re-recorded golden snapshots for all 7 datasets
439+
- Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2)
440+
441+
### What's Next
442+
443+
1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
444+
from proportions to raw counts.
445+
446+
---
447+
395448
## TDD Discipline
396449

397450
**CRITICAL: For every fix, ALWAYS follow this order:**

delphi/docs/PLAN_DISCREPANCY_FIXES.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
2020
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
2121
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
2222
| PR 3 (D9) | #2446 || Fix D9: z-score thresholds (one-tailed) |
23+
| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
2324

2425
Future fix PRs will be appended to the stack as they're created.
2526

@@ -465,7 +466,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
465466
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) |
466467
| D3 | K-smoother buffer | PR 10 || Fix |
467468
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE**|
468-
| D5 | Proportion test | PR 4 || Fix |
469+
| D5 | Proportion test | **PR 4** || **DONE** |
469470
| D6 | Two-proportion test | PR 5 || Fix |
470471
| D7 | Repness metric | PR 6 || Fix (with flag for old formula) |
471472
| D8 | Finalize cmt stats | PR 7 || Fix |

delphi/polismath/pca_kmeans_rep/repness.py

Lines changed: 50 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -60,29 +60,39 @@ def z_score_sig_95(z: float) -> bool:
6060
return z > Z_95
6161

6262

63-
def prop_test(p: float, n: int, p0: float) -> float:
63+
def prop_test(succ: int, n: int) -> float:
6464
"""
65-
One-proportion z-test.
66-
65+
One-proportion z-test, matching Clojure's stats/prop-test (stats.clj:10-15).
66+
67+
Clojure formula:
68+
(let [[succ n] (map inc [succ n])]
69+
(* 2 (sqrt n) (+ (/ succ n) -0.5)))
70+
71+
Which simplifies to: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
72+
73+
This is a Wilson-score-like test with built-in +1 pseudocount (Laplace
74+
smoothing). Unlike the standard z-test ((p - p0) / sqrt(p0*(1-p0)/n)),
75+
the +1 terms regularize extreme values for small samples, preventing
76+
spurious significance in small Polis groups.
77+
78+
Note: the pseudocount here (+1 to succ and n, i.e. Beta(1,1)) is
79+
independent of the PSEUDO_COUNT used for pa/pd computation (Beta(2,2)).
80+
Clojure's prop-test takes raw success counts, not pre-smoothed
81+
probabilities.
82+
6783
Args:
68-
p: Observed proportion
69-
n: Number of observations
70-
p0: Expected proportion under null hypothesis
71-
84+
succ: Number of successes (e.g. agrees or disagrees)
85+
n: Total number of trials (votes seen)
86+
7287
Returns:
73-
Z-score
88+
Z-score (positive means succ/n > 0.5)
7489
"""
75-
if n == 0 or p0 == 0 or p0 == 1:
90+
if n == 0:
7691
return 0.0
77-
78-
# Calculate standard error
79-
se = math.sqrt(p0 * (1 - p0) / n)
80-
81-
# Z-score calculation
82-
if se == 0:
83-
return 0.0
84-
else:
85-
return (p - p0) / se
92+
# Apply +1 pseudocount to both numerator and denominator
93+
succ_pc = succ + 1
94+
n_pc = n + 1
95+
return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
8696

8797

8898
def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
@@ -137,9 +147,10 @@ def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]
137147
p_agree = (n_agree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5
138148
p_disagree = (n_disagree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5
139149

140-
# Calculate significance tests
141-
p_agree_test = prop_test(p_agree, n_votes, 0.5) if n_votes > 0 else 0.0
142-
p_disagree_test = prop_test(p_disagree, n_votes, 0.5) if n_votes > 0 else 0.0
150+
# Calculate significance tests — pass raw counts, matching Clojure's
151+
# (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
152+
p_agree_test = prop_test(n_agree, n_votes) if n_votes > 0 else 0.0
153+
p_disagree_test = prop_test(n_disagree, n_votes) if n_votes > 0 else 0.0
143154

144155
# Return stats
145156
return {
@@ -457,23 +468,28 @@ def select_consensus_comments(all_stats: List[Dict[str, Any]]) -> List[Dict[str,
457468
# Vectorized DataFrame-native functions for multi-group operations
458469
# =============================================================================
459470

460-
def prop_test_vectorized(p: pd.Series, n: pd.Series, p0: float = 0.5) -> pd.Series:
471+
def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
461472
"""
462-
Vectorized one-proportion z-test.
473+
Vectorized one-proportion z-test, matching Clojure's stats/prop-test.
474+
475+
Formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
476+
477+
See prop_test() docstring for derivation and rationale.
463478
464479
Args:
465-
p: Series of observed proportions
466-
n: Series of number of observations
467-
p0: Expected proportion under null hypothesis (default: 0.5)
480+
succ: Series of success counts (e.g. agrees or disagrees)
481+
n: Series of total trial counts (votes seen)
468482
469483
Returns:
470484
Series of z-scores
471485
"""
472-
se = np.sqrt(p0 * (1 - p0) / n)
473-
z = (p - p0) / se
474-
# Handle edge cases: n=0, p0=0, p0=1 all result in 0
486+
succ_pc = succ + 1
487+
n_pc = n + 1
488+
z = 2 * np.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
489+
# Handle n=0 edge case (n_pc=1, succ_pc=1 → z = 2*1*(1/1 - 0.5) = 1.0,
490+
# but we want 0 for no-data rows)
491+
z = z.where(n > 0, 0.0)
475492
z = z.fillna(0.0)
476-
z = z.replace([np.inf, -np.inf], 0.0)
477493
return z
478494

479495

@@ -620,9 +636,10 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
620636
stats_df.loc[other_zero_mask, 'other_pa'] = 0.5
621637
stats_df.loc[other_zero_mask, 'other_pd'] = 0.5
622638

623-
# Compute proportion tests (group vs 0.5)
624-
stats_df['pat'] = prop_test_vectorized(stats_df['pa'], stats_df['ns'], 0.5)
625-
stats_df['pdt'] = prop_test_vectorized(stats_df['pd'], stats_df['ns'], 0.5)
639+
# Compute proportion tests — pass raw counts, matching Clojure's
640+
# (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
641+
stats_df['pat'] = prop_test_vectorized(stats_df['na'], stats_df['ns'])
642+
stats_df['pdt'] = prop_test_vectorized(stats_df['nd'], stats_df['ns'])
626643

627644
# Compute representativeness ratios (group vs other)
628645
stats_df['ra'] = stats_df['pa'] / stats_df['other_pa']

delphi/tests/test_discrepancy_fixes.py

Lines changed: 76 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -693,23 +693,32 @@ class TestD5ProportionTest:
693693
Clojure uses Wilson-score-like: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)
694694
695695
Clojure formula has built-in regularization via +1 terms.
696+
After fix, prop_test(succ, n) matches Clojure exactly.
696697
"""
697698

698-
@pytest.mark.xfail(reason="D5: Python standard z-test vs Clojure Wilson-score-like")
699699
def test_prop_test_matches_clojure_formula(self):
700-
"""prop_test should match Clojure's formula for known inputs."""
701-
# Example: 12 successes out of 13 trials
702-
succ, n = 12, 13
703-
# Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
704-
expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
705-
706-
# Current Python: prop_test(p, n, 0.5) where p = (succ + pc/2) / (n + pc)
707-
p = (succ + PSEUDO_COUNT / 2) / (n + PSEUDO_COUNT)
708-
python_result = prop_test(p, n, 0.5)
709-
710-
print(f"prop_test(succ={succ}, n={n}): Python={python_result:.4f}, Clojure={expected:.4f}")
711-
check.almost_equal(python_result, expected, abs=0.01,
712-
msg=f"prop_test mismatch: Python={python_result:.4f}, Clojure={expected:.4f}")
700+
"""prop_test(succ, n) should match Clojure's formula for known inputs."""
701+
test_cases = [
702+
(12, 13), # High success rate
703+
(5, 8), # Moderate
704+
(0, 10), # All failures
705+
(10, 10), # All successes
706+
(1, 2), # Tiny sample
707+
(50, 100), # Larger sample
708+
(0, 1), # Single trial, no success
709+
(1, 1), # Single trial, success
710+
]
711+
for succ, n in test_cases:
712+
# Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
713+
expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
714+
result = prop_test(succ, n)
715+
check.almost_equal(result, expected, abs=1e-10,
716+
msg=f"prop_test({succ}, {n}): got {result:.6f}, expected {expected:.6f}")
717+
718+
def test_prop_test_edge_cases(self):
719+
"""prop_test handles n=0 gracefully."""
720+
# n=0 should return 0 (no data)
721+
assert prop_test(0, 0) == 0.0
713722

714723
def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_name):
715724
"""Sanity check: Clojure's p-test values match the documented formula."""
@@ -1155,14 +1164,13 @@ def test_z_thresholds_are_one_tailed(self):
11551164
check.almost_equal(Z_95, 1.6449, abs=0.001,
11561165
msg=f"Z_95={Z_95}, expected 1.6449 (one-tailed)")
11571166

1158-
def test_clojure_prop_test_formula(self):
1159-
"""Verify Clojure's proportion test formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
1167+
def test_prop_test_matches_clojure_formula_synthetic(self):
1168+
"""prop_test(succ, n) should produce 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
11601169
# Small n: 5 successes out of 8 trials
11611170
succ, n = 5, 8
1162-
result = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
1163-
# Manual: 2 * 3 * (6/9 - 0.5) = 6 * 0.1667 = 1.0
1164-
expected = 2 * 3.0 * (6.0 / 9.0 - 0.5)
1165-
assert abs(result - expected) < 1e-10
1171+
expected = 2 * 3.0 * (6.0 / 9.0 - 0.5) # = 1.0
1172+
result = prop_test(succ, n)
1173+
assert abs(result - expected) < 1e-10, f"prop_test({succ}, {n})={result}, expected {expected}"
11661174

11671175
def test_clojure_repness_metric_product(self):
11681176
"""Verify Clojure's repness metric is a product: ra * rat * pa * pat."""
@@ -1177,3 +1185,51 @@ def test_clojure_repful_uses_rat_vs_rdt(self):
11771185

11781186
# rat < rdt → disagree
11791187
assert (0.5 < 1.5) # rat=0.5, rdt=1.5 → disagree
1188+
1189+
1190+
# ============================================================================
1191+
# Blob Injection Tests — Compare Python functions against real Clojure values
1192+
# ============================================================================
1193+
#
1194+
# These tests extract inputs from the Clojure math blob, feed them to Python
1195+
# functions, and compare outputs to the Clojure blob's values. This is the
1196+
# only non-tautological way to verify correctness: formula-only tests just
1197+
# re-implement our reading of the Clojure source and can't catch misreadings.
1198+
#
1199+
# Since Python and Clojure may produce different clusters (different k), we
1200+
# inject Clojure's own group memberships and vote counts from the blob,
1201+
# isolating each computation stage from upstream divergence.
1202+
# ============================================================================
1203+
1204+
@pytest.mark.clojure_comparison
1205+
class TestD5BlobInjection:
1206+
"""D5: Verify prop_test against real Clojure blob p-test values.
1207+
1208+
For each repness entry in the blob, extract n-success and n-trials,
1209+
feed to Python's prop_test(), compare to blob's p-test.
1210+
"""
1211+
1212+
def test_prop_test_matches_blob_p_test(self, clojure_blob, dataset_name):
1213+
"""prop_test(n_success, n_trials) should match blob's p-test for every repness entry."""
1214+
repness = clojure_blob.get('repness', {})
1215+
if not repness:
1216+
pytest.skip(f"No repness in Clojure blob for {dataset_name}")
1217+
1218+
mismatches = []
1219+
total = 0
1220+
for gid, entries in repness.items():
1221+
for entry in entries:
1222+
n_success = entry['n-success']
1223+
n_trials = entry['n-trials']
1224+
expected_p_test = entry['p-test']
1225+
actual = prop_test(n_success, n_trials)
1226+
total += 1
1227+
if abs(actual - expected_p_test) > 1e-4:
1228+
mismatches.append(
1229+
f"group={gid} tid={entry['tid']}: "
1230+
f"prop_test({n_success}, {n_trials})={actual:.6f}, "
1231+
f"blob p-test={expected_p_test:.6f}")
1232+
1233+
assert not mismatches, (
1234+
f"[{dataset_name}] {len(mismatches)}/{total} p-test mismatches:\n"
1235+
+ "\n".join(mismatches[:10]))

delphi/tests/test_old_format_repness.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
that wraps the new DataFrame-native implementation.
66
"""
77

8+
import math
89
import numpy as np
910
import pandas as pd
1011
import sys
@@ -43,15 +44,16 @@ def test_z_score_significance(self):
4344
assert not z_score_sig_95(1.64)
4445

4546
def test_prop_test(self):
46-
"""Test one-proportion z-test."""
47-
# Test cases
48-
assert np.isclose(prop_test(0.7, 100, 0.5), 4.0, atol=0.1)
49-
assert np.isclose(prop_test(0.2, 50, 0.3), -1.6, atol=0.1)
50-
51-
# Edge cases
52-
assert prop_test(0.5, 0, 0.5) == 0.0
53-
assert prop_test(0.7, 100, 0.0) == 0.0
54-
assert prop_test(0.7, 100, 1.0) == 0.0
47+
"""Test one-proportion z-test (Clojure formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5))."""
48+
# 70 successes out of 100
49+
assert np.isclose(prop_test(70, 100),
50+
2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01)
51+
# 10 successes out of 50
52+
assert np.isclose(prop_test(10, 50),
53+
2 * math.sqrt(51) * (11/51 - 0.5), atol=0.01)
54+
55+
# Edge case: n=0
56+
assert prop_test(0, 0) == 0.0
5557

5658
def test_two_prop_test(self):
5759
"""Test two-proportion z-test."""

0 commit comments

Comments
 (0)