Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount)

jucor · jucor · commit a387b9ed5abb · 2026-03-30T23:44:21.000+01:00
## Summary Replace Python's standard one-proportion z-test `prop_test(p, n, p0)` with Clojure's Wilson-score-like formula `prop_test(succ, n)` from `stats.clj:10-15`: ``` 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) ``` The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1) prior) that regularizes extreme values for small Polis groups. This is separate from the `PSEUDO_COUNT=2.0` used for `pa`/`pd` estimation (Beta(2,2) prior): - `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation - `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing **What changed in the output**: `pat`, `pdt` values (proportion test z-scores), and downstream `agree_metric` / `disagree_metric` values. The z-scores are now slightly different due to `sqrt(n+1)` vs `sqrt(n)` and `(succ+1)/(n+1)` vs `(na+1)/(n+2)` denominators. ## Changes - `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula - `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)` - `repness.py`: Callers updated to pass raw counts `(na, ns)` instead of `(pa, ns, 0.5)` - `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added 8 test cases + edge case - `test_repness_unit.py`, `test_old_format_repness.py`: Updated for new signature - Golden snapshots re-recorded for all datasets ## Test plan - [x] D5 formula tests pass (8 input pairs + edge cases) - [x] D5 Clojure blob consistency check passes (all datasets) - [x] Full test suite passes (public + private, 19/19 regression tests) - [x] Only pre-existing failure: pakistan-incremental D2 (unrelated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - RED: add D5 blob injection test (prop_test vs Clojure p-test values) - Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount) - Update plan and journal: mark D5 as done - Plan: add D5 PR number and stack position to cross-reference commit-id:48b77ba3
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session.
 
 ### What's Next
 
-1. **PR 3 — Fix D9 (Z-score thresholds)**: `Z_90=1.645` → `1.2816`, `Z_95=1.96` → `1.6449`
+1. **PR 4 — Fix D5 (Proportion test)**: Change `prop_test` from standard z-test to Clojure formula.
 2. Regression test performance optimization (separate session)
 
 ---
 
+## PR 4: Fix D5 — Proportion Test Formula
+
+### TDD steps
+1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed
+2. **Red**: Wrote tests calling `prop_test(succ, n)` (new signature) → 3 failures (TypeError: missing p0 arg)
+3. **Fix**: Replaced `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula:
+   `2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` (stats.clj:10-15)
+4. **Green**: All 7 D5 tests pass (formula checks + sanity checks + edge cases)
+5. **Full suite (public)**: 4 regression failures (expected — pat/pdt/metric values changed)
+6. **Investigation**: All diffs are in `pat`, `pdt`, `agree_metric`, `disagree_metric` — direct
+   downstream of the prop_test formula change. No unexpected field changes.
+7. **Re-recorded golden snapshots** for all 7 datasets (public + private)
+8. **Full suite (with --include-local)**: 1 failed (pakistan-incremental D2, pre-existing),
+   91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5
+
+### Changes
+- `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula
+  `2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)`. Added detailed docstring explaining
+  the Wilson-score-like regularization and the separate pseudocount from pa/pd.
+- `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)`
+- `repness.py`: Updated callers in `comment_stats()` and `compute_group_comment_stats_df()`
+  to pass raw counts `(na, ns)` / `(nd, ns)` instead of `(pa, ns, 0.5)` / `(pd, ns, 0.5)`
+- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added comprehensive
+  test cases (8 input pairs including boundary conditions) and edge case test
+- `test_repness_unit.py`: Updated `test_prop_test` and vectorized tests for new signature
+- `test_old_format_repness.py`: Updated `test_prop_test` for new signature
+
+### Key insight: two separate pseudocounts
+The Clojure `prop-test` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)),
+separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw
+success counts, not pre-smoothed probabilities. This means:
+- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
+- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing
+
+These are conceptually different: the probability is for ranking, the z-score is for
+significance filtering. Using different priors is intentional.
+
+### Session 7 (2026-03-13)
+
+- Created branch `jc/clj-parity-d5-prop-test` on top of `jc/clj-parity-d9-fix`
+- Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula
+- TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail)
+- Full suite: 4 regression failures, all in pat/pdt/metric fields (expected)
+- Re-recorded golden snapshots for all 7 datasets
+- Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2)
+
+### What's Next
+
+1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
+   from proportions to raw counts.
+
+---
+
 ## TDD Discipline
 
 **CRITICAL: For every fix, ALWAYS follow this order:**
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -20,6 +20,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
 | PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
 | (perf) | #2436 | Stack 10/10 | Speed up regression tests |
 | PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
+| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
 
 Future fix PRs will be appended to the stack as they're created.
 
@@ -465,7 +466,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
 | D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) |
 | D3 | K-smoother buffer | PR 10 | — | Fix |
 | D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
-| D5 | Proportion test | PR 4 | — | Fix |
+| D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
 | D6 | Two-proportion test | PR 5 | — | Fix |
 | D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
 | D8 | Finalize cmt stats | PR 7 | — | Fix |
diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py
@@ -60,29 +60,39 @@ def z_score_sig_95(z: float) -> bool:
     return z > Z_95
 
 
-def prop_test(p: float, n: int, p0: float) -> float:
+def prop_test(succ: int, n: int) -> float:
     """
-    One-proportion z-test.
-    
+    One-proportion z-test, matching Clojure's stats/prop-test (stats.clj:10-15).
+
+    Clojure formula:
+        (let [[succ n] (map inc [succ n])]
+          (* 2 (sqrt n) (+ (/ succ n) -0.5)))
+
+    Which simplifies to: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
+
+    This is a Wilson-score-like test with built-in +1 pseudocount (Laplace
+    smoothing). Unlike the standard z-test ((p - p0) / sqrt(p0*(1-p0)/n)),
+    the +1 terms regularize extreme values for small samples, preventing
+    spurious significance in small Polis groups.
+
+    Note: the pseudocount here (+1 to succ and n, i.e. Beta(1,1)) is
+    independent of the PSEUDO_COUNT used for pa/pd computation (Beta(2,2)).
+    Clojure's prop-test takes raw success counts, not pre-smoothed
+    probabilities.
+
     Args:
-        p: Observed proportion
-        n: Number of observations
-        p0: Expected proportion under null hypothesis
-        
+        succ: Number of successes (e.g. agrees or disagrees)
+        n: Total number of trials (votes seen)
+
     Returns:
-        Z-score
+        Z-score (positive means succ/n > 0.5)
     """
-    if n == 0 or p0 == 0 or p0 == 1:
+    if n == 0:
         return 0.0
-    
-    # Calculate standard error
-    se = math.sqrt(p0 * (1 - p0) / n)
-    
-    # Z-score calculation
-    if se == 0:
-        return 0.0
-    else:
-        return (p - p0) / se
+    # Apply +1 pseudocount to both numerator and denominator
+    succ_pc = succ + 1
+    n_pc = n + 1
+    return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
 
 
 def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
@@ -137,9 +147,10 @@ def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]
     p_agree = (n_agree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5
     p_disagree = (n_disagree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) if n_votes > 0 else 0.5
     
-    # Calculate significance tests
-    p_agree_test = prop_test(p_agree, n_votes, 0.5) if n_votes > 0 else 0.0
-    p_disagree_test = prop_test(p_disagree, n_votes, 0.5) if n_votes > 0 else 0.0
+    # Calculate significance tests — pass raw counts, matching Clojure's
+    # (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
+    p_agree_test = prop_test(n_agree, n_votes) if n_votes > 0 else 0.0
+    p_disagree_test = prop_test(n_disagree, n_votes) if n_votes > 0 else 0.0
     
     # Return stats
     return {
@@ -457,23 +468,28 @@ def select_consensus_comments(all_stats: List[Dict[str, Any]]) -> List[Dict[str,
 # Vectorized DataFrame-native functions for multi-group operations
 # =============================================================================
 
-def prop_test_vectorized(p: pd.Series, n: pd.Series, p0: float = 0.5) -> pd.Series:
+def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
     """
-    Vectorized one-proportion z-test.
+    Vectorized one-proportion z-test, matching Clojure's stats/prop-test.
+
+    Formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
+
+    See prop_test() docstring for derivation and rationale.
 
     Args:
-        p: Series of observed proportions
-        n: Series of number of observations
-        p0: Expected proportion under null hypothesis (default: 0.5)
+        succ: Series of success counts (e.g. agrees or disagrees)
+        n: Series of total trial counts (votes seen)
 
     Returns:
         Series of z-scores
     """
-    se = np.sqrt(p0 * (1 - p0) / n)
-    z = (p - p0) / se
-    # Handle edge cases: n=0, p0=0, p0=1 all result in 0
+    succ_pc = succ + 1
+    n_pc = n + 1
+    z = 2 * np.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
+    # Handle n=0 edge case (n_pc=1, succ_pc=1 → z = 2*1*(1/1 - 0.5) = 1.0,
+    # but we want 0 for no-data rows)
+    z = z.where(n > 0, 0.0)
     z = z.fillna(0.0)
-    z = z.replace([np.inf, -np.inf], 0.0)
     return z
 
 
@@ -620,9 +636,10 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
     stats_df.loc[other_zero_mask, 'other_pa'] = 0.5
     stats_df.loc[other_zero_mask, 'other_pd'] = 0.5
 
-    # Compute proportion tests (group vs 0.5)
-    stats_df['pat'] = prop_test_vectorized(stats_df['pa'], stats_df['ns'], 0.5)
-    stats_df['pdt'] = prop_test_vectorized(stats_df['pd'], stats_df['ns'], 0.5)
+    # Compute proportion tests — pass raw counts, matching Clojure's
+    # (stats/prop-test na ns) and (stats/prop-test nd ns) (repness.clj:74-75)
+    stats_df['pat'] = prop_test_vectorized(stats_df['na'], stats_df['ns'])
+    stats_df['pdt'] = prop_test_vectorized(stats_df['nd'], stats_df['ns'])
 
     # Compute representativeness ratios (group vs other)
     stats_df['ra'] = stats_df['pa'] / stats_df['other_pa']
diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
@@ -693,23 +693,32 @@ class TestD5ProportionTest:
         Clojure uses Wilson-score-like: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)
 
     Clojure formula has built-in regularization via +1 terms.
+    After fix, prop_test(succ, n) matches Clojure exactly.
     """
 
-    @pytest.mark.xfail(reason="D5: Python standard z-test vs Clojure Wilson-score-like")
     def test_prop_test_matches_clojure_formula(self):
-        """prop_test should match Clojure's formula for known inputs."""
-        # Example: 12 successes out of 13 trials
-        succ, n = 12, 13
-        # Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
-        expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
-
-        # Current Python: prop_test(p, n, 0.5) where p = (succ + pc/2) / (n + pc)
-        p = (succ + PSEUDO_COUNT / 2) / (n + PSEUDO_COUNT)
-        python_result = prop_test(p, n, 0.5)
-
-        print(f"prop_test(succ={succ}, n={n}): Python={python_result:.4f}, Clojure={expected:.4f}")
-        check.almost_equal(python_result, expected, abs=0.01,
-                            msg=f"prop_test mismatch: Python={python_result:.4f}, Clojure={expected:.4f}")
+        """prop_test(succ, n) should match Clojure's formula for known inputs."""
+        test_cases = [
+            (12, 13),  # High success rate
+            (5, 8),    # Moderate
+            (0, 10),   # All failures
+            (10, 10),  # All successes
+            (1, 2),    # Tiny sample
+            (50, 100), # Larger sample
+            (0, 1),    # Single trial, no success
+            (1, 1),    # Single trial, success
+        ]
+        for succ, n in test_cases:
+            # Clojure formula: 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
+            expected = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
+            result = prop_test(succ, n)
+            check.almost_equal(result, expected, abs=1e-10,
+                                msg=f"prop_test({succ}, {n}): got {result:.6f}, expected {expected:.6f}")
+
+    def test_prop_test_edge_cases(self):
+        """prop_test handles n=0 gracefully."""
+        # n=0 should return 0 (no data)
+        assert prop_test(0, 0) == 0.0
 
     def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_name):
         """Sanity check: Clojure's p-test values match the documented formula."""
@@ -1155,14 +1164,13 @@ def test_z_thresholds_are_one_tailed(self):
         check.almost_equal(Z_95, 1.6449, abs=0.001,
                             msg=f"Z_95={Z_95}, expected 1.6449 (one-tailed)")
 
-    def test_clojure_prop_test_formula(self):
-        """Verify Clojure's proportion test formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
+    def test_prop_test_matches_clojure_formula_synthetic(self):
+        """prop_test(succ, n) should produce 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)."""
         # Small n: 5 successes out of 8 trials
         succ, n = 5, 8
-        result = 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5)
-        # Manual: 2 * 3 * (6/9 - 0.5) = 6 * 0.1667 = 1.0
-        expected = 2 * 3.0 * (6.0 / 9.0 - 0.5)
-        assert abs(result - expected) < 1e-10
+        expected = 2 * 3.0 * (6.0 / 9.0 - 0.5)  # = 1.0
+        result = prop_test(succ, n)
+        assert abs(result - expected) < 1e-10, f"prop_test({succ}, {n})={result}, expected {expected}"
 
     def test_clojure_repness_metric_product(self):
         """Verify Clojure's repness metric is a product: ra * rat * pa * pat."""
@@ -1177,3 +1185,51 @@ def test_clojure_repful_uses_rat_vs_rdt(self):
 
         # rat < rdt → disagree
         assert (0.5 < 1.5)  # rat=0.5, rdt=1.5 → disagree
+
+
+# ============================================================================
+# Blob Injection Tests — Compare Python functions against real Clojure values
+# ============================================================================
+#
+# These tests extract inputs from the Clojure math blob, feed them to Python
+# functions, and compare outputs to the Clojure blob's values. This is the
+# only non-tautological way to verify correctness: formula-only tests just
+# re-implement our reading of the Clojure source and can't catch misreadings.
+#
+# Since Python and Clojure may produce different clusters (different k), we
+# inject Clojure's own group memberships and vote counts from the blob,
+# isolating each computation stage from upstream divergence.
+# ============================================================================
+
+@pytest.mark.clojure_comparison
+class TestD5BlobInjection:
+    """D5: Verify prop_test against real Clojure blob p-test values.
+
+    For each repness entry in the blob, extract n-success and n-trials,
+    feed to Python's prop_test(), compare to blob's p-test.
+    """
+
+    def test_prop_test_matches_blob_p_test(self, clojure_blob, dataset_name):
+        """prop_test(n_success, n_trials) should match blob's p-test for every repness entry."""
+        repness = clojure_blob.get('repness', {})
+        if not repness:
+            pytest.skip(f"No repness in Clojure blob for {dataset_name}")
+
+        mismatches = []
+        total = 0
+        for gid, entries in repness.items():
+            for entry in entries:
+                n_success = entry['n-success']
+                n_trials = entry['n-trials']
+                expected_p_test = entry['p-test']
+                actual = prop_test(n_success, n_trials)
+                total += 1
+                if abs(actual - expected_p_test) > 1e-4:
+                    mismatches.append(
+                        f"group={gid} tid={entry['tid']}: "
+                        f"prop_test({n_success}, {n_trials})={actual:.6f}, "
+                        f"blob p-test={expected_p_test:.6f}")
+
+        assert not mismatches, (
+            f"[{dataset_name}] {len(mismatches)}/{total} p-test mismatches:\n"
+            + "\n".join(mismatches[:10]))
diff --git a/delphi/tests/test_old_format_repness.py b/delphi/tests/test_old_format_repness.py
@@ -5,6 +5,7 @@
 that wraps the new DataFrame-native implementation.
 """
 
+import math
 import numpy as np
 import pandas as pd
 import sys
@@ -43,15 +44,16 @@ def test_z_score_significance(self):
         assert not z_score_sig_95(1.64)
 
     def test_prop_test(self):
-        """Test one-proportion z-test."""
-        # Test cases
-        assert np.isclose(prop_test(0.7, 100, 0.5), 4.0, atol=0.1)
-        assert np.isclose(prop_test(0.2, 50, 0.3), -1.6, atol=0.1)
-
-        # Edge cases
-        assert prop_test(0.5, 0, 0.5) == 0.0
-        assert prop_test(0.7, 100, 0.0) == 0.0
-        assert prop_test(0.7, 100, 1.0) == 0.0
+        """Test one-proportion z-test (Clojure formula: 2*sqrt(n+1)*((succ+1)/(n+1) - 0.5))."""
+        # 70 successes out of 100
+        assert np.isclose(prop_test(70, 100),
+                          2 * math.sqrt(101) * (71/101 - 0.5), atol=0.01)
+        # 10 successes out of 50
+        assert np.isclose(prop_test(10, 50),
+                          2 * math.sqrt(51) * (11/51 - 0.5), atol=0.01)
+
+        # Edge case: n=0
+        assert prop_test(0, 0) == 0.0
 
     def test_two_prop_test(self):
         """Test two-proportion z-test."""
diff --git a/delphi/tests/test_repness_unit.py b/delphi/tests/test_repness_unit.py