compdemocracy · jucor · Mar 17, 2026 · Mar 13, 2026 · Mar 18, 2026 · Copilot
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -440,8 +440,58 @@ significance filtering. Using different priors is intentional.
 
 ### What's Next
 
-1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
-   from proportions to raw counts.
+1. **PR 6 — Fix D7 (Repness metric)**: Change formula from `pa * (|pat| + |rat|)` to
+   `ra * rat * pa * pat` (Clojure product formula).
+
+---
+
+## PR 5: Fix D6 — Two-Proportion Test Pseudocounts
+
+### TDD steps
+1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed
+2. **Red**: Rewrote `TestD6TwoPropTest` with new signature `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
+   and correct Clojure formula → 3 failures (TypeError: old function expects proportions)
+3. **Fix**: Replaced both `two_prop_test` and `two_prop_test_vectorized` with Clojure formula:
+   add +1 to all 4 inputs (stats.clj:20), compute `pi1=(s+1)/(p+1)`, standard pooled z-test
+4. **Green**: All 3 D6 formula tests pass, 4 blob comparison tests xfail (depend on D10)
+5. **Full suite**: 4 regression failures, all in `rat`/`rdt`/`agree_metric`/`disagree_metric` — direct
+   downstream of the formula change. No unexpected fields affected.
+6. **Re-recorded golden snapshots** for all 7 datasets (public + private)
+7. **Final**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed
+
+### Changes
+- `repness.py`: `two_prop_test(p1, n1, p2, n2)` → `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
+  with +1 pseudocount on all 4 inputs, matching Clojure's `(map inc ...)` (stats.clj:20)
+- `repness.py`: `two_prop_test_vectorized` — same signature change
+- `repness.py`: Updated callers in `add_comparative_stats` and `compute_group_comment_stats_df`
+  to pass raw counts `(na, other_na, ns, other_ns)` instead of `(pa, ns, other_pa, other_ns)`
+- `test_discrepancy_fixes.py`: Rewrote `TestD6TwoPropTest` with correct formula, 7 test cases,
+  edge cases, and regularization effect test
+- `test_repness_unit.py`: Updated `test_two_prop_test`, `test_two_prop_test_vectorized`,
+  `test_two_prop_test_vectorized_edge_cases` for new signature
+- `test_old_format_repness.py`: Updated `test_two_prop_test` for new signature
+
+### Key insight: existing test had wrong expected formula
+The pre-existing D6 test computed expected values using `(succ+1)/(n+2)` — as if two pseudocounts
+were added to the denominator. But Clojure's `(map inc ...)` adds +1 to each value independently,
+giving `(succ+1)/(pop+1)`. The formula is a standard pooled z-test on the pseudocount-adjusted values,
+not a Beta distribution posterior.
+
+### Session 8 (2026-03-13)
+
+- Created branch `jc/clj-parity-d6-two-prop-test` on top of `jc/clj-parity-d5-prop-test`
+- Read Clojure source (stats.clj:18-33, repness.clj:97-100) to verify formula and call sites
+- Discovered the existing D6 test had wrong expected formula — fixed
+- TDD cycle: red (3 TypeError failures) → fix → green (3 pass, 4 xfail)
+- Updated all callers: both scalar (`add_comparative_stats`) and vectorized
+  (`compute_group_comment_stats_df`) now pass raw counts
+- Full suite: 4 regression failures, all in rat/rdt/metric fields (expected)
+- Re-recorded golden snapshots for all 7 datasets
+- Final validation: 102 passed, 1 pre-existing failure (pakistan-incremental D2)
+
+### What's Next
+
+1. **PR 6 — Fix D7 (Repness metric)**: Change from `pa * (|pat| + |rat|)` to `ra * rat * pa * pat`.
 
 ---
 

diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -21,6 +21,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
 | (perf) | #2436 | Stack 10/10 | Speed up regression tests |
 | PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
 | PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
+| PR 5 (D6) | #2449 | Stack 15/25 | Fix D6: two-proportion test pseudocounts |
 
 Future fix PRs will be appended to the stack as they're created.
 
@@ -467,7 +468,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
 | D3 | K-smoother buffer | PR 10 | — | Fix |
 | D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
 | D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
-| D6 | Two-proportion test | PR 5 | — | Fix |
+| D6 | Two-proportion test | **PR 5** | — | **DONE** ✓ |
 | D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
 | D8 | Finalize cmt stats | PR 7 | — | Fix |
 | D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |

diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py
@@ -95,33 +95,51 @@ def prop_test(succ: int, n: int) -> float:
     return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)
 
 
-def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
+def two_prop_test(succ_in: int, succ_out: int, pop_in: int, pop_out: int) -> float:
     """
-    Two-proportion z-test.
-
+    Two-proportion z-test with +1 pseudocount on all inputs.
+
+    Matches Clojure's stats/two-prop-test (stats.clj:18-33):
+      (let [[succ-in succ-out pop-in pop-out] (map inc [succ-in succ-out pop-in pop-out])
+            pi1 (/ succ-in pop-in)
+            pi2 (/ succ-out pop-out)
+            pi-hat (/ (+ succ-in succ-out) (+ pop-in pop-out))]
+        ...)
+
+    The +1 pseudocount (Laplace smoothing) regularizes the z-score for small
+    samples, preventing extreme values when group sizes are tiny.
+
     Args:
-        p1: First proportion
-        n1: Number of observations for first proportion
-        p2: Second proportion
-        n2: Number of observations for second proportion
-        
+        succ_in: Number of successes in the group (e.g., agrees)
+        succ_out: Number of successes outside the group
+        pop_in: Total votes in the group
+        pop_out: Total votes outside the group
+
     Returns:
-        Z-score
+        Z-score (positive means group proportion > other proportion)
     """
-    if n1 == 0 or n2 == 0:
+    if pop_in == 0 or pop_out == 0:
         return 0.0
-
-    # Pooled probability
-    p = (p1 * n1 + p2 * n2) / (n1 + n2)
-
-    # Standard error
-    se = math.sqrt(p * (1 - p) * (1/n1 + 1/n2))
-
-    # Z-score calculation
+
+    # Add +1 pseudocount to all four inputs (Clojure: map inc)
+    s1 = succ_in + 1
+    s2 = succ_out + 1
+    p1 = pop_in + 1
+    p2 = pop_out + 1
+
+    pi1 = s1 / p1
+    pi2 = s2 / p2
+    pi_hat = (s1 + s2) / (p1 + p2)
+
+    if pi_hat == 1.0:
+        # Clojure note (stats.clj:26-27): "this isn't quite right... could
+        # actually solve this using limits" — returning 0 for now, matching Clojure.
+        return 0.0
+
+    se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
-    p1 = pop_in + 1
-    p2 = pop_out + 1
-
-    pi1 = s1 / p1
-    pi2 = s2 / p2
-    pi_hat = (s1 + s2) / (p1 + p2)
-
-    if pi_hat == 1.0:
-        # Clojure note (stats.clj:26-27): "this isn't quite right... could
-        # actually solve this using limits" — returning 0 for now, matching Clojure.
-        return 0.0
-
-    se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
+    pop_in_pc = pop_in + 1
+    pop_out_pc = pop_out + 1
+
+    pi1 = s1 / pop_in_pc
+    pi2 = s2 / pop_out_pc
+    pi_hat = (s1 + s2) / (pop_in_pc + pop_out_pc)
+
+    if pi_hat == 1.0:
+        # Clojure note (stats.clj:26-27): "this isn't quite right... could
+        # actually solve this using limits" — returning 0 for now, matching Clojure.
+        return 0.0
+
+    se = math.sqrt(pi_hat * (1 - pi_hat) * (1 / pop_in_pc + 1 / pop_out_pc))
-    p1 = pop_in + 1
-    p2 = pop_out + 1
-
-    pi1 = s1 / p1
-    pi2 = s2 / p2
-    pi_hat = (s1 + s2) / (p1 + p2)
-
-    if pi_hat == 1.0:
-        # Clojure note (stats.clj:26-27): "this isn't quite right... could
-        # actually solve this using limits" — returning 0 for now, matching Clojure.
-        return 0.0
-
-    se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
+    pop_in_pc = pop_in + 1
+    pop_out_pc = pop_out + 1
+
+    pi1 = s1 / pop_in_pc
+    pi2 = s2 / pop_out_pc
+    pi_hat = (s1 + s2) / (pop_in_pc + pop_out_pc)
+
+    if pi_hat == 1.0:
+        # Clojure note (stats.clj:26-27): "this isn't quite right... could
+        # actually solve this using limits" — returning 0 for now, matching Clojure.
+        return 0.0
+
+    se = math.sqrt(pi_hat * (1 - pi_hat) * (1 / pop_in_pc + 1 / pop_out_pc))
     if se == 0:
         return 0.0
-    else:
-        return (p1 - p2) / se
+    return (pi1 - pi2) / se
 
 
 def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]:
@@ -182,15 +200,17 @@ def add_comparative_stats(comment_stats: Dict[str, Any],
     result['ra'] = result['pa'] / other_stats['pa'] if other_stats['pa'] > 0 else 1.0
     result['rd'] = result['pd'] / other_stats['pd'] if other_stats['pd'] > 0 else 1.0
 
-    # Calculate representativeness tests
+    # Calculate representativeness tests — pass raw counts, matching Clojure's
+    # (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
+    #                      (:ns in-stats) (sum :ns rest-stats))  (repness.clj:97-100)
     result['rat'] = two_prop_test(
-        result['pa'], result['ns'], 
-        other_stats['pa'], other_stats['ns']
+        result['na'], other_stats['na'],
+        result['ns'], other_stats['ns']
     )
-    
+
     result['rdt'] = two_prop_test(
-        result['pd'], result['ns'], 
-        other_stats['pd'], other_stats['ns']
+        result['nd'], other_stats['nd'],
+        result['ns'], other_stats['ns']
     )
 
     return result
@@ -493,30 +513,38 @@ def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
     return z
 
 
-def two_prop_test_vectorized(p1: pd.Series, n1: pd.Series,
-                             p2: pd.Series, n2: pd.Series) -> pd.Series:
+def two_prop_test_vectorized(succ_in: pd.Series, succ_out: pd.Series,
+                             pop_in: pd.Series, pop_out: pd.Series) -> pd.Series:
     """
-    Vectorized two-proportion z-test.
+    Vectorized two-proportion z-test with +1 pseudocount on all inputs.
+
+    Matches Clojure's stats/two-prop-test (stats.clj:18-33).
+    See two_prop_test() scalar version for formula details.
 
     Args:
-        p1: Series of first proportions
-        n1: Series of number of observations for first proportion
-        p2: Series of second proportions
-        n2: Series of number of observations for second proportion
+        succ_in: Series of success counts in the group
+        succ_out: Series of success counts outside the group
+        pop_in: Series of total vote counts in the group
+        pop_out: Series of total vote counts outside the group
 
     Returns:
         Series of z-scores
     """
-    # Pooled probability
-    p_pooled = (p1 * n1 + p2 * n2) / (n1 + n2)
+    # Add +1 pseudocount to all four inputs (Clojure: map inc)
+    s1 = succ_in + 1
+    s2 = succ_out + 1
+    p1 = pop_in + 1
+    p2 = pop_out + 1
 
-    # Standard error
-    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
+    pi1 = s1 / p1
+    pi2 = s2 / p2
+    pi_hat = (s1 + s2) / (p1 + p2)
 
-    # Z-score calculation
-    z = (p1 - p2) / se
+    se = np.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
+    z = (pi1 - pi2) / se
 
-    # Handle edge cases
+    # Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
+    z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
     z = z.fillna(0.0)
     z = z.replace([np.inf, -np.inf], 0.0)
-    # Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
-    z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
-    z = z.fillna(0.0)
-    z = z.replace([np.inf, -np.inf], 0.0)
+    # Handle edge cases: pi_hat=0 or 1 → se=0 → inf/NaN; map these to 0
+    z = z.replace([np.inf, -np.inf], 0.0)
+    z = z.fillna(0.0)
-    # Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
-    z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
-    z = z.fillna(0.0)
-    z = z.replace([np.inf, -np.inf], 0.0)
+    # Handle edge cases: pi_hat=0 or 1 → se=0 → inf/NaN; map these to 0
+    z = z.replace([np.inf, -np.inf], 0.0)
+    z = z.fillna(0.0)
     return z
@@ -649,14 +677,16 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
     stats_df['ra'] = stats_df['ra'].replace([np.inf, -np.inf], 1.0).fillna(1.0)
     stats_df['rd'] = stats_df['rd'].replace([np.inf, -np.inf], 1.0).fillna(1.0)
 
-    # Compute representativeness tests (two-proportion z-test: group vs other)
+    # Compute representativeness tests — pass raw counts, matching Clojure's
+    # (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
+    #                      (:ns in-stats) (sum :ns rest-stats))  (repness.clj:97-100)
     stats_df['rat'] = two_prop_test_vectorized(
-        stats_df['pa'], stats_df['ns'],
-        stats_df['other_pa'], stats_df['other_votes']
+        stats_df['na'], stats_df['other_agree'],
+        stats_df['ns'], stats_df['other_votes']
     )
     stats_df['rdt'] = two_prop_test_vectorized(
-        stats_df['pd'], stats_df['ns'],
-        stats_df['other_pd'], stats_df['other_votes']
+        stats_df['nd'], stats_df['other_disagree'],
+        stats_df['ns'], stats_df['other_votes']
     )
 
     # Compute metrics