@@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session.
387387
388388### What's Next
389389
390- 1 . ** PR 3 — Fix D9 (Z-score thresholds )** : ` Z_90=1.645 ` → ` 1.2816 ` , ` Z_95=1.96 ` → ` 1.6449 `
390+ 1 . ** PR 4 — Fix D5 (Proportion test )** : Change ` prop_test ` from standard z-test to Clojure formula.
3913912 . Regression test performance optimization (separate session)
392392
393393---
394394
395+ ## PR 4: Fix D5 — Proportion Test Formula
396+
397+ ### TDD steps
398+ 1 . ** Baseline** : 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed
399+ 2 . ** Red** : Wrote tests calling ` prop_test(succ, n) ` (new signature) → 3 failures (TypeError: missing p0 arg)
400+ 3 . ** Fix** : Replaced ` prop_test(p, n, p0) ` → ` prop_test(succ, n) ` with Clojure formula:
401+ ` 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) ` (stats.clj:10-15)
402+ 4 . ** Green** : All 7 D5 tests pass (formula checks + sanity checks + edge cases)
403+ 5 . ** Full suite (public)** : 4 regression failures (expected — pat/pdt/metric values changed)
404+ 6 . ** Investigation** : All diffs are in ` pat ` , ` pdt ` , ` agree_metric ` , ` disagree_metric ` — direct
405+ downstream of the prop_test formula change. No unexpected field changes.
406+ 7 . ** Re-recorded golden snapshots** for all 7 datasets (public + private)
407+ 8 . ** Full suite (with --include-local)** : 1 failed (pakistan-incremental D2, pre-existing),
408+ 91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5
409+
410+ ### Changes
411+ - ` repness.py ` : ` prop_test(p, n, p0) ` → ` prop_test(succ, n) ` with Clojure formula
412+ ` 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) ` . Added detailed docstring explaining
413+ the Wilson-score-like regularization and the separate pseudocount from pa/pd.
414+ - ` repness.py ` : ` prop_test_vectorized(p, n, p0) ` → ` prop_test_vectorized(succ, n) `
415+ - ` repness.py ` : Updated callers in ` comment_stats() ` and ` compute_group_comment_stats_df() `
416+ to pass raw counts ` (na, ns) ` / ` (nd, ns) ` instead of ` (pa, ns, 0.5) ` / ` (pd, ns, 0.5) `
417+ - ` test_discrepancy_fixes.py ` : Removed xfail from D5 formula test, added comprehensive
418+ test cases (8 input pairs including boundary conditions) and edge case test
419+ - ` test_repness_unit.py ` : Updated ` test_prop_test ` and vectorized tests for new signature
420+ - ` test_old_format_repness.py ` : Updated ` test_prop_test ` for new signature
421+
422+ ### Key insight: two separate pseudocounts
423+ The Clojure ` prop-test ` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)),
424+ separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw
425+ success counts, not pre-smoothed probabilities. This means:
426+ - ` pa = (na + 1) / (ns + 2) ` — Beta(2,2) prior for probability estimation
427+ - ` pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5) ` — Beta(1,1) prior for significance testing
428+
429+ These are conceptually different: the probability is for ranking, the z-score is for
430+ significance filtering. Using different priors is intentional.
431+
432+ ### Session 7 (2026-03-13)
433+
434+ - Created branch ` jc/clj-parity-d5-prop-test ` on top of ` jc/clj-parity-d9-fix `
435+ - Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula
436+ - TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail)
437+ - Full suite: 4 regression failures, all in pat/pdt/metric fields (expected)
438+ - Re-recorded golden snapshots for all 7 datasets
439+ - Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2)
440+
441+ ### What's Next
442+
443+ 1 . ** PR 5 — Fix D6 (Two-proportion test)** : Add +1 pseudocount to all 4 inputs, change signature
444+ from proportions to raw counts.
445+
446+ ---
447+
395448## TDD Discipline
396449
397450** CRITICAL: For every fix, ALWAYS follow this order:**
0 commit comments