Skip to content

Commit ee6be9f

Browse files
Julienclaude
authored andcommitted
Update plan and journal: mark D5 as done
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ff3f553 commit ee6be9f

2 files changed

Lines changed: 56 additions & 2 deletions

File tree

delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,11 +387,64 @@ Detailed analysis in `HANDOFF_REGRESSION_TEST_PERF.md` for a future session.
387387

388388
### What's Next
389389

390-
1. **PR 3 — Fix D9 (Z-score thresholds)**: `Z_90=1.645``1.2816`, `Z_95=1.96``1.6449`
390+
1. **PR 4 — Fix D5 (Proportion test)**: Change `prop_test` from standard z-test to Clojure formula.
391391
2. Regression test performance optimization (separate session)
392392

393393
---
394394

395+
## PR 4: Fix D5 — Proportion Test Formula
396+
397+
### TDD steps
398+
1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 91 passed, 5 skipped, 129 xfailed, 2 xpassed
399+
2. **Red**: Wrote tests calling `prop_test(succ, n)` (new signature) → 3 failures (TypeError: missing p0 arg)
400+
3. **Fix**: Replaced `prop_test(p, n, p0)``prop_test(succ, n)` with Clojure formula:
401+
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` (stats.clj:10-15)
402+
4. **Green**: All 7 D5 tests pass (formula checks + sanity checks + edge cases)
403+
5. **Full suite (public)**: 4 regression failures (expected — pat/pdt/metric values changed)
404+
6. **Investigation**: All diffs are in `pat`, `pdt`, `agree_metric`, `disagree_metric` — direct
405+
downstream of the prop_test formula change. No unexpected field changes.
406+
7. **Re-recorded golden snapshots** for all 7 datasets (public + private)
407+
8. **Full suite (with --include-local)**: 1 failed (pakistan-incremental D2, pre-existing),
408+
91 passed, 5 skipped, 129 xfailed, 2 xpassed — no regressions from D5
409+
410+
### Changes
411+
- `repness.py`: `prop_test(p, n, p0)``prop_test(succ, n)` with Clojure formula
412+
`2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)`. Added detailed docstring explaining
413+
the Wilson-score-like regularization and the separate pseudocount from pa/pd.
414+
- `repness.py`: `prop_test_vectorized(p, n, p0)``prop_test_vectorized(succ, n)`
415+
- `repness.py`: Updated callers in `comment_stats()` and `compute_group_comment_stats_df()`
416+
to pass raw counts `(na, ns)` / `(nd, ns)` instead of `(pa, ns, 0.5)` / `(pd, ns, 0.5)`
417+
- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added comprehensive
418+
test cases (8 input pairs including boundary conditions) and edge case test
419+
- `test_repness_unit.py`: Updated `test_prop_test` and vectorized tests for new signature
420+
- `test_old_format_repness.py`: Updated `test_prop_test` for new signature
421+
422+
### Key insight: two separate pseudocounts
423+
The Clojure `prop-test` has its own built-in +1 pseudocount (Laplace smoothing / Beta(1,1)),
424+
separate from the PSEUDO_COUNT=2.0 used for pa/pd (Beta(2,2)). The prop_test takes raw
425+
success counts, not pre-smoothed probabilities. This means:
426+
- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
427+
- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing
428+
429+
These are conceptually different: the probability is for ranking, the z-score is for
430+
significance filtering. Using different priors is intentional.
431+
432+
### Session 7 (2026-03-13)
433+
434+
- Created branch `jc/clj-parity-d5-prop-test` on top of `jc/clj-parity-d9-fix`
435+
- Read Clojure source (stats.clj:10-15, repness.clj:74-75) to verify formula
436+
- TDD cycle: red (3 TypeError failures) → fix → green (7 pass, 4 xfail)
437+
- Full suite: 4 regression failures, all in pat/pdt/metric fields (expected)
438+
- Re-recorded golden snapshots for all 7 datasets
439+
- Final validation: 19/19 regression tests pass, 1 pre-existing failure (pakistan-incremental D2)
440+
441+
### What's Next
442+
443+
1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
444+
from proportions to raw counts.
445+
446+
---
447+
395448
## TDD Discipline
396449

397450
**CRITICAL: For every fix, ALWAYS follow this order:**

delphi/docs/PLAN_DISCREPANCY_FIXES.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
2020
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
2121
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
2222
| PR 3 (D9) | #2446 || Fix D9: z-score thresholds (one-tailed) |
23+
| PR 4 (D5) ||| Fix D5: proportion test formula |
2324

2425
Future fix PRs will be appended to the stack as they're created.
2526

@@ -465,7 +466,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
465466
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) |
466467
| D3 | K-smoother buffer | PR 10 || Fix |
467468
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE**|
468-
| D5 | Proportion test | PR 4 || Fix |
469+
| D5 | Proportion test | **PR 4** || **DONE** |
469470
| D6 | Two-proportion test | PR 5 || Fix |
470471
| D7 | Repness metric | PR 6 || Fix (with flag for old formula) |
471472
| D8 | Finalize cmt stats | PR 7 || Fix |

0 commit comments

Comments
 (0)