Skip to content

Commit 9af6bdf

Browse files
committed
Fix D9: z-score thresholds from two-tailed to one-tailed
## Summary - Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj` - `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449 - Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values ## Why one-tailed? The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance. ## Test plan - [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green - [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private) - [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded) - [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds) - [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit-id:0194003d
1 parent ff213bd commit 9af6bdf

9 files changed

Lines changed: 519 additions & 700 deletions
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Handoff: PR 14 — Make Vectorized Code Readable + Blob Injection Tests
2+
3+
## Goal
4+
5+
The scalar functions (`comment_stats`, `add_comparative_stats`, `repness_metric`,
6+
`finalize_cmt_stats`) read like a step-by-step recipe. Their vectorized replacement
7+
(`compute_group_comment_stats_df`) buries the same logic in 150 lines of DataFrame
8+
plumbing. The vectorized path is the ONLY production code path — the scalar functions
9+
are dead code, called only from tests and benchmarks.
10+
11+
**PR 14 must make the vectorized code at least as readable as the scalar code, then
12+
delete the scalar functions.** This also enables vectorized blob injection tests —
13+
the only non-tautological way to verify correctness against the Clojure math blob.
14+
15+
## Starting point
16+
17+
Branch off `jc/clj-parity-d9-fix` (Stack 13). This is BELOW D5-D8 in the stack,
18+
so the refactor will be inherited by all formula fix PRs.
19+
20+
## Current stack (as of 2026-03-17)
21+
22+
```
23+
Stack 13: jc/clj-parity-d9-fix ← branch off HERE for PR 14
24+
Stack 14: jc/clj-parity-d5-prop-test (draft PR #2448)
25+
Stack 15: jc/clj-parity-d6-two-prop-test (draft PR #2449)
26+
Stack 16: jc/clj-parity-d7-repness-metric (draft PR #2450)
27+
Stack 17: jc/clj-parity-d8-finalize-stats (draft PR #2451)
28+
Stack 18: jc/clj-parity-d15-moderation-handling-zeros-vs-removes (PR #2452)
29+
Stack 19: jc/clj-parity-kmeans-k-divergence (k-divergence fix)
30+
Stack 20: jc/clj-parity-d10-rep-comment-selection
31+
Stack 21: jc/clj-parity-d11-consensus-comment-selection
32+
Stack 22: jc/clj-parity-d3-k-smoother-buffer
33+
Stack 23: jc/clj-parity-d12-comment-priorities
34+
Stack 24: jc/clj-parity-d1-pca-sign-flip-prevention
35+
Stack 25: jc/clj-parity-pr15-load-votes-sort
36+
```
37+
38+
D5-D8 are draft PRs waiting for vectorized blob tests before being marked ready.
39+
40+
## Task sequence
41+
42+
### 1. Refactor `compute_group_comment_stats_df`
43+
44+
Split into two phases:
45+
46+
**(a) DataFrame construction** (the plumbing):
47+
- Build participant→group mapping
48+
- Compute total counts per comment
49+
- Create full (group, comment) cross-product index
50+
- Join total counts, compute "other" counts
51+
52+
**(b) Statistics computation** (the readable part):
53+
A new function with clean DataFrame inputs (columns: `na`, `nd`, `ns`,
54+
`other_agree`, `other_disagree`, `other_votes`) → outputs (columns: `pa`, `pd`,
55+
`pat`, `pdt`, `ra`, `rd`, `rat`, `rdt`, `agree_metric`, `disagree_metric`, `repful`).
56+
57+
This function should read like the scalar recipe:
58+
```python
59+
# Probabilities with pseudocounts
60+
df['pa'] = (df['na'] + PSEUDO_COUNT/2) / (df['ns'] + PSEUDO_COUNT)
61+
# Proportion tests
62+
df['pat'] = prop_test_vectorized(df['na'], df['ns'])
63+
# Representativeness ratios
64+
df['ra'] = df['pa'] / df['other_pa']
65+
# ...etc
66+
```
67+
68+
### 2. Write vectorized blob injection tests
69+
70+
Inject Clojure group memberships + votes into the new statistics function,
71+
compare output to blob values. Test the PRODUCTION code path.
72+
73+
For each `(group, tid)` in the blob's `repness`:
74+
- Compare `pat` (or `pdt`) to blob `p-test`
75+
- Compare `rat` (or `rdt`) to blob `repness-test`
76+
- Compare `pa` (or `pd`) to blob `p-success`
77+
- Compare `repful` to blob `repful-for`
78+
79+
The blob only stores the winning side's values. `repful-for` tells you which
80+
side won: if `agree`, then `n-success`=na, `p-test`=pat, `repness-test`=rat.
81+
If `disagree`, then `n-success`=nd, `p-test`=pdt, `repness-test`=rdt.
82+
83+
### 3. Verify scalar ≡ vectorized
84+
85+
Run both paths on all datasets, compare outputs field by field. Must be
86+
identical (not just close — exact match) since they implement the same formulas.
87+
88+
### 4. Delete scalar functions
89+
90+
Remove: `comment_stats`, `add_comparative_stats`, `repness_metric`,
91+
`finalize_cmt_stats`, and scalar `prop_test`/`two_prop_test` if no longer
92+
needed (the vectorized versions remain).
93+
94+
Update all tests that called the scalar API.
95+
96+
### 5. Cascade rebase
97+
98+
Rebase D5→D6→D7→D8→D15→k-divergence→D10→D11→D3→D12→D1→PR15 on top.
99+
Use `.claude/skills/pr-stack/rebase-stack.sh` for the cascade.
100+
101+
### 6. Add per-PR vectorized blob injection tests
102+
103+
For each of D5, D6, D8: insert a RED blob test commit before the fix,
104+
verify it fails, then verify the fix makes it pass. Same TDD pattern used
105+
for the scalar blob tests already in those branches.
106+
107+
Then mark #2448-#2451 as ready (un-draft).
108+
109+
## Key blob context
110+
111+
- `n-trials` in the Clojure blob = `S` (total seen, INCLUDING passes), not
112+
`A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` matches blob
113+
`p-test` for tid=49 group 0 in vw (A=2, D=11, S=14, A+D=13).
114+
- `group-votes[gid].votes[tid]` = `{A: agrees, D: disagrees, S: total_seen}`
115+
- Blob `repness[gid][i]` fields: `n-success`, `n-trials`, `p-success`,
116+
`p-test`, `repness` (=ra or rd), `repness-test` (=rat or rdt),
117+
`repful-for`, `tid`, `best-agree` (optional).
118+
119+
## Files to modify
120+
121+
- `delphi/polismath/pca_kmeans_rep/repness.py` — the refactor
122+
- `delphi/tests/test_discrepancy_fixes.py` — vectorized blob tests
123+
- `delphi/tests/test_repness_unit.py` — update for new API
124+
- `delphi/tests/test_old_format_repness.py` — update for new API
125+
- `delphi/polismath/benchmarks/bench_repness.py` — update imports
126+
127+
## Reference
128+
129+
- Journal: `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md` — Session 11 entry,
130+
PR 14 readability goal in Notes for Future Sessions
131+
- Plan: `delphi/docs/PLAN_DISCREPANCY_FIXES.md` — mandatory blob comparison
132+
section in Testing Principles
133+
- Clojure source: `math/src/polismath/math/stats.clj` (prop-test, two-prop-test),
134+
`math/src/polismath/math/repness.clj` (finalize-cmt-stats, repness-metric)

delphi/docs/PLAN_DISCREPANCY_FIXES.md

Lines changed: 99 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
1919
| PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source |
2020
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
2121
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
22-
| PR 3 (D9) | || *Next: Fix D9 z-score thresholds* |
22+
| PR 3 (D9) | #2446 || Fix D9: z-score thresholds (one-tailed) |
2323

2424
Future fix PRs will be appended to the stack as they're created.
2525

@@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain:
5050
- **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
5151
- **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
5252
- **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.
53+
- **Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation.
54+
- **Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass.
55+
- **Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs.
56+
- **Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description.
5357

5458
### Datasets Available (sorted by size, smallest first)
5559

@@ -383,11 +387,51 @@ This is non-trivial and should be one of the last fixes.
383387

384388
---
385389

386-
### PR 14: Cleanup — Remove Dead Code
390+
### PR 14: Refactor Vectorized Code for Readability + Blob Injection Tests
391+
392+
**MOVED EARLIER**: PR 14 is now a prerequisite for all formula fix PRs (D5-D8+),
393+
not a post-parity cleanup. It branches off `jc/clj-parity-d9-fix` (Stack 13),
394+
below all formula fixes. Reason: the vectorized production path
395+
(`compute_group_comment_stats_df`) is too monolithic to test against the Clojure
396+
blob. The refactor makes it testable AND readable.
397+
398+
**The problem**: The scalar functions (`comment_stats`, `add_comparative_stats`,
399+
`repness_metric`, `finalize_cmt_stats`) read like a step-by-step recipe. The
400+
vectorized replacement (`compute_group_comment_stats_df`) buries the same logic
401+
in 150 lines of DataFrame plumbing. The scalar path is dead code in production —
402+
only called from tests and benchmarks.
403+
404+
**Task**:
405+
1. Split `compute_group_comment_stats_df` into (a) DataFrame construction
406+
(group mapping, cross-product index, joins) and (b) statistics computation
407+
as its own function with clean inputs/outputs — readable AND testable.
408+
2. Write vectorized blob injection tests: inject Clojure group memberships +
409+
votes, compare output to blob values. Tests the PRODUCTION code path.
410+
3. Verify scalar and vectorized paths produce identical output on all datasets.
411+
4. Delete scalar functions. Update tests.
412+
413+
**Files**: `polismath/pca_kmeans_rep/repness.py`, `tests/test_discrepancy_fixes.py`,
414+
`tests/test_repness_unit.py`, `tests/test_old_format_repness.py`,
415+
`polismath/benchmarks/bench_repness.py`
416+
417+
See `delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md` for full details.
418+
419+
After PR 14, each fix PR gets vectorized blob injection tests added in RED→GREEN
420+
TDD pattern. This includes D5-D8 (repness formula fixes), D10/D11 (selection),
421+
D15 (moderation), D12 (priorities). For D3 (k-smoother) and D1 (PCA sign flip),
422+
which are incremental-only features, add synthetic tests + skip markers for
423+
incremental blob comparison pending replay infrastructure (see Replay PRs A/B/C).
424+
425+
**After adding vectorized tests to each PR, update the plan AND journal** to
426+
record what was tested, what blob fields were compared, and any discrepancies found.
427+
This is mandatory — the plan and journal are how future sessions know what's done.
428+
429+
---
430+
431+
### PR 14b: Cleanup — Remove Remaining Dead Code (after parity)
387432

388433
**Files**: Multiple (see `08-dead-code.md`)
389434
- Custom kmeans chain in `clusters.py`
390-
- Non-vectorized repness functions in `repness.py`
391435
- Buggy `_compute_votes_base()` (after D12 replaces it)
392436
- `stats.py` inconsistencies (after D9 makes `repness.py` authoritative)
393437

@@ -425,13 +469,14 @@ By this point, we should have good test coverage from all the per-discrepancy te
425469
| D6 | Two-proportion test | PR 5 || Fix |
426470
| D7 | Repness metric | PR 6 || Fix (with flag for old formula) |
427471
| D8 | Finalize cmt stats | PR 7 || Fix |
428-
| D9 | Z-score thresholds | PR 3 | | Fix (next) |
472+
| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** |
429473
| D10 | Rep comment selection | PR 8 || Fix (with legacy env var) |
430474
| D11 | Consensus selection | PR 9 || Fix (with legacy env var) |
431475
| D12 | Comment priorities | PR 11 || Fix (implement from scratch) |
432476
| D13 | Subgroup clustering ||| **Deferred** (unused) |
433477
| D14 | Large conv optimization ||| **Deferred** (Python fast enough) |
434478
| D15 | Moderation handling | PR 12 || Fix |
479+
| Replay | Replay infrastructure (A/B/C) ||| NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |
435480

436481
### Non-discrepancy PRs in the stack
437482

@@ -441,6 +486,56 @@ By this point, we should have good test coverage from all the per-discrepancy te
441486

442487
---
443488

489+
## Tasks parallelization
490+
491+
D9 is done (PR #2446). The remaining fixes have the following dependency structure:
492+
493+
### Repness chain dependency graph (all in `repness.py`)
494+
495+
```
496+
D5 ─┬─→ D7 ─┐
497+
D6 ─┘ D8 ─┼─→ D10
498+
499+
D5 ──────────┴─→ D11
500+
```
501+
502+
- **D5, D6**: logically independent, but both modify `repness.py` (signature changes + caller updates in `compute_group_comment_stats_df`) — **must be sequential**
503+
- **D7**: after D5 + D6
504+
- **D8**: after D6
505+
- **D10**: after D7 + D8
506+
- **D11**: after D5 only (parallel with D7, D8, D10)
507+
508+
All are in `repness.py`, strictly sequential within this track.
509+
510+
### File-boundary analysis
511+
512+
Every fix touches `test_discrepancy_fixes.py` (different test classes per fix — low conflict risk, but same file). The production code boundaries are:
513+
514+
| File | Fixes that modify it |
515+
|------|---------------------|
516+
| `repness.py` | D5, D6, D7, D8, D10, D11 |
517+
| `conversation.py` | D3, D12, D15 |
518+
| `pca.py` | D12, D1/D1b |
519+
| `test_repness_unit.py` | D5, D6, D7, D8 |
520+
521+
**Within each file group, fixes must be sequential** to avoid merge conflicts.
522+
523+
### Practical parallel tracks (2 worktrees)
524+
525+
| Track | Worktree | Fixes (sequential within) | Files |
526+
|-------|----------|--------------------------|-------|
527+
| **A — Repness formulas** | main worktree | D5 → D6 → D7 → D8 → D10 → D11 | `repness.py`, `test_repness_unit.py` |
528+
| **B — Conversation/PCA** | separate worktree | D3 → D15 → D12 | `conversation.py`, `pca.py` |
529+
| **C — Late** | (after A+B) | D1/D1b | `pca.py` (needs replay infra) |
530+
531+
**Tracks A and B can run fully in parallel** using separate worktrees. Within each track, fixes are sequential (same files). Track B order is flexible — D3, D15, D12 touch different functions in `conversation.py`, so the order can be chosen for convenience. D12 is the largest (also touches `pca.py`), so putting it last gives D1/D1b a cleaner base.
532+
533+
The shared `test_discrepancy_fixes.py` file will need a mechanical merge when tracks converge, but since each fix modifies a different test class (already scaffolded with xfail markers), conflicts should be trivial to resolve.
534+
535+
**At convergence**: when both tracks are done, rebase Track B onto Track A (or vice versa). The only conflict will be in `test_discrepancy_fixes.py` — resolve by keeping both sets of test class changes.
536+
537+
---
538+
444539
## Test Infrastructure
445540

446541
### `tests/test_discrepancy_fixes.py` — New test file

delphi/docs/usage_examples.md

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -136,17 +136,8 @@ for group_id, comments in repness["group_repness"].items():
136136

137137
### Statistical Functions
138138

139-
```python
140-
from polismath.pca_kmeans_rep.stats import prop_test, two_prop_test
141-
142-
# Test proportion difference
143-
z_score = prop_test(70, 100) # 70 successes out of 100 trials
144-
print(f"Z-score: {z_score}")
145-
146-
# Compare two proportions
147-
z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100
148-
print(f"Comparison Z-score: {z_score}")
149-
```
139+
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
140+
defined inline in `repness.py` where they are used. See that module for details.
150141

151142
## Practical Examples
152143

delphi/polismath/pca_kmeans_rep/repness.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def z_score_sig_90(z: float) -> bool:
4444
Returns:
4545
True if significant at 90% confidence
4646
"""
47-
return abs(z) >= Z_90
47+
return z > Z_90
4848

4949

5050
def z_score_sig_95(z: float) -> bool:
@@ -57,7 +57,7 @@ def z_score_sig_95(z: float) -> bool:
5757
Returns:
5858
True if significant at 95% confidence
5959
"""
60-
return abs(z) >= Z_95
60+
return z > Z_95
6161

6262

6363
def prop_test(p: float, n: int, p0: float) -> float:
@@ -688,10 +688,10 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
688688
# Best agree: pa > pd and passes significance tests
689689
agree_candidates = stats_df[stats_df['pa'] > stats_df['pd']].copy()
690690
if not agree_candidates.empty:
691-
# Check significance: |pat| >= Z_90 and |rat| >= Z_90
691+
# Check significance: pat > Z_90 and rat > Z_90
692692
passing_agree = agree_candidates[
693-
(agree_candidates['pat'].abs() >= Z_90) &
694-
(agree_candidates['rat'].abs() >= Z_90) &
693+
(agree_candidates['pat'] > Z_90) &
694+
(agree_candidates['rat'] > Z_90) &
695695
(agree_candidates['pa'] >= 0.5)
696696
]
697697
if not passing_agree.empty:
@@ -701,8 +701,8 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
701701
disagree_candidates = stats_df[stats_df['pd'] > stats_df['pa']].copy()
702702
if not disagree_candidates.empty:
703703
passing_disagree = disagree_candidates[
704-
(disagree_candidates['pdt'].abs() >= Z_90) &
705-
(disagree_candidates['rdt'].abs() >= Z_90) &
704+
(disagree_candidates['pdt'] > Z_90) &
705+
(disagree_candidates['rdt'] > Z_90) &
706706
(disagree_candidates['pd'] >= 0.5)
707707
]
708708
if not passing_disagree.empty:

0 commit comments

Comments
 (0)