You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix D9: z-score thresholds from two-tailed to one-tailed
> **Stacked on #2443** (Fix test DB connection: use DATABASE_URL with dotenv). Please review and merge #2443 first.
> **Next in stack:** #2448 (Fix D5: match Clojure prop_test formula (Wilson-score-like with +1 pseudocount))
- Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj`
- `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449
- Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values
The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.
- [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
- [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
- [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
- [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds)
- [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
commit-id:0194003d
Future fix PRs will be appended to the stack as they're created.
25
25
@@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain:
50
50
-**Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
51
51
-**E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
52
52
-**Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.
53
+
-**Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation.
54
+
-**Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass.
55
+
-**Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs.
56
+
-**Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description.
53
57
54
58
### Datasets Available (sorted by size, smallest first)
55
59
@@ -383,11 +387,51 @@ This is non-trivial and should be one of the last fixes.
| D14 | Large conv optimization | — | — |**Deferred** (Python fast enough) |
434
478
| D15 | Moderation handling | PR 12 | — | Fix |
479
+
| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |
435
480
436
481
### Non-discrepancy PRs in the stack
437
482
@@ -441,6 +486,56 @@ By this point, we should have good test coverage from all the per-discrepancy te
441
486
442
487
---
443
488
489
+
## Tasks parallelization
490
+
491
+
D9 is done (PR #2446). The remaining fixes have the following dependency structure:
492
+
493
+
### Repness chain dependency graph (all in `repness.py`)
494
+
495
+
```
496
+
D5 ─┬─→ D7 ─┐
497
+
D6 ─┘ D8 ─┼─→ D10
498
+
│
499
+
D5 ──────────┴─→ D11
500
+
```
501
+
502
+
-**D5, D6**: logically independent, but both modify `repness.py` (signature changes + caller updates in `compute_group_comment_stats_df`) — **must be sequential**
503
+
-**D7**: after D5 + D6
504
+
-**D8**: after D6
505
+
-**D10**: after D7 + D8
506
+
-**D11**: after D5 only (parallel with D7, D8, D10)
507
+
508
+
All are in `repness.py`, strictly sequential within this track.
509
+
510
+
### File-boundary analysis
511
+
512
+
Every fix touches `test_discrepancy_fixes.py` (different test classes per fix — low conflict risk, but same file). The production code boundaries are:
513
+
514
+
| File | Fixes that modify it |
515
+
|------|---------------------|
516
+
|`repness.py`| D5, D6, D7, D8, D10, D11 |
517
+
|`conversation.py`| D3, D12, D15 |
518
+
|`pca.py`| D12, D1/D1b |
519
+
|`test_repness_unit.py`| D5, D6, D7, D8 |
520
+
521
+
**Within each file group, fixes must be sequential** to avoid merge conflicts.
**Tracks A and B can run fully in parallel** using separate worktrees. Within each track, fixes are sequential (same files). Track B order is flexible — D3, D15, D12 touch different functions in `conversation.py`, so the order can be chosen for convenience. D12 is the largest (also touches `pca.py`), so putting it last gives D1/D1b a cleaner base.
532
+
533
+
The shared `test_discrepancy_fixes.py` file will need a mechanical merge when tracks converge, but since each fix modifies a different test class (already scaffolded with xfail markers), conflicts should be trivial to resolve.
534
+
535
+
**At convergence**: when both tracks are done, rebase Track B onto Track A (or vice versa). The only conflict will be in `test_discrepancy_fixes.py` — resolve by keeping both sets of test class changes.
536
+
537
+
---
538
+
444
539
## Test Infrastructure
445
540
446
541
### `tests/test_discrepancy_fixes.py` — New test file
0 commit comments