Add mathematical rigor and exhaustive testing guidance to fix plan

jucor · claude · jucor · commit 19a1efaa78ea · 2026-03-23T15:41:12.000Z
- Require side-by-side Clojure/Python verification for every formula change
- Exhaustive RED phase: boundary conditions, edge cases, missing test audit
- Double-check array shapes, indices, aggregation axes after implementation
- Allow skipping very large private datasets during iteration, only run as final validation

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain:
 - **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
 - **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
 - **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.
+- **Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation.
+- **Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass.
+- **Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs.
+- **Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description.
 
 ### Datasets Available (sorted by size, smallest first)