From 5ae6ad5ad8656c5f50ee9c637a6c467ad78b1e29 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Thu, 26 Feb 2026 11:45:19 +0000
Subject: [PATCH 01/10] Fix D2: in-conv threshold min(7, n_cmts) to match
 Clojure
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Change threshold from 7+sqrt(n_cmts)*0.1 to min(7, n_cmts) in
_get_in_conv_participants(). This matches Clojure's conversation.clj
and includes more participants in clustering for large conversations.

Verified on vw, biodiversity, FLI, bg2018 (all datasets with complete
Clojure blobs). 3 private datasets have incomplete blobs (no in-conv
data) — blob regeneration delegated to separate task.

Re-recorded golden snapshot for biodiversity (428→441 participants).
Updated journal with TDD discipline, golden snapshot policy, and
pipelined worktree workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md       | 105 ++++++++++++++++--
 delphi/polismath/conversation/conversation.py |   6 +-
 delphi/tests/test_discrepancy_fixes.py        |   2 -
 3 files changed, 97 insertions(+), 16 deletions(-)
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index 4e77b31f8..e7720f2bb 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -104,16 +104,49 @@ After rebase onto updated `origin/kmeans_analysis_docs`:
 
 ---
 
-## What's Next: PR 1 — Fix D2 (In-Conv Participant Threshold)
-
-- Change `threshold = 7 + sqrt(n_cmts) * 0.1` to `threshold = min(7, n_cmts)` in `conversation.py:1238`
-- Use `self.raw_rating_mat` instead of `self.rating_mat` for counting
-- Add greedy fallback (top-15 voters if <15 qualify)
-- Add monotonic persistence (once in, always in)
-- Remove xfail from `TestD2InConvThreshold`
-- Check cluster count impact (may change from 3→2 for biodiversity, matching Clojure)
-- Re-record golden snapshots if needed
-- Document new baseline
+## PR 1: Fix D2 — In-Conv Participant Threshold (blocked on Clojure blob regeneration)
+
+### TDD steps
+1. **Baseline (public only)**: 3 failed (2 D2 + 1 DynamoDB), 205 passed, 4 skipped, 20 xfailed, 3 xpassed
+2. **Red**: Removed xfail from `TestD2InConvThreshold` — biodiversity fails (Python=428, Clojure=441, threshold 8.8 vs 7)
+3. **Fix**: Changed `threshold = 7 + sqrt(n_cmts) * 0.1` → `threshold = min(7, n_cmts)` in `conversation.py:1270`
+4. **Green (public only)**: All 4 D2 tests pass (vw + biodiversity × 2 tests each)
+5. **Full suite (public only)**: 1 failed (DynamoDB only), 207 passed — but golden snapshots were re-recorded for biodiversity without proper verification
+6. **Full suite (with private datasets)**: 15 failed, 6 errors — regression tests fail for private datasets (golden snapshots stale), `engage` dataset has duplicate vote files
+7. Investigated regression failures — all caused by expected threshold change. Re-recorded after verification.
+8. **Blocker**: 3 private datasets have incomplete Clojure blobs (no in-conv data). D2 tests fail on those — not a code issue. Delegated blob regeneration to separate session.
+
+### Investigation of regression failures
+- Private dataset golden snapshots were never committed (unstaged in .local repo) — reverted automatically
+- biodiversity: re-recorded after verifying D2 in-conv set matches Clojure exactly (428→441 participants)
+- Private datasets: threshold dropped significantly for large conversations:
+  - bg2050 (7753 comments): old threshold ≈ 15.8, new threshold = 7 → 6609/7890 qualify (was fewer)
+  - This is the expected and correct effect of the D2 fix
+- **Gotcha**: `--datasets <name>` alone won't find private datasets — must also pass `--include-local`!
+  Without it, private datasets are silently skipped (shown as `[NOTSET]`).
+- Golden snapshots re-recorded for FLI, bg2018, pakistan, bg2050 after verifying all regression
+  failures are downstream of the expected threshold change. Committed in `.local` repo on
+  branch `series-of-fixes`.
+
+### Incomplete Clojure blobs (BLOCKING)
+
+3 private datasets have incomplete Clojure cold-start blobs (4 keys instead of 23):
+- Missing `in-conv`, `repness`, `consensus`, `comment-priorities`, etc.
+- Likely caused by Clojure math service timeout on large conversations
+- The 4 remaining datasets (vw, biodiversity, FLI, bg2018) have complete blobs (23 keys)
+- **D2 tests pass on all datasets with complete blobs**
+- D2 tests fail on the 3 incomplete datasets because `in-conv` is empty — this is a data
+  problem, not a code problem
+
+**Action needed**: Regenerate Clojure blobs for the 3 incomplete datasets using
+`generate_cold_start_clojure.py` with the prodclone DB and Clojure math service.
+This is delegated to a separate session.
+
+### What was NOT needed
+- `raw_rating_mat` vs `rating_mat` — not needed, the existing vote counting works
+- Greedy fallback / monotonic persistence — not needed for parity (cold-start only)
+
+### What's Next: PR 2 — Fix D4 (Pseudocount)
 
 ---
 
@@ -143,9 +176,59 @@ After rebase onto updated `origin/kmeans_analysis_docs`:
 
 ---
 
+## TDD Discipline
+
+**CRITICAL: For every fix, ALWAYS follow this order:**
+1. **Run the full test suite** (all datasets, including private) to establish the baseline
+2. **Remove xfail** from the target test(s)
+3. **Run tests and confirm they FAIL** (red) — this validates the test actually catches the discrepancy
+4. **Apply the fix**
+5. **Run tests and confirm they PASS** (green)
+6. **Run the full test suite** to check for regressions vs the baseline from step 1
+7. **If regression tests fail**: INVESTIGATE before re-recording golden snapshots (see below)
+
+Never skip step 3. A test that passes before the fix is applied is not testing anything useful.
+
+### Golden snapshots are precious
+
+**NEVER blindly re-record golden snapshots.** They are the regression safety net.
+When a fix causes regression test failures:
+
+1. **Investigate** what changed: compare old vs new output, check which fields differ
+2. **Verify** the new output is closer to Clojure (e.g., in-conv set now matches exactly)
+3. **Only then** re-record, one dataset at a time, after confirming correctness
+4. **Commit** the updated snapshots with a clear message explaining why they changed
+
+### Per-dataset testing for faster feedback
+
+Run each dataset as a **separate background task** instead of one big pytest invocation.
+Smallest datasets finish first, giving early signal without waiting for the 1M-vote ones.
+
+**IMPORTANT**: Always pass `--include-local` when testing private datasets.
+Without it, `--datasets <name>` silently skips private datasets (shown as `[NOTSET]`).
+
+### Pipelined worktree workflow
+
+Full test suite takes ~14 minutes. To avoid idle time:
+- When tests start running on the current worktree, create the next worktree and start coding the next fix
+- Each worktree = one fix, one branch, one PR — stacked like the PR chain
+- Number of worktrees in flight depends on test duration vs coding speed
+- If a lower fix needs amending, rebase the chain of worktrees above it
+- Each worktree gets its own `.local` private data clone (via `link-to-polis-worktree.sh`)
+
+**When starting a new Claude session for the next fix**, ask the user whether they want
+to create a new worktree. If yes, provide a prompt they can use to start that session:
+
+> Start working on [Clj parity] fix DN (<description>). This is a new worktree
+> stacked on `<previous-branch>`. Read `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md`
+> and `delphi/docs/CLJ-PARITY-FIXES-PLAN.md` for context. Follow the TDD discipline
+> documented in the journal.
+
+---
+
 ## Notes for Future Sessions
 
-- Private datasets not available in this worktree. Need prodclone DB + `generate_cold_start_clojure.py`.
+- Private datasets are in `delphi/real_data/.local/` (separate git repo, linked via `link-to-polis-worktree.sh`)
 - `test_discrepancy_fixes.py` uses same parametrization pattern as `test_legacy_clojure_regression.py` (own `pytest_generate_tests` hook, `dataset_name` fixture).
 - 11 pre-existing test failures are from the stacked branch, not from our work. They should be fixed in their respective PRs before merging to main.
 - `strict=False` on xfail means xpass (unexpected pass) is reported but not a failure. Used when some datasets pass by coincidence.
diff --git a/delphi/polismath/conversation/conversation.py b/delphi/polismath/conversation/conversation.py
index 04662179a..56bd2db94 100644
--- a/delphi/polismath/conversation/conversation.py
+++ b/delphi/polismath/conversation/conversation.py
@@ -1259,14 +1259,14 @@ def _get_in_conv_participants(self) -> Set[str]:
 
         Matches Clojure's in-conv logic from conversation.clj lines 239-266.
 
-        Threshold: participant must have voted on at least:
-            7 + sqrt(n_comments) * 0.1 comments
+        Threshold: participant must have voted on at least min(7, n_comments)
+        comments (Clojure parity fix D2).
 
         Returns:
             Set of participant IDs that meet the threshold
         """
         n_cmts = len(self.rating_mat.columns) if hasattr(self.rating_mat, 'columns') else 0
-        threshold = 7 + np.sqrt(n_cmts) * 0.1
+        threshold = min(7, n_cmts)
 
         # Get vote counts for all participants
         vote_counts = self._compute_user_vote_counts()
diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
index 289622378..a09e0c17f 100644
--- a/delphi/tests/test_discrepancy_fixes.py
+++ b/delphi/tests/test_discrepancy_fixes.py
@@ -161,7 +161,6 @@ class TestD2InConvThreshold:
     filtering out participants that Clojure would keep.
     """
 
-    @pytest.mark.xfail(reason="D2: Python threshold = 7+sqrt(n)*0.1 vs Clojure min(7, n)", strict=False)
     def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
         """Number of in-conv participants should match Clojure."""
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)
@@ -171,7 +170,6 @@ def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
         check.equal(python_in_conv_count, len(clojure_in_conv),
                      f"In-conv count mismatch: Python={python_in_conv_count}, Clojure={len(clojure_in_conv)}")
 
-    @pytest.mark.xfail(reason="D2: Python threshold = 7+sqrt(n)*0.1 vs Clojure min(7, n)", strict=False)
     def test_in_conv_set_matches(self, conv, clojure_blob, dataset_name):
         """The actual set of in-conv participants should match Clojure."""
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)

From dd0521c140979d16be6c0a78c266c9b4e8923d7c Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 3 Mar 2026 15:11:54 +0000
Subject: [PATCH 02/10] Skip D2 tests on datasets with incomplete Clojure blobs

Three large datasets have incomplete cold-start Clojure blobs (4 keys
instead of 23, missing in-conv data). D2 tests now skip gracefully on
these datasets instead of failing with misleading empty-set comparisons.

The incomplete blobs are caused by Clojure's power-iteration PCA being
too slow for large matrices (30-75M cells). These datasets will be
validated later when we build the incremental replay infrastructure.

Verified: 8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing
duplicate vote files in engage dataset) on full dataset suite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | 16 ++++++++++++++++
 delphi/tests/test_discrepancy_fixes.py  | 11 ++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index e7720f2bb..689a355e5 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -173,6 +173,22 @@ This is delegated to a separate session.
   - `CLJ-PARITY-FIXES-JOURNAL.md` in `series-of-fixes` (amended tip)
   - Force-pushed all three branches, rebased the chain
 - Tests unchanged: 5 passed, 2 skipped, 18 xfailed, 5 xpassed
+- Renamed journal and plan to `CLJ-PARITY-FIXES-*.md`, amended introducing commits, rebased chain
+- Set up private data repo infrastructure:
+  - Pushed data to bare repo at `~/polis/github/real_data_private`
+  - Created `link-to-polis-worktree.sh` for per-worktree clones with post-checkout branch sync
+  - Linked `.local` to this worktree, created `series-of-fixes` branch in private repo
+- Created `CLAUDE.local.md` (via stow) and `CLAUDE.md` in private data repo
+- D2 fix (TDD):
+  - Baseline (public only): 205 passed, 3 failed (2 D2 + 1 DynamoDB)
+  - Red: removed xfail from D2 tests, biodiversity fails (Python=428, Clojure=441)
+  - Fix: `threshold = min(7, n_cmts)` in `conversation.py:1270`
+  - Green: D2 tests pass on vw + biodiversity
+  - Full suite with private datasets (14 min): regression failures on all private datasets (expected — threshold change cascades to clustering)
+  - Investigated: all failures are downstream of threshold change, verified correct
+  - Re-recorded golden snapshots for biodiversity + 4 private datasets after verification
+  - Discovered 3 private datasets have incomplete Clojure blobs (4 keys instead of 23) — delegated regeneration to separate session
+  - Committed and pushed D2 fix (`df2d013ec`)
 
 ---
 
diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
index a09e0c17f..12c1e2a89 100644
--- a/delphi/tests/test_discrepancy_fixes.py
+++ b/delphi/tests/test_discrepancy_fixes.py
@@ -154,15 +154,14 @@ def _clojure_in_conv_set(blob: dict) -> set[int]:
 @pytest.mark.clojure_comparison
 class TestD2InConvThreshold:
     """
-    D2: Python uses threshold = 7 + sqrt(n_cmts) * 0.1
-        Clojure uses threshold = min(7, n_cmts)
-
-    This causes Python to require more votes for larger conversations,
-    filtering out participants that Clojure would keep.
+    D2: Clojure uses threshold = min(7, n_cmts) for in-conv filtering.
+    Python now matches (fixed from 7 + sqrt(n_cmts) * 0.1).
     """
 
     def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
         """Number of in-conv participants should match Clojure."""
+        if 'in-conv' not in clojure_blob:
+            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data (incomplete cold-start blob)")
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)
         python_in_conv_count = len(conv._get_in_conv_participants())
 
@@ -172,6 +171,8 @@ def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
 
     def test_in_conv_set_matches(self, conv, clojure_blob, dataset_name):
         """The actual set of in-conv participants should match Clojure."""
+        if 'in-conv' not in clojure_blob:
+            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data (incomplete cold-start blob)")
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)
         python_in_conv = conv._get_in_conv_participants()
         # Convert python pids to int for comparison

From fc40e160eef70d775db0695856b60316d5606b87 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 3 Mar 2026 16:21:42 +0000
Subject: [PATCH 03/10] Address Copilot review: fix base-cluster sort order
 (D2b) and stale comment

- D2b: Remove sort-by-size and ID reassignment of base clusters;
  keep k-means ID order to match Clojure's (sort-by :id). The size-sort
  changed encounter order of centers fed into group-level k-means.
- Update stale comment in test_conversation.py referencing old threshold
  formula (7 + sqrt(n_cmts) * 0.1) to current min(7, n_cmts).
- Add D2b to plan checklist and journal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md       | 10 ++++++++++
 delphi/docs/PLAN_DISCREPANCY_FIXES.md         |  1 +
 delphi/polismath/conversation/conversation.py |  9 ++++-----
 delphi/tests/test_conversation.py             |  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index 689a355e5..89010bb0a 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -146,6 +146,16 @@ This is delegated to a separate session.
 - `raw_rating_mat` vs `rating_mat` — not needed, the existing vote counting works
 - Greedy fallback / monotonic persistence — not needed for parity (cold-start only)
 
+### D2b: Base-cluster sort order (added from Copilot review)
+
+Copilot flagged that Python sorts base clusters by size (descending) and reassigns IDs,
+while Clojure uses `(sort-by :id ...)` which preserves k-means' original cluster IDs.
+The size-sort changes the encounter order of base-cluster centers fed into group-level
+k-means (which uses first-k-distinct initialization), potentially diverging from Clojure.
+
+**Fix**: Removed size-sort and ID reassignment; now sort by k-means ID (ascending),
+matching Clojure's `sort-by :id`.
+
 ### What's Next: PR 2 — Fix D4 (Pseudocount)
 
 ---
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
index 396537023..b54d990dd 100644
--- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md
+++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -371,6 +371,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
 | D1 | PCA sign flips | PR 13 | Fix (sign consistency) |
 | D1b | Projection input | PR 13 | Fix with D1 |
 | D2 | In-conv threshold | **PR 1** | Fix |
+| D2b | Base-cluster sort order | **PR 1** | Fix (keep k-means ID order, match Clojure sort-by :id) |
 | D3 | K-smoother buffer | PR 10 | Fix |
 | D4 | Pseudocount formula | **PR 2** | Fix |
 | D5 | Proportion test | PR 4 | Fix |
diff --git a/delphi/polismath/conversation/conversation.py b/delphi/polismath/conversation/conversation.py
index 56bd2db94..3d73870c9 100644
--- a/delphi/polismath/conversation/conversation.py
+++ b/delphi/polismath/conversation/conversation.py
@@ -589,11 +589,10 @@ def _compute_clusters(self) -> None:
                 'members': member_pids
             })
 
-        # Sort base clusters by size (descending) for consistency
-        base_clusters.sort(key=lambda c: len(c['members']), reverse=True)
-        # Reassign IDs based on sorted order
-        for i, cluster in enumerate(base_clusters):
-            cluster['id'] = i
+        # Keep base clusters in k-means ID order (matching Clojure's sort-by :id)
+        # Do NOT sort by size or reassign IDs — that would change the encounter
+        # order of centers used in group clustering's first-k-distinct initialization.
+        base_clusters.sort(key=lambda c: c['id'])
 
         logger.info(f"Created {len(base_clusters)} base clusters")
 
diff --git a/delphi/tests/test_conversation.py b/delphi/tests/test_conversation.py
index 6447d5af9..c6c950243 100644
--- a/delphi/tests/test_conversation.py
+++ b/delphi/tests/test_conversation.py
@@ -427,7 +427,7 @@ def test_recompute(self):
         }
         
         # Create two distinct opinion groups with enough votes per participant
-        # to meet the vote threshold (7 + sqrt(n_comments) * 0.1)
+        # to meet the vote threshold (min(7, n_comments))
         for i in range(20):
             pid = f'p{i}'
 

From 68baabc85f99dfa3c777edfb84206a4c33c52402 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 3 Mar 2026 18:27:54 +0000
Subject: [PATCH 04/10] Add PR 1 test results to journal
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Document xpassed tests after D2 fix: D6 two_prop_test (1) and
D9 repness_not_empty (6) — both strict=False, expected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index 89010bb0a..f5699521e 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -156,6 +156,20 @@ k-means (which uses first-k-distinct initialization), potentially diverging from
 **Fix**: Removed size-sort and ID reassignment; now sort by k-means ID (ascending),
 matching Clojure's `sort-by :id`.
 
+### Test results for PR 1
+
+```
+17 passed, 16 skipped, 31 xfailed, 7 xpassed, 4 errors (with --include-local, 7 datasets)
+```
+
+- **17 passed**: D2 tests pass on 4 datasets with complete blobs (8 tests) + formula sanity checks + blob consistency
+- **16 skipped**: D2 skipped on 3 datasets with incomplete Clojure blobs + D15 moderation + engage errors
+- **31 xfailed**: Remaining discrepancy tests (D4-D12)
+- **7 xpassed** (all `strict=False`, so green):
+  - D6 two_prop_test × 1 — pseudocount difference too small for this test case
+  - D9 repness_not_empty × 6 — test too weak (checks non-empty, not correct count)
+- **4 errors**: engage dataset has duplicate vote files (pre-existing data issue)
+
 ### What's Next: PR 2 — Fix D4 (Pseudocount)
 
 ---

From 514a3b732887c043fb7216f8d480ea49640c2e82 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Wed, 4 Mar 2026 11:01:27 +0000
Subject: [PATCH 05/10] Plan: add D2c (vote count source) and D2d (in-conv
 monotonicity) to fix plan

D2c: Both vote counts and n_cmts must use raw_rating_mat (includes
moderated-out columns), not rating_mat. This is structural, not related
to delta vs full processing. Two xfail unit tests planned.

D2d: Full recompute from raw_rating_mat guarantees monotonicity without
needing to persist in-conv to DynamoDB (unlike Clojure which persists
because it uses delta vote processing). 6 tests guard this invariant
for future delta-processing refactors. Ref: compdemocracy/polis#2358.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/PLAN_DISCREPANCY_FIXES.md | 62 ++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
index b54d990dd..fe487214f 100644
--- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md
+++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -107,14 +107,16 @@ Fixes are ordered by **pipeline execution order**: participant filtering → pro
 **Target**: `threshold = min(7, n_cmts)` → threshold=7, more participants qualify.
 
 **Additional changes**:
-- Use `self.raw_rating_mat` instead of `self.rating_mat` (count on unmoderated matrix)
+- **D2c — Vote count source**: Use `self.raw_rating_mat` instead of `self.rating_mat` for computing per-participant vote counts. Clojure's `user-vote-counts` (conversation.clj:217-225) counts votes from `raw-rating-mat`, which includes votes on moderated-out comments. Python currently uses the filtered `rating_mat`, which excludes moderated-out columns — undercounting votes and potentially dropping participants below the in-conv threshold. This is a **structural** discrepancy that occurs on every computation (cold-start or incremental), not just with delta updates — because the difference is in *which matrix* the count is derived from. Additionally, `n_cmts` differs: Clojure's `n-cmts` (conversation.clj:214-215) counts columns of `rating-mat` which still includes moderated-out columns (zeroed but present, via `zero-out-columns` in named_matrix.clj:214-228). Python's `n_cmts = len(self.rating_mat.columns)` excludes moderated-out columns entirely (removed by `_apply_moderation`, line 308). This means the `min(7, n_cmts)` threshold itself also differs. Both aspects (vote count source and `n_cmts` source) must use `raw_rating_mat` to match Clojure. (This is related to but distinct from D15 — moderation handling — which tracks the broader zero-out vs remove discrepancy.)
 - Add greedy fallback (top-15 voters if <15 qualify)
 - Add monotonic persistence (once in, always in)
 
 **Test-first approach**:
 1. Add test comparing in-conv participant count/set between Python and Clojure for ALL datasets
 2. Add synthetic edge-case tests: tiny conversation (fewer comments than threshold), conversation where greedy fallback triggers
-3. Run tests → expect failure
+3. **D2c unit test — vote count source**: Synthetic test with 10 comments, 3 moderated-out. Participant P voted on all 10. Verify `_compute_user_vote_counts` returns 10 for P (from `raw_rating_mat`), not 7 (from `rating_mat`). Mark xfail before fix.
+4. **D2c unit test — n_cmts threshold**: Same setup. Verify `n_cmts` used in the `min(7, n_cmts)` threshold equals 10 (all comments including moderated-out), not 7 (only non-moderated-out). This matters when a conversation has fewer than 7 non-moderated-out comments but more than 7 total: the threshold would be artificially low in Python, changing who qualifies. Mark xfail before fix.
+5. Run tests → expect failure on D2c tests
 4. Apply fix
 5. Re-run ALL tests — cluster count may change (possibly 3→2 for biodiversity, matching Clojure)
 6. **Cluster comparison test (CRITICAL)**: After fixing D2, add a test that compares Python and Clojure cluster assignments at the participant level — number of clusters AND which participants are in which cluster (using Jaccard similarity or exact match). The Python clustering calls sklearn K-means; Clojure uses a two-level approach. If clusters still don't match after the threshold fix, investigate why and potentially create an additional PR (PR 1b) to fix the remaining clustering discrepancy before moving on to repness fixes.
@@ -122,6 +124,60 @@ Fixes are ordered by **pipeline execution order**: participant filtering → pro
 
 ---
 
+### PR 1bis: Fix D2d — In-Conv Monotonicity
+
+**Related upstream issue**: [compdemocracy/polis#2358](https://github.com/compdemocracy/polis/issues/2358) — "Non-Deterministic K-Means Clustering Due to Worker Restart". That issue focuses on `group-clusterings` not being persisted. In-conv IS persisted in Clojure (see below), but we take a different — and better — approach.
+
+**Clojure behavior (verified)**:
+- **Monotonic**: `(as-> (or (:in-conv conv) #{}) in-conv (into in-conv ...))` (conversation.clj:244) — starts from previous set, only adds, never removes.
+- **Persisted to DB**: `:in-conv` is included in `prep-main` (conv_man.clj:55) and written to `math_main`. On restart, `load-or-init` (conv_man.clj:196) calls `restructure-json-conv` which restores it as a set (conv_man.clj:182). So in-conv **survives worker restarts** in Clojure.
+- **Why Clojure needs persistence**: Clojure uses **delta vote processing** — during normal operation, only new votes since last timestamp are fed via `keep-votes` (conversation.clj:189-190) into `raw-rating-mat`. After a restart, `load-or-init` rebuilds `raw-rating-mat` from scratch (conv_man.clj:200-203, `conv-poll` with offset 0), but during incremental updates, old votes are not re-scanned. Without persisting in-conv, a restart + moderation could drop participants whose qualifying votes are on now-moderated-out comments, because those votes might not be re-counted in delta mode.
+
+**Why Python doesn't need persistence — full recompute is better**:
+
+Python rebuilds `raw_rating_mat` from **all** votes every time (no delta processing). Since:
+1. **Votes are immutable** in PostgreSQL — they can be updated (agree→disagree) but never deleted. A participant's count of "comments voted on" never decreases.
+2. **`raw_rating_mat` includes votes on moderated-out comments** — moderation only affects `rating_mat` (columns removed), not `raw_rating_mat`.
+3. **Therefore monotonicity is a free consequence**: if a participant had ≥7 votes at time T1, they still have ≥7 votes at time T2. No persistence needed — the DB is the persistence.
+4. **Worker restart changes nothing**: on restart, `raw_rating_mat` is rebuilt from all votes → same counts → same in-conv set. No DynamoDB schema change needed.
+
+This is **strictly better** than Clojure's approach: simpler code, no persistence to maintain, no risk of stale/corrupt persisted state, and deterministic regardless of restart history. Clojure only needs persistence because it uses delta updates.
+
+**CRITICAL: future-proofing for delta vote processing**:
+
+If the code is ever refactored to process only delta votes (for performance), in-conv **must** be persisted to DynamoDB at that point. Without full recompute, the guarantees above break: after a restart, `raw_rating_mat` would only contain new votes, and participants whose qualifying votes are all in the past would be lost.
+
+This must be documented:
+1. **In code**: a prominent comment block on `_get_in_conv_participants()` explaining WHY full recompute makes persistence unnecessary, and that switching to delta processing REQUIRES adding persistence (with a reference to how Clojure does it: conv_man.clj:55, conversation.clj:244).
+2. **In the PR description**: a dedicated section explaining the design decision, referencing issue #2358 and the Clojure persistence code.
+
+**File**: `delphi/polismath/conversation/conversation.py`
+
+**Current**: `_get_in_conv_participants()` (line 1256) recomputes from scratch using `self.rating_mat` (the filtered matrix). With the D2c fix (PR 1), it will use `self.raw_rating_mat`. Since `raw_rating_mat` contains all historical votes, monotonicity is guaranteed without explicit persistence.
+
+**Test-first approach — tests that guard the monotonicity invariant**:
+
+These tests must capture every scenario that would break if someone switched to delta processing without adding persistence. Each test should have a docstring explaining: "This passes today because we do full recompute. If you switch to delta vote processing, you MUST persist in-conv to DynamoDB — see issue #2358."
+
+1. **T1 — Basic monotonicity across updates**: Participant P votes on 7 comments in batch 1 → qualifies. Batch 2 adds new comments but P doesn't vote on them. P's count stays 7. Assert P is still in-conv after batch 2.
+
+2. **T2 — Monotonicity survives moderation-out**: P votes on 7 comments in batch 1 → qualifies. Between batches, 3 of those comments are moderated-out. Batch 2 arrives with new votes from other participants. Assert P is still in-conv (because `raw_rating_mat` still has P's 7 votes).
+
+3. **T3 — Worker restart with moderation**: P votes on 7 comments in batch 1 → qualifies. Destroy the `Conversation` object (simulates worker death). Moderate-out 3 of P's comments. Create a new `Conversation` object from all votes (simulates worker restart with full recompute). Assert P is still in-conv. This is the key test that would FAIL under delta processing without persistence.
+
+4. **T4 — Worker restart with no new votes**: Same as T3 but no new votes arrive after restart — just moderation happened. Rebuild `Conversation` from existing votes. Assert P is still in-conv. (Tests that recompute alone is sufficient, no "trigger" of new votes needed.)
+
+5. **T5 — Vote on comment that is later moderated-out, then new votes arrive**: P votes on c1-c7. c1-c3 are moderated-out. New participant Q votes on c4-c10. Rebuild conversation from all votes. Assert both P (7 votes on raw matrix) and Q (7 votes on non-moderated-out comments) are in-conv.
+
+6. **T6 — Greedy fallback after moderation**: Conversation has 8 comments, 14 participants who each voted on all 8, plus participant P who voted on exactly 7. All qualify. Then 5 comments are moderated-out, leaving only 3 non-moderated-out. After full recompute from `raw_rating_mat`: `n_cmts` = 8 (from `raw_rating_mat`), threshold = `min(7, 8)` = 7. P still has 7 votes → still qualifies. Verify the greedy fallback (top-15) is not erroneously triggered.
+
+**Documentation requirements**:
+- **Code comment** on `_get_in_conv_participants()`: block comment explaining the full-recompute-guarantees-monotonicity design, the delta-processing caveat, and references to Clojure's persistence approach and issue #2358.
+- **PR description**: dedicated "Design Decision: In-Conv Monotonicity" section explaining why we chose full recompute over persistence, with the future-proofing warning.
+- **Test docstrings**: each test explains what it guards against and what would need to change under delta processing.
+
+---
+
 ### PR 2: Fix D4 — Pseudocount Formula
 
 **Why before D9**: The pseudocount affects probability estimates (`pa`, `pd`), which feed into proportion tests and z-scores. Fixing this first gives us correct probability values, even if the z-score thresholds are still wrong. This follows pipeline execution order: probabilities are computed before significance testing.
@@ -372,6 +428,8 @@ By this point, we should have good test coverage from all the per-discrepancy te
 | D1b | Projection input | PR 13 | Fix with D1 |
 | D2 | In-conv threshold | **PR 1** | Fix |
 | D2b | Base-cluster sort order | **PR 1** | Fix (keep k-means ID order, match Clojure sort-by :id) |
+| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | Fix (use `raw_rating_mat` for vote counts, so votes on moderated-out comments are included) |
+| D2d | In-conv monotonicity (once in, always in) | **PR 1bis** | Full recompute from `raw_rating_mat` guarantees monotonicity without persistence. 6 tests guarding the invariant for future delta-processing refactors. Ref: [#2358](https://github.com/compdemocracy/polis/issues/2358) |
 | D3 | K-smoother buffer | PR 10 | Fix |
 | D4 | Pseudocount formula | **PR 2** | Fix |
 | D5 | Proportion test | PR 4 | Fix |

From a07a1687eefa1e2f0e4809a2ccdf7bf27403d427 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Wed, 4 Mar 2026 11:03:29 +0000
Subject: [PATCH 06/10] Journal: add session 3 findings (D2c vote count source,
 D2d monotonicity)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- D2c: structural discrepancy in vote counts and n_cmts — both must
  use raw_rating_mat to match Clojure's zero-out-columns behavior
- D2d: full recompute guarantees monotonicity without persistence,
  unlike Clojure which persists because it uses delta processing
- Corrected earlier "not needed" note about raw_rating_mat
- Updated What's Next: D2c → D2d → D4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | 76 +++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 4 deletions(-)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index f5699521e..3ed6ae505 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -142,9 +142,41 @@ After rebase onto updated `origin/kmeans_analysis_docs`:
 `generate_cold_start_clojure.py` with the prodclone DB and Clojure math service.
 This is delegated to a separate session.
 
-### What was NOT needed
-- `raw_rating_mat` vs `rating_mat` — not needed, the existing vote counting works
-- Greedy fallback / monotonic persistence — not needed for parity (cold-start only)
+### What was NOT needed (revised — see D2c/D2d below)
+- ~~`raw_rating_mat` vs `rating_mat` — not needed~~ **WRONG**: See D2c below, this IS needed.
+- Greedy fallback / monotonic persistence — not needed for cold-start parity, but monotonicity tests needed for future-proofing (see D2d).
+
+### D2c: Vote count source — raw vs filtered matrix (discovered in session 3)
+
+Deep investigation of Clojure vs Python revealed a **structural discrepancy** in how
+votes are counted for the in-conv threshold, independent of delta vs full processing:
+
+| Aspect | Clojure | Python (current) |
+|--------|---------|-----------------|
+| Vote counts per participant | From `raw-rating-mat` (conversation.clj:217-225) — includes votes on moderated-out comments | From `self.rating_mat` (conversation.py:1226/1244) — excludes moderated-out columns |
+| `n_cmts` (threshold cap) | From `rating-mat` (conversation.clj:214-215) — columns zeroed but still present, so count includes moderated-out | From `self.rating_mat` (conversation.py:1268) — moderated-out columns removed entirely |
+
+Key insight: Clojure's `zero-out-columns` (named_matrix.clj:214-228) sets moderated-out
+column values to 0 but **keeps the columns** in `rating-mat`. Python's `_apply_moderation`
+(conversation.py:308) **removes columns entirely**. This means both `user-vote-counts`
+and `n-cmts` differ between implementations.
+
+**Fix**: Both `_compute_user_vote_counts` and `n_cmts` must use `self.raw_rating_mat`.
+Two xfail unit tests planned (vote count source + n_cmts threshold).
+
+### D2d: In-conv monotonicity — full recompute vs persistence (discovered in session 3)
+
+Investigated whether Python needs to persist in-conv to DynamoDB (like Clojure persists to
+`math_main`). Finding: **no, full recompute is better**.
+
+Clojure persists in-conv (conv_man.clj:55) because it uses delta vote processing. Python
+rebuilds `raw_rating_mat` from all votes every time. Since votes are immutable in PostgreSQL
+(can be updated, never deleted), a participant's vote count never decreases → monotonicity
+is a free consequence of full recompute from `raw_rating_mat`.
+
+**Decision**: No DynamoDB persistence. Instead, 6 tests (T1-T6) guard the monotonicity
+invariant, each documenting that switching to delta processing would require adding
+persistence. See plan PR 1bis for test details. Ref: compdemocracy/polis#2358.
 
 ### D2b: Base-cluster sort order (added from Copilot review)
 
@@ -170,7 +202,23 @@ matching Clojure's `sort-by :id`.
   - D9 repness_not_empty × 6 — test too weak (checks non-empty, not correct count)
 - **4 errors**: engage dataset has duplicate vote files (pre-existing data issue)
 
-### What's Next: PR 2 — Fix D4 (Pseudocount)
+### Golden snapshot re-recording (BLOCKED)
+
+After D2b fix, regression tests fail on all datasets except vw (as expected — sort order
+change cascades to different cluster assignments). biodiversity was re-recorded and verified.
+Private datasets (FLI, bg2018, bg2050, pakistan) need re-recording but the recorder crashes
+on `engage` (pre-existing duplicate vote files) before reaching the others. Need to record
+them individually, but **blocked on fixes to PRs earlier in the stack** (#2393, #2397).
+Will re-record after those are resolved and rebased.
+
+### What's Next
+
+1. **PR 1 (D2c)**: Implement the vote count source fix — switch `_compute_user_vote_counts`
+   and `n_cmts` to use `self.raw_rating_mat`. Write xfail tests first, then fix, then verify.
+2. **PR 1bis (D2d)**: Write the 6 monotonicity tests (T1-T6). These should pass immediately
+   after D2c is fixed (full recompute + raw_rating_mat = monotonicity for free). Add the
+   code comment block and PR description documenting the design decision.
+3. **PR 2 — Fix D4 (Pseudocount)**: Next in pipeline order after D2 is fully resolved.
 
 ---
 
@@ -214,6 +262,26 @@ matching Clojure's `sort-by :id`.
   - Discovered 3 private datasets have incomplete Clojure blobs (4 keys instead of 23) — delegated regeneration to separate session
   - Committed and pushed D2 fix (`df2d013ec`)
 
+### Session 3 (2026-03-04)
+
+- Deep investigation of in-conv vote counting: compared Clojure `user-vote-counts`
+  (conversation.clj:217-225, uses `raw-rating-mat`) vs Python `_compute_user_vote_counts`
+  (conversation.py:1226, uses `self.rating_mat`)
+- Discovered D2c: structural discrepancy in both vote count source AND `n_cmts` — Clojure's
+  `zero-out-columns` keeps moderated-out columns (zeroed), Python's `_apply_moderation`
+  removes them entirely. Both vote counts and threshold cap differ.
+- Investigated whether to persist in-conv to DynamoDB (like Clojure does to `math_main`).
+  Found: Clojure persists because it uses delta vote processing. Python does full recompute
+  → monotonicity is free. No persistence needed.
+- Verified Clojure persistence path: `prep-main` (conv_man.clj:55) writes `:in-conv`,
+  `restructure-json-conv` (conv_man.clj:182) restores it on restart.
+- Updated plan: added D2c (vote count source, 2 xfail tests) and D2d (monotonicity, 6 tests
+  guarding against future delta-processing refactor). Ref: compdemocracy/polis#2358.
+- Corrected earlier journal entry that said `raw_rating_mat` was "not needed" — it IS needed.
+- Added terminology rule to CLAUDE.local.md: always say "moderated-out" or "moderated-in",
+  never just "moderated".
+- Committed plan update (`f2bf77c38`)
+
 ---
 
 ## TDD Discipline

From f5caa6129e316228484039b1fbba0ba6bab3ae6c Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 10 Mar 2026 17:45:37 +0000
Subject: [PATCH 07/10] Re-record golden snapshots and remove passing xfail
 markers

Re-record golden snapshots for biodiversity and vw datasets.
Remove xfail markers from 3 tests that now pass: D6 pseudocounts
(test_two_prop_test_with_pseudocounts), D9 empty repness
(test_repness_not_empty), and repness structure validation
(test_repness_structure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/tests/test_discrepancy_fixes.py | 2 --
 delphi/tests/test_repness_smoke.py     | 1 -
 2 files changed, 3 deletions(-)

diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
index 12c1e2a89..198bac8e3 100644
--- a/delphi/tests/test_discrepancy_fixes.py
+++ b/delphi/tests/test_discrepancy_fixes.py
@@ -263,7 +263,6 @@ def test_z95_matches_clojure(self):
         check.almost_equal(Z_95, 1.6449, abs=0.001,
                             msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}")
 
-    @pytest.mark.xfail(reason="D9: Two-tailed thresholds produce empty repness")
     def test_repness_not_empty(self, conv, dataset_name):
         """Repness should produce non-empty comment_repness with correct thresholds."""
         repness = conv.repness
@@ -346,7 +345,6 @@ class TestD6TwoPropTest:
         Clojure adds +1 pseudocount to all 4 inputs (succ1, n1, succ2, n2).
     """
 
-    @pytest.mark.xfail(reason="D6: Python two_prop_test lacks pseudocounts")
     def test_two_prop_test_with_pseudocounts(self):
         """two_prop_test should add +1 pseudocounts matching Clojure."""
         # With pseudocounts: (succ+1)/(n+2) for both groups
diff --git a/delphi/tests/test_repness_smoke.py b/delphi/tests/test_repness_smoke.py
index ae2c09a97..98d6cd100 100644
--- a/delphi/tests/test_repness_smoke.py
+++ b/delphi/tests/test_repness_smoke.py
@@ -82,7 +82,6 @@ def test_repness_runs_without_error(self, dataset_name: str, conversation):
         logger.info(f"✓ Representativeness runs without error for {dataset_name}")
 
     @pytest.mark.use_discovered_datasets
-    @pytest.mark.xfail(strict=True, reason="Pre-existing: repness structure validation needs investigation")
     def test_repness_structure(self, dataset_name: str, conversation):
         """Test representativeness results have expected structure."""
         logger.debug(f"Testing representativeness structure for {dataset_name}")

From 6fd8e0a0c656d05b12d2a9d078259a6050cd0e75 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 10 Mar 2026 19:02:05 +0000
Subject: [PATCH 08/10] xfail D2 in-conv tests on incremental blobs

D2 behaviour matches on cold-start; incremental deferred to future PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/tests/test_discrepancy_fixes.py | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
index 198bac8e3..ec6125579 100644
--- a/delphi/tests/test_discrepancy_fixes.py
+++ b/delphi/tests/test_discrepancy_fixes.py
@@ -161,7 +161,16 @@ class TestD2InConvThreshold:
     def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
         """Number of in-conv participants should match Clojure."""
         if 'in-conv' not in clojure_blob:
-            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data (incomplete cold-start blob)")
+            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data")
+        if dataset_name.endswith('-incremental'):
+            # Incremental blobs were built progressively as votes trickled in,
+            # so the threshold min(7, n_cmts) was evaluated at each iteration
+            # with a smaller n_cmts than the final value. This admits a few
+            # extra participants to in-conv during earlier iterations.
+            # The difference is tiny (1-2 participants) when a cold-start blob
+            # is available. Very large conversations have empty cold-start blobs
+            # because Clojure can't process them in one pass.
+            pytest.xfail("D2: behaviour matches on cold-start, incremental deferred to future PR")
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)
         python_in_conv_count = len(conv._get_in_conv_participants())
 
@@ -172,7 +181,16 @@ def test_in_conv_count_matches(self, conv, clojure_blob, dataset_name):
     def test_in_conv_set_matches(self, conv, clojure_blob, dataset_name):
         """The actual set of in-conv participants should match Clojure."""
         if 'in-conv' not in clojure_blob:
-            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data (incomplete cold-start blob)")
+            pytest.skip(f"[{dataset_name}] Clojure blob has no in-conv data")
+        if dataset_name.endswith('-incremental'):
+            # Incremental blobs were built progressively as votes trickled in,
+            # so the threshold min(7, n_cmts) was evaluated at each iteration
+            # with a smaller n_cmts than the final value. This admits a few
+            # extra participants to in-conv during earlier iterations.
+            # The difference is tiny (1-2 participants) when a cold-start blob
+            # is available. Very large conversations have empty cold-start blobs
+            # because Clojure can't process them in one pass.
+            pytest.xfail("D2: behaviour matches on cold-start, incremental deferred to future PR")
         clojure_in_conv = _clojure_in_conv_set(clojure_blob)
         python_in_conv = conv._get_in_conv_participants()
         # Convert python pids to int for comparison

From b2181878a4b273051edd56b45dfae915cea3f311 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Tue, 10 Mar 2026 22:49:18 +0000
Subject: [PATCH 09/10] Journal: add session 4, update plan with D2 incremental
 in Replay PR B
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Session 4: dual-blob test infrastructure, D2 incremental xfail with rationale,
  golden snapshot re-recording, 245 passed / 0 failures
- Incomplete Clojure blobs: BLOCKING → RESOLVED (test both blob types instead
  of regenerating)
- Plan: add D2 in-conv incremental matching to Replay PR B (early participants
  admitted when n_cmts < 7, all 4 datasets with both blob types exhibit it)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | 49 ++++++++++++++++++++-----
 delphi/docs/PLAN_DISCREPANCY_FIXES.md   |  5 +++
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index 3ed6ae505..e5289277f 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -128,19 +128,19 @@ After rebase onto updated `origin/kmeans_analysis_docs`:
   failures are downstream of the expected threshold change. Committed in `.local` repo on
   branch `series-of-fixes`.
 
-### Incomplete Clojure blobs (BLOCKING)
+### Incomplete Clojure blobs (RESOLVED)
 
 3 private datasets have incomplete Clojure cold-start blobs (4 keys instead of 23):
 - Missing `in-conv`, `repness`, `consensus`, `comment-priorities`, etc.
-- Likely caused by Clojure math service timeout on large conversations
-- The 4 remaining datasets (vw, biodiversity, FLI, bg2018) have complete blobs (23 keys)
-- **D2 tests pass on all datasets with complete blobs**
-- D2 tests fail on the 3 incomplete datasets because `in-conv` is empty — this is a data
-  problem, not a code problem
+- Root cause: the Clojure math worker relies on incremental processing and cannot
+  analyse these large conversations in a single cold-start pass.
+- The 4 remaining datasets (vw, biodiversity, FLI, bg2018) have complete cold-start blobs (23 keys)
 
-**Action needed**: Regenerate Clojure blobs for the 3 incomplete datasets using
-`generate_cold_start_clojure.py` with the prodclone DB and Clojure math service.
-This is delegated to a separate session.
+**Resolution (session 4)**: Instead of regenerating blobs, we now test against both
+incremental and cold-start blob types. `get_blob_variants()` discovers which blob types
+have meaningful content per dataset (via `_is_blob_filled()`). Tests on datasets with
+empty cold-start blobs only run against the incremental blob. D2 in-conv tests on
+incremental blobs are xfailed (see session 4 notes). No blob regeneration needed.
 
 ### What was NOT needed (revised — see D2c/D2d below)
 - ~~`raw_rating_mat` vs `rating_mat` — not needed~~ **WRONG**: See D2c below, this IS needed.
@@ -282,6 +282,37 @@ Will re-record after those are resolved and rebased.
   never just "moderated".
 - Committed plan update (`f2bf77c38`)
 
+### Session 4 (2026-03-10)
+
+- **Dual-blob test infrastructure** (committed on `jc/series-of-fixes`, #2420):
+  - Extended `get_dataset_files()` with `blob_type` parameter (explicit `'incremental'` or `'cold_start'`)
+  - Added `get_blob_variants()` to discover which blob types are available per dataset,
+    filtering out empty/unfilled blobs via `_is_blob_filled()` (checks for PCA data or
+    non-empty base-clusters)
+  - Extended `@pytest.mark.use_discovered_datasets` with optional `use_blobs=True` parameter
+    to parametrize tests with composite `dataset-blob_type` IDs (e.g., `biodiversity-incremental`)
+  - Added `parse_dataset_blob_id()` in conftest for splitting composite IDs
+  - Applied to `test_legacy_clojure_regression.py`, `test_discrepancy_fixes.py`, and
+    `test_legacy_repness_comparison.py`
+  - Conversation computation shared across blob variants via `_CONV_CACHE` (keyed by dataset
+    name) to avoid redundant recomputation
+
+- **D2 incremental xfail with rationale** (committed on `jc/clj-parity-d2-fix`, #2421):
+  - D2 in-conv tests on incremental blobs are xfailed with inline comments explaining why:
+    incremental blobs were built progressively as votes trickled in, so the threshold
+    `min(7, n_cmts)` was evaluated at each iteration with a smaller `n_cmts`, admitting
+    a few extra participants (1–2) during earlier iterations. Very large conversations
+    have empty cold-start blobs because the Clojure math worker relies on incremental
+    processing — it cannot analyse the whole conversation in a single cold-start pass.
+  - Matching incremental exactly would require simulating the progressive threshold — tracked
+    as future work under Replay Infrastructure (PRs A/B/C in the plan)
+
+- **Golden snapshots re-recorded** for all public + private datasets, 5 stale xfail markers removed
+- **Final test results**: 245 passed, 5 skipped, 36 xfailed, 0 failures, 0 xpassed
+- Updated PR #2421 description with incremental vs cold-start explanation
+- Local handoff file created at `delphi/docs/HANDOFF_D2_INCREMENTAL_IN_CONV.md` (untracked)
+  for future investigation of how much in-conv sets differ between blob types
+
 ---
 
 ## TDD Discipline
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
index fe487214f..cbc1d7d63 100644
--- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md
+++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -481,6 +481,11 @@ Three separate PRs for temporal/incremental testing:
 **Replay PR B**: Use replay infrastructure for tests
 - Test D3 (k-smoother stability) with real incremental data
 - Test D1 (PCA sign consistency) with real incremental data
+- Test D2 (in-conv threshold) on incremental blobs: currently xfailed because early
+  participants (low PIDs) were admitted when `n_cmts` was still < 7 during early
+  iterations, making the threshold equal to `n_cmts` rather than 7. All four datasets
+  with both blob types exhibit this (1–2 extra participants each). Matching incremental
+  behaviour requires simulating the progressive threshold evaluation.
 - Compare Python's incremental behavior to Clojure's
 
 **Replay PR C**: Visualization movie generation

From d213a7978505ab1eeae57b05b5840a52fcfd1169 Mon Sep 17 00:00:00 2001
From: Julien Cornebise <julien@cornebise.com>
Date: Wed, 11 Mar 2026 10:01:29 +0000
Subject: [PATCH 10/10] Fix D2c: use raw_rating_mat for vote counts and n_cmts
 threshold

Vote counts and n_cmts now come from raw_rating_mat (includes votes on
moderated-out comments), matching Clojure's user-vote-counts which reads
from raw-rating-mat. Previously, _compute_user_vote_counts and
_get_in_conv_participants used the filtered rating_mat, causing
participants to drop below the in-conv threshold when their voted
comments were moderated-out.

Also adds D2d monotonicity tests (T1-T5) guarding the invariant that
once a participant qualifies for in-conv, they can never be removed.
These pass for free with full recompute from raw_rating_mat; documented
that switching to delta vote processing would require persisting in-conv
to DynamoDB (see #2358).

Tests: 253 passed, 5 skipped, 36 xfailed (0 failures)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md       |  33 +++
 delphi/docs/PLAN_DISCREPANCY_FIXES.md         |  62 ++--
 delphi/polismath/conversation/conversation.py |  30 +-
 delphi/tests/test_discrepancy_fixes.py        | 278 ++++++++++++++++++
 4 files changed, 354 insertions(+), 49 deletions(-)

diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
index e5289277f..5295958c0 100644
--- a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
+++ b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -313,6 +313,39 @@ Will re-record after those are resolved and rebased.
 - Local handoff file created at `delphi/docs/HANDOFF_D2_INCREMENTAL_IN_CONV.md` (untracked)
   for future investigation of how much in-conv sets differ between blob types
 
+### Session 5 (2026-03-11)
+
+- **D2c fix**: Switched `_compute_user_vote_counts` and `_get_in_conv_participants` to use
+  `self.raw_rating_mat` instead of `self.rating_mat`. Both vote counts and `n_cmts` now
+  include votes on moderated-out comments, matching Clojure's `user-vote-counts`
+  (conversation.clj:217-225) and `n-cmts` (conversation.clj:214-215).
+- **D2c tests** (3 in `TestD2cVoteCountSource`):
+  - `test_vote_count_includes_moderated_out_votes`: 10 comments, 3 moderated-out, count=10
+  - `test_n_cmts_includes_moderated_out_comments`: verifies threshold uses raw column count,
+    not filtered; also tests that a participant with 6 raw votes is correctly excluded
+  - `test_participant_stays_in_conv_after_moderation`: the critical scenario — participant
+    with 8 votes stays in-conv when 3 comments moderated-out (filtered count drops to 5)
+- **D2d monotonicity tests** (5 in `TestD2dInConvMonotonicity`):
+  - T1: basic monotonicity across batch updates
+  - T2: survives moderation-out
+  - T3: worker restart + moderation (key delta-processing guard)
+  - T4: worker restart, moderation, no new votes
+  - T5: mixed participants with moderation
+  - All pass for free with D2c fix (full recompute from `raw_rating_mat`)
+- **TDD discipline**: wrote tests first (D2c xfail, D2d no xfail), confirmed D2c red (3
+  xfailed) and D2d red (T2-T5 failed, T1 passed), applied fix, confirmed all green.
+- **Full test suite**: 253 passed, 5 skipped, 36 xfailed, 0 failures (+8 from session 4)
+- **No regressions** on public datasets
+- Added code comments on `_get_in_conv_participants` documenting the monotonicity design
+  decision and delta-processing caveat (ref: #2358)
+- Updated plan: D2, D2b, D2c, D2d all marked DONE; PR 1bis merged into PR 1
+- Updated PR #2421 description with D2c/D2d sections
+
+### What's Next
+
+1. **PR 2 — Fix D4 (Pseudocount)**: `PSEUDO_COUNT = 1.5` → `2.0` to match Clojure's Beta(2,2) prior.
+2. **PR 3 — Fix D9 (Z-score thresholds)**: Switch from two-tailed to one-tailed z-scores.
+
 ---
 
 ## TDD Discipline
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
index cbc1d7d63..848f576f4 100644
--- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md
+++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -103,28 +103,17 @@ Fixes are ordered by **pipeline execution order**: participant filtering → pro
 
 **File**: `delphi/polismath/conversation/conversation.py`
 
-**Current**: `threshold = 7 + sqrt(n_cmts) * 0.1` → biodiversity (314 comments): threshold=8.8, keeps 428/536.
-**Target**: `threshold = min(7, n_cmts)` → threshold=7, more participants qualify.
+**Current**: ~~`threshold = 7 + sqrt(n_cmts) * 0.1`~~ → **DONE**: `threshold = min(7, n_cmts)`.
+**D2c**: ~~Vote counts and `n_cmts` from `rating_mat`~~ → **DONE**: Both use `raw_rating_mat` (includes moderated-out comments). Matches Clojure's `user-vote-counts` (conversation.clj:217-225).
+**D2b**: ~~Base clusters sorted by size~~ → **DONE**: Sort by k-means ID (matches Clojure's `sort-by :id`).
 
-**Additional changes**:
-- **D2c — Vote count source**: Use `self.raw_rating_mat` instead of `self.rating_mat` for computing per-participant vote counts. Clojure's `user-vote-counts` (conversation.clj:217-225) counts votes from `raw-rating-mat`, which includes votes on moderated-out comments. Python currently uses the filtered `rating_mat`, which excludes moderated-out columns — undercounting votes and potentially dropping participants below the in-conv threshold. This is a **structural** discrepancy that occurs on every computation (cold-start or incremental), not just with delta updates — because the difference is in *which matrix* the count is derived from. Additionally, `n_cmts` differs: Clojure's `n-cmts` (conversation.clj:214-215) counts columns of `rating-mat` which still includes moderated-out columns (zeroed but present, via `zero-out-columns` in named_matrix.clj:214-228). Python's `n_cmts = len(self.rating_mat.columns)` excludes moderated-out columns entirely (removed by `_apply_moderation`, line 308). This means the `min(7, n_cmts)` threshold itself also differs. Both aspects (vote count source and `n_cmts` source) must use `raw_rating_mat` to match Clojure. (This is related to but distinct from D15 — moderation handling — which tracks the broader zero-out vs remove discrepancy.)
-- Add greedy fallback (top-15 voters if <15 qualify)
-- Add monotonic persistence (once in, always in)
-
-**Test-first approach**:
-1. Add test comparing in-conv participant count/set between Python and Clojure for ALL datasets
-2. Add synthetic edge-case tests: tiny conversation (fewer comments than threshold), conversation where greedy fallback triggers
-3. **D2c unit test — vote count source**: Synthetic test with 10 comments, 3 moderated-out. Participant P voted on all 10. Verify `_compute_user_vote_counts` returns 10 for P (from `raw_rating_mat`), not 7 (from `rating_mat`). Mark xfail before fix.
-4. **D2c unit test — n_cmts threshold**: Same setup. Verify `n_cmts` used in the `min(7, n_cmts)` threshold equals 10 (all comments including moderated-out), not 7 (only non-moderated-out). This matters when a conversation has fewer than 7 non-moderated-out comments but more than 7 total: the threshold would be artificially low in Python, changing who qualifies. Mark xfail before fix.
-5. Run tests → expect failure on D2c tests
-4. Apply fix
-5. Re-run ALL tests — cluster count may change (possibly 3→2 for biodiversity, matching Clojure)
-6. **Cluster comparison test (CRITICAL)**: After fixing D2, add a test that compares Python and Clojure cluster assignments at the participant level — number of clusters AND which participants are in which cluster (using Jaccard similarity or exact match). The Python clustering calls sklearn K-means; Clojure uses a two-level approach. If clusters still don't match after the threshold fix, investigate why and potentially create an additional PR (PR 1b) to fix the remaining clustering discrepancy before moving on to repness fixes.
-7. Document new baseline in journal
+**Remaining (deferred)**:
+- Greedy fallback (top-15 voters if <15 qualify) — not needed for current datasets
+- Cluster comparison test at participant level — deferred to after repness fixes
 
 ---
 
-### PR 1bis: Fix D2d — In-Conv Monotonicity
+### PR 1bis: Fix D2d — In-Conv Monotonicity — **DONE** (merged into PR 1)
 
 **Related upstream issue**: [compdemocracy/polis#2358](https://github.com/compdemocracy/polis/issues/2358) — "Non-Deterministic K-Means Clustering Due to Worker Restart". That issue focuses on `group-clusterings` not being persisted. In-conv IS persisted in Clojure (see below), but we take a different — and better — approach.
 
@@ -153,28 +142,17 @@ This must be documented:
 
 **File**: `delphi/polismath/conversation/conversation.py`
 
-**Current**: `_get_in_conv_participants()` (line 1256) recomputes from scratch using `self.rating_mat` (the filtered matrix). With the D2c fix (PR 1), it will use `self.raw_rating_mat`. Since `raw_rating_mat` contains all historical votes, monotonicity is guaranteed without explicit persistence.
-
-**Test-first approach — tests that guard the monotonicity invariant**:
-
-These tests must capture every scenario that would break if someone switched to delta processing without adding persistence. Each test should have a docstring explaining: "This passes today because we do full recompute. If you switch to delta vote processing, you MUST persist in-conv to DynamoDB — see issue #2358."
-
-1. **T1 — Basic monotonicity across updates**: Participant P votes on 7 comments in batch 1 → qualifies. Batch 2 adds new comments but P doesn't vote on them. P's count stays 7. Assert P is still in-conv after batch 2.
-
-2. **T2 — Monotonicity survives moderation-out**: P votes on 7 comments in batch 1 → qualifies. Between batches, 3 of those comments are moderated-out. Batch 2 arrives with new votes from other participants. Assert P is still in-conv (because `raw_rating_mat` still has P's 7 votes).
-
-3. **T3 — Worker restart with moderation**: P votes on 7 comments in batch 1 → qualifies. Destroy the `Conversation` object (simulates worker death). Moderate-out 3 of P's comments. Create a new `Conversation` object from all votes (simulates worker restart with full recompute). Assert P is still in-conv. This is the key test that would FAIL under delta processing without persistence.
-
-4. **T4 — Worker restart with no new votes**: Same as T3 but no new votes arrive after restart — just moderation happened. Rebuild `Conversation` from existing votes. Assert P is still in-conv. (Tests that recompute alone is sufficient, no "trigger" of new votes needed.)
-
-5. **T5 — Vote on comment that is later moderated-out, then new votes arrive**: P votes on c1-c7. c1-c3 are moderated-out. New participant Q votes on c4-c10. Rebuild conversation from all votes. Assert both P (7 votes on raw matrix) and Q (7 votes on non-moderated-out comments) are in-conv.
+**DONE**: `_get_in_conv_participants()` uses `self.raw_rating_mat` (D2c fix). Monotonicity is a free consequence. Code comment on the function documents the design decision and the delta-processing caveat.
 
-6. **T6 — Greedy fallback after moderation**: Conversation has 8 comments, 14 participants who each voted on all 8, plus participant P who voted on exactly 7. All qualify. Then 5 comments are moderated-out, leaving only 3 non-moderated-out. After full recompute from `raw_rating_mat`: `n_cmts` = 8 (from `raw_rating_mat`), threshold = `min(7, 8)` = 7. P still has 7 votes → still qualifies. Verify the greedy fallback (top-15) is not erroneously triggered.
+**Tests implemented** (T1-T5 in `TestD2dInConvMonotonicity`):
+- T1: Basic monotonicity across batch updates
+- T2: Survives moderation-out of voted comments
+- T3: Worker restart + moderation (key delta-processing guard)
+- T4: Worker restart, moderation, no new votes
+- T5: Mixed participants with moderation
+- T6 (greedy fallback): deferred — greedy fallback not yet implemented
 
-**Documentation requirements**:
-- **Code comment** on `_get_in_conv_participants()`: block comment explaining the full-recompute-guarantees-monotonicity design, the delta-processing caveat, and references to Clojure's persistence approach and issue #2358.
-- **PR description**: dedicated "Design Decision: In-Conv Monotonicity" section explaining why we chose full recompute over persistence, with the future-proofing warning.
-- **Test docstrings**: each test explains what it guards against and what would need to change under delta processing.
+All test docstrings explain what would break under delta processing.
 
 ---
 
@@ -426,10 +404,10 @@ By this point, we should have good test coverage from all the per-discrepancy te
 |----|-------------|-----|--------|
 | D1 | PCA sign flips | PR 13 | Fix (sign consistency) |
 | D1b | Projection input | PR 13 | Fix with D1 |
-| D2 | In-conv threshold | **PR 1** | Fix |
-| D2b | Base-cluster sort order | **PR 1** | Fix (keep k-means ID order, match Clojure sort-by :id) |
-| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | Fix (use `raw_rating_mat` for vote counts, so votes on moderated-out comments are included) |
-| D2d | In-conv monotonicity (once in, always in) | **PR 1bis** | Full recompute from `raw_rating_mat` guarantees monotonicity without persistence. 6 tests guarding the invariant for future delta-processing refactors. Ref: [#2358](https://github.com/compdemocracy/polis/issues/2358) |
+| D2 | In-conv threshold | **PR 1** | **DONE** ✓ |
+| D2b | Base-cluster sort order | **PR 1** | **DONE** ✓ |
+| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | **DONE** ✓ |
+| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **DONE** ✓ (5 guard tests, T1-T5) |
 | D3 | K-smoother buffer | PR 10 | Fix |
 | D4 | Pseudocount formula | **PR 2** | Fix |
 | D5 | Proportion test | PR 4 | Fix |
diff --git a/delphi/polismath/conversation/conversation.py b/delphi/polismath/conversation/conversation.py
index 3d73870c9..0c88ce4bd 100644
--- a/delphi/polismath/conversation/conversation.py
+++ b/delphi/polismath/conversation/conversation.py
@@ -1210,25 +1210,31 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
         """
         Compute the number of votes per participant.
 
+        Uses raw_rating_mat (not rating_mat) so that votes on moderated-out
+        comments are still counted. This matches Clojure's user-vote-counts
+        (conversation.clj:217-225) which reads from raw-rating-mat.
+        Fix D2c: see PLAN_DISCREPANCY_FIXES.md.
+
         Returns:
             Dictionary mapping participant IDs to vote counts
         """
         import time
         start_time = time.time()
-        logger.info(f"Starting _compute_user_vote_counts for {self.rating_mat.shape[0]} participants")
+        mat = self.raw_rating_mat
+        logger.info(f"Starting _compute_user_vote_counts for {mat.shape[0]} participants")
 
         vote_counts = {}
 
         # Use more efficient approach for large datasets
-        if self.rating_mat.shape[0] > 1000:
+        if mat.shape[0] > 1000:
             # Create a mask of non-nan values across the entire matrix
-            non_nan_mask = ~np.isnan(self.rating_mat.values)
+            non_nan_mask = ~np.isnan(mat.values)
 
             # Sum across rows using vectorized operation
             row_sums = np.sum(non_nan_mask, axis=1)
 
             # Convert to dictionary
-            for i, pid in enumerate(self.rating_mat.index):
+            for i, pid in enumerate(mat.index):
                 if i < len(row_sums):
                     vote_counts[pid] = int(row_sums[i])
                 else:
@@ -1238,9 +1244,9 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
             logger.info(f"Computed vote counts for {len(vote_counts)} participants using vectorized approach in {time.time() - start_time:.4f}s")
         else:
             # Original approach for smaller datasets
-            for i, pid in enumerate(self.rating_mat.index):
+            for i, pid in enumerate(mat.index):
                 # Get row of votes for this participant
-                row = self.rating_mat.values[i, :]
+                row = mat.values[i, :]
 
                 # Count non-nan values
                 count = np.sum(~np.isnan(row))
@@ -1261,10 +1267,20 @@ def _get_in_conv_participants(self) -> Set[str]:
         Threshold: participant must have voted on at least min(7, n_comments)
         comments (Clojure parity fix D2).
 
+        Both vote counts and n_cmts use raw_rating_mat (fix D2c), which includes
+        votes on moderated-out comments. This matches Clojure, where
+        zero-out-columns keeps moderated-out columns in the matrix (zeroed but
+        present). Since raw_rating_mat contains all historical votes and votes
+        are immutable in PostgreSQL, monotonicity is guaranteed without explicit
+        persistence — a participant who once qualified can never lose votes.
+        If the code is ever refactored to use delta vote processing, in-conv
+        MUST be persisted to DynamoDB. See compdemocracy/polis#2358 and
+        Clojure's approach in conv_man.clj:55, conversation.clj:244.
+
         Returns:
             Set of participant IDs that meet the threshold
         """
-        n_cmts = len(self.rating_mat.columns) if hasattr(self.rating_mat, 'columns') else 0
+        n_cmts = len(self.raw_rating_mat.columns) if hasattr(self.raw_rating_mat, 'columns') else 0
         threshold = min(7, n_cmts)
 
         # Get vote counts for all participants
diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py
index ec6125579..9725b5b5a 100644
--- a/delphi/tests/test_discrepancy_fixes.py
+++ b/delphi/tests/test_discrepancy_fixes.py
@@ -210,6 +210,284 @@ def test_in_conv_set_matches(self, conv, clojure_blob, dataset_name):
         check.equal(only_clojure, set(), f"Participants in Clojure but not Python: {len(only_clojure)}")
 
 
+# ============================================================================
+# D2c — Vote Count Source (raw vs filtered matrix)
+# ============================================================================
+
+def _build_conv_with_moderation(
+    n_comments: int = 10,
+    mod_out_tids: list | None = None,
+    participant_votes: dict | None = None,
+) -> Conversation:
+    """Build a synthetic Conversation with moderation applied.
+
+    Args:
+        n_comments: Total number of comments (tids 0..n_comments-1).
+        mod_out_tids: List of tids to moderate-out.
+        participant_votes: Dict mapping pid → list of tids they voted on.
+            If None, a single participant votes on all comments.
+
+    Returns:
+        A Conversation with votes ingested and moderation applied (no recompute).
+    """
+    if participant_votes is None:
+        participant_votes = {0: list(range(n_comments))}
+
+    votes_list = []
+    for pid, tids in participant_votes.items():
+        for tid in tids:
+            votes_list.append({'pid': pid, 'tid': tid, 'vote': 1})
+
+    conv = Conversation('test-d2c')
+    conv = conv.update_votes({'votes': votes_list}, recompute=False)
+
+    if mod_out_tids:
+        conv = conv.update_moderation(
+            {'mod_out_tids': mod_out_tids}, recompute=False
+        )
+
+    return conv
+
+
+class TestD2cVoteCountSource:
+    """
+    D2c: Vote counts for in-conv threshold must come from raw_rating_mat
+    (includes votes on moderated-out comments), not rating_mat (filtered).
+
+    Clojure's user-vote-counts (conversation.clj:217-225) uses raw-rating-mat.
+    Python currently uses self.rating_mat, undercounting when comments are
+    moderated-out.
+    """
+
+    def test_vote_count_includes_moderated_out_votes(self):
+        """Participant who voted on 10 comments (3 moderated-out) should have count=10."""
+        conv = _build_conv_with_moderation(
+            n_comments=10,
+            mod_out_tids=[0, 1, 2],
+            participant_votes={0: list(range(10))},
+        )
+
+        vote_counts = conv._compute_user_vote_counts()
+        assert vote_counts[0] == 10, (
+            f"Vote count should be 10 (from raw_rating_mat), got {vote_counts[0]} "
+            f"(rating_mat has {len(conv.rating_mat.columns)} columns)"
+        )
+
+    def test_n_cmts_includes_moderated_out_comments(self):
+        """n_cmts in threshold should count all comments including moderated-out.
+
+        With 10 total comments and 5 moderated-out, n_cmts should be 10.
+        This matters when non-moderated-out count < 7: the threshold would be
+        artificially low in Python (min(7,5)=5 vs correct min(7,10)=7).
+        """
+        conv = _build_conv_with_moderation(
+            n_comments=10,
+            mod_out_tids=[0, 1, 2, 3, 4],
+            participant_votes={0: list(range(10))},
+        )
+
+        # raw_rating_mat has all columns; rating_mat has only non-moderated-out
+        n_cmts_raw = len(conv.raw_rating_mat.columns)
+        n_cmts_filtered = len(conv.rating_mat.columns)
+
+        assert n_cmts_raw == 10, f"raw_rating_mat should have 10 columns, got {n_cmts_raw}"
+        assert n_cmts_filtered == 5, f"rating_mat should have 5 columns, got {n_cmts_filtered}"
+
+        # The threshold used by _get_in_conv_participants should be min(7, 10) = 7,
+        # not min(7, 5) = 5. Verify indirectly: participant with exactly 6 votes
+        # should NOT be in-conv (threshold=7), but would be if n_cmts=5 (threshold=5).
+        conv2 = _build_conv_with_moderation(
+            n_comments=10,
+            mod_out_tids=[0, 1, 2, 3, 4],
+            participant_votes={
+                0: list(range(10)),    # 10 raw votes → in-conv
+                1: list(range(5, 11)), # 6 raw votes (only 1 moderated-out) → NOT in-conv
+            },
+        )
+        in_conv = conv2._get_in_conv_participants()
+        assert 0 in in_conv, "P0 (10 raw votes) should be in-conv"
+        assert 1 not in in_conv, (
+            "P1 (6 raw votes) should NOT be in-conv with threshold=7, "
+            "but would be if n_cmts wrongly used filtered count (5)"
+        )
+
+    def test_participant_stays_in_conv_after_moderation(self):
+        """Participant with 8 votes stays in-conv even when 3 comments moderated-out.
+
+        This is the critical scenario: participant votes above threshold (8 >= 7),
+        comments get moderated-out between update_votes calls reducing filtered
+        count to 5, but raw count is still 8 so they should remain in-conv.
+        """
+        # Participant 0: votes on 8 comments (above threshold of 7)
+        # Participant 1: votes on 7 comments (borderline)
+        conv = _build_conv_with_moderation(
+            n_comments=10,
+            mod_out_tids=[0, 1, 2],  # 3 moderated-out
+            participant_votes={
+                0: list(range(8)),   # votes on c0-c7; after mod, only c3-c7 visible (5)
+                1: list(range(7)),   # votes on c0-c6; after mod, only c3-c6 visible (4)
+            },
+        )
+
+        in_conv = conv._get_in_conv_participants()
+
+        # Both should be in-conv: raw counts are 8 and 7, threshold is min(7, 10) = 7
+        assert 0 in in_conv, (
+            f"Participant 0 (8 raw votes) should be in-conv, "
+            f"but filtered count is {np.sum(~np.isnan(conv.rating_mat.loc[0].values))}"
+        )
+        assert 1 in in_conv, (
+            f"Participant 1 (7 raw votes) should be in-conv, "
+            f"but filtered count is {np.sum(~np.isnan(conv.rating_mat.loc[1].values))}"
+        )
+
+
+# ============================================================================
+# D2d — In-Conv Monotonicity
+# ============================================================================
+
+class TestD2dInConvMonotonicity:
+    """
+    D2d: Once a participant qualifies for in-conv, they must never be removed.
+
+    These tests pass today because Python does full recompute from raw_rating_mat
+    (which includes all historical votes, even on moderated-out comments).
+    If the code is ever refactored to use delta vote processing, in-conv MUST be
+    persisted to DynamoDB — see compdemocracy/polis#2358 and the Clojure approach
+    in conv_man.clj:55, conversation.clj:244.
+    """
+
+    def test_t1_basic_monotonicity_across_updates(self):
+        """Participant who qualified in batch 1 stays in-conv after batch 2.
+
+        P votes on 7 comments in batch 1 → qualifies. Batch 2 adds new comments
+        but P doesn't vote on them. P's count stays 7 → still in-conv.
+
+        Would FAIL under delta processing without persistence if batch 2 only
+        contained new votes and P's old votes weren't re-scanned.
+        """
+        conv = Conversation('test-d2d-t1')
+
+        # Batch 1: P0 votes on 7 comments, P1 votes on all 10
+        batch1 = {'votes': [
+            *[{'pid': 0, 'tid': tid, 'vote': 1} for tid in range(7)],
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(10)],
+        ]}
+        conv = conv.update_votes(batch1, recompute=False)
+        in_conv_after_b1 = conv._get_in_conv_participants()
+        assert 0 in in_conv_after_b1, "P0 should be in-conv after batch 1 (7 votes)"
+
+        # Batch 2: new comments c10-c14, only P1 votes on them
+        batch2 = {'votes': [
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(10, 15)],
+        ]}
+        conv = conv.update_votes(batch2, recompute=False)
+        in_conv_after_b2 = conv._get_in_conv_participants()
+        assert 0 in in_conv_after_b2, (
+            "P0 should still be in-conv after batch 2 (7 votes unchanged)"
+        )
+
+    def test_t2_monotonicity_survives_moderation(self):
+        """Participant stays in-conv after their voted comments are moderated-out.
+
+        P votes on 7 comments → qualifies. 3 comments moderated-out. P's raw
+        count is still 7 → still in-conv (because raw_rating_mat is used).
+
+        Would FAIL under delta processing without persistence if moderation
+        caused a recount from only the filtered matrix.
+        """
+        conv = Conversation('test-d2d-t2')
+        votes = {'votes': [
+            *[{'pid': 0, 'tid': tid, 'vote': 1} for tid in range(7)],
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(10)],
+        ]}
+        conv = conv.update_votes(votes, recompute=False)
+
+        in_conv_before = conv._get_in_conv_participants()
+        assert 0 in in_conv_before, "P0 should be in-conv before moderation"
+
+        # Moderate out 3 of P0's comments
+        conv = conv.update_moderation({'mod_out_tids': [0, 1, 2]}, recompute=False)
+
+        in_conv_after = conv._get_in_conv_participants()
+        assert 0 in in_conv_after, (
+            "P0 should still be in-conv after moderation "
+            "(7 raw votes, only 4 on filtered matrix)"
+        )
+
+    def test_t3_worker_restart_with_moderation(self):
+        """Participant survives worker restart + moderation.
+
+        P votes on 7 comments → qualifies. Worker dies. 3 comments moderated-out.
+        New worker rebuilds from all votes. P still has 7 raw votes → in-conv.
+
+        This is the KEY test that would FAIL under delta processing without
+        persistence: after restart, only new votes would be scanned.
+        """
+        # Original worker
+        conv1 = Conversation('test-d2d-t3')
+        votes = {'votes': [
+            *[{'pid': 0, 'tid': tid, 'vote': 1} for tid in range(7)],
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(10)],
+        ]}
+        conv1 = conv1.update_votes(votes, recompute=False)
+        assert 0 in conv1._get_in_conv_participants()
+
+        # Simulate worker restart: new Conversation object, replay ALL votes
+        conv2 = Conversation('test-d2d-t3')
+        conv2 = conv2.update_votes(votes, recompute=False)
+
+        # Apply moderation that happened while worker was dead
+        conv2 = conv2.update_moderation({'mod_out_tids': [0, 1, 2]}, recompute=False)
+
+        in_conv = conv2._get_in_conv_participants()
+        assert 0 in in_conv, (
+            "P0 should be in-conv after restart+moderation "
+            "(full recompute from all votes)"
+        )
+
+    def test_t4_worker_restart_moderation_no_new_votes(self):
+        """Rebuild from existing votes after moderation, no new votes needed.
+
+        Same as T3 but verifies that recompute alone is sufficient — no "trigger"
+        of new votes is needed to re-evaluate in-conv.
+        """
+        votes = {'votes': [
+            *[{'pid': 0, 'tid': tid, 'vote': 1} for tid in range(7)],
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(10)],
+        ]}
+
+        # Build from scratch with moderation already applied
+        conv = Conversation('test-d2d-t4')
+        conv = conv.update_votes(votes, recompute=False)
+        conv = conv.update_moderation({'mod_out_tids': [0, 1, 2]}, recompute=False)
+
+        in_conv = conv._get_in_conv_participants()
+        assert 0 in in_conv, (
+            "P0 should be in-conv with just existing votes + moderation"
+        )
+
+    def test_t5_mixed_participants_moderation(self):
+        """Both old and new participants correct after moderation.
+
+        P0 votes on c0-c6. c0-c2 moderated-out. New participant Q votes on c3-c9.
+        Rebuild from all votes. Both should be in-conv:
+        - P0: 7 raw votes (c0-c6), threshold min(7,10)=7 → qualifies
+        - Q: 7 votes on non-moderated-out comments → qualifies
+        """
+        conv = Conversation('test-d2d-t5')
+        votes = {'votes': [
+            *[{'pid': 0, 'tid': tid, 'vote': 1} for tid in range(7)],   # c0-c6
+            *[{'pid': 1, 'tid': tid, 'vote': 1} for tid in range(3, 10)],  # c3-c9
+        ]}
+        conv = conv.update_votes(votes, recompute=False)
+        conv = conv.update_moderation({'mod_out_tids': [0, 1, 2]}, recompute=False)
+
+        in_conv = conv._get_in_conv_participants()
+        assert 0 in in_conv, "P0 (7 raw votes) should be in-conv"
+        assert 1 in in_conv, "P1 (7 votes on non-moderated-out comments) should be in-conv"
+
+
 # ============================================================================
 # D4 — Pseudocount Formula
 # ============================================================================