Fix D2c: use raw_rating_mat for vote counts and n_cmts threshold

jucor · claude · jucor · commit 2ef165aa7be5 · 2026-03-11T10:03:47.000Z
Vote counts and n_cmts now come from raw_rating_mat (includes votes on moderated-out comments), matching Clojure's user-vote-counts which reads from raw-rating-mat. Previously, _compute_user_vote_counts and _get_in_conv_participants used the filtered rating_mat, causing participants to drop below the in-conv threshold when their voted comments were moderated-out. Also adds D2d monotonicity tests (T1-T5) guarding the invariant that once a participant qualifies for in-conv, they can never be removed. These pass for free with full recompute from raw_rating_mat; documented that switching to delta vote processing would require persisting in-conv to DynamoDB (see #2358). Tests: 253 passed, 5 skipped, 36 xfailed (0 failures) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -313,6 +313,39 @@ Will re-record after those are resolved and rebased.
 - Local handoff file created at `delphi/docs/HANDOFF_D2_INCREMENTAL_IN_CONV.md` (untracked)
   for future investigation of how much in-conv sets differ between blob types
 
+### Session 5 (2026-03-11)
+
+- **D2c fix**: Switched `_compute_user_vote_counts` and `_get_in_conv_participants` to use
+  `self.raw_rating_mat` instead of `self.rating_mat`. Both vote counts and `n_cmts` now
+  include votes on moderated-out comments, matching Clojure's `user-vote-counts`
+  (conversation.clj:217-225) and `n-cmts` (conversation.clj:214-215).
+- **D2c tests** (3 in `TestD2cVoteCountSource`):
+  - `test_vote_count_includes_moderated_out_votes`: 10 comments, 3 moderated-out, count=10
+  - `test_n_cmts_includes_moderated_out_comments`: verifies threshold uses raw column count,
+    not filtered; also tests that a participant with 6 raw votes is correctly excluded
+  - `test_participant_stays_in_conv_after_moderation`: the critical scenario — participant
+    with 8 votes stays in-conv when 3 comments moderated-out (filtered count drops to 5)
+- **D2d monotonicity tests** (5 in `TestD2dInConvMonotonicity`):
+  - T1: basic monotonicity across batch updates
+  - T2: survives moderation-out
+  - T3: worker restart + moderation (key delta-processing guard)
+  - T4: worker restart, moderation, no new votes
+  - T5: mixed participants with moderation
+  - All pass for free with D2c fix (full recompute from `raw_rating_mat`)
+- **TDD discipline**: wrote tests first (D2c xfail, D2d no xfail), confirmed D2c red (3
+  xfailed) and D2d red (T2-T5 failed, T1 passed), applied fix, confirmed all green.
+- **Full test suite**: 253 passed, 5 skipped, 36 xfailed, 0 failures (+8 from session 4)
+- **No regressions** on public datasets
+- Added code comments on `_get_in_conv_participants` documenting the monotonicity design
+  decision and delta-processing caveat (ref: #2358)
+- Updated plan: D2, D2b, D2c, D2d all marked DONE; PR 1bis merged into PR 1
+- Updated PR #2421 description with D2c/D2d sections
+
+### What's Next
+
+1. **PR 2 — Fix D4 (Pseudocount)**: `PSEUDO_COUNT = 1.5` → `2.0` to match Clojure's Beta(2,2) prior.
+2. **PR 3 — Fix D9 (Z-score thresholds)**: Switch from two-tailed to one-tailed z-scores.
+
 ---
 
 ## TDD Discipline
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -103,28 +103,17 @@ Fixes are ordered by **pipeline execution order**: participant filtering → pro
 
 **File**: `delphi/polismath/conversation/conversation.py`
 
-**Current**: `threshold = 7 + sqrt(n_cmts) * 0.1` → biodiversity (314 comments): threshold=8.8, keeps 428/536.
-**Target**: `threshold = min(7, n_cmts)` → threshold=7, more participants qualify.
+**Current**: ~~`threshold = 7 + sqrt(n_cmts) * 0.1`~~ → **DONE**: `threshold = min(7, n_cmts)`.
+**D2c**: ~~Vote counts and `n_cmts` from `rating_mat`~~ → **DONE**: Both use `raw_rating_mat` (includes moderated-out comments). Matches Clojure's `user-vote-counts` (conversation.clj:217-225).
+**D2b**: ~~Base clusters sorted by size~~ → **DONE**: Sort by k-means ID (matches Clojure's `sort-by :id`).
 
-**Additional changes**:
-- **D2c — Vote count source**: Use `self.raw_rating_mat` instead of `self.rating_mat` for computing per-participant vote counts. Clojure's `user-vote-counts` (conversation.clj:217-225) counts votes from `raw-rating-mat`, which includes votes on moderated-out comments. Python currently uses the filtered `rating_mat`, which excludes moderated-out columns — undercounting votes and potentially dropping participants below the in-conv threshold. This is a **structural** discrepancy that occurs on every computation (cold-start or incremental), not just with delta updates — because the difference is in *which matrix* the count is derived from. Additionally, `n_cmts` differs: Clojure's `n-cmts` (conversation.clj:214-215) counts columns of `rating-mat` which still includes moderated-out columns (zeroed but present, via `zero-out-columns` in named_matrix.clj:214-228). Python's `n_cmts = len(self.rating_mat.columns)` excludes moderated-out columns entirely (removed by `_apply_moderation`, line 308). This means the `min(7, n_cmts)` threshold itself also differs. Both aspects (vote count source and `n_cmts` source) must use `raw_rating_mat` to match Clojure. (This is related to but distinct from D15 — moderation handling — which tracks the broader zero-out vs remove discrepancy.)
-- Add greedy fallback (top-15 voters if <15 qualify)
-- Add monotonic persistence (once in, always in)
-
-**Test-first approach**:
-1. Add test comparing in-conv participant count/set between Python and Clojure for ALL datasets
-2. Add synthetic edge-case tests: tiny conversation (fewer comments than threshold), conversation where greedy fallback triggers
-3. **D2c unit test — vote count source**: Synthetic test with 10 comments, 3 moderated-out. Participant P voted on all 10. Verify `_compute_user_vote_counts` returns 10 for P (from `raw_rating_mat`), not 7 (from `rating_mat`). Mark xfail before fix.
-4. **D2c unit test — n_cmts threshold**: Same setup. Verify `n_cmts` used in the `min(7, n_cmts)` threshold equals 10 (all comments including moderated-out), not 7 (only non-moderated-out). This matters when a conversation has fewer than 7 non-moderated-out comments but more than 7 total: the threshold would be artificially low in Python, changing who qualifies. Mark xfail before fix.
-5. Run tests → expect failure on D2c tests
-4. Apply fix
-5. Re-run ALL tests — cluster count may change (possibly 3→2 for biodiversity, matching Clojure)
-6. **Cluster comparison test (CRITICAL)**: After fixing D2, add a test that compares Python and Clojure cluster assignments at the participant level — number of clusters AND which participants are in which cluster (using Jaccard similarity or exact match). The Python clustering calls sklearn K-means; Clojure uses a two-level approach. If clusters still don't match after the threshold fix, investigate why and potentially create an additional PR (PR 1b) to fix the remaining clustering discrepancy before moving on to repness fixes.
-7. Document new baseline in journal
+**Remaining (deferred)**:
+- Greedy fallback (top-15 voters if <15 qualify) — not needed for current datasets
+- Cluster comparison test at participant level — deferred to after repness fixes
 
 ---
 
-### PR 1bis: Fix D2d — In-Conv Monotonicity
+### PR 1bis: Fix D2d — In-Conv Monotonicity — **DONE** (merged into PR 1)
 
 **Related upstream issue**: [compdemocracy/polis#2358](https://github.com/compdemocracy/polis/issues/2358) — "Non-Deterministic K-Means Clustering Due to Worker Restart". That issue focuses on `group-clusterings` not being persisted. In-conv IS persisted in Clojure (see below), but we take a different — and better — approach.
 
@@ -153,28 +142,17 @@ This must be documented:
 
 **File**: `delphi/polismath/conversation/conversation.py`
 
-**Current**: `_get_in_conv_participants()` (line 1256) recomputes from scratch using `self.rating_mat` (the filtered matrix). With the D2c fix (PR 1), it will use `self.raw_rating_mat`. Since `raw_rating_mat` contains all historical votes, monotonicity is guaranteed without explicit persistence.
-
-**Test-first approach — tests that guard the monotonicity invariant**:
-
-These tests must capture every scenario that would break if someone switched to delta processing without adding persistence. Each test should have a docstring explaining: "This passes today because we do full recompute. If you switch to delta vote processing, you MUST persist in-conv to DynamoDB — see issue #2358."
-
-1. **T1 — Basic monotonicity across updates**: Participant P votes on 7 comments in batch 1 → qualifies. Batch 2 adds new comments but P doesn't vote on them. P's count stays 7. Assert P is still in-conv after batch 2.
-
-2. **T2 — Monotonicity survives moderation-out**: P votes on 7 comments in batch 1 → qualifies. Between batches, 3 of those comments are moderated-out. Batch 2 arrives with new votes from other participants. Assert P is still in-conv (because `raw_rating_mat` still has P's 7 votes).
-
-3. **T3 — Worker restart with moderation**: P votes on 7 comments in batch 1 → qualifies. Destroy the `Conversation` object (simulates worker death). Moderate-out 3 of P's comments. Create a new `Conversation` object from all votes (simulates worker restart with full recompute). Assert P is still in-conv. This is the key test that would FAIL under delta processing without persistence.
-
-4. **T4 — Worker restart with no new votes**: Same as T3 but no new votes arrive after restart — just moderation happened. Rebuild `Conversation` from existing votes. Assert P is still in-conv. (Tests that recompute alone is sufficient, no "trigger" of new votes needed.)
-
-5. **T5 — Vote on comment that is later moderated-out, then new votes arrive**: P votes on c1-c7. c1-c3 are moderated-out. New participant Q votes on c4-c10. Rebuild conversation from all votes. Assert both P (7 votes on raw matrix) and Q (7 votes on non-moderated-out comments) are in-conv.
+**DONE**: `_get_in_conv_participants()` uses `self.raw_rating_mat` (D2c fix). Monotonicity is a free consequence. Code comment on the function documents the design decision and the delta-processing caveat.
 
-6. **T6 — Greedy fallback after moderation**: Conversation has 8 comments, 14 participants who each voted on all 8, plus participant P who voted on exactly 7. All qualify. Then 5 comments are moderated-out, leaving only 3 non-moderated-out. After full recompute from `raw_rating_mat`: `n_cmts` = 8 (from `raw_rating_mat`), threshold = `min(7, 8)` = 7. P still has 7 votes → still qualifies. Verify the greedy fallback (top-15) is not erroneously triggered.
+**Tests implemented** (T1-T5 in `TestD2dInConvMonotonicity`):
+- T1: Basic monotonicity across batch updates
+- T2: Survives moderation-out of voted comments
+- T3: Worker restart + moderation (key delta-processing guard)
+- T4: Worker restart, moderation, no new votes
+- T5: Mixed participants with moderation
+- T6 (greedy fallback): deferred — greedy fallback not yet implemented
 
-**Documentation requirements**:
-- **Code comment** on `_get_in_conv_participants()`: block comment explaining the full-recompute-guarantees-monotonicity design, the delta-processing caveat, and references to Clojure's persistence approach and issue #2358.
-- **PR description**: dedicated "Design Decision: In-Conv Monotonicity" section explaining why we chose full recompute over persistence, with the future-proofing warning.
-- **Test docstrings**: each test explains what it guards against and what would need to change under delta processing.
+All test docstrings explain what would break under delta processing.
 
 ---
 
@@ -426,10 +404,10 @@ By this point, we should have good test coverage from all the per-discrepancy te
 |----|-------------|-----|--------|
 | D1 | PCA sign flips | PR 13 | Fix (sign consistency) |
 | D1b | Projection input | PR 13 | Fix with D1 |
-| D2 | In-conv threshold | **PR 1** | Fix |
-| D2b | Base-cluster sort order | **PR 1** | Fix (keep k-means ID order, match Clojure sort-by :id) |
-| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | Fix (use `raw_rating_mat` for vote counts, so votes on moderated-out comments are included) |
-| D2d | In-conv monotonicity (once in, always in) | **PR 1bis** | Full recompute from `raw_rating_mat` guarantees monotonicity without persistence. 6 tests guarding the invariant for future delta-processing refactors. Ref: [#2358](https://github.com/compdemocracy/polis/issues/2358) |
+| D2 | In-conv threshold | **PR 1** | **DONE** ✓ |
+| D2b | Base-cluster sort order | **PR 1** | **DONE** ✓ |
+| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | **DONE** ✓ |
+| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **DONE** ✓ (5 guard tests, T1-T5) |
 | D3 | K-smoother buffer | PR 10 | Fix |
 | D4 | Pseudocount formula | **PR 2** | Fix |
 | D5 | Proportion test | PR 4 | Fix |
diff --git a/delphi/polismath/conversation/conversation.py b/delphi/polismath/conversation/conversation.py
@@ -1211,25 +1211,31 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
         """
         Compute the number of votes per participant.
 
+        Uses raw_rating_mat (not rating_mat) so that votes on moderated-out
+        comments are still counted. This matches Clojure's user-vote-counts
+        (conversation.clj:217-225) which reads from raw-rating-mat.
+        Fix D2c: see PLAN_DISCREPANCY_FIXES.md.
+
         Returns:
             Dictionary mapping participant IDs to vote counts
         """
         import time
         start_time = time.time()
-        logger.info(f"Starting _compute_user_vote_counts for {self.rating_mat.shape[0]} participants")
+        mat = self.raw_rating_mat
+        logger.info(f"Starting _compute_user_vote_counts for {mat.shape[0]} participants")
 
         vote_counts = {}
 
         # Use more efficient approach for large datasets
-        if self.rating_mat.shape[0] > 1000:
+        if mat.shape[0] > 1000:
             # Create a mask of non-nan values across the entire matrix
-            non_nan_mask = ~np.isnan(self.rating_mat.values)
+            non_nan_mask = ~np.isnan(mat.values)
 
             # Sum across rows using vectorized operation
             row_sums = np.sum(non_nan_mask, axis=1)
 
             # Convert to dictionary
-            for i, pid in enumerate(self.rating_mat.index):
+            for i, pid in enumerate(mat.index):
                 if i < len(row_sums):
                     vote_counts[pid] = int(row_sums[i])
                 else:
@@ -1239,9 +1245,9 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
             logger.info(f"Computed vote counts for {len(vote_counts)} participants using vectorized approach in {time.time() - start_time:.4f}s")
         else:
             # Original approach for smaller datasets
-            for i, pid in enumerate(self.rating_mat.index):
+            for i, pid in enumerate(mat.index):
                 # Get row of votes for this participant
-                row = self.rating_mat.values[i, :]
+                row = mat.values[i, :]
 
                 # Count non-nan values
                 count = np.sum(~np.isnan(row))
@@ -1262,10 +1268,20 @@ def _get_in_conv_participants(self) -> Set[str]:
         Threshold: participant must have voted on at least min(7, n_comments)
         comments (Clojure parity fix D2).
 
+        Both vote counts and n_cmts use raw_rating_mat (fix D2c), which includes
+        votes on moderated-out comments. This matches Clojure, where
+        zero-out-columns keeps moderated-out columns in the matrix (zeroed but
+        present). Since raw_rating_mat contains all historical votes and votes
+        are immutable in PostgreSQL, monotonicity is guaranteed without explicit
+        persistence — a participant who once qualified can never lose votes.
+        If the code is ever refactored to use delta vote processing, in-conv
+        MUST be persisted to DynamoDB. See compdemocracy/polis#2358 and
+        Clojure's approach in conv_man.clj:55, conversation.clj:244.
+
         Returns:
             Set of participant IDs that meet the threshold
         """
-        n_cmts = len(self.rating_mat.columns) if hasattr(self.rating_mat, 'columns') else 0
+        n_cmts = len(self.raw_rating_mat.columns) if hasattr(self.raw_rating_mat, 'columns') else 0
         threshold = min(7, n_cmts)
 
         # Get vote counts for all participants
diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py