Skip to content

Commit aabb481

Browse files
jucorclaude
andcommitted
Fix D2c: use raw_rating_mat for vote counts and n_cmts threshold
Vote counts and n_cmts now come from raw_rating_mat (includes votes on moderated-out comments), matching Clojure's user-vote-counts which reads from raw-rating-mat. Previously, _compute_user_vote_counts and _get_in_conv_participants used the filtered rating_mat, causing participants to drop below the in-conv threshold when their voted comments were moderated-out. Also adds D2d monotonicity tests (T1-T5) guarding the invariant that once a participant qualifies for in-conv, they can never be removed. These pass for free with full recompute from raw_rating_mat; documented that switching to delta vote processing would require persisting in-conv to DynamoDB (see #2358). Tests: 253 passed, 5 skipped, 36 xfailed (0 failures) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1190bcb commit aabb481

6 files changed

Lines changed: 33795 additions & 42300 deletions

File tree

delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,39 @@ Will re-record after those are resolved and rebased.
313313
- Local handoff file created at `delphi/docs/HANDOFF_D2_INCREMENTAL_IN_CONV.md` (untracked)
314314
for future investigation of how much in-conv sets differ between blob types
315315

316+
### Session 5 (2026-03-11)
317+
318+
- **D2c fix**: Switched `_compute_user_vote_counts` and `_get_in_conv_participants` to use
319+
`self.raw_rating_mat` instead of `self.rating_mat`. Both vote counts and `n_cmts` now
320+
include votes on moderated-out comments, matching Clojure's `user-vote-counts`
321+
(conversation.clj:217-225) and `n-cmts` (conversation.clj:214-215).
322+
- **D2c tests** (3 in `TestD2cVoteCountSource`):
323+
- `test_vote_count_includes_moderated_out_votes`: 10 comments, 3 moderated-out, count=10
324+
- `test_n_cmts_includes_moderated_out_comments`: verifies threshold uses raw column count,
325+
not filtered; also tests that a participant with 6 raw votes is correctly excluded
326+
- `test_participant_stays_in_conv_after_moderation`: the critical scenario — participant
327+
with 8 votes stays in-conv when 3 comments moderated-out (filtered count drops to 5)
328+
- **D2d monotonicity tests** (5 in `TestD2dInConvMonotonicity`):
329+
- T1: basic monotonicity across batch updates
330+
- T2: survives moderation-out
331+
- T3: worker restart + moderation (key delta-processing guard)
332+
- T4: worker restart, moderation, no new votes
333+
- T5: mixed participants with moderation
334+
- All pass for free with D2c fix (full recompute from `raw_rating_mat`)
335+
- **TDD discipline**: wrote tests first (D2c xfail, D2d no xfail), confirmed D2c red (3
336+
xfailed) and D2d red (T2-T5 failed, T1 passed), applied fix, confirmed all green.
337+
- **Full test suite**: 253 passed, 5 skipped, 36 xfailed, 0 failures (+8 from session 4)
338+
- **No regressions** on public datasets
339+
- Added code comments on `_get_in_conv_participants` documenting the monotonicity design
340+
decision and delta-processing caveat (ref: #2358)
341+
- Updated plan: D2, D2b, D2c, D2d all marked DONE; PR 1bis merged into PR 1
342+
- Updated PR #2421 description with D2c/D2d sections
343+
344+
### What's Next
345+
346+
1. **PR 2 — Fix D4 (Pseudocount)**: `PSEUDO_COUNT = 1.5``2.0` to match Clojure's Beta(2,2) prior.
347+
2. **PR 3 — Fix D9 (Z-score thresholds)**: Switch from two-tailed to one-tailed z-scores.
348+
316349
---
317350

318351
## TDD Discipline

delphi/docs/PLAN_DISCREPANCY_FIXES.md

Lines changed: 20 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -103,28 +103,17 @@ Fixes are ordered by **pipeline execution order**: participant filtering → pro
103103

104104
**File**: `delphi/polismath/conversation/conversation.py`
105105

106-
**Current**: `threshold = 7 + sqrt(n_cmts) * 0.1` → biodiversity (314 comments): threshold=8.8, keeps 428/536.
107-
**Target**: `threshold = min(7, n_cmts)` → threshold=7, more participants qualify.
106+
**Current**: ~~`threshold = 7 + sqrt(n_cmts) * 0.1`~~**DONE**: `threshold = min(7, n_cmts)`.
107+
**D2c**: ~~Vote counts and `n_cmts` from `rating_mat`~~**DONE**: Both use `raw_rating_mat` (includes moderated-out comments). Matches Clojure's `user-vote-counts` (conversation.clj:217-225).
108+
**D2b**: ~~Base clusters sorted by size~~**DONE**: Sort by k-means ID (matches Clojure's `sort-by :id`).
108109

109-
**Additional changes**:
110-
- **D2c — Vote count source**: Use `self.raw_rating_mat` instead of `self.rating_mat` for computing per-participant vote counts. Clojure's `user-vote-counts` (conversation.clj:217-225) counts votes from `raw-rating-mat`, which includes votes on moderated-out comments. Python currently uses the filtered `rating_mat`, which excludes moderated-out columns — undercounting votes and potentially dropping participants below the in-conv threshold. This is a **structural** discrepancy that occurs on every computation (cold-start or incremental), not just with delta updates — because the difference is in *which matrix* the count is derived from. Additionally, `n_cmts` differs: Clojure's `n-cmts` (conversation.clj:214-215) counts columns of `rating-mat` which still includes moderated-out columns (zeroed but present, via `zero-out-columns` in named_matrix.clj:214-228). Python's `n_cmts = len(self.rating_mat.columns)` excludes moderated-out columns entirely (removed by `_apply_moderation`, line 308). This means the `min(7, n_cmts)` threshold itself also differs. Both aspects (vote count source and `n_cmts` source) must use `raw_rating_mat` to match Clojure. (This is related to but distinct from D15 — moderation handling — which tracks the broader zero-out vs remove discrepancy.)
111-
- Add greedy fallback (top-15 voters if <15 qualify)
112-
- Add monotonic persistence (once in, always in)
113-
114-
**Test-first approach**:
115-
1. Add test comparing in-conv participant count/set between Python and Clojure for ALL datasets
116-
2. Add synthetic edge-case tests: tiny conversation (fewer comments than threshold), conversation where greedy fallback triggers
117-
3. **D2c unit test — vote count source**: Synthetic test with 10 comments, 3 moderated-out. Participant P voted on all 10. Verify `_compute_user_vote_counts` returns 10 for P (from `raw_rating_mat`), not 7 (from `rating_mat`). Mark xfail before fix.
118-
4. **D2c unit test — n_cmts threshold**: Same setup. Verify `n_cmts` used in the `min(7, n_cmts)` threshold equals 10 (all comments including moderated-out), not 7 (only non-moderated-out). This matters when a conversation has fewer than 7 non-moderated-out comments but more than 7 total: the threshold would be artificially low in Python, changing who qualifies. Mark xfail before fix.
119-
5. Run tests → expect failure on D2c tests
120-
4. Apply fix
121-
5. Re-run ALL tests — cluster count may change (possibly 3→2 for biodiversity, matching Clojure)
122-
6. **Cluster comparison test (CRITICAL)**: After fixing D2, add a test that compares Python and Clojure cluster assignments at the participant level — number of clusters AND which participants are in which cluster (using Jaccard similarity or exact match). The Python clustering calls sklearn K-means; Clojure uses a two-level approach. If clusters still don't match after the threshold fix, investigate why and potentially create an additional PR (PR 1b) to fix the remaining clustering discrepancy before moving on to repness fixes.
123-
7. Document new baseline in journal
110+
**Remaining (deferred)**:
111+
- Greedy fallback (top-15 voters if <15 qualify) — not needed for current datasets
112+
- Cluster comparison test at participant level — deferred to after repness fixes
124113

125114
---
126115

127-
### PR 1bis: Fix D2d — In-Conv Monotonicity
116+
### PR 1bis: Fix D2d — In-Conv Monotonicity**DONE** (merged into PR 1)
128117

129118
**Related upstream issue**: [compdemocracy/polis#2358](https://github.com/compdemocracy/polis/issues/2358) — "Non-Deterministic K-Means Clustering Due to Worker Restart". That issue focuses on `group-clusterings` not being persisted. In-conv IS persisted in Clojure (see below), but we take a different — and better — approach.
130119

@@ -153,28 +142,17 @@ This must be documented:
153142

154143
**File**: `delphi/polismath/conversation/conversation.py`
155144

156-
**Current**: `_get_in_conv_participants()` (line 1256) recomputes from scratch using `self.rating_mat` (the filtered matrix). With the D2c fix (PR 1), it will use `self.raw_rating_mat`. Since `raw_rating_mat` contains all historical votes, monotonicity is guaranteed without explicit persistence.
157-
158-
**Test-first approach — tests that guard the monotonicity invariant**:
159-
160-
These tests must capture every scenario that would break if someone switched to delta processing without adding persistence. Each test should have a docstring explaining: "This passes today because we do full recompute. If you switch to delta vote processing, you MUST persist in-conv to DynamoDB — see issue #2358."
161-
162-
1. **T1 — Basic monotonicity across updates**: Participant P votes on 7 comments in batch 1 → qualifies. Batch 2 adds new comments but P doesn't vote on them. P's count stays 7. Assert P is still in-conv after batch 2.
163-
164-
2. **T2 — Monotonicity survives moderation-out**: P votes on 7 comments in batch 1 → qualifies. Between batches, 3 of those comments are moderated-out. Batch 2 arrives with new votes from other participants. Assert P is still in-conv (because `raw_rating_mat` still has P's 7 votes).
165-
166-
3. **T3 — Worker restart with moderation**: P votes on 7 comments in batch 1 → qualifies. Destroy the `Conversation` object (simulates worker death). Moderate-out 3 of P's comments. Create a new `Conversation` object from all votes (simulates worker restart with full recompute). Assert P is still in-conv. This is the key test that would FAIL under delta processing without persistence.
167-
168-
4. **T4 — Worker restart with no new votes**: Same as T3 but no new votes arrive after restart — just moderation happened. Rebuild `Conversation` from existing votes. Assert P is still in-conv. (Tests that recompute alone is sufficient, no "trigger" of new votes needed.)
169-
170-
5. **T5 — Vote on comment that is later moderated-out, then new votes arrive**: P votes on c1-c7. c1-c3 are moderated-out. New participant Q votes on c4-c10. Rebuild conversation from all votes. Assert both P (7 votes on raw matrix) and Q (7 votes on non-moderated-out comments) are in-conv.
145+
**DONE**: `_get_in_conv_participants()` uses `self.raw_rating_mat` (D2c fix). Monotonicity is a free consequence. Code comment on the function documents the design decision and the delta-processing caveat.
171146

172-
6. **T6 — Greedy fallback after moderation**: Conversation has 8 comments, 14 participants who each voted on all 8, plus participant P who voted on exactly 7. All qualify. Then 5 comments are moderated-out, leaving only 3 non-moderated-out. After full recompute from `raw_rating_mat`: `n_cmts` = 8 (from `raw_rating_mat`), threshold = `min(7, 8)` = 7. P still has 7 votes → still qualifies. Verify the greedy fallback (top-15) is not erroneously triggered.
147+
**Tests implemented** (T1-T5 in `TestD2dInConvMonotonicity`):
148+
- T1: Basic monotonicity across batch updates
149+
- T2: Survives moderation-out of voted comments
150+
- T3: Worker restart + moderation (key delta-processing guard)
151+
- T4: Worker restart, moderation, no new votes
152+
- T5: Mixed participants with moderation
153+
- T6 (greedy fallback): deferred — greedy fallback not yet implemented
173154

174-
**Documentation requirements**:
175-
- **Code comment** on `_get_in_conv_participants()`: block comment explaining the full-recompute-guarantees-monotonicity design, the delta-processing caveat, and references to Clojure's persistence approach and issue #2358.
176-
- **PR description**: dedicated "Design Decision: In-Conv Monotonicity" section explaining why we chose full recompute over persistence, with the future-proofing warning.
177-
- **Test docstrings**: each test explains what it guards against and what would need to change under delta processing.
155+
All test docstrings explain what would break under delta processing.
178156

179157
---
180158

@@ -426,10 +404,10 @@ By this point, we should have good test coverage from all the per-discrepancy te
426404
|----|-------------|-----|--------|
427405
| D1 | PCA sign flips | PR 13 | Fix (sign consistency) |
428406
| D1b | Projection input | PR 13 | Fix with D1 |
429-
| D2 | In-conv threshold | **PR 1** | Fix |
430-
| D2b | Base-cluster sort order | **PR 1** | Fix (keep k-means ID order, match Clojure sort-by :id) |
431-
| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | Fix (use `raw_rating_mat` for vote counts, so votes on moderated-out comments are included) |
432-
| D2d | In-conv monotonicity (once in, always in) | **PR 1bis** | Full recompute from `raw_rating_mat` guarantees monotonicity without persistence. 6 tests guarding the invariant for future delta-processing refactors. Ref: [#2358](https://github.com/compdemocracy/polis/issues/2358) |
407+
| D2 | In-conv threshold | **PR 1** | **DONE** |
408+
| D2b | Base-cluster sort order | **PR 1** | **DONE** |
409+
| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | **DONE** |
410+
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **DONE** ✓ (5 guard tests, T1-T5) |
433411
| D3 | K-smoother buffer | PR 10 | Fix |
434412
| D4 | Pseudocount formula | **PR 2** | Fix |
435413
| D5 | Proportion test | PR 4 | Fix |

delphi/polismath/conversation/conversation.py

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1210,25 +1210,31 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
12101210
"""
12111211
Compute the number of votes per participant.
12121212
1213+
Uses raw_rating_mat (not rating_mat) so that votes on moderated-out
1214+
comments are still counted. This matches Clojure's user-vote-counts
1215+
(conversation.clj:217-225) which reads from raw-rating-mat.
1216+
Fix D2c: see PLAN_DISCREPANCY_FIXES.md.
1217+
12131218
Returns:
12141219
Dictionary mapping participant IDs to vote counts
12151220
"""
12161221
import time
12171222
start_time = time.time()
1218-
logger.info(f"Starting _compute_user_vote_counts for {self.rating_mat.shape[0]} participants")
1223+
mat = self.raw_rating_mat
1224+
logger.info(f"Starting _compute_user_vote_counts for {mat.shape[0]} participants")
12191225

12201226
vote_counts = {}
12211227

12221228
# Use more efficient approach for large datasets
1223-
if self.rating_mat.shape[0] > 1000:
1229+
if mat.shape[0] > 1000:
12241230
# Create a mask of non-nan values across the entire matrix
1225-
non_nan_mask = ~np.isnan(self.rating_mat.values)
1231+
non_nan_mask = ~np.isnan(mat.values)
12261232

12271233
# Sum across rows using vectorized operation
12281234
row_sums = np.sum(non_nan_mask, axis=1)
12291235

12301236
# Convert to dictionary
1231-
for i, pid in enumerate(self.rating_mat.index):
1237+
for i, pid in enumerate(mat.index):
12321238
if i < len(row_sums):
12331239
vote_counts[pid] = int(row_sums[i])
12341240
else:
@@ -1238,9 +1244,9 @@ def _compute_user_vote_counts(self) -> Dict[str, int]:
12381244
logger.info(f"Computed vote counts for {len(vote_counts)} participants using vectorized approach in {time.time() - start_time:.4f}s")
12391245
else:
12401246
# Original approach for smaller datasets
1241-
for i, pid in enumerate(self.rating_mat.index):
1247+
for i, pid in enumerate(mat.index):
12421248
# Get row of votes for this participant
1243-
row = self.rating_mat.values[i, :]
1249+
row = mat.values[i, :]
12441250

12451251
# Count non-nan values
12461252
count = np.sum(~np.isnan(row))
@@ -1261,10 +1267,20 @@ def _get_in_conv_participants(self) -> Set[str]:
12611267
Threshold: participant must have voted on at least min(7, n_comments)
12621268
comments (Clojure parity fix D2).
12631269
1270+
Both vote counts and n_cmts use raw_rating_mat (fix D2c), which includes
1271+
votes on moderated-out comments. This matches Clojure, where
1272+
zero-out-columns keeps moderated-out columns in the matrix (zeroed but
1273+
present). Since raw_rating_mat contains all historical votes and votes
1274+
are immutable in PostgreSQL, monotonicity is guaranteed without explicit
1275+
persistence — a participant who once qualified can never lose votes.
1276+
If the code is ever refactored to use delta vote processing, in-conv
1277+
MUST be persisted to DynamoDB. See compdemocracy/polis#2358 and
1278+
Clojure's approach in conv_man.clj:55, conversation.clj:244.
1279+
12641280
Returns:
12651281
Set of participant IDs that meet the threshold
12661282
"""
1267-
n_cmts = len(self.rating_mat.columns) if hasattr(self.rating_mat, 'columns') else 0
1283+
n_cmts = len(self.raw_rating_mat.columns) if hasattr(self.raw_rating_mat, 'columns') else 0
12681284
threshold = min(7, n_cmts)
12691285

12701286
# Get vote counts for all participants

0 commit comments

Comments
 (0)