[Stack 10/27] Fix D2: in-conv participant threshold + D2c vote count source#2421
[Stack 10/27] Fix D2: in-conv participant threshold + D2c vote count source#2421jucor wants to merge 10 commits into
Conversation
e43b0f9 to
5e2a0de
Compare
fa6075a to
4c2d5a7
Compare
5e2a0de to
33ecc1c
Compare
4c2d5a7 to
f44511a
Compare
33ecc1c to
d7b7c34
Compare
f44511a to
1db230b
Compare
d2b434e to
4220632
Compare
1db230b to
e28e871
Compare
e28e871 to
f4f4aa4
Compare
0e8c100 to
2ef165a
Compare
There was a problem hiding this comment.
Pull request overview
Aligns Python’s in-conv participant selection and clustering inputs with the legacy Clojure implementation (D2/D2b/D2c), and expands the regression test harness to handle multiple Clojure blob variants (incremental vs cold-start) while adding guard tests for monotonicity (D2d).
Changes:
- Update in-conv logic to use
threshold = min(7, n_cmts)and compute bothn_cmts+ vote counts fromraw_rating_mat(includes moderated-out comment votes). - Preserve base-cluster encounter order by keeping k-means IDs (sort-by
id, no size-based reordering / ID reassignment). - Extend test infrastructure to parametrize over blob variants and add D2c/D2d synthetic/guard tests.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| delphi/polismath/conversation/conversation.py | Implements D2/D2b/D2c logic changes (threshold, raw vote counts, base-cluster order) and adds rationale comments. |
| delphi/polismath/regression/datasets.py | Adds blob_type selection to get_dataset_files() and introduces get_blob_variants() discovery via _is_blob_filled(). |
| delphi/polismath/regression/init.py | Exposes get_blob_variants from the regression package. |
| delphi/tests/conftest.py | Adds use_blobs support to use_discovered_datasets marker and introduces parse_dataset_blob_id(). |
| delphi/tests/test_discrepancy_fixes.py | Converts D2 tests to blob-aware datasets, adds D2c vote-source tests and D2d monotonicity guard tests. |
| delphi/tests/test_legacy_clojure_regression.py | Converts legacy Clojure regression tests to run per dataset+blob variant with caching. |
| delphi/tests/test_legacy_repness_comparison.py | Updates legacy repness comparison to use blob-aware dataset IDs. |
| delphi/tests/test_repness_smoke.py | Removes an xfail on repness structure validation. |
| delphi/tests/test_conversation.py | Updates the in-test comment describing the vote threshold formula. |
| delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/golden_snapshot.json | Re-recorded golden snapshot (timestamps, math_tick, and timing stats changed). |
| delphi/docs/PLAN_DISCREPANCY_FIXES.md | Marks D2/D2b/D2c/D2d as DONE and documents the monotonicity design note. |
| delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | Updates the parity-fix journal with session notes, rationale, and outcomes for D2c/D2d. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4d12c2e to
7a05ab7
Compare
aabb481 to
9bf6805
Compare
7a05ab7 to
679694b
Compare
679694b to
1439005
Compare
9bf6805 to
21abf22
Compare
21abf22 to
8a05bd9
Compare
1439005 to
689ed20
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (3)
delphi/polismath/conversation/conversation.py:1291
update_votes()explicitly preserves pid/tid types (they can be ints/strings), and the new synthetic tests also use int pids._compute_user_vote_counts()and_get_in_conv_participants()are annotated asDict[str, int]/Set[str], but they actually return keys of whatever type the matrix index uses. Adjust the type hints (e.g.,Hashable/Any) to match runtime behavior so the annotations remain accurate.
def _compute_user_vote_counts(self) -> Dict[str, int]:
"""
Compute the number of votes per participant.
Uses raw_rating_mat (not rating_mat) so that votes on moderated-out
comments are still counted. This matches Clojure's user-vote-counts
(conversation.clj:217-225) which reads from raw-rating-mat.
Fix D2c: see PLAN_DISCREPANCY_FIXES.md.
Returns:
Dictionary mapping participant IDs to vote counts
"""
import time
start_time = time.time()
mat = self.raw_rating_mat
logger.info(f"Starting _compute_user_vote_counts for {mat.shape[0]} participants")
vote_counts = {}
# Use more efficient approach for large datasets
if mat.shape[0] > 1000:
# Create a mask of non-nan values across the entire matrix
non_nan_mask = ~np.isnan(mat.values)
# Sum across rows using vectorized operation
row_sums = np.sum(non_nan_mask, axis=1)
# Convert to dictionary
for i, pid in enumerate(mat.index):
if i < len(row_sums):
vote_counts[pid] = int(row_sums[i])
else:
# Fallback if dimensions don't match
vote_counts[pid] = 0
logger.info(f"Computed vote counts for {len(vote_counts)} participants using vectorized approach in {time.time() - start_time:.4f}s")
else:
# Original approach for smaller datasets
for i, pid in enumerate(mat.index):
# Get row of votes for this participant
row = mat.values[i, :]
# Count non-nan values
count = np.sum(~np.isnan(row))
# Store count
vote_counts[pid] = int(count)
logger.info(f"Computed vote counts for {len(vote_counts)} participants using original approach in {time.time() - start_time:.4f}s")
return vote_counts
def _get_in_conv_participants(self) -> Set[str]:
"""
Get participants who have voted enough to be included in clustering.
Matches Clojure's in-conv logic from conversation.clj lines 239-266.
Threshold: participant must have voted on at least min(7, n_comments)
comments (Clojure parity fix D2).
Both vote counts and n_cmts use raw_rating_mat (fix D2c), which includes
votes on moderated-out comments. This matches Clojure, where
zero-out-columns keeps moderated-out columns in the matrix (zeroed but
present). Since raw_rating_mat contains all historical votes and votes
are immutable in PostgreSQL, monotonicity is guaranteed without explicit
persistence — a participant who once qualified can never lose votes.
If the code is ever refactored to use delta vote processing, in-conv
MUST be persisted to DynamoDB. See compdemocracy/polis#2358 and
Clojure's approach in conv_man.clj:55, conversation.clj:244.
Returns:
Set of participant IDs that meet the threshold
"""
n_cmts = len(self.raw_rating_mat.columns) if hasattr(self.raw_rating_mat, 'columns') else 0
threshold = min(7, n_cmts)
# Get vote counts for all participants
vote_counts = self._compute_user_vote_counts()
# Filter participants meeting threshold
in_conv = {pid for pid, count in vote_counts.items() if count >= threshold}
delphi/tests/test_discrepancy_fixes.py:669
test_two_prop_test_with_pseudocountscurrently uses (succ1,n1)=(10,20) and (succ2,n2)=(15,30), which both produce p=0.5. That makes both the plain two-proportion z-test and the pseudocount-adjusted version return ~0, so the test will pass even if the pseudocount adjustment is not implemented. Use inputs where p1 != p2 (and ideally where the +1/+2 adjustment materially changes the z-score) so this test actually detects the D6 discrepancy.
def test_two_prop_test_with_pseudocounts(self):
"""two_prop_test should add +1 pseudocounts matching Clojure."""
# With pseudocounts: (succ+1)/(n+2) for both groups
succ1, n1 = 10, 20
succ2, n2 = 15, 30
# Clojure formula adds +1 to successes and +2 to trials
p1_clj = (succ1 + 1) / (n1 + 2)
p2_clj = (succ2 + 1) / (n2 + 2)
p_pooled_clj = (succ1 + succ2 + 2) / (n1 + n2 + 4)
se_clj = math.sqrt(p_pooled_clj * (1 - p_pooled_clj) * (1 / (n1 + 2) + 1 / (n2 + 2)))
expected = (p1_clj - p2_clj) / se_clj if se_clj > 0 else 0.0
# Python currently doesn't add pseudocounts
p1_py = succ1 / n1
p2_py = succ2 / n2
python_result = two_prop_test(p1_py, n1, p2_py, n2)
print(f"two_prop_test: Python={python_result:.4f}, Clojure(with pseudocounts)={expected:.4f}")
check.almost_equal(python_result, expected, abs=0.01,
msg=f"two_prop_test should include pseudocounts: Python={python_result:.4f}, expected={expected:.4f}")
delphi/tests/test_discrepancy_fixes.py:575
- The docstring says this asserts repness is non-empty "with correct thresholds", but the test doesn't validate the Z-score thresholds (and D9 is still tracked via the z90/z95 xfail tests). As written, this can pass even when thresholds are still wrong, which makes it misleading as a D9 regression signal. Consider either tying this assertion to the threshold constants (or to Clojure reference output), or moving it out of the D9 section / rewording to avoid implying it validates the thresholds.
def test_repness_not_empty(self, conv, dataset_name):
"""Repness should produce non-empty comment_repness with correct thresholds."""
repness = conv.repness
check.is_not_none(repness, "Repness should not be None")
if repness:
check.is_in('comment_repness', repness, "Should have comment_repness key")
if 'comment_repness' in repness:
check.greater(len(repness['comment_repness']), 0,
"comment_repness should not be empty")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8a05bd9 to
adac3bb
Compare
Change threshold from 7+sqrt(n_cmts)*0.1 to min(7, n_cmts) in _get_in_conv_participants(). This matches Clojure's conversation.clj and includes more participants in clustering for large conversations. Verified on vw, biodiversity, FLI, bg2018 (all datasets with complete Clojure blobs). 3 private datasets have incomplete blobs (no in-conv data) — blob regeneration delegated to separate task. Re-recorded golden snapshot for biodiversity (428→441 participants). Updated journal with TDD discipline, golden snapshot policy, and pipelined worktree workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three large datasets have incomplete cold-start Clojure blobs (4 keys instead of 23, missing in-conv data). D2 tests now skip gracefully on these datasets instead of failing with misleading empty-set comparisons. The incomplete blobs are caused by Clojure's power-iteration PCA being too slow for large matrices (30-75M cells). These datasets will be validated later when we build the incremental replay infrastructure. Verified: 8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing duplicate vote files in engage dataset) on full dataset suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…omment - D2b: Remove sort-by-size and ID reassignment of base clusters; keep k-means ID order to match Clojure's (sort-by :id). The size-sort changed encounter order of centers fed into group-level k-means. - Update stale comment in test_conversation.py referencing old threshold formula (7 + sqrt(n_cmts) * 0.1) to current min(7, n_cmts). - Add D2b to plan checklist and journal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document xpassed tests after D2 fix: D6 two_prop_test (1) and D9 repness_not_empty (6) — both strict=False, expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix plan D2c: Both vote counts and n_cmts must use raw_rating_mat (includes moderated-out columns), not rating_mat. This is structural, not related to delta vs full processing. Two xfail unit tests planned. D2d: Full recompute from raw_rating_mat guarantees monotonicity without needing to persist in-conv to DynamoDB (unlike Clojure which persists because it uses delta vote processing). 6 tests guard this invariant for future delta-processing refactors. Ref: #2358. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ity) - D2c: structural discrepancy in vote counts and n_cmts — both must use raw_rating_mat to match Clojure's zero-out-columns behavior - D2d: full recompute guarantees monotonicity without persistence, unlike Clojure which persists because it uses delta processing - Corrected earlier "not needed" note about raw_rating_mat - Updated What's Next: D2c → D2d → D4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-record golden snapshots for biodiversity and vw datasets. Remove xfail markers from 3 tests that now pass: D6 pseudocounts (test_two_prop_test_with_pseudocounts), D9 empty repness (test_repness_not_empty), and repness structure validation (test_repness_structure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
D2 behaviour matches on cold-start; incremental deferred to future PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Session 4: dual-blob test infrastructure, D2 incremental xfail with rationale, golden snapshot re-recording, 245 passed / 0 failures - Incomplete Clojure blobs: BLOCKING → RESOLVED (test both blob types instead of regenerating) - Plan: add D2 in-conv incremental matching to Replay PR B (early participants admitted when n_cmts < 7, all 4 datasets with both blob types exhibit it) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vote counts and n_cmts now come from raw_rating_mat (includes votes on moderated-out comments), matching Clojure's user-vote-counts which reads from raw-rating-mat. Previously, _compute_user_vote_counts and _get_in_conv_participants used the filtered rating_mat, causing participants to drop below the in-conv threshold when their voted comments were moderated-out. Also adds D2d monotonicity tests (T1-T5) guarding the invariant that once a participant qualifies for in-conv, they can never be removed. These pass for free with full recompute from raw_rating_mat; documented that switching to delta vote processing would require persisting in-conv to DynamoDB (see #2358). Tests: 253 passed, 5 skipped, 36 xfailed (0 failures) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
adac3bb to
d213a79
Compare
35e84d5 to
9585ffa
Compare
Delphi Coverage Report
|
|
Superseded by spr-managed PR stack. See the new stack starting at #2508. |
Summary
Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).
D2: In-conv threshold
threshold = 7 + sqrt(n_cmts) * 0.1— increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)threshold = min(7, n_cmts)— matches Clojure exactlyD2b: Base-cluster sort order (from Copilot review)
(sort-by :id ...)D2c: Vote count source (raw vs filtered matrix)
_compute_user_vote_countsandn_cmtsusedself.rating_mat(filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.self.raw_rating_mat(includes all votes, even on moderated-out comments), matching Clojure'suser-vote-counts(conversation.clj:217-225) which reads fromraw-rating-mat.D2d: In-conv monotonicity (design decision)
Python does full recompute from
raw_rating_matevery time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is strictly better than Clojure's approach (which persists in-conv tomath_mainbecause it uses delta vote processing).5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).
Impact
Incremental vs cold-start blob testing
D2 tests run against both cold-start and incremental Clojure blobs (infrastructure from #2420):
min(7, n_cmts)is evaluated once with the finaln_cmts. Python matches these exactly.n_cmts, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).D2 tests on incremental blobs are currently xfailed with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.
Test results
Test plan
🤖 Generated with Claude Code