[Clj parity PR 1] Fix D2: in-conv participant threshold by jucor · Pull Request #2408 · compdemocracy/polis

jucor · 2026-03-03T15:19:02Z

Summary

Stacked on #2401 (per-discrepancy test infrastructure). Please review and merge #2401 first.

To see only the diff added by this PR (3 commits): compare series-of-fixes...clj-parity-d2-fix

Fixes the in-conv participant threshold (D2 discrepancy) and base-cluster sort order (D2b) to match Clojure.

D2: In-conv threshold

Before: threshold = 7 + sqrt(n_cmts) * 0.1 — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
After: threshold = min(7, n_cmts) — matches Clojure exactly

D2b: Base-cluster sort order (from Copilot review)

Before: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
After: Keep k-means ID order, matching Clojure's (sort-by :id ...)

Impact

biodiversity: 428 → 441 in-conv participants (now matches Clojure)
Verified on 4 datasets with complete Clojure cold-start blobs: vw, biodiversity, and 2 private datasets

Incomplete Clojure blobs

3 private datasets have incomplete Clojure cold-start blobs (4 keys instead of 23, missing in-conv data). This is caused by Clojure's power-iteration PCA being too slow on large matrices (30-75M cells). D2 tests now skip gracefully on these datasets. They will be validated later when we build the incremental replay infrastructure (needed for D3/D1 anyway).

Test results

8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing duplicate vote files in one dataset)

Test plan

D2 tests pass on all datasets with complete Clojure blobs
D2 tests skip gracefully on datasets with incomplete blobs
Full test suite shows no regressions vs baseline
Copilot review comments addressed (D2b sort order, stale test comment)
Golden snapshots re-recorded for affected datasets

🤖 Generated with Claude Code

Copilot

Pull request overview

Updates the Delphi math pipeline and test/tooling to better align Python outputs with the Clojure reference implementation, including the D2 “in-conv” participant vote threshold adjustment.

Changes:

Update “in-conv” participant threshold logic (D2) and expand clustering to a two-level (base clusters → group clusters) structure with unfolded groups for downstream computations.
Rework test dataset parametrization to support dataset selection (--datasets) and parallel test grouping (xdist_group), and adjust smoke/regression tests accordingly.
Add/refresh regression and Clojure comparison tooling (new CLI scripts/options), dataset blob handling (cold-start blobs), and supporting documentation.

Reviewed changes

Copilot reviewed 51 out of 55 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
delphi/polismath/conversation/conversation.py	Implements in-conv threshold change and two-level clustering; updates repness/group-vote computations to use unfolded group membership.
delphi/polismath/pca_kmeans_rep/clusters.py	Aligns initialization with Clojure-style “first k distinct points” and adds sklearn-backed helpers.
delphi/polismath/regression/datasets.py	Adds cold-start blob detection and prefers cold-start blobs when available.
delphi/polismath/regression/init.py	Re-exports Clojure comparer utilities from the regression package.
delphi/scripts/regression_comparer.py	Adds CLI options (include-local, PCA sign-flip tolerance, outlier tolerances) and passes them into the comparer.
delphi/scripts/regression_recorder.py	Adds `--include-local` support when recording golden snapshots.
delphi/scripts/clojure_comparer.py	New CLI for comparing Python outputs against Clojure math blobs.
delphi/pyproject.toml	Adds click dependency; configures pytest-xdist loadgroup and adds marker metadata.
delphi/tests/conftest.py	Adds `--datasets` filtering, xdist grouping helpers, and collection-time deselection logic.
delphi/tests/test_regression.py	Enables PCA sign-flip tolerance in regression comparisons; adds targeted unit tests around tolerance behavior.
delphi/tests/test_repness_smoke.py	Switches dataset parametrization and uses unfolded group clusters when calling repness functions.
delphi/tests/test_legacy_repness_comparison.py	Switches dataset parametrization for legacy comparison tests.
delphi/tests/test_conversation_smoke.py	Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pipeline_integrity.py	Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pca_smoke.py	Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pca_unit.py	Removes tests for removed PCA helpers; adds NaN-handling and no-vote participant coverage for `pca_project_dataframe`.
delphi/tests/test_pca_edge_cases.py	Simplifies edge-case coverage to focus on `pca_project_dataframe`.
delphi/tests/test_edge_cases.py	Updates expected repness output for insufficient-data scenario.
delphi/tests/test_datasets.py	Updates `DatasetInfo` construction for new cold-start blob field and expected file checks.
delphi/tests/test_clusters.py	Updates expectations for `init_clusters()` behavior when `k > n_points`.
delphi/tests/test_math_pipeline_runs_e2e.py	Updates mock data directory name to match committed dataset directory.
delphi/tests/README.md	Documents new pytest CLI options and parallel execution behavior.
delphi/docs/CLOJURE_TWO_LEVEL_CLUSTERING.md	New documentation for Clojure’s hierarchical clustering architecture.
delphi/docs/SUBGROUP_CLUSTERING_THIRD_LEVEL.md	New documentation describing Clojure’s (currently unimplemented) subgroup clustering level.
delphi/docs/CLOJURE_COMPARISON.md	New/expanded documentation for Clojure comparison tests and tooling.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md	New journal tracking parity work and discrepancy status.
delphi/polismath/benchmarks/bench_pca.py	New PCA benchmarking script.
delphi/CLAUDE.md	Updates testing/regression/Clojure-comparison guidance.
delphi/tests/test_legacy_clojure_output.py	Removes legacy standalone analysis script.
delphi/tests/test_golden_data.py	Removes old xfail golden comparison test.
delphi/tests/legacy_compare_with_clojure.py	Removes legacy direct comparison script.
deep-analysis-for-julien/01-overview-and-architecture.md	Adds deep-dive architecture analysis docs.
deep-analysis-for-julien/02-pca-analysis.md	Adds PCA discrepancy analysis.
deep-analysis-for-julien/03-clustering-analysis.md	Adds clustering discrepancy analysis.
deep-analysis-for-julien/04-repness-analysis.md	Adds repness discrepancy analysis.
deep-analysis-for-julien/05-participant-filtering.md	Adds participant filtering / priority analysis.
deep-analysis-for-julien/06-comment-routing.md	Adds comment routing analysis.
deep-analysis-for-julien/07-discrepancies.md	Adds consolidated discrepancy list.
deep-analysis-for-julien/08-dead-code.md	Adds dead-code inventory.
deep-analysis-for-julien/09-fix-plan.md	Adds fix plan for parity work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T15:52:33Z

+        # Add clusters data in hierarchical format
+        # Base clusters (participants → ~100 clusters) in folded format
+        result['base-clusters'] = self._fold_base_clusters(self.base_clusters)
+
+        # Group clusters (base clusters → 2-5 groups)
+        # Members are base cluster IDs, not participant IDs
+        result['group-clusters'] = self.group_clusters
+
+        # Legacy field for backward compatibility (remove after full migration)
        result['group_clusters'] = self.group_clusters


to_dict() now sets result['base-clusters'] to the folded base-cluster structure, but later in the same method it overwrites base-clusters with self.group_clusters (so consumers will never actually see base clusters). This breaks the intended hierarchical output format and can also confuse downstream code that expects folded base-clusters. Suggest removing the later overwrite and ensuring base-clusters always contains the folded base clusters (and if you need the legacy behavior, keep it under a separate legacy key).

Copilot · 2026-03-03T15:52:33Z

+        # Sort base clusters by size (descending) for consistency
+        base_clusters.sort(key=lambda c: len(c['members']), reverse=True)
+        # Reassign IDs based on sorted order
+        for i, cluster in enumerate(base_clusters):
+            cluster['id'] = i


Base clusters are being sorted by size and then IDs are reassigned. Because group-cluster k-means initialization uses “first k distinct points in encounter order”, this size-sort changes the encounter order of base-cluster centers and can change the resulting group clustering (and likely diverge from Clojure’s sort-by :id behavior). If parity is the goal, keep base clusters in deterministic ID order (and avoid reassigning IDs based on size) before running the second-level clustering.

Suggested change

# Sort base clusters by size (descending) for consistency

base_clusters.sort(key=lambda c: len(c['members']), reverse=True)

# Reassign IDs based on sorted order

for i, cluster in enumerate(base_clusters):

cluster['id'] = i

# NOTE:

# Keep base clusters in deterministic ID (encounter) order.

# Avoid sorting by size and reassigning IDs, since that would

# change the encounter order of centers used in group clustering.

# If parity with a Clojure implementation using (sort-by :id) is

# desired, this existing ID assignment and order already matches.

Copilot · 2026-03-03T15:52:33Z

+        # Create two distinct opinion groups with enough votes per participant
+        # to meet the vote threshold (7 + sqrt(n_comments) * 0.1)


The comment in this test still references the old in-conv vote threshold formula 7 + sqrt(n_comments) * 0.1, but the implementation has been changed to min(7, n_comments). Please update the comment so the test documents the correct behavior it relies on.

Change threshold from 7+sqrt(n_cmts)*0.1 to min(7, n_cmts) in _get_in_conv_participants(). This matches Clojure's conversation.clj and includes more participants in clustering for large conversations. Verified on vw, biodiversity, FLI, bg2018 (all datasets with complete Clojure blobs). 3 private datasets have incomplete blobs (no in-conv data) — blob regeneration delegated to separate task. Re-recorded golden snapshot for biodiversity (428→441 participants). Updated journal with TDD discipline, golden snapshot policy, and pipelined worktree workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three large datasets have incomplete cold-start Clojure blobs (4 keys instead of 23, missing in-conv data). D2 tests now skip gracefully on these datasets instead of failing with misleading empty-set comparisons. The incomplete blobs are caused by Clojure's power-iteration PCA being too slow for large matrices (30-75M cells). These datasets will be validated later when we build the incremental replay infrastructure. Verified: 8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing duplicate vote files in engage dataset) on full dataset suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…omment - D2b: Remove sort-by-size and ID reassignment of base clusters; keep k-means ID order to match Clojure's (sort-by :id). The size-sort changed encounter order of centers fed into group-level k-means. - Update stale comment in test_conversation.py referencing old threshold formula (7 + sqrt(n_cmts) * 0.1) to current min(7, n_cmts). - Add D2b to plan checklist and journal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document xpassed tests after D2 fix: D6 two_prop_test (1) and D9 repness_not_empty (6) — both strict=False, expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ix plan D2c: Both vote counts and n_cmts must use raw_rating_mat (includes moderated-out columns), not rating_mat. This is structural, not related to delta vs full processing. Two xfail unit tests planned. D2d: Full recompute from raw_rating_mat guarantees monotonicity without needing to persist in-conv to DynamoDB (unlike Clojure which persists because it uses delta vote processing). 6 tests guard this invariant for future delta-processing refactors. Ref: compdemocracy#2358. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ity) - D2c: structural discrepancy in vote counts and n_cmts — both must use raw_rating_mat to match Clojure's zero-out-columns behavior - D2d: full recompute guarantees monotonicity without persistence, unlike Clojure which persists because it uses delta processing - Corrected earlier "not needed" note about raw_rating_mat - Updated What's Next: D2c → D2d → D4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor · 2026-03-05T13:32:27Z

Superseded by #2421 (moved head branch to upstream for simpler stacked PR workflow).

jucor requested a review from Copilot March 3, 2026 15:46

Copilot started reviewing on behalf of jucor March 3, 2026 15:46 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

jucor force-pushed the clj-parity-d2-fix branch 2 times, most recently from f27a629 to 27f3bbb Compare March 4, 2026 11:57

jucor marked this pull request as draft March 4, 2026 11:57

jucor force-pushed the clj-parity-d2-fix branch 2 times, most recently from 98e6965 to 598879d Compare March 4, 2026 13:35

jucor and others added 6 commits March 4, 2026 14:04

Add PR 1 test results to journal

6f35786

Document xpassed tests after D2 fix: D6 two_prop_test (1) and D9 repness_not_empty (6) — both strict=False, expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor force-pushed the clj-parity-d2-fix branch from 598879d to fa6075a Compare March 4, 2026 14:04

jucor closed this Mar 5, 2026

jucor deleted the clj-parity-d2-fix branch March 5, 2026 12:41

jucor restored the clj-parity-d2-fix branch March 5, 2026 12:43

jucor reopened this Mar 5, 2026

jucor changed the base branch from edge to jc/series-of-fixes March 5, 2026 12:47

jucor mentioned this pull request Mar 5, 2026

[Stack 10/27] Fix D2: in-conv participant threshold + D2c vote count source #2421

Closed

6 tasks

jucor closed this Mar 5, 2026

jucor deleted the clj-parity-d2-fix branch March 5, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clj parity PR 1] Fix D2: in-conv participant threshold#2408

[Clj parity PR 1] Fix D2: in-conv participant threshold#2408
jucor wants to merge 6 commits into
compdemocracy:jc/series-of-fixesfrom
jucor:clj-parity-d2-fix

jucor commented Mar 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

jucor commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Create two distinct opinion groups with enough votes per participant
		# to meet the vote threshold (7 + sqrt(n_comments) * 0.1)

Conversation

jucor commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

D2: In-conv threshold

D2b: Base-cluster sort order (from Copilot review)

Impact

Incomplete Clojure blobs

Test results

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

jucor commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jucor commented Mar 3, 2026 •

edited

Loading