[Clj parity PR 1] Fix D2: in-conv participant threshold#2408
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Delphi math pipeline and test/tooling to better align Python outputs with the Clojure reference implementation, including the D2 “in-conv” participant vote threshold adjustment.
Changes:
- Update “in-conv” participant threshold logic (D2) and expand clustering to a two-level (base clusters → group clusters) structure with unfolded groups for downstream computations.
- Rework test dataset parametrization to support dataset selection (
--datasets) and parallel test grouping (xdist_group), and adjust smoke/regression tests accordingly. - Add/refresh regression and Clojure comparison tooling (new CLI scripts/options), dataset blob handling (cold-start blobs), and supporting documentation.
Reviewed changes
Copilot reviewed 51 out of 55 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| delphi/polismath/conversation/conversation.py | Implements in-conv threshold change and two-level clustering; updates repness/group-vote computations to use unfolded group membership. |
| delphi/polismath/pca_kmeans_rep/clusters.py | Aligns initialization with Clojure-style “first k distinct points” and adds sklearn-backed helpers. |
| delphi/polismath/regression/datasets.py | Adds cold-start blob detection and prefers cold-start blobs when available. |
| delphi/polismath/regression/init.py | Re-exports Clojure comparer utilities from the regression package. |
| delphi/scripts/regression_comparer.py | Adds CLI options (include-local, PCA sign-flip tolerance, outlier tolerances) and passes them into the comparer. |
| delphi/scripts/regression_recorder.py | Adds --include-local support when recording golden snapshots. |
| delphi/scripts/clojure_comparer.py | New CLI for comparing Python outputs against Clojure math blobs. |
| delphi/pyproject.toml | Adds click dependency; configures pytest-xdist loadgroup and adds marker metadata. |
| delphi/tests/conftest.py | Adds --datasets filtering, xdist grouping helpers, and collection-time deselection logic. |
| delphi/tests/test_regression.py | Enables PCA sign-flip tolerance in regression comparisons; adds targeted unit tests around tolerance behavior. |
| delphi/tests/test_repness_smoke.py | Switches dataset parametrization and uses unfolded group clusters when calling repness functions. |
| delphi/tests/test_legacy_repness_comparison.py | Switches dataset parametrization for legacy comparison tests. |
| delphi/tests/test_conversation_smoke.py | Switches dataset parametrization to xdist-grouped params. |
| delphi/tests/test_pipeline_integrity.py | Switches dataset parametrization to xdist-grouped params. |
| delphi/tests/test_pca_smoke.py | Switches dataset parametrization to xdist-grouped params. |
| delphi/tests/test_pca_unit.py | Removes tests for removed PCA helpers; adds NaN-handling and no-vote participant coverage for pca_project_dataframe. |
| delphi/tests/test_pca_edge_cases.py | Simplifies edge-case coverage to focus on pca_project_dataframe. |
| delphi/tests/test_edge_cases.py | Updates expected repness output for insufficient-data scenario. |
| delphi/tests/test_datasets.py | Updates DatasetInfo construction for new cold-start blob field and expected file checks. |
| delphi/tests/test_clusters.py | Updates expectations for init_clusters() behavior when k > n_points. |
| delphi/tests/test_math_pipeline_runs_e2e.py | Updates mock data directory name to match committed dataset directory. |
| delphi/tests/README.md | Documents new pytest CLI options and parallel execution behavior. |
| delphi/docs/CLOJURE_TWO_LEVEL_CLUSTERING.md | New documentation for Clojure’s hierarchical clustering architecture. |
| delphi/docs/SUBGROUP_CLUSTERING_THIRD_LEVEL.md | New documentation describing Clojure’s (currently unimplemented) subgroup clustering level. |
| delphi/docs/CLOJURE_COMPARISON.md | New/expanded documentation for Clojure comparison tests and tooling. |
| delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | New journal tracking parity work and discrepancy status. |
| delphi/polismath/benchmarks/bench_pca.py | New PCA benchmarking script. |
| delphi/CLAUDE.md | Updates testing/regression/Clojure-comparison guidance. |
| delphi/tests/test_legacy_clojure_output.py | Removes legacy standalone analysis script. |
| delphi/tests/test_golden_data.py | Removes old xfail golden comparison test. |
| delphi/tests/legacy_compare_with_clojure.py | Removes legacy direct comparison script. |
| deep-analysis-for-julien/01-overview-and-architecture.md | Adds deep-dive architecture analysis docs. |
| deep-analysis-for-julien/02-pca-analysis.md | Adds PCA discrepancy analysis. |
| deep-analysis-for-julien/03-clustering-analysis.md | Adds clustering discrepancy analysis. |
| deep-analysis-for-julien/04-repness-analysis.md | Adds repness discrepancy analysis. |
| deep-analysis-for-julien/05-participant-filtering.md | Adds participant filtering / priority analysis. |
| deep-analysis-for-julien/06-comment-routing.md | Adds comment routing analysis. |
| deep-analysis-for-julien/07-discrepancies.md | Adds consolidated discrepancy list. |
| deep-analysis-for-julien/08-dead-code.md | Adds dead-code inventory. |
| deep-analysis-for-julien/09-fix-plan.md | Adds fix plan for parity work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Add clusters data in hierarchical format | ||
| # Base clusters (participants → ~100 clusters) in folded format | ||
| result['base-clusters'] = self._fold_base_clusters(self.base_clusters) | ||
|
|
||
| # Group clusters (base clusters → 2-5 groups) | ||
| # Members are base cluster IDs, not participant IDs | ||
| result['group-clusters'] = self.group_clusters | ||
|
|
||
| # Legacy field for backward compatibility (remove after full migration) | ||
| result['group_clusters'] = self.group_clusters |
There was a problem hiding this comment.
to_dict() now sets result['base-clusters'] to the folded base-cluster structure, but later in the same method it overwrites base-clusters with self.group_clusters (so consumers will never actually see base clusters). This breaks the intended hierarchical output format and can also confuse downstream code that expects folded base-clusters. Suggest removing the later overwrite and ensuring base-clusters always contains the folded base clusters (and if you need the legacy behavior, keep it under a separate legacy key).
| # Sort base clusters by size (descending) for consistency | ||
| base_clusters.sort(key=lambda c: len(c['members']), reverse=True) | ||
| # Reassign IDs based on sorted order | ||
| for i, cluster in enumerate(base_clusters): | ||
| cluster['id'] = i |
There was a problem hiding this comment.
Base clusters are being sorted by size and then IDs are reassigned. Because group-cluster k-means initialization uses “first k distinct points in encounter order”, this size-sort changes the encounter order of base-cluster centers and can change the resulting group clustering (and likely diverge from Clojure’s sort-by :id behavior). If parity is the goal, keep base clusters in deterministic ID order (and avoid reassigning IDs based on size) before running the second-level clustering.
| # Sort base clusters by size (descending) for consistency | |
| base_clusters.sort(key=lambda c: len(c['members']), reverse=True) | |
| # Reassign IDs based on sorted order | |
| for i, cluster in enumerate(base_clusters): | |
| cluster['id'] = i | |
| # NOTE: | |
| # Keep base clusters in deterministic ID (encounter) order. | |
| # Avoid sorting by size and reassigning IDs, since that would | |
| # change the encounter order of centers used in group clustering. | |
| # If parity with a Clojure implementation using (sort-by :id) is | |
| # desired, this existing ID assignment and order already matches. |
| # Create two distinct opinion groups with enough votes per participant | ||
| # to meet the vote threshold (7 + sqrt(n_comments) * 0.1) |
There was a problem hiding this comment.
The comment in this test still references the old in-conv vote threshold formula 7 + sqrt(n_comments) * 0.1, but the implementation has been changed to min(7, n_comments). Please update the comment so the test documents the correct behavior it relies on.
f27a629 to
27f3bbb
Compare
98e6965 to
598879d
Compare
Change threshold from 7+sqrt(n_cmts)*0.1 to min(7, n_cmts) in _get_in_conv_participants(). This matches Clojure's conversation.clj and includes more participants in clustering for large conversations. Verified on vw, biodiversity, FLI, bg2018 (all datasets with complete Clojure blobs). 3 private datasets have incomplete blobs (no in-conv data) — blob regeneration delegated to separate task. Re-recorded golden snapshot for biodiversity (428→441 participants). Updated journal with TDD discipline, golden snapshot policy, and pipelined worktree workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three large datasets have incomplete cold-start Clojure blobs (4 keys instead of 23, missing in-conv data). D2 tests now skip gracefully on these datasets instead of failing with misleading empty-set comparisons. The incomplete blobs are caused by Clojure's power-iteration PCA being too slow for large matrices (30-75M cells). These datasets will be validated later when we build the incremental replay infrastructure. Verified: 8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing duplicate vote files in engage dataset) on full dataset suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…omment - D2b: Remove sort-by-size and ID reassignment of base clusters; keep k-means ID order to match Clojure's (sort-by :id). The size-sort changed encounter order of centers fed into group-level k-means. - Update stale comment in test_conversation.py referencing old threshold formula (7 + sqrt(n_cmts) * 0.1) to current min(7, n_cmts). - Add D2b to plan checklist and journal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document xpassed tests after D2 fix: D6 two_prop_test (1) and D9 repness_not_empty (6) — both strict=False, expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix plan D2c: Both vote counts and n_cmts must use raw_rating_mat (includes moderated-out columns), not rating_mat. This is structural, not related to delta vs full processing. Two xfail unit tests planned. D2d: Full recompute from raw_rating_mat guarantees monotonicity without needing to persist in-conv to DynamoDB (unlike Clojure which persists because it uses delta vote processing). 6 tests guard this invariant for future delta-processing refactors. Ref: compdemocracy#2358. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ity) - D2c: structural discrepancy in vote counts and n_cmts — both must use raw_rating_mat to match Clojure's zero-out-columns behavior - D2d: full recompute guarantees monotonicity without persistence, unlike Clojure which persists because it uses delta processing - Corrected earlier "not needed" note about raw_rating_mat - Updated What's Next: D2c → D2d → D4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Superseded by #2421 (moved head branch to upstream for simpler stacked PR workflow). |
Summary
Fixes the in-conv participant threshold (D2 discrepancy) and base-cluster sort order (D2b) to match Clojure.
D2: In-conv threshold
threshold = 7 + sqrt(n_cmts) * 0.1— increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)threshold = min(7, n_cmts)— matches Clojure exactlyD2b: Base-cluster sort order (from Copilot review)
(sort-by :id ...)Impact
Incomplete Clojure blobs
3 private datasets have incomplete Clojure cold-start blobs (4 keys instead of 23, missing
in-convdata). This is caused by Clojure's power-iteration PCA being too slow on large matrices (30-75M cells). D2 tests now skip gracefully on these datasets. They will be validated later when we build the incremental replay infrastructure (needed for D3/D1 anyway).Test results
Test plan
🤖 Generated with Claude Code