Skip to content

[Clj parity PR 1] Fix D2: in-conv participant threshold#2408

Closed
jucor wants to merge 6 commits into
compdemocracy:jc/series-of-fixesfrom
jucor:clj-parity-d2-fix
Closed

[Clj parity PR 1] Fix D2: in-conv participant threshold#2408
jucor wants to merge 6 commits into
compdemocracy:jc/series-of-fixesfrom
jucor:clj-parity-d2-fix

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 3, 2026

Summary

Stacked on #2401 (per-discrepancy test infrastructure). Please review and merge #2401 first.

To see only the diff added by this PR (3 commits): compare series-of-fixes...clj-parity-d2-fix

Fixes the in-conv participant threshold (D2 discrepancy) and base-cluster sort order (D2b) to match Clojure.

D2: In-conv threshold

  • Before: threshold = 7 + sqrt(n_cmts) * 0.1 — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
  • After: threshold = min(7, n_cmts) — matches Clojure exactly

D2b: Base-cluster sort order (from Copilot review)

  • Before: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
  • After: Keep k-means ID order, matching Clojure's (sort-by :id ...)

Impact

  • biodiversity: 428 → 441 in-conv participants (now matches Clojure)
  • Verified on 4 datasets with complete Clojure cold-start blobs: vw, biodiversity, and 2 private datasets

Incomplete Clojure blobs

3 private datasets have incomplete Clojure cold-start blobs (4 keys instead of 23, missing in-conv data). This is caused by Clojure's power-iteration PCA being too slow on large matrices (30-75M cells). D2 tests now skip gracefully on these datasets. They will be validated later when we build the incremental replay infrastructure (needed for D3/D1 anyway).

Test results

8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing duplicate vote files in one dataset)

Test plan

  • D2 tests pass on all datasets with complete Clojure blobs
  • D2 tests skip gracefully on datasets with incomplete blobs
  • Full test suite shows no regressions vs baseline
  • Copilot review comments addressed (D2b sort order, stale test comment)
  • Golden snapshots re-recorded for affected datasets

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Delphi math pipeline and test/tooling to better align Python outputs with the Clojure reference implementation, including the D2 “in-conv” participant vote threshold adjustment.

Changes:

  • Update “in-conv” participant threshold logic (D2) and expand clustering to a two-level (base clusters → group clusters) structure with unfolded groups for downstream computations.
  • Rework test dataset parametrization to support dataset selection (--datasets) and parallel test grouping (xdist_group), and adjust smoke/regression tests accordingly.
  • Add/refresh regression and Clojure comparison tooling (new CLI scripts/options), dataset blob handling (cold-start blobs), and supporting documentation.

Reviewed changes

Copilot reviewed 51 out of 55 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
delphi/polismath/conversation/conversation.py Implements in-conv threshold change and two-level clustering; updates repness/group-vote computations to use unfolded group membership.
delphi/polismath/pca_kmeans_rep/clusters.py Aligns initialization with Clojure-style “first k distinct points” and adds sklearn-backed helpers.
delphi/polismath/regression/datasets.py Adds cold-start blob detection and prefers cold-start blobs when available.
delphi/polismath/regression/init.py Re-exports Clojure comparer utilities from the regression package.
delphi/scripts/regression_comparer.py Adds CLI options (include-local, PCA sign-flip tolerance, outlier tolerances) and passes them into the comparer.
delphi/scripts/regression_recorder.py Adds --include-local support when recording golden snapshots.
delphi/scripts/clojure_comparer.py New CLI for comparing Python outputs against Clojure math blobs.
delphi/pyproject.toml Adds click dependency; configures pytest-xdist loadgroup and adds marker metadata.
delphi/tests/conftest.py Adds --datasets filtering, xdist grouping helpers, and collection-time deselection logic.
delphi/tests/test_regression.py Enables PCA sign-flip tolerance in regression comparisons; adds targeted unit tests around tolerance behavior.
delphi/tests/test_repness_smoke.py Switches dataset parametrization and uses unfolded group clusters when calling repness functions.
delphi/tests/test_legacy_repness_comparison.py Switches dataset parametrization for legacy comparison tests.
delphi/tests/test_conversation_smoke.py Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pipeline_integrity.py Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pca_smoke.py Switches dataset parametrization to xdist-grouped params.
delphi/tests/test_pca_unit.py Removes tests for removed PCA helpers; adds NaN-handling and no-vote participant coverage for pca_project_dataframe.
delphi/tests/test_pca_edge_cases.py Simplifies edge-case coverage to focus on pca_project_dataframe.
delphi/tests/test_edge_cases.py Updates expected repness output for insufficient-data scenario.
delphi/tests/test_datasets.py Updates DatasetInfo construction for new cold-start blob field and expected file checks.
delphi/tests/test_clusters.py Updates expectations for init_clusters() behavior when k > n_points.
delphi/tests/test_math_pipeline_runs_e2e.py Updates mock data directory name to match committed dataset directory.
delphi/tests/README.md Documents new pytest CLI options and parallel execution behavior.
delphi/docs/CLOJURE_TWO_LEVEL_CLUSTERING.md New documentation for Clojure’s hierarchical clustering architecture.
delphi/docs/SUBGROUP_CLUSTERING_THIRD_LEVEL.md New documentation describing Clojure’s (currently unimplemented) subgroup clustering level.
delphi/docs/CLOJURE_COMPARISON.md New/expanded documentation for Clojure comparison tests and tooling.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md New journal tracking parity work and discrepancy status.
delphi/polismath/benchmarks/bench_pca.py New PCA benchmarking script.
delphi/CLAUDE.md Updates testing/regression/Clojure-comparison guidance.
delphi/tests/test_legacy_clojure_output.py Removes legacy standalone analysis script.
delphi/tests/test_golden_data.py Removes old xfail golden comparison test.
delphi/tests/legacy_compare_with_clojure.py Removes legacy direct comparison script.
deep-analysis-for-julien/01-overview-and-architecture.md Adds deep-dive architecture analysis docs.
deep-analysis-for-julien/02-pca-analysis.md Adds PCA discrepancy analysis.
deep-analysis-for-julien/03-clustering-analysis.md Adds clustering discrepancy analysis.
deep-analysis-for-julien/04-repness-analysis.md Adds repness discrepancy analysis.
deep-analysis-for-julien/05-participant-filtering.md Adds participant filtering / priority analysis.
deep-analysis-for-julien/06-comment-routing.md Adds comment routing analysis.
deep-analysis-for-julien/07-discrepancies.md Adds consolidated discrepancy list.
deep-analysis-for-julien/08-dead-code.md Adds dead-code inventory.
deep-analysis-for-julien/09-fix-plan.md Adds fix plan for parity work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1464 to 1473
# Add clusters data in hierarchical format
# Base clusters (participants → ~100 clusters) in folded format
result['base-clusters'] = self._fold_base_clusters(self.base_clusters)

# Group clusters (base clusters → 2-5 groups)
# Members are base cluster IDs, not participant IDs
result['group-clusters'] = self.group_clusters

# Legacy field for backward compatibility (remove after full migration)
result['group_clusters'] = self.group_clusters
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_dict() now sets result['base-clusters'] to the folded base-cluster structure, but later in the same method it overwrites base-clusters with self.group_clusters (so consumers will never actually see base clusters). This breaks the intended hierarchical output format and can also confuse downstream code that expects folded base-clusters. Suggest removing the later overwrite and ensuring base-clusters always contains the folded base clusters (and if you need the legacy behavior, keep it under a separate legacy key).

Copilot uses AI. Check for mistakes.
Comment on lines +593 to +597
# Sort base clusters by size (descending) for consistency
base_clusters.sort(key=lambda c: len(c['members']), reverse=True)
# Reassign IDs based on sorted order
for i, cluster in enumerate(base_clusters):
cluster['id'] = i
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base clusters are being sorted by size and then IDs are reassigned. Because group-cluster k-means initialization uses “first k distinct points in encounter order”, this size-sort changes the encounter order of base-cluster centers and can change the resulting group clustering (and likely diverge from Clojure’s sort-by :id behavior). If parity is the goal, keep base clusters in deterministic ID order (and avoid reassigning IDs based on size) before running the second-level clustering.

Suggested change
# Sort base clusters by size (descending) for consistency
base_clusters.sort(key=lambda c: len(c['members']), reverse=True)
# Reassign IDs based on sorted order
for i, cluster in enumerate(base_clusters):
cluster['id'] = i
# NOTE:
# Keep base clusters in deterministic ID (encounter) order.
# Avoid sorting by size and reassigning IDs, since that would
# change the encounter order of centers used in group clustering.
# If parity with a Clojure implementation using (sort-by :id) is
# desired, this existing ID assignment and order already matches.

Copilot uses AI. Check for mistakes.
Comment thread delphi/tests/test_conversation.py Outdated
Comment on lines +429 to +430
# Create two distinct opinion groups with enough votes per participant
# to meet the vote threshold (7 + sqrt(n_comments) * 0.1)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in this test still references the old in-conv vote threshold formula 7 + sqrt(n_comments) * 0.1, but the implementation has been changed to min(7, n_comments). Please update the comment so the test documents the correct behavior it relies on.

Copilot uses AI. Check for mistakes.
@jucor jucor force-pushed the clj-parity-d2-fix branch 2 times, most recently from f27a629 to 27f3bbb Compare March 4, 2026 11:57
@jucor jucor marked this pull request as draft March 4, 2026 11:57
@jucor jucor force-pushed the clj-parity-d2-fix branch 2 times, most recently from 98e6965 to 598879d Compare March 4, 2026 13:35
jucor and others added 6 commits March 4, 2026 14:04
Change threshold from 7+sqrt(n_cmts)*0.1 to min(7, n_cmts) in
_get_in_conv_participants(). This matches Clojure's conversation.clj
and includes more participants in clustering for large conversations.

Verified on vw, biodiversity, FLI, bg2018 (all datasets with complete
Clojure blobs). 3 private datasets have incomplete blobs (no in-conv
data) — blob regeneration delegated to separate task.

Re-recorded golden snapshot for biodiversity (428→441 participants).
Updated journal with TDD discipline, golden snapshot policy, and
pipelined worktree workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three large datasets have incomplete cold-start Clojure blobs (4 keys
instead of 23, missing in-conv data). D2 tests now skip gracefully on
these datasets instead of failing with misleading empty-set comparisons.

The incomplete blobs are caused by Clojure's power-iteration PCA being
too slow for large matrices (30-75M cells). These datasets will be
validated later when we build the incremental replay infrastructure.

Verified: 8 passed, 4 skipped (incomplete blobs), 2 errors (pre-existing
duplicate vote files in engage dataset) on full dataset suite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…omment

- D2b: Remove sort-by-size and ID reassignment of base clusters;
  keep k-means ID order to match Clojure's (sort-by :id). The size-sort
  changed encounter order of centers fed into group-level k-means.
- Update stale comment in test_conversation.py referencing old threshold
  formula (7 + sqrt(n_cmts) * 0.1) to current min(7, n_cmts).
- Add D2b to plan checklist and journal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document xpassed tests after D2 fix: D6 two_prop_test (1) and
D9 repness_not_empty (6) — both strict=False, expected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix plan

D2c: Both vote counts and n_cmts must use raw_rating_mat (includes
moderated-out columns), not rating_mat. This is structural, not related
to delta vs full processing. Two xfail unit tests planned.

D2d: Full recompute from raw_rating_mat guarantees monotonicity without
needing to persist in-conv to DynamoDB (unlike Clojure which persists
because it uses delta vote processing). 6 tests guard this invariant
for future delta-processing refactors. Ref: compdemocracy#2358.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ity)

- D2c: structural discrepancy in vote counts and n_cmts — both must
  use raw_rating_mat to match Clojure's zero-out-columns behavior
- D2d: full recompute guarantees monotonicity without persistence,
  unlike Clojure which persists because it uses delta processing
- Corrected earlier "not needed" note about raw_rating_mat
- Updated What's Next: D2c → D2d → D4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jucor jucor force-pushed the clj-parity-d2-fix branch from 598879d to fa6075a Compare March 4, 2026 14:04
@jucor jucor closed this Mar 5, 2026
@jucor jucor deleted the clj-parity-d2-fix branch March 5, 2026 12:41
@jucor jucor restored the clj-parity-d2-fix branch March 5, 2026 12:43
@jucor jucor reopened this Mar 5, 2026
@jucor jucor changed the base branch from edge to jc/series-of-fixes March 5, 2026 12:47
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 5, 2026

Superseded by #2421 (moved head branch to upstream for simpler stacked PR workflow).

@jucor jucor closed this Mar 5, 2026
@jucor jucor deleted the clj-parity-d2-fix branch March 5, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants