Skip to content

Commit e15037c

Browse files
jucorclaude
andauthored
[Stack 5/27] Cold-start Clojure math blob generation and cluster visualization (#2485)
* Add script to generate cold-start Clojure math blobs for fair comparison Creates generate_cold_start_clojure.py to generate fresh cold-start Clojure reference data for fair Python vs Clojure comparison. The script: - Stops if math worker is running (prevents conflicts) - Backs up existing math_main row - Deletes row to force cold-start (load-or-init creates fresh new-conv) - Runs Clojure computation via Docker - Extracts cold-start math blob - Restores original row automatically Key features: - Support for --all flag to process all datasets - Support for --include-local flag for local datasets - Automatic zid lookup from report_id via reports table - Loads configuration from polis-kmeans/.env (DATABASE_URL) Test infrastructure updates: - datasets.py now prefers cold-start blobs when available - Added has_cold_start_blob field to DatasetInfo - get_dataset_files() uses cold-start blob by default Documentation updates: - Comprehensive usage guide in SESSION_HANDOFF_KMEANS.md - Commands to find and verify cold-start blobs - Configuration requirements and setup instructions Reference data: - Generated cold-start blobs for biodiversity and vw datasets - Tests will now use these for fair cold vs cold comparison Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Implement conversation replay approach for cold-start generation and add visualization with PCA sign flip detection Cold-start generation (generate_cold_start_clojure.py): - Rewrite using "conversation replay" approach that works with Clojure poller design - Creates temporary conversation with fresh zid, copies votes with fresh timestamps - Runs poller with MATH_ZID_ALLOWLIST to only process the replayed conversation - Automatically cleans up all temporary data (math tables, votes, conversation) - Add bash wrapper script (generate_cold_start.sh) that stops math containers first Visualization (visualize_cluster_comparison.py): - Add PCA sign flip detection by comparing component correlations - Apply sign corrections to base cluster centers before visualization - Fix convex hull rendering to show outlines for both datasets in overlay view - Include sign_flips in metrics JSON output Documentation: - Update SESSION_HANDOFF_KMEANS.md with new approach and remove "BROKEN" warnings - Document the conversation replay workflow and cleanup behavior Regenerate cold-start blobs for biodiversity and vw datasets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Improve cold-start generator and cluster visualizer CLI Cold-start generator (generate_cold_start_clojure.py): - Add --pause-math to pause/resume workers instead of stopping - Add --verbose/-v for real-time Clojure poller output - Support multiple datasets as arguments - Use fast INSERT...SELECT for vote copying (was executemany) - Handle duplicate votes with DISTINCT ON - Remove shell wrapper (functionality now in Python script) Cluster visualizer (visualize_cluster_comparison.py): - Add --all option for processing all datasets - Synchronize X/Y axis limits in side-by-side plots - Print full absolute paths for generated PNGs - Support multiple datasets as arguments Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Compute cold-start math blobs for kmeans * Interrupt upon Clojure error * test_datasets: add cold_start blob fixture and has_cold_start arg Deferred from commit 18ad361 — the test_datasets.py changes depend on the has_cold_start_blob field introduced in datasets.py by the cold-start tooling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a2cee07 commit e15037c

8 files changed

Lines changed: 152979 additions & 12 deletions

File tree

delphi/docs/SESSION_HANDOFF_KMEANS.md

Lines changed: 322 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -527,8 +527,327 @@ After implementation:
527527
| Add in-conv filtering | ✅ Done | `_get_in_conv_participants()` |
528528
| Update serialization | ✅ Done | `_fold_base_clusters()`, outputs hierarchical format |
529529
| Add incremental clustering (`:last-clusters`) | ⏳ TODO | Need to use previous clusters as initialization |
530-
| Generate clean Clojure references | ⏳ TODO | Fresh computations for fair comparison |
531-
| Update tests for fair comparison | ⏳ TODO | Compare cold-start vs cold-start |
530+
| Generate clean Clojure references | ✅ Done | Uses "fake conversation" approach - creates temp conversation with copied votes |
531+
| Update tests for fair comparison | ✅ Done | Tests now prefer cold-start blobs automatically |
532+
| Implement fake conversation approach | ✅ Done | Script creates temp zid, copies votes with fresh timestamps, runs poller |
533+
534+
---
535+
536+
## Generating Clean Cold-Start Clojure References
537+
538+
To ensure fair comparison between Python and Clojure implementations, we can generate fresh cold-start Clojure math blobs using the `generate_cold_start_clojure.py` script.
539+
540+
### Why Cold-Start Matters
541+
542+
The Clojure implementation uses `:last-clusters` to warm-start from previous state. This creates non-deterministic behavior:
543+
- If math worker was NOT restarted: uses previous clusters for initialization
544+
- If math worker WAS restarted: loses in-memory state, behaves differently
545+
546+
By generating cold-start references, we compare:
547+
- **Python cold-start** (always) vs **Clojure cold-start** (forced by deleting math_main row)
548+
549+
### How to Generate Cold-Start Blobs
550+
551+
**Prerequisites:**
552+
1. Stop any running math worker: `docker compose stop math` (from the worktree root)
553+
2. Ensure DATABASE_URL is set in the worktree root's `.env` file
554+
- Example: `DATABASE_URL=postgres://postgres:password@host.docker.internal:5433/polis-dev`
555+
556+
**Generate for single dataset:**
557+
```bash
558+
cd delphi # from worktree root
559+
uv run python scripts/generate_cold_start_clojure.py biodiversity
560+
```
561+
562+
**Generate for all committed datasets:**
563+
```bash
564+
uv run python scripts/generate_cold_start_clojure.py --all
565+
```
566+
567+
**Generate for all datasets including local (.local/):**
568+
```bash
569+
uv run python scripts/generate_cold_start_clojure.py --all --include-local
570+
```
571+
572+
**Advanced options:**
573+
```bash
574+
# Keep fake conversation data for debugging (not cleaned up)
575+
uv run python scripts/generate_cold_start_clojure.py biodiversity --no-cleanup
576+
577+
# Process a specific local dataset
578+
uv run python scripts/generate_cold_start_clojure.py my-local-dataset
579+
580+
# Increase timeout for large datasets
581+
uv run python scripts/generate_cold_start_clojure.py biodiversity --timeout 600
582+
```
583+
584+
**Output files:**
585+
- `{report_id}_math_blob_cold_start.json` - Fresh cold-start computation
586+
587+
### Finding Pre-Computed Cold-Start Math Blobs
588+
589+
After running the script, cold-start math blobs are saved in the dataset directories:
590+
591+
**Location pattern:**
592+
```
593+
delphi/real_data/{report_id}-{dataset_name}/{report_id}_math_blob_cold_start.json
594+
```
595+
596+
**For committed datasets:**
597+
```bash
598+
# Biodiversity example
599+
ls -lh delphi/real_data/r4tykwac8thvzv35jrn53-biodiversity/r4tykwac8thvzv35jrn53_math_blob_cold_start.json
600+
601+
# VW example
602+
ls -lh delphi/real_data/r6vbnhffkxbd7ifmfbdrd-vw/r6vbnhffkxbd7ifmfbdrd_math_blob_cold_start.json
603+
604+
# List all cold-start blobs
605+
find delphi/real_data -name "*_math_blob_cold_start.json" -type f
606+
```
607+
608+
**For local datasets:**
609+
```bash
610+
# List all cold-start blobs in .local/
611+
find delphi/real_data/.local -name "*_math_blob_cold_start.json" -type f
612+
```
613+
614+
**Verify a cold-start blob was created:**
615+
```bash
616+
cd delphi # from worktree root
617+
618+
# Check file exists and size
619+
ls -lh real_data/r4tykwac8thvzv35jrn53-biodiversity/*cold_start*.json
620+
621+
# Quick inspection of content
622+
jq 'keys | length' real_data/r4tykwac8thvzv35jrn53-biodiversity/r4tykwac8thvzv35jrn53_math_blob_cold_start.json
623+
624+
# Compare file sizes (cold-start should be similar to original)
625+
ls -lh real_data/r4tykwac8thvzv35jrn53-biodiversity/*_math_blob*.json
626+
```
627+
628+
**Dataset directory structure after running script:**
629+
```
630+
real_data/r4tykwac8thvzv35jrn53-biodiversity/
631+
├── 2024-11-12-1652-r4tykwac8thvzv35jrn53-votes.csv
632+
├── 2024-11-12-1652-r4tykwac8thvzv35jrn53-comments.csv
633+
├── 2024-11-12-1652-r4tykwac8thvzv35jrn53-summary.csv
634+
├── r4tykwac8thvzv35jrn53_math_blob.json # Original (unknown provenance)
635+
├── r4tykwac8thvzv35jrn53_math_blob_cold_start.json # Fresh cold-start ✨
636+
└── golden_snapshot.json
637+
```
638+
639+
### How Tests Use Cold-Start Blobs
640+
641+
The test infrastructure automatically detects and prefers cold-start blobs when available:
642+
643+
**Automatic detection (in `datasets.py`):**
644+
```python
645+
def get_dataset_files(name: str, prefer_cold_start: bool = True):
646+
# Automatically uses {report_id}_math_blob_cold_start.json if it exists
647+
# Falls back to {report_id}_math_blob.json otherwise
648+
```
649+
650+
**Run tests (will auto-use cold-start blobs):**
651+
```bash
652+
cd delphi # from worktree root
653+
654+
# Run all Clojure comparison tests
655+
uv run pytest tests/test_legacy_clojure_regression.py -v
656+
657+
# Run specific clustering comparison
658+
uv run pytest tests/test_legacy_clojure_regression.py::TestClojureRegression::test_group_clustering -v
659+
660+
# Run for specific dataset
661+
uv run pytest tests/test_legacy_clojure_regression.py -v -k biodiversity
662+
```
663+
664+
**Check which blob is being used:**
665+
```python
666+
from polismath.regression import get_dataset_files
667+
668+
# Will use cold-start if available
669+
files = get_dataset_files('biodiversity')
670+
print(f"Using: {files['math_blob']}")
671+
# Output: .../r4tykwac8thvzv35jrn53_math_blob_cold_start.json (if exists)
672+
673+
# Force use of original blob
674+
files = get_dataset_files('biodiversity', prefer_cold_start=False)
675+
print(f"Using: {files['math_blob']}")
676+
# Output: .../r4tykwac8thvzv35jrn53_math_blob.json
677+
```
678+
679+
**Check which datasets have cold-start blobs:**
680+
```bash
681+
cd delphi # from worktree root
682+
683+
# List all datasets with cold-start blobs
684+
uv run python -c "
685+
from polismath.regression import discover_datasets
686+
datasets = discover_datasets(include_local=False)
687+
for name, info in datasets.items():
688+
status = '✓ cold-start' if info.has_cold_start_blob else '✗ original only'
689+
print(f'{name}: {status}')
690+
"
691+
```
692+
693+
### How the Script Works (Fake Conversation Approach)
694+
695+
The script uses a "fake conversation" approach to generate true cold-start computations. This works WITH the Clojure poller's design rather than against it.
696+
697+
**The Process:**
698+
699+
1. **Create fake conversation**: Insert a minimal row in `conversations` table with a fresh auto-generated zid
700+
2. **Copy votes with fresh timestamps**: Copy all votes from source zid to fake zid, with timestamps starting from "now" (spaced 10ms apart to preserve order)
701+
3. **Run poller**: Start the Clojure poller with `MATH_ZID_ALLOWLIST={fake_zid}` to only process our fake conversation
702+
4. **Wait for computation**: Poll `math_main` until `base-clusters` has data
703+
5. **Extract and save**: Save the math blob with the original source zid for consistency
704+
6. **Cleanup**: Delete all fake data from `conversations`, `votes`, `votes_latest_unique`, `participants`, and `math_main`
705+
706+
**Why This Works:**
707+
- The poller finds votes with `created > last_poll_timestamp`
708+
- Fresh timestamps ensure the votes are picked up
709+
- A new zid means no interference from existing math_main rows or cached state
710+
- The `MATH_ZID_ALLOWLIST` filter restricts processing to just our fake conversation
711+
712+
**Key Functions:**
713+
- `create_fake_conversation(conn, source_zid)` → Creates minimal conversation row, returns new zid
714+
- `copy_votes_with_fresh_timestamps(conn, source_zid, fake_zid)` → Copies votes with sequential fresh timestamps
715+
- `cleanup_fake_conversation(conn, fake_zid)` → Deletes all fake data from all tables
716+
717+
### Command-Line Options
718+
719+
- `DATASETS...`: Specify one or more dataset names (e.g., `biodiversity vw`)
720+
- `--all`: Process all datasets
721+
- `--include-local`: Include datasets from `real_data/.local/`
722+
- `--no-cleanup`: Keep fake conversation data for debugging (normally cleaned up automatically)
723+
- `--timeout N`: Set timeout in seconds for math computation (default: 300)
724+
- `--pause-math`: Automatically pause running math workers (resumes after completion)
725+
- `--verbose` / `-v`: Show detailed output including real-time Clojure poller logs
726+
727+
**Examples:**
728+
```bash
729+
# Single dataset
730+
uv run python scripts/generate_cold_start_clojure.py biodiversity
731+
732+
# Multiple datasets
733+
uv run python scripts/generate_cold_start_clojure.py biodiversity vw american-assembly
734+
735+
# All datasets with verbose output and longer timeout
736+
uv run python scripts/generate_cold_start_clojure.py --all --include-local --pause-math --timeout 600 -v
737+
```
738+
739+
### Safety Features
740+
741+
- **Math worker detection**: Refuses to run if math worker is active (use `--pause-math` to auto-pause)
742+
- **Environment validation**: Checks that DATABASE_URL is set before proceeding
743+
- **Report ID validation**: Verifies report_id exists in reports table
744+
- **Vote verification**: Confirms source zid has votes before attempting computation
745+
- **Automatic cleanup**: Fake conversation data is always deleted (unless `--no-cleanup`)
746+
- **No permanent changes**: Original database data is never modified
747+
- **Error detection**: Monitors Clojure poller output for fatal errors and aborts early (see below)
748+
749+
### Clojure Error Detection
750+
751+
The script monitors the Clojure poller output for fatal errors and aborts early instead of waiting for timeout. Detected patterns:
752+
- `"Failed conversation update"` - General computation failure
753+
- `"nil has zero dimensionality"` - Empty matrix in PCA (see Known Limitations)
754+
- `"Re-queueing messages for failed update"` - Persistent failure
755+
- `"java.lang.OutOfMemoryError"` - Memory exhaustion
756+
757+
When detected, the script aborts with:
758+
```
759+
✗ Clojure poller failed: Clojure error detected: nil has zero dimensionality
760+
The conversation data may not be processable by the Clojure implementation.
761+
```
762+
763+
---
764+
765+
## Cluster Visualization Script
766+
767+
The `visualize_cluster_comparison.py` script generates visual comparisons between different clustering outputs.
768+
769+
### Usage
770+
771+
```bash
772+
cd delphi
773+
774+
# Single dataset
775+
uv run python scripts/visualize_cluster_comparison.py biodiversity
776+
777+
# Multiple datasets
778+
uv run python scripts/visualize_cluster_comparison.py biodiversity vw
779+
780+
# All datasets
781+
uv run python scripts/visualize_cluster_comparison.py --all --include-local
782+
```
783+
784+
### Output
785+
786+
For each dataset, generates:
787+
- `{dataset}_golden_vs_coldstart_sidebyside.png` - Python vs Clojure cold-start side-by-side
788+
- `{dataset}_golden_vs_coldstart_overlay.png` - Python vs Clojure cold-start overlay
789+
- `{dataset}_coldstart_vs_regular_sidebyside.png` - Clojure cold-start vs original
790+
- `{dataset}_coldstart_vs_regular_overlay.png` - Clojure cold-start vs original overlay
791+
- `{dataset}_*_metrics.json` - Comparison metrics (Jaccard similarity, etc.)
792+
793+
Output directory: `scripts/outputs/cluster_visualizations/{dataset}/`
794+
795+
Full absolute paths are printed for each PNG, allowing alt-click to open in IDE.
796+
797+
### Features
798+
799+
- **PCA sign flip detection**: Automatically detects and corrects PCA sign flips between implementations
800+
- **Synchronized axes**: Side-by-side plots share the same X/Y limits for direct comparison
801+
- **Convex hulls**: Shows group boundaries with convex hulls in overlay mode
802+
803+
---
804+
805+
## Known Limitations
806+
807+
### Clojure "nil has zero dimensionality" Error
808+
809+
Some large conversations fail in the Clojure implementation with:
810+
```
811+
clojure.lang.ExceptionInfo: nil has zero dimensionality, cannot get count for dimension: 0
812+
at polismath.math.conversation/partial-pca/learn (conversation.clj:719)
813+
```
814+
815+
**Cause**: The conversation data results in an empty or nil matrix during PCA computation. This can happen when:
816+
- All comments are moderated out
817+
- Not enough participants meet the "in-conv" threshold
818+
- Edge cases in the data that produce empty participant matrices
819+
820+
**Impact**: These conversations cannot be processed by the Clojure implementation and will fail in the cold-start generation script.
821+
822+
**Workaround**: The Python implementation may handle these edge cases differently. For affected conversations, only Python-generated outputs will be available.
823+
824+
**Known affected datasets**: bg2050 (pakistan)
825+
826+
---
827+
828+
## Configuration
829+
830+
### Environment Setup
831+
832+
The cold-start generation script requires database access. Configuration is loaded from the worktree root's `.env` file.
833+
834+
**Required variables**:
835+
```bash
836+
DATABASE_URL=postgres://postgres:password@host.docker.internal:5433/polis-dev
837+
MATH_ENV=prod # or 'dev' for development
838+
```
839+
840+
**Setup**:
841+
```bash
842+
# Copy from main polis repo if not already present
843+
cp /path/to/polis/.env /path/to/polis-kmeans/.env
844+
```
845+
846+
The script also needs the Clojure math worker Docker image available:
847+
```bash
848+
cd /path/to/polis-kmeans
849+
docker compose build math # If image not already built
850+
```
532851

533852
---
534853

@@ -538,3 +857,4 @@ After implementation:
538857
- **Clojure source**: `math/src/polismath/math/`
539858
- **Python clusters**: `polismath/pca_kmeans_rep/clusters.py`
540859
- **Python conversation**: `polismath/conversation/conversation.py`
860+
- **Cold-start script**: `scripts/generate_cold_start_clojure.py`

delphi/polismath/regression/datasets.py

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ class DatasetInfo:
3636
is_local: bool
3737
has_golden: bool
3838
has_math_blob: bool
39+
has_cold_start_blob: bool
3940
has_votes: bool
4041
has_comments: bool
4142

@@ -66,10 +67,15 @@ def get_local_data_dir() -> Path:
6667

6768
def _check_files(path: Path, report_id: str) -> dict:
6869
"""Check which required files exist."""
70+
# Check for both cold-start and original math blobs
71+
cold_start_blob = path / f"{report_id}_math_blob_cold_start.json"
72+
original_blob = path / f"{report_id}_math_blob.json"
73+
6974
return {
7075
'has_votes': any(path.glob(f"*-{report_id}-votes.csv")),
7176
'has_comments': any(path.glob(f"*-{report_id}-comments.csv")),
72-
'has_math_blob': (path / f"{report_id}_math_blob.json").exists(),
77+
'has_math_blob': cold_start_blob.exists() or original_blob.exists(),
78+
'has_cold_start_blob': cold_start_blob.exists(),
7379
'has_golden': (path / "golden_snapshot.json").exists(),
7480
}
7581

@@ -150,8 +156,13 @@ def get_dataset_report_id(name: str) -> str:
150156
return get_dataset_info(name).report_id
151157

152158

153-
def get_dataset_files(name: str) -> Dict[str, str]:
154-
"""Get file paths for a dataset."""
159+
def get_dataset_files(name: str, prefer_cold_start: bool = True) -> Dict[str, str]:
160+
"""Get file paths for a dataset.
161+
162+
Args:
163+
name: Dataset name
164+
prefer_cold_start: If True (default), use cold-start blob when available
165+
"""
155166
info = get_dataset_info(name)
156167
rid = info.report_id
157168

@@ -163,13 +174,22 @@ def find_file(pattern: str) -> str:
163174
raise ValueError(f"Multiple files matching {pattern} in {info.path}: {matches}")
164175
return str(matches[0].resolve())
165176

177+
# Check for cold-start blob first, fall back to original
178+
cold_start_blob = info.path / f"{rid}_math_blob_cold_start.json"
179+
original_blob = info.path / f"{rid}_math_blob.json"
180+
181+
if prefer_cold_start and cold_start_blob.exists():
182+
math_blob_path = str(cold_start_blob)
183+
else:
184+
math_blob_path = str(original_blob)
185+
166186
return {
167187
'report_id': rid,
168188
'data_dir': str(info.path),
169189
'votes': find_file(f"*-{rid}-votes.csv"),
170190
'comments': find_file(f"*-{rid}-comments.csv"),
171191
'summary': find_file(f"*-{rid}-summary.csv"),
172-
'math_blob': str(info.path / f"{rid}_math_blob.json"),
192+
'math_blob': math_blob_path,
173193
}
174194

175195

0 commit comments

Comments
 (0)