You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Stack 5/27] Cold-start Clojure math blob generation and cluster visualization (#2485)
* Add script to generate cold-start Clojure math blobs for fair comparison
Creates generate_cold_start_clojure.py to generate fresh cold-start Clojure
reference data for fair Python vs Clojure comparison. The script:
- Stops if math worker is running (prevents conflicts)
- Backs up existing math_main row
- Deletes row to force cold-start (load-or-init creates fresh new-conv)
- Runs Clojure computation via Docker
- Extracts cold-start math blob
- Restores original row automatically
Key features:
- Support for --all flag to process all datasets
- Support for --include-local flag for local datasets
- Automatic zid lookup from report_id via reports table
- Loads configuration from polis-kmeans/.env (DATABASE_URL)
Test infrastructure updates:
- datasets.py now prefers cold-start blobs when available
- Added has_cold_start_blob field to DatasetInfo
- get_dataset_files() uses cold-start blob by default
Documentation updates:
- Comprehensive usage guide in SESSION_HANDOFF_KMEANS.md
- Commands to find and verify cold-start blobs
- Configuration requirements and setup instructions
Reference data:
- Generated cold-start blobs for biodiversity and vw datasets
- Tests will now use these for fair cold vs cold comparison
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Implement conversation replay approach for cold-start generation and add visualization with PCA sign flip detection
Cold-start generation (generate_cold_start_clojure.py):
- Rewrite using "conversation replay" approach that works with Clojure poller design
- Creates temporary conversation with fresh zid, copies votes with fresh timestamps
- Runs poller with MATH_ZID_ALLOWLIST to only process the replayed conversation
- Automatically cleans up all temporary data (math tables, votes, conversation)
- Add bash wrapper script (generate_cold_start.sh) that stops math containers first
Visualization (visualize_cluster_comparison.py):
- Add PCA sign flip detection by comparing component correlations
- Apply sign corrections to base cluster centers before visualization
- Fix convex hull rendering to show outlines for both datasets in overlay view
- Include sign_flips in metrics JSON output
Documentation:
- Update SESSION_HANDOFF_KMEANS.md with new approach and remove "BROKEN" warnings
- Document the conversation replay workflow and cleanup behavior
Regenerate cold-start blobs for biodiversity and vw datasets.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Improve cold-start generator and cluster visualizer CLI
Cold-start generator (generate_cold_start_clojure.py):
- Add --pause-math to pause/resume workers instead of stopping
- Add --verbose/-v for real-time Clojure poller output
- Support multiple datasets as arguments
- Use fast INSERT...SELECT for vote copying (was executemany)
- Handle duplicate votes with DISTINCT ON
- Remove shell wrapper (functionality now in Python script)
Cluster visualizer (visualize_cluster_comparison.py):
- Add --all option for processing all datasets
- Synchronize X/Y axis limits in side-by-side plots
- Print full absolute paths for generated PNGs
- Support multiple datasets as arguments
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Compute cold-start math blobs for kmeans
* Interrupt upon Clojure error
* test_datasets: add cold_start blob fixture and has_cold_start arg
Deferred from commit 18ad361 — the test_datasets.py changes depend on the
has_cold_start_blob field introduced in datasets.py by the cold-start tooling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
To ensure fair comparison between Python and Clojure implementations, we can generate fresh cold-start Clojure math blobs using the `generate_cold_start_clojure.py` script.
539
+
540
+
### Why Cold-Start Matters
541
+
542
+
The Clojure implementation uses `:last-clusters` to warm-start from previous state. This creates non-deterministic behavior:
543
+
- If math worker was NOT restarted: uses previous clusters for initialization
544
+
- If math worker WAS restarted: loses in-memory state, behaves differently
545
+
546
+
By generating cold-start references, we compare:
547
+
-**Python cold-start** (always) vs **Clojure cold-start** (forced by deleting math_main row)
548
+
549
+
### How to Generate Cold-Start Blobs
550
+
551
+
**Prerequisites:**
552
+
1. Stop any running math worker: `docker compose stop math` (from the worktree root)
553
+
2. Ensure DATABASE_URL is set in the worktree root's `.env` file
from polismath.regression import discover_datasets
686
+
datasets = discover_datasets(include_local=False)
687
+
for name, info in datasets.items():
688
+
status = '✓ cold-start' if info.has_cold_start_blob else '✗ original only'
689
+
print(f'{name}: {status}')
690
+
"
691
+
```
692
+
693
+
### How the Script Works (Fake Conversation Approach)
694
+
695
+
The script uses a "fake conversation" approach to generate true cold-start computations. This works WITH the Clojure poller's design rather than against it.
696
+
697
+
**The Process:**
698
+
699
+
1.**Create fake conversation**: Insert a minimal row in `conversations` table with a fresh auto-generated zid
700
+
2.**Copy votes with fresh timestamps**: Copy all votes from source zid to fake zid, with timestamps starting from "now" (spaced 10ms apart to preserve order)
701
+
3.**Run poller**: Start the Clojure poller with `MATH_ZID_ALLOWLIST={fake_zid}` to only process our fake conversation
702
+
4.**Wait for computation**: Poll `math_main` until `base-clusters` has data
703
+
5.**Extract and save**: Save the math blob with the original source zid for consistency
704
+
6.**Cleanup**: Delete all fake data from `conversations`, `votes`, `votes_latest_unique`, `participants`, and `math_main`
705
+
706
+
**Why This Works:**
707
+
- The poller finds votes with `created > last_poll_timestamp`
708
+
- Fresh timestamps ensure the votes are picked up
709
+
- A new zid means no interference from existing math_main rows or cached state
710
+
- The `MATH_ZID_ALLOWLIST` filter restricts processing to just our fake conversation
711
+
712
+
**Key Functions:**
713
+
-`create_fake_conversation(conn, source_zid)` → Creates minimal conversation row, returns new zid
714
+
-`copy_votes_with_fresh_timestamps(conn, source_zid, fake_zid)` → Copies votes with sequential fresh timestamps
715
+
-`cleanup_fake_conversation(conn, fake_zid)` → Deletes all fake data from all tables
716
+
717
+
### Command-Line Options
718
+
719
+
-`DATASETS...`: Specify one or more dataset names (e.g., `biodiversity vw`)
720
+
-`--all`: Process all datasets
721
+
-`--include-local`: Include datasets from `real_data/.local/`
722
+
-`--no-cleanup`: Keep fake conversation data for debugging (normally cleaned up automatically)
723
+
-`--timeout N`: Set timeout in seconds for math computation (default: 300)
724
+
-`--pause-math`: Automatically pause running math workers (resumes after completion)
725
+
-`--verbose` / `-v`: Show detailed output including real-time Clojure poller logs
726
+
727
+
**Examples:**
728
+
```bash
729
+
# Single dataset
730
+
uv run python scripts/generate_cold_start_clojure.py biodiversity
731
+
732
+
# Multiple datasets
733
+
uv run python scripts/generate_cold_start_clojure.py biodiversity vw american-assembly
734
+
735
+
# All datasets with verbose output and longer timeout
736
+
uv run python scripts/generate_cold_start_clojure.py --all --include-local --pause-math --timeout 600 -v
737
+
```
738
+
739
+
### Safety Features
740
+
741
+
-**Math worker detection**: Refuses to run if math worker is active (use `--pause-math` to auto-pause)
742
+
-**Environment validation**: Checks that DATABASE_URL is set before proceeding
743
+
-**Report ID validation**: Verifies report_id exists in reports table
744
+
-**Vote verification**: Confirms source zid has votes before attempting computation
745
+
-**Automatic cleanup**: Fake conversation data is always deleted (unless `--no-cleanup`)
746
+
-**No permanent changes**: Original database data is never modified
747
+
-**Error detection**: Monitors Clojure poller output for fatal errors and aborts early (see below)
748
+
749
+
### Clojure Error Detection
750
+
751
+
The script monitors the Clojure poller output for fatal errors and aborts early instead of waiting for timeout. Detected patterns:
752
+
-`"Failed conversation update"` - General computation failure
753
+
-`"nil has zero dimensionality"` - Empty matrix in PCA (see Known Limitations)
754
+
-`"Re-queueing messages for failed update"` - Persistent failure
Full absolute paths are printed for each PNG, allowing alt-click to open in IDE.
796
+
797
+
### Features
798
+
799
+
-**PCA sign flip detection**: Automatically detects and corrects PCA sign flips between implementations
800
+
-**Synchronized axes**: Side-by-side plots share the same X/Y limits for direct comparison
801
+
-**Convex hulls**: Shows group boundaries with convex hulls in overlay mode
802
+
803
+
---
804
+
805
+
## Known Limitations
806
+
807
+
### Clojure "nil has zero dimensionality" Error
808
+
809
+
Some large conversations fail in the Clojure implementation with:
810
+
```
811
+
clojure.lang.ExceptionInfo: nil has zero dimensionality, cannot get count for dimension: 0
812
+
at polismath.math.conversation/partial-pca/learn (conversation.clj:719)
813
+
```
814
+
815
+
**Cause**: The conversation data results in an empty or nil matrix during PCA computation. This can happen when:
816
+
- All comments are moderated out
817
+
- Not enough participants meet the "in-conv" threshold
818
+
- Edge cases in the data that produce empty participant matrices
819
+
820
+
**Impact**: These conversations cannot be processed by the Clojure implementation and will fail in the cold-start generation script.
821
+
822
+
**Workaround**: The Python implementation may handle these edge cases differently. For affected conversations, only Python-generated outputs will be available.
823
+
824
+
**Known affected datasets**: bg2050 (pakistan)
825
+
826
+
---
827
+
828
+
## Configuration
829
+
830
+
### Environment Setup
831
+
832
+
The cold-start generation script requires database access. Configuration is loaded from the worktree root's `.env` file.
0 commit comments