feat: Union-find context compaction as alternative to flat summarization

## Summary

An alternative compression strategy that shifts summarization off the blocking path. Instead of compressing all old messages into a single snapshot, this clusters them into a union-find forest and resolves summaries asynchronously during the main LLM call wait. Implemented behind a feature flag with 62 tests passing, evaluated on 12 GitHub issue conversations.

## Problem

Flat summarization has three pain points:

1. **Blocking UX**: Compression triggers a 20-30s spinner while the LLM summarizes + verifies
2. **Lost details**: A single summary snapshot drops specific technical values under compression pressure
3. **Cliff edge**: Everything older than 70% gets compressed at once, all-or-nothing

## Proposed Solution

Overlap window architecture: hot zone + cold forest with deferred background summarization.

- `append()` is **synchronous** (<1ms). Pushes to hot zone, graduates oldest messages into a union-find forest by embedding similarity.
- `render()` is **synchronous** (<0.1ms). Returns cached cluster summaries + hot zone verbatim. No LLM calls.
- `resolveDirty()` is **async fire-and-forget**. Batch-summarizes dirty clusters during the main LLM call wait (5-30s window). One call per dirty cluster.
- **Overlap window**: Graduated messages stay in hot zone for ~2 turns. By the time they evict, their cluster summary is already resolved.

The key idea: hide summarization latency under the normal model wait instead of blocking the UI.

## Experimental Results (v2)

Preregistered experiment on 12 GitHub issue conversations (120 messages each), evaluated with Gemini 3.1 Flash Lite as LLM-as-judge. Classification: exploratory benchmark validation.

| Hypothesis | Result | Details |
|---|---|---|
| **H2a: Latency** | **PASS** | append p95 = 0.33ms, render p50 = 0.006ms (criterion: <100ms) |
| **H3: Cost** | **PASS** | 0.79x flat's token cost in this sample (112K vs 143K tokens) |
| **H1: Recall** | Inconclusive | +8.3pp (30.2% vs 21.9%), p=0.136. Won 8/12 conversations. Promising but not statistically significant at n=12. |
| **H2b: Background** | INFO | resolveDirty p50 = 4.0s, fits within main LLM call wait |

**Cost note**: Union-find makes more summarizer calls than flat (35 vs 24), but each call summarizes a smaller cluster incrementally, so total token volume was lower in this sample.

**Cost breakdown**:
- Union-find v2: 35 summarizer calls across 12 conversations
- Flat: 24 calls (2 per conversation)
- v1 prototype: ~960 calls (O(n²) — every merge triggered a full re-summarization)

Current evidence is strongest on latency and cost shape. Recall needs a larger sample (24+ conversations) with a production model to reach significance.

Full results, CSVs, and preregistration: [kimjune01/union-find-compaction-for-gemini-cli](https://github.com/kimjune01/union-find-compaction-for-gemini-cli)

Design document (transformation spec, design decisions, work log): [transformation-design.md](https://github.com/kimjune01/union-find-compaction-for-gemini-cli/blob/main/transformation-design.md)

## Implementation

Branch [`feat/union-find-compaction`](https://github.com/kimjune01/gemini-cli/tree/feat/union-find-compaction). Feature-flagged behind `COMPRESSION_STRATEGY` experiment flag, defaults to flat.

**New files:**
- `contextWindow.ts`: Forest (union-find with path compression + union by rank) and ContextWindow (overlap window manager)
- `embeddingService.ts`: TF-IDF embedder (synchronous, no API calls). `Embedder` interface includes optional `embedQuery()` for non-mutating retrieval.
- `clusterSummarizer.ts`: LLM summarizer using existing BaseLlmClient

**Modified files:**
- `chatCompressionService.ts`: Dual-path routing (flat vs union-find)
- `flagNames.ts`: Feature flag
- `config.ts`: `getCompressionStrategy()` method

62 tests across 4 test files. All existing flat compression tests still pass.

**Review guide** (3 commits on branch):
1. `9dbb15f` — Initial implementation (v1 architecture, superseded)
2. `6950b3f` — v2 rewrite with overlap window (the core design)
3. `cc8e2e2` — Bug fixes from code review (start here for the subtle parts)

Reading order within `contextWindow.ts`: `cosineSimilarity()` → `Forest` class → `ContextWindow` class. Then `chatCompressionService.ts` `compactWithUnionFind()` for integration.

## Review-Driven Hardening

After the initial implementation, a code review identified 10 issues (5 fixed, 5 documented as known limitations). Fixes applied:

- **Embedding dimension mismatch → NaN**: TF-IDF vocabulary grows over time, making newer vectors longer. `cosineSimilarity()` and centroid merging now handle mismatched dimensions safely.
- **`resolveDirty()` race condition**: If `union()` modifies a cluster during async summarization, the stale result is discarded via reference-equality check on dirty inputs. Combined inputs resolve in a future call.
- **Render query contamination**: `embed()` mutates the TF-IDF corpus. Added `embedQuery()` (non-mutating) for retrieval paths so searching doesn't change future embeddings.
- **In-flight merge during summarization**: If a root is merged while its summary is in flight, the summary is safely skipped rather than overwriting the merged cluster's combined dirty state.
- **Config validation**: Constructor rejects `evictAt < graduateAt` (would corrupt internal index tracking).

Details in the [work log](https://github.com/kimjune01/union-find-compaction-for-gemini-cli/blob/main/WORK_LOG.md).

## Known Limitations

- **Unbounded `_nodes` map**: Graduated messages stay in the forest indefinitely. Design doc specifies a 1000-message eviction policy, but it's not yet implemented.
- **IDF drift**: Old embeddings and centroids are never recomputed as the TF-IDF corpus evolves. Retrieval quality degrades over very long conversations.
- **Unbounded TF-IDF vocabulary**: `_vocab` and `_termDocFreq` grow without bound. Bounded in practice by conversation length.
- **`totalMessages` double-counts overlap**: Graduated messages in the overlap window exist in both forest and hot zone.
- **AbortSignal not propagated**: `ClusterSummarizer` creates a detached `AbortController` per call. User cancellation doesn't stop background summarization.

## On Change Size

This is a large change (3 new + 3 modified files). It's currently packaged as one feature-flagged experiment path because the components are tightly coupled. If the direction is interesting, follow-up refinement can be split into smaller reviewable pieces.

Mitigations:
- Feature-flagged, defaults to flat, zero impact unless explicitly enabled
- No migration, existing conversations untouched
- 62 dedicated tests + existing compression tests still pass
- Preregistered experiment with reproducible harness (`npx tsx run-v2-experiment.ts`)
- Recommend validation at higher sample size with Gemini 3 Pro before enabling by default

## Ask

Looking for feedback on whether this direction is worth upstreaming behind an experiment flag. If so, I can open a PR for code review.

## Disclosure

All written artifacts were produced with LLM assistance: implementation and prose via Claude (Opus 4.6), code review and issue review via GPT-5.4 (codex). The experiment harness, data collection, and analysis are reproducible from the linked repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Union-find context compaction as alternative to flat summarization #22877

Summary

Problem

Proposed Solution

Experimental Results (v2)

Implementation

Review-Driven Hardening

Known Limitations

On Change Size

Ask

Disclosure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hypothesis	Result	Details
H2a: Latency	PASS	append p95 = 0.33ms, render p50 = 0.006ms (criterion: <100ms)
H3: Cost	PASS	0.79x flat's token cost in this sample (112K vs 143K tokens)
H1: Recall	Inconclusive	+8.3pp (30.2% vs 21.9%), p=0.136. Won 8/12 conversations. Promising but not statistically significant at n=12.
H2b: Background	INFO	resolveDirty p50 = 4.0s, fits within main LLM call wait

feat: Union-find context compaction as alternative to flat summarization #22877

Description

Summary

Problem

Proposed Solution

Experimental Results (v2)

Implementation

Review-Driven Hardening

Known Limitations

On Change Size

Ask

Disclosure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions