Summary
An alternative compression strategy that shifts summarization off the blocking path. Instead of compressing all old messages into a single snapshot, this clusters them into a union-find forest and resolves summaries asynchronously during the main LLM call wait. Implemented behind a feature flag with 62 tests passing, evaluated on 12 GitHub issue conversations.
Problem
Flat summarization has three pain points:
- Blocking UX: Compression triggers a 20-30s spinner while the LLM summarizes + verifies
- Lost details: A single summary snapshot drops specific technical values under compression pressure
- Cliff edge: Everything older than 70% gets compressed at once, all-or-nothing
Proposed Solution
Overlap window architecture: hot zone + cold forest with deferred background summarization.
append() is synchronous (<1ms). Pushes to hot zone, graduates oldest messages into a union-find forest by embedding similarity.
render() is synchronous (<0.1ms). Returns cached cluster summaries + hot zone verbatim. No LLM calls.
resolveDirty() is async fire-and-forget. Batch-summarizes dirty clusters during the main LLM call wait (5-30s window). One call per dirty cluster.
- Overlap window: Graduated messages stay in hot zone for ~2 turns. By the time they evict, their cluster summary is already resolved.
The key idea: hide summarization latency under the normal model wait instead of blocking the UI.
Experimental Results (v2)
Preregistered experiment on 12 GitHub issue conversations (120 messages each), evaluated with Gemini 3.1 Flash Lite as LLM-as-judge. Classification: exploratory benchmark validation.
| Hypothesis |
Result |
Details |
| H2a: Latency |
PASS |
append p95 = 0.33ms, render p50 = 0.006ms (criterion: <100ms) |
| H3: Cost |
PASS |
0.79x flat's token cost in this sample (112K vs 143K tokens) |
| H1: Recall |
Inconclusive |
+8.3pp (30.2% vs 21.9%), p=0.136. Won 8/12 conversations. Promising but not statistically significant at n=12. |
| H2b: Background |
INFO |
resolveDirty p50 = 4.0s, fits within main LLM call wait |
Cost note: Union-find makes more summarizer calls than flat (35 vs 24), but each call summarizes a smaller cluster incrementally, so total token volume was lower in this sample.
Cost breakdown:
- Union-find v2: 35 summarizer calls across 12 conversations
- Flat: 24 calls (2 per conversation)
- v1 prototype: ~960 calls (O(n²) — every merge triggered a full re-summarization)
Current evidence is strongest on latency and cost shape. Recall needs a larger sample (24+ conversations) with a production model to reach significance.
Full results, CSVs, and preregistration: kimjune01/union-find-compaction-for-gemini-cli
Design document (transformation spec, design decisions, work log): transformation-design.md
Implementation
Branch feat/union-find-compaction. Feature-flagged behind COMPRESSION_STRATEGY experiment flag, defaults to flat.
New files:
contextWindow.ts: Forest (union-find with path compression + union by rank) and ContextWindow (overlap window manager)
embeddingService.ts: TF-IDF embedder (synchronous, no API calls). Embedder interface includes optional embedQuery() for non-mutating retrieval.
clusterSummarizer.ts: LLM summarizer using existing BaseLlmClient
Modified files:
chatCompressionService.ts: Dual-path routing (flat vs union-find)
flagNames.ts: Feature flag
config.ts: getCompressionStrategy() method
62 tests across 4 test files. All existing flat compression tests still pass.
Review guide (3 commits on branch):
9dbb15f — Initial implementation (v1 architecture, superseded)
6950b3f — v2 rewrite with overlap window (the core design)
cc8e2e2 — Bug fixes from code review (start here for the subtle parts)
Reading order within contextWindow.ts: cosineSimilarity() → Forest class → ContextWindow class. Then chatCompressionService.ts compactWithUnionFind() for integration.
Review-Driven Hardening
After the initial implementation, a code review identified 10 issues (5 fixed, 5 documented as known limitations). Fixes applied:
- Embedding dimension mismatch → NaN: TF-IDF vocabulary grows over time, making newer vectors longer.
cosineSimilarity() and centroid merging now handle mismatched dimensions safely.
resolveDirty() race condition: If union() modifies a cluster during async summarization, the stale result is discarded via reference-equality check on dirty inputs. Combined inputs resolve in a future call.
- Render query contamination:
embed() mutates the TF-IDF corpus. Added embedQuery() (non-mutating) for retrieval paths so searching doesn't change future embeddings.
- In-flight merge during summarization: If a root is merged while its summary is in flight, the summary is safely skipped rather than overwriting the merged cluster's combined dirty state.
- Config validation: Constructor rejects
evictAt < graduateAt (would corrupt internal index tracking).
Details in the work log.
Known Limitations
- Unbounded
_nodes map: Graduated messages stay in the forest indefinitely. Design doc specifies a 1000-message eviction policy, but it's not yet implemented.
- IDF drift: Old embeddings and centroids are never recomputed as the TF-IDF corpus evolves. Retrieval quality degrades over very long conversations.
- Unbounded TF-IDF vocabulary:
_vocab and _termDocFreq grow without bound. Bounded in practice by conversation length.
totalMessages double-counts overlap: Graduated messages in the overlap window exist in both forest and hot zone.
- AbortSignal not propagated:
ClusterSummarizer creates a detached AbortController per call. User cancellation doesn't stop background summarization.
On Change Size
This is a large change (3 new + 3 modified files). It's currently packaged as one feature-flagged experiment path because the components are tightly coupled. If the direction is interesting, follow-up refinement can be split into smaller reviewable pieces.
Mitigations:
- Feature-flagged, defaults to flat, zero impact unless explicitly enabled
- No migration, existing conversations untouched
- 62 dedicated tests + existing compression tests still pass
- Preregistered experiment with reproducible harness (
npx tsx run-v2-experiment.ts)
- Recommend validation at higher sample size with Gemini 3 Pro before enabling by default
Ask
Looking for feedback on whether this direction is worth upstreaming behind an experiment flag. If so, I can open a PR for code review.
Disclosure
All written artifacts were produced with LLM assistance: implementation and prose via Claude (Opus 4.6), code review and issue review via GPT-5.4 (codex). The experiment harness, data collection, and analysis are reproducible from the linked repo.
Summary
An alternative compression strategy that shifts summarization off the blocking path. Instead of compressing all old messages into a single snapshot, this clusters them into a union-find forest and resolves summaries asynchronously during the main LLM call wait. Implemented behind a feature flag with 62 tests passing, evaluated on 12 GitHub issue conversations.
Problem
Flat summarization has three pain points:
Proposed Solution
Overlap window architecture: hot zone + cold forest with deferred background summarization.
append()is synchronous (<1ms). Pushes to hot zone, graduates oldest messages into a union-find forest by embedding similarity.render()is synchronous (<0.1ms). Returns cached cluster summaries + hot zone verbatim. No LLM calls.resolveDirty()is async fire-and-forget. Batch-summarizes dirty clusters during the main LLM call wait (5-30s window). One call per dirty cluster.The key idea: hide summarization latency under the normal model wait instead of blocking the UI.
Experimental Results (v2)
Preregistered experiment on 12 GitHub issue conversations (120 messages each), evaluated with Gemini 3.1 Flash Lite as LLM-as-judge. Classification: exploratory benchmark validation.
Cost note: Union-find makes more summarizer calls than flat (35 vs 24), but each call summarizes a smaller cluster incrementally, so total token volume was lower in this sample.
Cost breakdown:
Current evidence is strongest on latency and cost shape. Recall needs a larger sample (24+ conversations) with a production model to reach significance.
Full results, CSVs, and preregistration: kimjune01/union-find-compaction-for-gemini-cli
Design document (transformation spec, design decisions, work log): transformation-design.md
Implementation
Branch
feat/union-find-compaction. Feature-flagged behindCOMPRESSION_STRATEGYexperiment flag, defaults to flat.New files:
contextWindow.ts: Forest (union-find with path compression + union by rank) and ContextWindow (overlap window manager)embeddingService.ts: TF-IDF embedder (synchronous, no API calls).Embedderinterface includes optionalembedQuery()for non-mutating retrieval.clusterSummarizer.ts: LLM summarizer using existing BaseLlmClientModified files:
chatCompressionService.ts: Dual-path routing (flat vs union-find)flagNames.ts: Feature flagconfig.ts:getCompressionStrategy()method62 tests across 4 test files. All existing flat compression tests still pass.
Review guide (3 commits on branch):
9dbb15f— Initial implementation (v1 architecture, superseded)6950b3f— v2 rewrite with overlap window (the core design)cc8e2e2— Bug fixes from code review (start here for the subtle parts)Reading order within
contextWindow.ts:cosineSimilarity()→Forestclass →ContextWindowclass. ThenchatCompressionService.tscompactWithUnionFind()for integration.Review-Driven Hardening
After the initial implementation, a code review identified 10 issues (5 fixed, 5 documented as known limitations). Fixes applied:
cosineSimilarity()and centroid merging now handle mismatched dimensions safely.resolveDirty()race condition: Ifunion()modifies a cluster during async summarization, the stale result is discarded via reference-equality check on dirty inputs. Combined inputs resolve in a future call.embed()mutates the TF-IDF corpus. AddedembedQuery()(non-mutating) for retrieval paths so searching doesn't change future embeddings.evictAt < graduateAt(would corrupt internal index tracking).Details in the work log.
Known Limitations
_nodesmap: Graduated messages stay in the forest indefinitely. Design doc specifies a 1000-message eviction policy, but it's not yet implemented._vocaband_termDocFreqgrow without bound. Bounded in practice by conversation length.totalMessagesdouble-counts overlap: Graduated messages in the overlap window exist in both forest and hot zone.ClusterSummarizercreates a detachedAbortControllerper call. User cancellation doesn't stop background summarization.On Change Size
This is a large change (3 new + 3 modified files). It's currently packaged as one feature-flagged experiment path because the components are tightly coupled. If the direction is interesting, follow-up refinement can be split into smaller reviewable pieces.
Mitigations:
npx tsx run-v2-experiment.ts)Ask
Looking for feedback on whether this direction is worth upstreaming behind an experiment flag. If so, I can open a PR for code review.
Disclosure
All written artifacts were produced with LLM assistance: implementation and prose via Claude (Opus 4.6), code review and issue review via GPT-5.4 (codex). The experiment harness, data collection, and analysis are reproducible from the linked repo.