Skip to content

feat: Union-find context compaction as alternative to flat summarization #22877

@kimjune01

Description

@kimjune01

Summary

An alternative compression strategy that shifts summarization off the blocking path. Instead of compressing all old messages into a single snapshot, this clusters them into a union-find forest and resolves summaries asynchronously during the main LLM call wait. Implemented behind a feature flag with 62 tests passing, evaluated on 12 GitHub issue conversations.

Problem

Flat summarization has three pain points:

  1. Blocking UX: Compression triggers a 20-30s spinner while the LLM summarizes + verifies
  2. Lost details: A single summary snapshot drops specific technical values under compression pressure
  3. Cliff edge: Everything older than 70% gets compressed at once, all-or-nothing

Proposed Solution

Overlap window architecture: hot zone + cold forest with deferred background summarization.

  • append() is synchronous (<1ms). Pushes to hot zone, graduates oldest messages into a union-find forest by embedding similarity.
  • render() is synchronous (<0.1ms). Returns cached cluster summaries + hot zone verbatim. No LLM calls.
  • resolveDirty() is async fire-and-forget. Batch-summarizes dirty clusters during the main LLM call wait (5-30s window). One call per dirty cluster.
  • Overlap window: Graduated messages stay in hot zone for ~2 turns. By the time they evict, their cluster summary is already resolved.

The key idea: hide summarization latency under the normal model wait instead of blocking the UI.

Experimental Results (v2)

Preregistered experiment on 12 GitHub issue conversations (120 messages each), evaluated with Gemini 3.1 Flash Lite as LLM-as-judge. Classification: exploratory benchmark validation.

Hypothesis Result Details
H2a: Latency PASS append p95 = 0.33ms, render p50 = 0.006ms (criterion: <100ms)
H3: Cost PASS 0.79x flat's token cost in this sample (112K vs 143K tokens)
H1: Recall Inconclusive +8.3pp (30.2% vs 21.9%), p=0.136. Won 8/12 conversations. Promising but not statistically significant at n=12.
H2b: Background INFO resolveDirty p50 = 4.0s, fits within main LLM call wait

Cost note: Union-find makes more summarizer calls than flat (35 vs 24), but each call summarizes a smaller cluster incrementally, so total token volume was lower in this sample.

Cost breakdown:

  • Union-find v2: 35 summarizer calls across 12 conversations
  • Flat: 24 calls (2 per conversation)
  • v1 prototype: ~960 calls (O(n²) — every merge triggered a full re-summarization)

Current evidence is strongest on latency and cost shape. Recall needs a larger sample (24+ conversations) with a production model to reach significance.

Full results, CSVs, and preregistration: kimjune01/union-find-compaction-for-gemini-cli

Design document (transformation spec, design decisions, work log): transformation-design.md

Implementation

Branch feat/union-find-compaction. Feature-flagged behind COMPRESSION_STRATEGY experiment flag, defaults to flat.

New files:

  • contextWindow.ts: Forest (union-find with path compression + union by rank) and ContextWindow (overlap window manager)
  • embeddingService.ts: TF-IDF embedder (synchronous, no API calls). Embedder interface includes optional embedQuery() for non-mutating retrieval.
  • clusterSummarizer.ts: LLM summarizer using existing BaseLlmClient

Modified files:

  • chatCompressionService.ts: Dual-path routing (flat vs union-find)
  • flagNames.ts: Feature flag
  • config.ts: getCompressionStrategy() method

62 tests across 4 test files. All existing flat compression tests still pass.

Review guide (3 commits on branch):

  1. 9dbb15f — Initial implementation (v1 architecture, superseded)
  2. 6950b3f — v2 rewrite with overlap window (the core design)
  3. cc8e2e2 — Bug fixes from code review (start here for the subtle parts)

Reading order within contextWindow.ts: cosineSimilarity()Forest class → ContextWindow class. Then chatCompressionService.ts compactWithUnionFind() for integration.

Review-Driven Hardening

After the initial implementation, a code review identified 10 issues (5 fixed, 5 documented as known limitations). Fixes applied:

  • Embedding dimension mismatch → NaN: TF-IDF vocabulary grows over time, making newer vectors longer. cosineSimilarity() and centroid merging now handle mismatched dimensions safely.
  • resolveDirty() race condition: If union() modifies a cluster during async summarization, the stale result is discarded via reference-equality check on dirty inputs. Combined inputs resolve in a future call.
  • Render query contamination: embed() mutates the TF-IDF corpus. Added embedQuery() (non-mutating) for retrieval paths so searching doesn't change future embeddings.
  • In-flight merge during summarization: If a root is merged while its summary is in flight, the summary is safely skipped rather than overwriting the merged cluster's combined dirty state.
  • Config validation: Constructor rejects evictAt < graduateAt (would corrupt internal index tracking).

Details in the work log.

Known Limitations

  • Unbounded _nodes map: Graduated messages stay in the forest indefinitely. Design doc specifies a 1000-message eviction policy, but it's not yet implemented.
  • IDF drift: Old embeddings and centroids are never recomputed as the TF-IDF corpus evolves. Retrieval quality degrades over very long conversations.
  • Unbounded TF-IDF vocabulary: _vocab and _termDocFreq grow without bound. Bounded in practice by conversation length.
  • totalMessages double-counts overlap: Graduated messages in the overlap window exist in both forest and hot zone.
  • AbortSignal not propagated: ClusterSummarizer creates a detached AbortController per call. User cancellation doesn't stop background summarization.

On Change Size

This is a large change (3 new + 3 modified files). It's currently packaged as one feature-flagged experiment path because the components are tightly coupled. If the direction is interesting, follow-up refinement can be split into smaller reviewable pieces.

Mitigations:

  • Feature-flagged, defaults to flat, zero impact unless explicitly enabled
  • No migration, existing conversations untouched
  • 62 dedicated tests + existing compression tests still pass
  • Preregistered experiment with reproducible harness (npx tsx run-v2-experiment.ts)
  • Recommend validation at higher sample size with Gemini 3 Pro before enabling by default

Ask

Looking for feedback on whether this direction is worth upstreaming behind an experiment flag. If so, I can open a PR for code review.

Disclosure

All written artifacts were produced with LLM assistance: implementation and prose via Claude (Opus 4.6), code review and issue review via GPT-5.4 (codex). The experiment harness, data collection, and analysis are reproducible from the linked repo.

Metadata

Metadata

Labels

area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions