Skip to content

Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241

Open
initializ-mk wants to merge 10 commits into
mainfrom
feat/ctxzip-compression
Open

Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241
initializ-mk wants to merge 10 commits into
mainfrom
feat/ctxzip-compression

Conversation

@initializ-mk

Copy link
Copy Markdown
Contributor

What

Integrates ctxzip v0.1.0 — reversible, structure-aware context compression — into the agent loop. Bulky tool outputs (JSON arrays, logs, grep results) are compressed before reaching the LLM; everything dropped is stored in a durable local bbolt store behind an inline <<ctxzip:HASH ...>> marker and retrievable via the new context_expand tool. Lossy on the wire, lossless end-to-end.

Off by default. Enable via compression.enabled: true in forge.yaml, FORGE_COMPRESSION=true, forge run --compression, or the new init-wizard step.

Architecture — three seams (forge-core/compress/)

  1. AfterToolExecHook — compresses tool output once, at production time, before it enters Memory. Compressed bytes never change afterwards, so the conversation prefix stays byte-stable and provider prompt caches keep hitting. Registered after guardrail/redaction hooks; error results and small outputs stay verbatim.
  2. WrapClientllm.Client decorator below the FallbackChain (covers retries + compactor calls) compressing the live zone of each request. Deterministic across turns: relevance query pinned to the first user message, never the latest turn.
  3. ExpandToolcontext_expand(hash) builtin; the loop executes it like any other tool, no retrieval machinery needed. Tolerates imperfect hashes (whole markers, truncated hex → unique-prefix resolution against recently emitted markers).

A runtime-owned system directive is appended when compression is on, so every skill's agent knows what markers are and when to expand — skill authors need zero awareness.

Provider prompt-cache hints (ClientConfig.PromptCaching)

Gated by compression.cache_hints (defaults to enabled):

  • anthropic: cache_control: ephemeral breakpoints on the last tool definition + system block (block-form system only when on — wire format byte-identical to today when off, test-asserted). Also applies on the aws_sigv4 Bedrock-passthrough path.
  • openai/gemini: stable prompt_cache_key derived from (model, system, tool names).

Observability

  • context_compressed / context_expanded audit events (via EmitFromContext — correlation_id/task_id/seq/signing like every other event) carrying per-event figures plus running totals.
  • invocation_complete gains compression_saved_tokens_total, compression_count, expansion_count — per-invocation, keyed by correlation ID so concurrent tasks never cross-contaminate.

Config & CLI surfaces

compression:
  enabled: true
  keep_patterns: [CrashLoopBackOff, PAYMENT_DECLINED]  # domain never-drop vocabulary
  # store_path (.forge/ctxzip.db) / ttl (30m) / min_tool_output_chars (2048) / cache_hints
  • forge run --compression[=false] — tri-state override (absent = yaml/env decide)
  • forge serve --compression[=false] — forwarded to the daemon
  • forge init --compression + a new Context Compression TUI wizard step

Live-tested

Hardened against a real gpt-4o agent over a 150-pod fixture (several commits exist because live testing found the failure modes — expand/compress tail-chase, hash transcription, store-dir creation):

  • grep output compressed 1,397 → 51 tokens (96%) with the one CrashLoopBackOff pod kept verbatim
  • model called context_expand unprompted, hash transcribed correctly, got intact lines back, answered with the exact error (OOMKilled 512Mi)
  • surgical sessions correctly report compression_count: 0 — compression is insurance against bulk, not a tax on every call

Notes for reviewers

  • Fail-open everywhere: store init failure, compression error, or inflation all fall back to uncompressed originals.
  • Token figures in audit fields are tokenizer estimates (documented); billed truth stays in llm_call.input_tokens.
  • Known follow-ups (intentionally out of scope): compression-aware tool limits (grep's internal 50-line default truncates upstream of the reversible layer); grep_search returning "(no matches found)" for a nonexistent file; tools: in forge.yaml not registering builtins (banner is cosmetic).

🤖 Generated with Claude Code

initializ-mk and others added 9 commits July 3, 2026 16:34
…ompt-cache hints

Wires github.com/initializ/ctxzip into the agent loop as an opt-in feature
(compression.enabled in forge.yaml, or FORGE_COMPRESSION=true). Bulky tool
outputs and conversation content are compressed before reaching the LLM;
everything dropped is stored in a durable local bbolt store (.forge/ctxzip.db)
behind a <<ctxzip:HASH>> marker, retrievable via the new context_expand tool —
lossy on the wire, lossless end-to-end.

New package forge-core/compress:
- AfterToolExecHook — compresses tool output once at production time, before
  it enters Memory, so historic bytes never change and provider prompt caches
  keep hitting. Registered after guardrail/redaction hooks; error results and
  small outputs are left verbatim.
- WrapClient — llm.Client decorator compressing the live zone of every
  outbound request (frozen prefix + recent turns forwarded byte-identical).
  Deterministic across turns: the relevance query is pinned to the first user
  message, never the latest turn.
- ExpandTool — context_expand builtin retrieving originals by marker hash;
  registered only when compression is on (memory_get pattern).

Provider prompt-cache hints (ClientConfig.PromptCaching, gated by
compression.cache_hints, defaulting to compression.enabled):
- anthropic: cache_control ephemeral breakpoints on the last tool definition
  and the system block (block-form system only when caching — wire format is
  byte-identical to the previous contract when off). Also applies on the
  aws_sigv4 path, which speaks the same Messages wire format.
- openai: stable prompt_cache_key derived from (model, system, tool names)
  for cache-shard pinning; prefix caching itself is automatic.

Config: CompressionConfig (enabled / store_path / ttl / min_tool_output_chars
/ cache_hints) following the MemoryConfig.LongTerm opt-in pattern, wired in
the runner beside initLongTermMemory. Fail-open: any init error runs the
agent uncompressed.

Tests: hook compression + context_expand round-trip, error/small-output
verbatim guarantees, live-zone vs frozen-prefix boundaries, cross-turn
determinism (the cache-safety property), marker-hash normalization,
anthropic/openai wire-format assertions on and off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…silience, store dir

Fixes from live agent testing (gpt-4o against a 150-pod fixture):

- Never recompress context_expand output, at both seams (hook skips the
  tool by name; client wrapper passes ctxzip SkipNames). Without this the
  loop chased its own tail: the model expanded a marker and the hook
  crushed the expansion straight back into a marker.
- Hash transcription resilience: models truncate or mangle marker hashes
  when copying them into tool calls. The Runtime now remembers emitted
  marker hashes; the expand tool resolves a unique prefix (≥6 chars) on
  exact-miss, and normalizeHash strips a glued ":count" suffix.
- Create the store's parent directory (bbolt creates the file, not the
  dir) — on a fresh project .forge/ doesn't exist and compression failed
  open with "no such file or directory".
- Bump ctxzip to a8b7923→94668f4: line-mode text compression (grep/log
  layout preserved byte-faithfully through the CCR round trip), stop-term
  filtering in line dedup, 12-hex marker hashes.

Live result on "status breakdown + unhealthy pod" over 150 pods:
tool output crushed 1397→51 tokens (96%), model called context_expand
with a correctly-transcribed hash, got intact lines back, and answered
with the exact error (CrashLoopBackOff / OOMKilled 512Mi) — total
session 10,982 input tokens vs 19,799 in the pre-fix run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compression.keep_patterns lets the agent builder declare a domain
vocabulary of case-insensitive substrings compression must never drop:

  compression:
    enabled: true
    keep_patterns: [CrashLoopBackOff, ImagePullBackOff, OOMKilled]

Threaded through compress.Config into both seams (AfterToolExec hook and
the llm.Client wrapper) as ctxzip Options.MustKeep. Union semantics with
ctxzip's built-in error floor — patterns only ever add protection.

Bumps ctxzip to 304962f, whose defaults also grow k8s state words
(crash/backoff/oomkilled/evicted/unhealthy/degraded) after live testing
showed "CrashLoopBackOff" matched nothing in the original error list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two new audit events make token savings attributable instead of living
only in debug logs:

- context_compressed — fired from both seams (tool_output hook, request
  wrapper) with seam, tool, tokens_before/after, saved_tokens, plus
  running totals (total_saved_tokens / total_compressions /
  total_expansions) so any single event shows the cumulative picture,
  not just the per-call delta.
- context_expanded — fired on every context_expand retrieval with hash,
  hit, bytes and the same running totals; expansions are the cost side
  auditors net against savings.

Events flow through AuditLogger.EmitFromContext, so correlation_id /
task_id / seq are stamped like every other audit event and SIEM
consumers can join savings to invocations. compress stays decoupled via
a Config.Audit callback; nil disables emission. Runtime.Totals() exposes
the process-lifetime snapshot. Token figures are tokenizer estimates
(directionally accurate), not provider-billed counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
invocation_complete now carries compression_saved_tokens_total,
compression_count, and (when nonzero) expansion_count alongside the
existing input/output token totals, so per-invocation cost rollups show
what compression saved without joining context_compressed events.

Savings are accumulated per correlation ID inside compress.Runtime and
popped once at the response boundary (TakeInvocationTotals), so
concurrent invocations never cross-contaminate — diffing the process-
lifetime totals would have. Fields are present whenever compression is
enabled; zeros mean "on, but nothing was worth compressing". Token
figures remain tokenizer estimates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ker awareness

Marker awareness was living in the test agent's SKILL.md, which does not
scale: every skill author would have to document compression. Compression
is a runtime capability, so the runtime now briefs the model itself —
when compression is enabled, compress.SystemDirective is appended to the
system prompt (same pattern as codeAgentDirective), explaining what
<<ctxzip:...>> markers are, that the visible remainder keeps errors and
representative content, and when/how to call context_expand.

The directive is a constant, keeping the system prompt byte-stable across
turns for provider prompt caches. A guard test pins it to the real tool
name and marker prefix so a rename cannot silently orphan the text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the feat-branch pseudo-version with the immutable release tag —
the forge PR now depends on a stable, reviewable ctxzip version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compression becomes reachable from every entry point, not just
forge.yaml/env:

- `forge run --compression` / `--compression=false` — tri-state: absent
  leaves yaml/env resolution untouched; explicit values override both by
  setting FORGE_COMPRESSION (same pattern as --model → MODEL_NAME).
- `forge serve --compression[=false]` — forwarded to the forked daemon
  `forge run`, only when explicitly passed.
- `forge init --compression` — non-interactive scaffolding writes a
  commented `compression.enabled: true` block into forge.yaml.
- init TUI wizard — new "Context Compression" step (SingleSelect,
  Enabled/Disabled with explanatory descriptions) between Skills and
  Auth; selection flows through WizardContext.Compression into the same
  forge.yaml block.

Scaffold test covers both directions: --compression writes the block,
default omits it (off by default).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New docs/core-concepts/context-compression.md — the feature's home:
problem statement, pipeline diagram, keep-floor layers, configuration
and precedence, provider cache hints, observability, failure posture.

Per the sync-docs mapping, updates ripple to:
- forge-yaml-schema.md — compression block in the full schema plus a
  dedicated reference section with field table
- cli-reference.md — --compression rows in the init/run/serve flag
  tables; wizard step order documented under forge init
- runtime-engine.md — Context Compression section describing the three
  loop seams (hook after guardrails, client wrapper below the fallback
  chain, context_expand tool) and the cache-stability posture
- audit-logging.md — context_compressed / context_expanded event rows;
  invocation_complete row gains the per-invocation compression fields
- tools-and-builtins.md — context_expand in the builtin table plus a
  Context Expansion Tool section (hash tolerance, miss guidance)
- environment-variables.md — FORGE_COMPRESSION
- README.md — Context Compression row in the documentation table
- .claude/skills/forge.md — swept sections 8 (memory), 13 (CLI), 14
  (schema), 17 (audit reference), 19 (docs map); ToC unchanged (no new
  numbered sections)

Link check: 0 broken links across README + 55 docs files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@initializ-mk

Copy link
Copy Markdown
Contributor Author

Requesting changes (posting as a comment — GitHub won't let the author submit a formal Request-Changes review on their own PR). Blocking item: the streaming perInvocation leak below.

Reviewed for correctness and traced the invocation-complete paths + concurrency. Strong PR — fail-open everywhere, careful locking (accumulate under lock → snapshot → emit audit outside lock), the expand-tool-output-never-recompressed tail-chase fix, and the "wire format byte-identical when off" prompt-cache path is test-asserted. Builds clean; compress + llm/providers tests pass. One real bug on the primary path.

🔴 perInvocation leaks (and compression metrics are missing) on the streaming paths

TakeInvocationTotals — which both pops perInvocation[correlationID] and adds the compression_* fields — is called in exactly one place: executeTask (runner.go:1552). But there are three invocation_complete emission sites, and the other two are the sendSubscribe streaming handlers that run their own ExecuteStream loop instead of executeTask:

  • tasks/sendSubscribe (JSON-RPC SSE) → emits at 1296, no Take.
  • REST POST /tasks/sendSubscribe → emits at 1821, no Take.

Both streaming handlers set a correlation ID on ctx (WithCorrelationID) and use the shared compressed client + hooks, so a streaming invocation that compresses does call recordCompression → bumpInvocation, adding a perInvocation[cid] entry. Since only executeTask pops, and perInvocation has no TTL/sweep, every streaming invocation that compresses leaks its bucket permanently. sendSubscribe is a primary A2A mode, so with compression on this grows unbounded over process lifetime.

Same root cause, second gap: the PR advertises invocation_complete gaining compression_saved_tokens_total / compression_count / expansion_count, but those are only added in executeTask — so streaming invocation_complete events don't carry them. The feature is under-delivered on the common path.

Fix (one change closes both): factor the "TakeInvocationTotals(ctx) + populate the three compression fields" block out of executeTask into a helper, and call it at all three EmitInvocationComplete sites (1296, 1565, 1821). Optionally also give perInvocation a size cap / periodic sweep so a missed Take can't leak — but the shared helper is the real fix and also ships the metrics.

🟡 Secondary

  • recent marker map is unbounded — every emitted marker hash is remembered (rememberMarkers) and never evicted. Small strings, but grows for the process lifetime; a cap / LRU would bound it (you only need recent markers for imperfect-hash resolution).
  • bbolt is a single-writer file lock. .forge/ctxzip.db can be opened by only one process; two agents on a shared volume (K8s multi-replica) or forge run + forge serve on the same dir will collide. Fail-open only helps if ccr.NewBoltStore errors rather than blocks on the flock — worth confirming it sets an Open timeout, and documenting the single-writer constraint in the compression doc ("one store per process/replica").
  • Inflation guard is == 0, not <= 0 (hook.go: res.SavedTokens() == 0). If ctxzip can ever return negative savings (inflated output), that guard misses it and you'd record negative savings + apply the inflated bytes. Depends on ctxzip's contract; low risk, but <= 0 is safer.

Note (in the PR's favor)

go.mod is correct — ctxzip is a direct require and bbolt is correctly // indirect (transitive via ctxzip; compress imports ctxzip/ccr, not bbolt). No go mod tidy needed.

Verdict

Request changes for the streaming perInvocation leak + missing metrics — a genuine unbounded leak on the main path and a gap between advertised and actual streaming behavior. The shared-helper refactor is small and fixes both. The bbolt single-writer note and recent cap are worth doing but non-blocking.

…s; bound runtime maps

Addresses the PR #241 review (blocking + secondary items):

- 🔴 Streaming perInvocation leak + missing metrics: TakeInvocationTotals
  was only called in executeTask, but invocation_complete is emitted from
  THREE sites — the tasks/sendSubscribe JSON-RPC SSE and REST streaming
  handlers ran their own emission without popping the bucket, so every
  streaming invocation that compressed leaked its correlation bucket
  permanently AND its invocation_complete lacked the compression fields.
  The pop+populate block is now a shared helper (appendCompressionFields,
  documented as required at every emission site) called from all three.
  Pinned by TestAppendCompressionFields_PopsAndPopulates: fields
  populated, pop is one-shot (no double-count), nil-safe.

- 🟡 Leak backstops: perInvocation is bounded to 1024 buckets with
  oldest-touched eviction (a future missed pop can no longer grow
  unbounded); the recent-marker prefix-resolution set is bounded to 2048
  with oldest-emitted eviction. Both pinned by tests.

- 🟡 Inflation guards tightened from == 0 to <= 0 at both seams — ctxzip
  clamps SavedTokens at zero today, but the guard must not silently apply
  inflated output if that contract ever changes.

- 🟡 bbolt single-writer constraint documented in context-compression.md:
  the store holds an exclusive flock with a 5s open timeout (verified in
  ctxzip's NewBoltStore), so a second process fails open and runs
  uncompressed; each replica should get its own store_path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@initializ-mk

Copy link
Copy Markdown
Contributor Author

Resolved in deb06fd — all four findings addressed:

🔴 Streaming perInvocation leak + missing metrics — fixed with the suggested shared helper. The pop+populate block is now appendCompressionFields(ctx, fields), called at all three EmitInvocationComplete sites (executeTask, JSON-RPC SSE sendSubscribe, REST sendSubscribe). The helper's doc comment names all three sites explicitly so a future fourth emission path knows the contract. Pinned by TestAppendCompressionFields_PopsAndPopulates: fields populated on first call, one-shot pop (second call returns zeros — no double-count if two sites ever fire), nil-safe with compression disabled.

🟡 Leak backstops (took the "optionally also" suggestion): perInvocation is now bounded to 1024 buckets with oldest-touched eviction — a future missed pop degrades to bounded memory instead of an unbounded leak. recent markers bounded to 2048 with oldest-emitted eviction (only recent markers matter for transcription repair; exact hashes still resolve via the store). Both pinned by tests (TestPerInvocationBuckets_Bounded, TestRecentMarkers_Bounded).

🟡 Inflation guards: both seams tightened from == 0 to <= 0. Confirmed ctxzip's SavedTokens() clamps at zero today, so this is defense-in-depth against a future contract change — noted as such in the code comment.

🟡 bbolt single-writer: confirmed ccr.NewBoltStore sets bolt.Options{Timeout: 5 * time.Second} — a second process errors after 5s rather than blocking, so fail-open engages (warning logged, agent runs uncompressed). Documented in context-compression.md § Failure posture: one store per process/replica, give each replica its own store_path.

Gate: build ✅, vet ✅, gofmt ✅, forge-core 32 packages ✅, forge-cli/runtime full suite ✅.

And thanks for the go.mod note — glad the direct/indirect split checked out.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant