Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241
Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241initializ-mk wants to merge 10 commits into
Conversation
…ompt-cache hints Wires github.com/initializ/ctxzip into the agent loop as an opt-in feature (compression.enabled in forge.yaml, or FORGE_COMPRESSION=true). Bulky tool outputs and conversation content are compressed before reaching the LLM; everything dropped is stored in a durable local bbolt store (.forge/ctxzip.db) behind a <<ctxzip:HASH>> marker, retrievable via the new context_expand tool — lossy on the wire, lossless end-to-end. New package forge-core/compress: - AfterToolExecHook — compresses tool output once at production time, before it enters Memory, so historic bytes never change and provider prompt caches keep hitting. Registered after guardrail/redaction hooks; error results and small outputs are left verbatim. - WrapClient — llm.Client decorator compressing the live zone of every outbound request (frozen prefix + recent turns forwarded byte-identical). Deterministic across turns: the relevance query is pinned to the first user message, never the latest turn. - ExpandTool — context_expand builtin retrieving originals by marker hash; registered only when compression is on (memory_get pattern). Provider prompt-cache hints (ClientConfig.PromptCaching, gated by compression.cache_hints, defaulting to compression.enabled): - anthropic: cache_control ephemeral breakpoints on the last tool definition and the system block (block-form system only when caching — wire format is byte-identical to the previous contract when off). Also applies on the aws_sigv4 path, which speaks the same Messages wire format. - openai: stable prompt_cache_key derived from (model, system, tool names) for cache-shard pinning; prefix caching itself is automatic. Config: CompressionConfig (enabled / store_path / ttl / min_tool_output_chars / cache_hints) following the MemoryConfig.LongTerm opt-in pattern, wired in the runner beside initLongTermMemory. Fail-open: any init error runs the agent uncompressed. Tests: hook compression + context_expand round-trip, error/small-output verbatim guarantees, live-zone vs frozen-prefix boundaries, cross-turn determinism (the cache-safety property), marker-hash normalization, anthropic/openai wire-format assertions on and off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…silience, store dir Fixes from live agent testing (gpt-4o against a 150-pod fixture): - Never recompress context_expand output, at both seams (hook skips the tool by name; client wrapper passes ctxzip SkipNames). Without this the loop chased its own tail: the model expanded a marker and the hook crushed the expansion straight back into a marker. - Hash transcription resilience: models truncate or mangle marker hashes when copying them into tool calls. The Runtime now remembers emitted marker hashes; the expand tool resolves a unique prefix (≥6 chars) on exact-miss, and normalizeHash strips a glued ":count" suffix. - Create the store's parent directory (bbolt creates the file, not the dir) — on a fresh project .forge/ doesn't exist and compression failed open with "no such file or directory". - Bump ctxzip to a8b7923→94668f4: line-mode text compression (grep/log layout preserved byte-faithfully through the CCR round trip), stop-term filtering in line dedup, 12-hex marker hashes. Live result on "status breakdown + unhealthy pod" over 150 pods: tool output crushed 1397→51 tokens (96%), model called context_expand with a correctly-transcribed hash, got intact lines back, and answered with the exact error (CrashLoopBackOff / OOMKilled 512Mi) — total session 10,982 input tokens vs 19,799 in the pre-fix run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compression.keep_patterns lets the agent builder declare a domain
vocabulary of case-insensitive substrings compression must never drop:
compression:
enabled: true
keep_patterns: [CrashLoopBackOff, ImagePullBackOff, OOMKilled]
Threaded through compress.Config into both seams (AfterToolExec hook and
the llm.Client wrapper) as ctxzip Options.MustKeep. Union semantics with
ctxzip's built-in error floor — patterns only ever add protection.
Bumps ctxzip to 304962f, whose defaults also grow k8s state words
(crash/backoff/oomkilled/evicted/unhealthy/degraded) after live testing
showed "CrashLoopBackOff" matched nothing in the original error list.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two new audit events make token savings attributable instead of living only in debug logs: - context_compressed — fired from both seams (tool_output hook, request wrapper) with seam, tool, tokens_before/after, saved_tokens, plus running totals (total_saved_tokens / total_compressions / total_expansions) so any single event shows the cumulative picture, not just the per-call delta. - context_expanded — fired on every context_expand retrieval with hash, hit, bytes and the same running totals; expansions are the cost side auditors net against savings. Events flow through AuditLogger.EmitFromContext, so correlation_id / task_id / seq are stamped like every other audit event and SIEM consumers can join savings to invocations. compress stays decoupled via a Config.Audit callback; nil disables emission. Runtime.Totals() exposes the process-lifetime snapshot. Token figures are tokenizer estimates (directionally accurate), not provider-billed counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
invocation_complete now carries compression_saved_tokens_total, compression_count, and (when nonzero) expansion_count alongside the existing input/output token totals, so per-invocation cost rollups show what compression saved without joining context_compressed events. Savings are accumulated per correlation ID inside compress.Runtime and popped once at the response boundary (TakeInvocationTotals), so concurrent invocations never cross-contaminate — diffing the process- lifetime totals would have. Fields are present whenever compression is enabled; zeros mean "on, but nothing was worth compressing". Token figures remain tokenizer estimates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ker awareness Marker awareness was living in the test agent's SKILL.md, which does not scale: every skill author would have to document compression. Compression is a runtime capability, so the runtime now briefs the model itself — when compression is enabled, compress.SystemDirective is appended to the system prompt (same pattern as codeAgentDirective), explaining what <<ctxzip:...>> markers are, that the visible remainder keeps errors and representative content, and when/how to call context_expand. The directive is a constant, keeping the system prompt byte-stable across turns for provider prompt caches. A guard test pins it to the real tool name and marker prefix so a rename cannot silently orphan the text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the feat-branch pseudo-version with the immutable release tag — the forge PR now depends on a stable, reviewable ctxzip version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compression becomes reachable from every entry point, not just forge.yaml/env: - `forge run --compression` / `--compression=false` — tri-state: absent leaves yaml/env resolution untouched; explicit values override both by setting FORGE_COMPRESSION (same pattern as --model → MODEL_NAME). - `forge serve --compression[=false]` — forwarded to the forked daemon `forge run`, only when explicitly passed. - `forge init --compression` — non-interactive scaffolding writes a commented `compression.enabled: true` block into forge.yaml. - init TUI wizard — new "Context Compression" step (SingleSelect, Enabled/Disabled with explanatory descriptions) between Skills and Auth; selection flows through WizardContext.Compression into the same forge.yaml block. Scaffold test covers both directions: --compression writes the block, default omits it (off by default). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New docs/core-concepts/context-compression.md — the feature's home: problem statement, pipeline diagram, keep-floor layers, configuration and precedence, provider cache hints, observability, failure posture. Per the sync-docs mapping, updates ripple to: - forge-yaml-schema.md — compression block in the full schema plus a dedicated reference section with field table - cli-reference.md — --compression rows in the init/run/serve flag tables; wizard step order documented under forge init - runtime-engine.md — Context Compression section describing the three loop seams (hook after guardrails, client wrapper below the fallback chain, context_expand tool) and the cache-stability posture - audit-logging.md — context_compressed / context_expanded event rows; invocation_complete row gains the per-invocation compression fields - tools-and-builtins.md — context_expand in the builtin table plus a Context Expansion Tool section (hash tolerance, miss guidance) - environment-variables.md — FORGE_COMPRESSION - README.md — Context Compression row in the documentation table - .claude/skills/forge.md — swept sections 8 (memory), 13 (CLI), 14 (schema), 17 (audit reference), 19 (docs map); ToC unchanged (no new numbered sections) Link check: 0 broken links across README + 55 docs files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reviewed for correctness and traced the invocation-complete paths + concurrency. Strong PR — fail-open everywhere, careful locking (accumulate under lock → snapshot → emit audit outside lock), the expand-tool-output-never-recompressed tail-chase fix, and the "wire format byte-identical when off" prompt-cache path is test-asserted. Builds clean; 🔴
|
…s; bound runtime maps Addresses the PR #241 review (blocking + secondary items): - 🔴 Streaming perInvocation leak + missing metrics: TakeInvocationTotals was only called in executeTask, but invocation_complete is emitted from THREE sites — the tasks/sendSubscribe JSON-RPC SSE and REST streaming handlers ran their own emission without popping the bucket, so every streaming invocation that compressed leaked its correlation bucket permanently AND its invocation_complete lacked the compression fields. The pop+populate block is now a shared helper (appendCompressionFields, documented as required at every emission site) called from all three. Pinned by TestAppendCompressionFields_PopsAndPopulates: fields populated, pop is one-shot (no double-count), nil-safe. - 🟡 Leak backstops: perInvocation is bounded to 1024 buckets with oldest-touched eviction (a future missed pop can no longer grow unbounded); the recent-marker prefix-resolution set is bounded to 2048 with oldest-emitted eviction. Both pinned by tests. - 🟡 Inflation guards tightened from == 0 to <= 0 at both seams — ctxzip clamps SavedTokens at zero today, but the guard must not silently apply inflated output if that contract ever changes. - 🟡 bbolt single-writer constraint documented in context-compression.md: the store holds an exclusive flock with a 5s open timeout (verified in ctxzip's NewBoltStore), so a second process fails open and runs uncompressed; each replica should get its own store_path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Resolved in deb06fd — all four findings addressed: 🔴 Streaming 🟡 Leak backstops (took the "optionally also" suggestion): 🟡 Inflation guards: both seams tightened from 🟡 bbolt single-writer: confirmed Gate: build ✅, vet ✅, gofmt ✅, And thanks for the go.mod note — glad the direct/indirect split checked out. 🤖 Generated with Claude Code |
What
Integrates ctxzip v0.1.0 — reversible, structure-aware context compression — into the agent loop. Bulky tool outputs (JSON arrays, logs, grep results) are compressed before reaching the LLM; everything dropped is stored in a durable local bbolt store behind an inline
<<ctxzip:HASH ...>>marker and retrievable via the newcontext_expandtool. Lossy on the wire, lossless end-to-end.Off by default. Enable via
compression.enabled: truein forge.yaml,FORGE_COMPRESSION=true,forge run --compression, or the new init-wizard step.Architecture — three seams (
forge-core/compress/)AfterToolExecHook— compresses tool output once, at production time, before it enters Memory. Compressed bytes never change afterwards, so the conversation prefix stays byte-stable and provider prompt caches keep hitting. Registered after guardrail/redaction hooks; error results and small outputs stay verbatim.WrapClient—llm.Clientdecorator below the FallbackChain (covers retries + compactor calls) compressing the live zone of each request. Deterministic across turns: relevance query pinned to the first user message, never the latest turn.ExpandTool—context_expand(hash)builtin; the loop executes it like any other tool, no retrieval machinery needed. Tolerates imperfect hashes (whole markers, truncated hex → unique-prefix resolution against recently emitted markers).A runtime-owned system directive is appended when compression is on, so every skill's agent knows what markers are and when to expand — skill authors need zero awareness.
Provider prompt-cache hints (
ClientConfig.PromptCaching)Gated by
compression.cache_hints(defaults toenabled):cache_control: ephemeralbreakpoints on the last tool definition + system block (block-form system only when on — wire format byte-identical to today when off, test-asserted). Also applies on theaws_sigv4Bedrock-passthrough path.prompt_cache_keyderived from (model, system, tool names).Observability
context_compressed/context_expandedaudit events (viaEmitFromContext— correlation_id/task_id/seq/signing like every other event) carrying per-event figures plus running totals.invocation_completegainscompression_saved_tokens_total,compression_count,expansion_count— per-invocation, keyed by correlation ID so concurrent tasks never cross-contaminate.Config & CLI surfaces
forge run --compression[=false]— tri-state override (absent = yaml/env decide)forge serve --compression[=false]— forwarded to the daemonforge init --compression+ a new Context Compression TUI wizard stepLive-tested
Hardened against a real gpt-4o agent over a 150-pod fixture (several commits exist because live testing found the failure modes — expand/compress tail-chase, hash transcription, store-dir creation):
CrashLoopBackOffpod kept verbatimcontext_expandunprompted, hash transcribed correctly, got intact lines back, answered with the exact error (OOMKilled 512Mi)compression_count: 0— compression is insurance against bulk, not a tax on every callNotes for reviewers
llm_call.input_tokens.grep_searchreturning "(no matches found)" for a nonexistent file;tools:in forge.yaml not registering builtins (banner is cosmetic).🤖 Generated with Claude Code