Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly by initializ-mk · Pull Request #241 · initializ/forge

initializ-mk · 2026-07-04T20:56:56Z

What

Integrates ctxzip v0.1.0 — reversible, structure-aware context compression — into the agent loop. Bulky tool outputs (JSON arrays, logs, grep results) are compressed before reaching the LLM; everything dropped is stored in a durable local bbolt store behind an inline <<ctxzip:HASH ...>> marker and retrievable via the new context_expand tool. Lossy on the wire, lossless end-to-end.

Off by default. Enable via compression.enabled: true in forge.yaml, FORGE_COMPRESSION=true, forge run --compression, or the new init-wizard step.

Architecture — three seams (`forge-core/compress/`)

AfterToolExecHook — compresses tool output once, at production time, before it enters Memory. Compressed bytes never change afterwards, so the conversation prefix stays byte-stable and provider prompt caches keep hitting. Registered after guardrail/redaction hooks; error results and small outputs stay verbatim.
WrapClient — llm.Client decorator below the FallbackChain (covers retries + compactor calls) compressing the live zone of each request. Deterministic across turns: relevance query pinned to the first user message, never the latest turn.
ExpandTool — context_expand(hash) builtin; the loop executes it like any other tool, no retrieval machinery needed. Tolerates imperfect hashes (whole markers, truncated hex → unique-prefix resolution against recently emitted markers).

A runtime-owned system directive is appended when compression is on, so every skill's agent knows what markers are and when to expand — skill authors need zero awareness.

Provider prompt-cache hints (`ClientConfig.PromptCaching`)

Gated by compression.cache_hints (defaults to enabled):

anthropic: cache_control: ephemeral breakpoints on the last tool definition + system block (block-form system only when on — wire format byte-identical to today when off, test-asserted). Also applies on the aws_sigv4 Bedrock-passthrough path.
openai/gemini: stable prompt_cache_key derived from (model, system, tool names).

Observability

context_compressed / context_expanded audit events (via EmitFromContext — correlation_id/task_id/seq/signing like every other event) carrying per-event figures plus running totals.
invocation_complete gains compression_saved_tokens_total, compression_count, expansion_count — per-invocation, keyed by correlation ID so concurrent tasks never cross-contaminate.

Config & CLI surfaces

compression:
  enabled: true
  keep_patterns: [CrashLoopBackOff, PAYMENT_DECLINED]  # domain never-drop vocabulary
  # store_path (.forge/ctxzip.db) / ttl (30m) / min_tool_output_chars (2048) / cache_hints

forge run --compression[=false] — tri-state override (absent = yaml/env decide)
forge serve --compression[=false] — forwarded to the daemon
forge init --compression + a new Context Compression TUI wizard step

Live-tested

Hardened against a real gpt-4o agent over a 150-pod fixture (several commits exist because live testing found the failure modes — expand/compress tail-chase, hash transcription, store-dir creation):

grep output compressed 1,397 → 51 tokens (96%) with the one CrashLoopBackOff pod kept verbatim
model called context_expand unprompted, hash transcribed correctly, got intact lines back, answered with the exact error (OOMKilled 512Mi)
surgical sessions correctly report compression_count: 0 — compression is insurance against bulk, not a tax on every call

Notes for reviewers

Fail-open everywhere: store init failure, compression error, or inflation all fall back to uncompressed originals.
Token figures in audit fields are tokenizer estimates (documented); billed truth stays in llm_call.input_tokens.
Known follow-ups (intentionally out of scope): compression-aware tool limits (grep's internal 50-line default truncates upstream of the reversible layer); grep_search returning "(no matches found)" for a nonexistent file; tools: in forge.yaml not registering builtins (banner is cosmetic).

🤖 Generated with Claude Code

…ompt-cache hints Wires github.com/initializ/ctxzip into the agent loop as an opt-in feature (compression.enabled in forge.yaml, or FORGE_COMPRESSION=true). Bulky tool outputs and conversation content are compressed before reaching the LLM; everything dropped is stored in a durable local bbolt store (.forge/ctxzip.db) behind a <<ctxzip:HASH>> marker, retrievable via the new context_expand tool — lossy on the wire, lossless end-to-end. New package forge-core/compress: - AfterToolExecHook — compresses tool output once at production time, before it enters Memory, so historic bytes never change and provider prompt caches keep hitting. Registered after guardrail/redaction hooks; error results and small outputs are left verbatim. - WrapClient — llm.Client decorator compressing the live zone of every outbound request (frozen prefix + recent turns forwarded byte-identical). Deterministic across turns: the relevance query is pinned to the first user message, never the latest turn. - ExpandTool — context_expand builtin retrieving originals by marker hash; registered only when compression is on (memory_get pattern). Provider prompt-cache hints (ClientConfig.PromptCaching, gated by compression.cache_hints, defaulting to compression.enabled): - anthropic: cache_control ephemeral breakpoints on the last tool definition and the system block (block-form system only when caching — wire format is byte-identical to the previous contract when off). Also applies on the aws_sigv4 path, which speaks the same Messages wire format. - openai: stable prompt_cache_key derived from (model, system, tool names) for cache-shard pinning; prefix caching itself is automatic. Config: CompressionConfig (enabled / store_path / ttl / min_tool_output_chars / cache_hints) following the MemoryConfig.LongTerm opt-in pattern, wired in the runner beside initLongTermMemory. Fail-open: any init error runs the agent uncompressed. Tests: hook compression + context_expand round-trip, error/small-output verbatim guarantees, live-zone vs frozen-prefix boundaries, cross-turn determinism (the cache-safety property), marker-hash normalization, anthropic/openai wire-format assertions on and off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…silience, store dir Fixes from live agent testing (gpt-4o against a 150-pod fixture): - Never recompress context_expand output, at both seams (hook skips the tool by name; client wrapper passes ctxzip SkipNames). Without this the loop chased its own tail: the model expanded a marker and the hook crushed the expansion straight back into a marker. - Hash transcription resilience: models truncate or mangle marker hashes when copying them into tool calls. The Runtime now remembers emitted marker hashes; the expand tool resolves a unique prefix (≥6 chars) on exact-miss, and normalizeHash strips a glued ":count" suffix. - Create the store's parent directory (bbolt creates the file, not the dir) — on a fresh project .forge/ doesn't exist and compression failed open with "no such file or directory". - Bump ctxzip to a8b7923→94668f4: line-mode text compression (grep/log layout preserved byte-faithfully through the CCR round trip), stop-term filtering in line dedup, 12-hex marker hashes. Live result on "status breakdown + unhealthy pod" over 150 pods: tool output crushed 1397→51 tokens (96%), model called context_expand with a correctly-transcribed hash, got intact lines back, and answered with the exact error (CrashLoopBackOff / OOMKilled 512Mi) — total session 10,982 input tokens vs 19,799 in the pre-fix run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

compression.keep_patterns lets the agent builder declare a domain vocabulary of case-insensitive substrings compression must never drop: compression: enabled: true keep_patterns: [CrashLoopBackOff, ImagePullBackOff, OOMKilled] Threaded through compress.Config into both seams (AfterToolExec hook and the llm.Client wrapper) as ctxzip Options.MustKeep. Union semantics with ctxzip's built-in error floor — patterns only ever add protection. Bumps ctxzip to 304962f, whose defaults also grow k8s state words (crash/backoff/oomkilled/evicted/unhealthy/degraded) after live testing showed "CrashLoopBackOff" matched nothing in the original error list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two new audit events make token savings attributable instead of living only in debug logs: - context_compressed — fired from both seams (tool_output hook, request wrapper) with seam, tool, tokens_before/after, saved_tokens, plus running totals (total_saved_tokens / total_compressions / total_expansions) so any single event shows the cumulative picture, not just the per-call delta. - context_expanded — fired on every context_expand retrieval with hash, hit, bytes and the same running totals; expansions are the cost side auditors net against savings. Events flow through AuditLogger.EmitFromContext, so correlation_id / task_id / seq are stamped like every other audit event and SIEM consumers can join savings to invocations. compress stays decoupled via a Config.Audit callback; nil disables emission. Runtime.Totals() exposes the process-lifetime snapshot. Token figures are tokenizer estimates (directionally accurate), not provider-billed counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

invocation_complete now carries compression_saved_tokens_total, compression_count, and (when nonzero) expansion_count alongside the existing input/output token totals, so per-invocation cost rollups show what compression saved without joining context_compressed events. Savings are accumulated per correlation ID inside compress.Runtime and popped once at the response boundary (TakeInvocationTotals), so concurrent invocations never cross-contaminate — diffing the process- lifetime totals would have. Fields are present whenever compression is enabled; zeros mean "on, but nothing was worth compressing". Token figures remain tokenizer estimates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ker awareness Marker awareness was living in the test agent's SKILL.md, which does not scale: every skill author would have to document compression. Compression is a runtime capability, so the runtime now briefs the model itself — when compression is enabled, compress.SystemDirective is appended to the system prompt (same pattern as codeAgentDirective), explaining what <<ctxzip:...>> markers are, that the visible remainder keeps errors and representative content, and when/how to call context_expand. The directive is a constant, keeping the system prompt byte-stable across turns for provider prompt caches. A guard test pins it to the real tool name and marker prefix so a rename cannot silently orphan the text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replaces the feat-branch pseudo-version with the immutable release tag — the forge PR now depends on a stable, reviewable ctxzip version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Compression becomes reachable from every entry point, not just forge.yaml/env: - `forge run --compression` / `--compression=false` — tri-state: absent leaves yaml/env resolution untouched; explicit values override both by setting FORGE_COMPRESSION (same pattern as --model → MODEL_NAME). - `forge serve --compression[=false]` — forwarded to the forked daemon `forge run`, only when explicitly passed. - `forge init --compression` — non-interactive scaffolding writes a commented `compression.enabled: true` block into forge.yaml. - init TUI wizard — new "Context Compression" step (SingleSelect, Enabled/Disabled with explanatory descriptions) between Skills and Auth; selection flows through WizardContext.Compression into the same forge.yaml block. Scaffold test covers both directions: --compression writes the block, default omits it (off by default). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

New docs/core-concepts/context-compression.md — the feature's home: problem statement, pipeline diagram, keep-floor layers, configuration and precedence, provider cache hints, observability, failure posture. Per the sync-docs mapping, updates ripple to: - forge-yaml-schema.md — compression block in the full schema plus a dedicated reference section with field table - cli-reference.md — --compression rows in the init/run/serve flag tables; wizard step order documented under forge init - runtime-engine.md — Context Compression section describing the three loop seams (hook after guardrails, client wrapper below the fallback chain, context_expand tool) and the cache-stability posture - audit-logging.md — context_compressed / context_expanded event rows; invocation_complete row gains the per-invocation compression fields - tools-and-builtins.md — context_expand in the builtin table plus a Context Expansion Tool section (hash tolerance, miss guidance) - environment-variables.md — FORGE_COMPRESSION - README.md — Context Compression row in the documentation table - .claude/skills/forge.md — swept sections 8 (memory), 13 (CLI), 14 (schema), 17 (audit reference), 19 (docs map); ToC unchanged (no new numbered sections) Link check: 0 broken links across README + 55 docs files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

initializ-mk · 2026-07-04T23:25:14Z

Requesting changes (posting as a comment — GitHub won't let the author submit a formal Request-Changes review on their own PR). Blocking item: the streaming perInvocation leak below.

Reviewed for correctness and traced the invocation-complete paths + concurrency. Strong PR — fail-open everywhere, careful locking (accumulate under lock → snapshot → emit audit outside lock), the expand-tool-output-never-recompressed tail-chase fix, and the "wire format byte-identical when off" prompt-cache path is test-asserted. Builds clean; compress + llm/providers tests pass. One real bug on the primary path.

🔴 `perInvocation` leaks (and compression metrics are missing) on the streaming paths

TakeInvocationTotals — which both pops perInvocation[correlationID] and adds the compression_* fields — is called in exactly one place: executeTask (runner.go:1552). But there are three invocation_complete emission sites, and the other two are the sendSubscribe streaming handlers that run their own ExecuteStream loop instead of executeTask:

tasks/sendSubscribe (JSON-RPC SSE) → emits at 1296, no Take.
REST POST /tasks/sendSubscribe → emits at 1821, no Take.

Both streaming handlers set a correlation ID on ctx (WithCorrelationID) and use the shared compressed client + hooks, so a streaming invocation that compresses does call recordCompression → bumpInvocation, adding a perInvocation[cid] entry. Since only executeTask pops, and perInvocation has no TTL/sweep, every streaming invocation that compresses leaks its bucket permanently. sendSubscribe is a primary A2A mode, so with compression on this grows unbounded over process lifetime.

Same root cause, second gap: the PR advertises invocation_complete gaining compression_saved_tokens_total / compression_count / expansion_count, but those are only added in executeTask — so streaming invocation_complete events don't carry them. The feature is under-delivered on the common path.

Fix (one change closes both): factor the "TakeInvocationTotals(ctx) + populate the three compression fields" block out of executeTask into a helper, and call it at all three EmitInvocationComplete sites (1296, 1565, 1821). Optionally also give perInvocation a size cap / periodic sweep so a missed Take can't leak — but the shared helper is the real fix and also ships the metrics.

🟡 Secondary

recent marker map is unbounded — every emitted marker hash is remembered (rememberMarkers) and never evicted. Small strings, but grows for the process lifetime; a cap / LRU would bound it (you only need recent markers for imperfect-hash resolution).
bbolt is a single-writer file lock. .forge/ctxzip.db can be opened by only one process; two agents on a shared volume (K8s multi-replica) or forge run + forge serve on the same dir will collide. Fail-open only helps if ccr.NewBoltStore errors rather than blocks on the flock — worth confirming it sets an Open timeout, and documenting the single-writer constraint in the compression doc ("one store per process/replica").
Inflation guard is == 0, not <= 0 (hook.go: res.SavedTokens() == 0). If ctxzip can ever return negative savings (inflated output), that guard misses it and you'd record negative savings + apply the inflated bytes. Depends on ctxzip's contract; low risk, but <= 0 is safer.

Note (in the PR's favor)

go.mod is correct — ctxzip is a direct require and bbolt is correctly // indirect (transitive via ctxzip; compress imports ctxzip/ccr, not bbolt). No go mod tidy needed.

Verdict

Request changes for the streaming perInvocation leak + missing metrics — a genuine unbounded leak on the main path and a gap between advertised and actual streaming behavior. The shared-helper refactor is small and fixes both. The bbolt single-writer note and recent cap are worth doing but non-blocking.

…s; bound runtime maps Addresses the PR #241 review (blocking + secondary items): - 🔴 Streaming perInvocation leak + missing metrics: TakeInvocationTotals was only called in executeTask, but invocation_complete is emitted from THREE sites — the tasks/sendSubscribe JSON-RPC SSE and REST streaming handlers ran their own emission without popping the bucket, so every streaming invocation that compressed leaked its correlation bucket permanently AND its invocation_complete lacked the compression fields. The pop+populate block is now a shared helper (appendCompressionFields, documented as required at every emission site) called from all three. Pinned by TestAppendCompressionFields_PopsAndPopulates: fields populated, pop is one-shot (no double-count), nil-safe. - 🟡 Leak backstops: perInvocation is bounded to 1024 buckets with oldest-touched eviction (a future missed pop can no longer grow unbounded); the recent-marker prefix-resolution set is bounded to 2048 with oldest-emitted eviction. Both pinned by tests. - 🟡 Inflation guards tightened from == 0 to <= 0 at both seams — ctxzip clamps SavedTokens at zero today, but the guard must not silently apply inflated output if that contract ever changes. - 🟡 bbolt single-writer constraint documented in context-compression.md: the store holds an exclusive flock with a 5s open timeout (verified in ctxzip's NewBoltStore), so a second process fails open and runs uncompressed; each replica should get its own store_path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

initializ-mk · 2026-07-04T23:34:06Z

Resolved in deb06fd — all four findings addressed:

🔴 Streaming perInvocation leak + missing metrics — fixed with the suggested shared helper. The pop+populate block is now appendCompressionFields(ctx, fields), called at all three EmitInvocationComplete sites (executeTask, JSON-RPC SSE sendSubscribe, REST sendSubscribe). The helper's doc comment names all three sites explicitly so a future fourth emission path knows the contract. Pinned by TestAppendCompressionFields_PopsAndPopulates: fields populated on first call, one-shot pop (second call returns zeros — no double-count if two sites ever fire), nil-safe with compression disabled.

🟡 Leak backstops (took the "optionally also" suggestion): perInvocation is now bounded to 1024 buckets with oldest-touched eviction — a future missed pop degrades to bounded memory instead of an unbounded leak. recent markers bounded to 2048 with oldest-emitted eviction (only recent markers matter for transcription repair; exact hashes still resolve via the store). Both pinned by tests (TestPerInvocationBuckets_Bounded, TestRecentMarkers_Bounded).

🟡 Inflation guards: both seams tightened from == 0 to <= 0. Confirmed ctxzip's SavedTokens() clamps at zero today, so this is defense-in-depth against a future contract change — noted as such in the code comment.

🟡 bbolt single-writer: confirmed ccr.NewBoltStore sets bolt.Options{Timeout: 5 * time.Second} — a second process errors after 5s rather than blocking, so fail-open engages (warning logged, agent runs uncompressed). Documented in context-compression.md § Failure posture: one store per process/replica, give each replica its own store_path.

Gate: build ✅, vet ✅, gofmt ✅, forge-core 32 packages ✅, forge-cli/runtime full suite ✅.

And thanks for the go.mod note — glad the direct/indirect split checked out.

🤖 Generated with Claude Code

initializ-mk and others added 9 commits July 3, 2026 16:34

chore: pin ctxzip to released v0.1.0

e2673d1

Replaces the feat-branch pseudo-version with the immutable release tag — the forge PR now depends on a stable, reviewable ctxzip version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241

Reversible context compression (ctxzip): 60-95% fewer tokens on bulky tool outputs, losslessly#241
initializ-mk wants to merge 10 commits into
mainfrom
feat/ctxzip-compression

initializ-mk commented Jul 4, 2026

Uh oh!

initializ-mk commented Jul 4, 2026

Uh oh!

initializ-mk commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

initializ-mk commented Jul 4, 2026

What

Architecture — three seams (forge-core/compress/)

Provider prompt-cache hints (ClientConfig.PromptCaching)

Observability

Config & CLI surfaces

Live-tested

Notes for reviewers

Uh oh!

initializ-mk commented Jul 4, 2026

🔴 perInvocation leaks (and compression metrics are missing) on the streaming paths

🟡 Secondary

Note (in the PR's favor)

Verdict

Uh oh!

initializ-mk commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Architecture — three seams (`forge-core/compress/`)

Provider prompt-cache hints (`ClientConfig.PromptCaching`)

🔴 `perInvocation` leaks (and compression metrics are missing) on the streaming paths