initializ · initializ-mk · Jul 3, 2026 · Jul 3, 2026 · Jul 3, 2026 · Jul 3, 2026
diff --git a/.claude/skills/forge.md b/.claude/skills/forge.md
@@ -351,7 +351,19 @@ caps, compaction triggers. Embedding provider auto-detects from the
 LLM provider (Anthropic → `voyage-3` family; OpenAI → `text-embedding-3-small`)
 unless `memory.embedding_provider` is explicit.
 
-**Read**: `docs/core-concepts/memory-system.md`.
+Opt-in **context compression** (ctxzip): when `compression.enabled` is
+set, large tool outputs are compressed reversibly before reaching the
+LLM — an `AfterToolExec` hook compresses once at production time, an
+`llm.Client` wrapper compresses each request's live zone, and the
+`context_expand` builtin retrieves offloaded originals by
+`<<ctxzip:HASH>>` marker from a bbolt store (`.forge/ctxzip.db`,
+30-min TTL). `compression.keep_patterns` declares domain vocabulary
+that is never dropped; `compression.cache_hints` injects provider
+prompt-cache primitives (anthropic `cache_control`, openai
+`prompt_cache_key`). Fail-open: any error runs uncompressed.
+
+**Read**: `docs/core-concepts/memory-system.md`,
+`docs/core-concepts/context-compression.md`.
 
 ---
 
@@ -768,10 +780,10 @@ Full reference: `docs/reference/cli-reference.md`.
 
 | Subcommand | Purpose | Key flags |
 |---|---|---|
-| `forge init` | Scaffold a new agent: `forge.yaml`, `.env`, `SKILL.md`, `guardrails.json`. Interactive TUI by default; `--non-interactive` for CI | `--model-provider`, `--model-name`, `--channels`, `--auth`, `--from-skills` |
+| `forge init` | Scaffold a new agent: `forge.yaml`, `.env`, `SKILL.md`, `guardrails.json`. Interactive TUI by default; `--non-interactive` for CI | `--model-provider`, `--model-name`, `--channels`, `--auth`, `--from-skills`, `--compression` |
 | `forge build` | Run the build pipeline → `.forge-output/agent.json` + container Dockerfile + K8s manifests + (optional) signature | `--output-dir`, `--sign` |
 | `forge validate` | Lint `forge.yaml` + SKILL.md. `--platform-policy=PATH` lints a policy file standalone | `--strict`, `--command-compat`, `--platform-policy` |
-| `forge run` | Dev-mode A2A server with hot-reload | `--port`, `--host`, `--with slack,telegram`, `--mock-tools`, `--no-auth`, `--cors-origins`, `--audit-socket`, `--audit-http-endpoint`, `--rate-limit-*`, `--otel-enabled`, `--otel-endpoint`, `--otel-sampler` |
+| `forge run` | Dev-mode A2A server with hot-reload | `--port`, `--host`, `--with slack,telegram`, `--mock-tools`, `--no-auth`, `--cors-origins`, `--audit-socket`, `--audit-http-endpoint`, `--rate-limit-*`, `--otel-enabled`, `--otel-endpoint`, `--otel-sampler`, `--compression[=false]` |
 | `forge serve start \| stop \| status \| logs` | Daemonized A2A server (forks `forge run`). Forwards CLI flags + env to the child | `--port`, `--shutdown-timeout`, `--with` |
 | `forge export` | Export `agent.json` for registry upload | |
 | `forge package` | Generate Dockerfile + Kubernetes manifests + `egress_allowlist.json`. `--prod` rejects `dev-open` egress + dev-only tools | `--registry`, `--tag`, `--base`, `--prod` |
@@ -852,6 +864,11 @@ memory:
   long_term: false
   embedding_provider: openai
 
+compression:
+  enabled: false          # reversible context compression (ctxzip)
+  keep_patterns: []       # never-drop vocabulary
+  cache_hints: true       # provider prompt-cache primitives
+
 mcp:
   token_store_path: ~/.forge/mcp-tokens.enc
   servers:
@@ -1042,7 +1059,9 @@ when OTel tracing is enabled (OTel v1 / Phase 4 / #105). Both use
 | `EventMCPToolConflict` | `mcp_tool_conflict` | Namespaced tool collision detected |
 | `EventMCPTokenRefresh` | `mcp_token_refresh` | OAuth 2.1 token refresh result |
 | `EventAgentCardPublished` | `agent_card_published` | Agent Card finalized at startup / hot-reload; `name`, `version`, `protocol_version`, `url`, `skill_count`, `capabilities`, `security_schemes`, `card_size_bytes`, `card_sha256` (FWS-1) |
-| `AuditInvocationComplete` | `invocation_complete` | A2A invocation closed; `duration_ms`, `input_tokens_total`, `output_tokens_total`, `llm_call_count`, `model`, `provider` (FWS-3) |
+| `context_compressed` | `context_compressed` | Context compression shrank content; `seam` (`tool_output` / `request`), `tool`, `tokens_before` / `tokens_after` / `saved_tokens` + running totals (tokenizer estimates) |
+| `context_expanded` | `context_expanded` | Model retrieved offloaded content via `context_expand`; `hash`, `hit`, `bytes` + running totals |
+| `AuditInvocationComplete` | `invocation_complete` | A2A invocation closed; `duration_ms`, `input_tokens_total`, `output_tokens_total`, `llm_call_count`, `model`, `provider` (FWS-3); with compression enabled also `compression_saved_tokens_total`, `compression_count`, `expansion_count` |
 | `AuditInvocationCancelled` | `invocation_cancelled` | A2A invocation cancelled via `tasks/cancel`; classified `reason` + partial token totals (FWS-4) |
 | `AuditTaskAdmissionDenied` | `task_admission_denied` | Inbound `tasks/send` denied by the platform admission middleware (#201; opt-in via `FORGE_ADMISSION_URL` + `FORGE_PLATFORM_TOKEN`); `reason`, `scope`, `window`, `reset_at`, `cached`. Caller sees HTTP 402 Payment Required. |
 | `AuditPolicyLoaded` | `policy_loaded` | One per non-empty policy layer at startup; `layer`, `source`, per-list size counters (FWS-5/6) |
@@ -1103,6 +1122,7 @@ docs/
 │   ├── skill-md-format.md        ← SKILL.md schema
 │   ├── channels.md
 │   ├── memory-system.md
+│   ├── context-compression.md    ← reversible tool-output compression
 │   ├── scheduling.md
 │   └── observability-tracing.md  ← OTel v1 (#108) — spans, propagation, audit cross-link
 ├── security/

diff --git a/README.md b/README.md
@@ -86,6 +86,7 @@ You write a `SKILL.md`. Forge compiles it into a secure, runnable agent with egr
 | [Tools](docs/core-concepts/tools-and-builtins.md) | Built-in tools, adapters, and custom tools |
 | [Runtime](docs/core-concepts/runtime-engine.md) | LLM providers, fallback chains, running modes |
 | [Memory](docs/core-concepts/memory-system.md) | Session persistence and long-term memory |
+| [Context Compression](docs/core-concepts/context-compression.md) | Reversible compression of bulky tool outputs — fewer tokens, nothing lost |
 | [Channels](docs/core-concepts/channels.md) | Slack and Telegram adapter setup |
 | [Scheduling](docs/core-concepts/scheduling.md) | Cron configuration and schedule tools |
 | [Tracing](docs/core-concepts/observability-tracing.md) | OpenTelemetry distributed tracing — spans, propagation, audit cross-link |

diff --git a/docs/core-concepts/context-compression.md b/docs/core-concepts/context-compression.md
@@ -0,0 +1,117 @@
+---
+title: "Context Compression"
+description: "Reversible compression of bulky tool outputs — fewer tokens, nothing lost."
+order: 6
+---
+
+Forge can compress bulky tool outputs before they reach the LLM — reversibly: everything dropped stays retrievable, so compression is lossy on the wire but lossless end-to-end.
+
+Powered by [ctxzip](https://github.com/initializ/ctxzip). Off by default; enable per agent in `forge.yaml`, per run with a flag, or at scaffold time in the init wizard.
+
+## The problem it solves
+
+Agent tool outputs are dominated by repetition: 149 pods that are `Running` and one that is `CrashLoopBackOff`; hundreds of log lines differing only by timestamp; JSON list responses where the model needs three rows. Without compression these outputs either flood the context window or get **truncated** — destroying whatever fell past the cut, which is frequently the one row that mattered.
+
+Compression inverts the tradeoff: keep what matters (errors, anomalies, query-relevant rows, boundaries), offload the rest to a local store, and let the model retrieve the original if it turns out to need it.
+
+## How it works
+
+```
+tool executes
+   │
+   ▼
+AfterToolExec hook ──── output ≥ 2 KB? ──── compress once, at production time
+   │                                         dropped content → .forge/ctxzip.db
+   │                                         replaced by <<ctxzip:HASH note>> marker
+   ▼
+Memory (compressed bytes never change → provider prompt caches stay warm)
+   │
+   ▼
+LLM client wrapper ──── compresses the live zone of each request
+   │                    (system prompt + recent turns forwarded byte-identical)
+   ▼
+LLM sees:  [... kept rows, errors intact ...] <<ctxzip:ac998fea694b 149_lines_offloaded>>
+   │
+   └─ needs the offloaded data? → calls context_expand(hash) → original returned
+```
+
+Three pieces, all automatic once enabled:
+
+| Piece | What it does |
+|-------|--------------|
+| Tool-output hook | Compresses each large tool result once, before it enters session memory. Error results and small outputs are left verbatim. |
+| Client wrapper | Compresses the remaining live zone of each outbound request. Deterministic across turns so historic messages always compress to identical bytes. |
+| `context_expand` tool | Registered automatically. The model calls it with a marker's hash to get the original content back. A system-prompt directive teaches every agent what markers are — skills need zero awareness. |
+
+## What is never dropped
+
+Fidelity is layered; every layer only ever adds protection:
+
+1. **Error floor** — content matching error vocabulary (`error`, `fail`, `panic`, `timeout`, `crash`, `backoff`, `oomkilled`, `evicted`, …) is kept verbatim.
+2. **`keep_patterns`** — your domain's never-drop vocabulary (see below).
+3. **Query anchors** — items matching the conversation's ask survive.
+4. **Structure** — head/tail windows and one exemplar of each near-duplicate group.
+5. **Reversibility** — everything else is offloaded to the store, not deleted.
+6. **Source of truth** — after the store TTL (30 min), the disk or the original command still holds the data; a retrieval miss tells the model to re-run the producing tool.
+
+## Configuration
+
+```yaml
+# forge.yaml
+compression:
+  enabled: true                # default: false
+  keep_patterns:               # domain vocabulary that must never be dropped
+    - CrashLoopBackOff
+    - PAYMENT_DECLINED
+  # store_path: .forge/ctxzip.db      # offloaded-originals store (bbolt)
+  # ttl: 30m                          # how long originals stay retrievable
+  # min_tool_output_chars: 2048       # hook floor; smaller outputs untouched
+  # cache_hints: true                 # provider prompt-cache hints (defaults to enabled)
+```
+
+Precedence (most specific wins):
+
+```
+forge run --compression[=false]  >  FORGE_COMPRESSION=true|false  >  compression.enabled  >  off
+```
+
+| Surface | Usage |
+|---------|-------|
+| `forge run --compression` | Enable for one run; `--compression=false` force-disables even when forge.yaml enables it |
+| `forge serve --compression[=false]` | Forwarded to the daemon |
+| `forge init --compression` | Scaffold a new agent with the block enabled |
+| init TUI wizard | "Context Compression" step (between Skills and Auth) |
+
+## Provider prompt-cache hints
+
+Compressing the wrong bytes can *cost* tokens by busting the provider's prompt cache, so compression never touches the system prompt, tool definitions, or recent turns, and its output is deterministic across turns. On top of that, `cache_hints` (on by default when compression is enabled) injects each provider's native cache primitives:
+
+| Provider | Hint |
+|----------|------|
+| anthropic | `cache_control: {type: ephemeral}` breakpoints on the last tool definition and the system block — caches the stable tools+system prefix across turns. Also applies on the `aws_sigv4` Bedrock-passthrough path. |
+| openai / gemini | A stable `prompt_cache_key` derived from (model, system prompt, tool names) — pins cache routing; prefix caching itself is automatic. |
+
+When `cache_hints` is off, provider wire formats are byte-identical to a build without compression.
+
+## Observability
+
+Savings are first-class audit events, not log noise — see [Audit Logging](../security/audit-logging.md) for the event schema:
+
+- `context_compressed` — per compression: seam, tool, tokens before/after/saved, plus running totals.
+- `context_expanded` — per retrieval: hash, hit, bytes — the cost side to net against savings.
+- `invocation_complete` gains `compression_saved_tokens_total`, `compression_count`, and `expansion_count`, accumulated per invocation (concurrent tasks never cross-contaminate).
+
+Token figures are tokenizer estimates (directionally accurate); billed truth remains `llm_call.input_tokens`. A surgical session that produced only small outputs correctly reports `compression_count: 0` — compression is insurance against bulk, not a tax on every call.
+
+## Failure posture
+
+Fail-open, always: if the store cannot be opened, a compressor errors, or "compression" would grow a message, the original content is used unchanged. Error tool results are never compressed. An expired retrieval is not a dead end — the model is told to re-run the tool that produced the output.
+
+**Single-writer store.** The bbolt store at `store_path` holds an exclusive file lock — one store per process. A second process pointing at the same file (two replicas on a shared volume, or `forge run` alongside `forge serve` in the same directory) fails to acquire the lock after a 5-second timeout and that process runs uncompressed (fail-open, with a startup warning). Give each replica its own `store_path` — offloaded originals are only ever retrieved by the process that offloaded them, so the store has no reason to be shared.
+
+## Related
+
+- [Runtime Engine](runtime-engine.md) — where the hook and client wrapper sit in the agent loop
+- [Tools & Builtins](tools-and-builtins.md) — the `context_expand` tool
+- [forge.yaml Schema](../reference/forge-yaml-schema.md) — the `compression` block
+- [CLI Reference](../reference/cli-reference.md) — flags and wizard step
diff --git a/docs/core-concepts/runtime-engine.md b/docs/core-concepts/runtime-engine.md
@@ -277,6 +277,10 @@ The `FilesDir` is set via `LLMExecutorConfig.FilesDir` and made available to too
 
 For details on session persistence, context window management, compaction, and long-term memory, see [Memory](memory-system.md).
 
+## Context Compression
+
+When `compression.enabled` is set, the runner wires reversible context compression (ctxzip) into the loop at three points: an `AfterToolExec` hook compresses large tool outputs once, before they enter memory (registered after the guardrail hooks, so it compresses redacted output); the LLM client is wrapped in a compressing decorator below the fallback chain (so retries and compactor summarization calls are covered too); and the `context_expand` retrieval tool is registered so the model can recover offloaded content by marker hash. A constant system-prompt directive teaches the model what `<<ctxzip:...>>` markers are — individual skills need no awareness. Compression output is deterministic across turns and never touches the system prompt or recent messages, keeping provider prompt caches warm; `compression.cache_hints` additionally injects the provider's native cache primitives (anthropic `cache_control` breakpoints, openai `prompt_cache_key`). See [Context Compression](context-compression.md).
+
 ## Hooks
 
 The engine fires hooks at key points in the loop. See [Hooks](hooks.md) for details.

diff --git a/docs/core-concepts/tools-and-builtins.md b/docs/core-concepts/tools-and-builtins.md
@@ -30,6 +30,7 @@ Tools are capabilities that an LLM agent can invoke during execution. Forge prov
 | `read_skill` | Load full instructions for an available skill on demand |
 | `memory_search` | Search long-term memory (when enabled) |
 | `memory_get` | Read memory files (when enabled) |
+| `context_expand` | Retrieve the original content behind a `<<ctxzip:...>>` compression marker (when [compression](context-compression.md) is enabled) |
 | `cli_execute` | Execute pre-approved CLI binaries |
 | `schedule_set` | Create or update a recurring cron schedule |
 | `schedule_list` | List all active and inactive schedules |
@@ -180,6 +181,10 @@ When [long-term memory](memory-system.md) is enabled, two additional tools are r
 
 These tools allow the agent to recall information from previous sessions.
 
+## Context Expansion Tool
+
+When [context compression](context-compression.md) is enabled, the `context_expand` tool is registered. Compressed tool outputs carry inline `<<ctxzip:HASH note>>` markers; the model calls `context_expand` with the hash to retrieve the offloaded original from the local store. The tool tolerates imperfect input — a whole marker pasted as the hash, or a truncated hash that uniquely prefixes a recently emitted one — and a miss (expired/evicted entry) returns guidance to re-run the producing tool rather than an error.
+
 ## Development Tools
 
 Development tools (`local_shell`, `local_file_browser`, `debug_console`, `test_runner`) are available during `forge run --dev` but are **automatically filtered out** in production builds by the `ToolFilterStage`.

diff --git a/docs/reference/cli-reference.md b/docs/reference/cli-reference.md
@@ -18,7 +18,7 @@ Complete reference for all Forge CLI commands.
 
 ## `forge init`
 
-Initialize a new agent project.
+Initialize a new agent project. Without `--non-interactive`, a TUI wizard walks through: name → model provider → fallbacks → channel → tools → skills → context compression → authentication → egress review → summary.
 
 ```
 forge init [name] [flags]
@@ -39,6 +39,7 @@ forge init [name] [flags]
 | `--org-id` | | | OpenAI Organization ID (enterprise) |
 | `--from-skills` | | | Path to a SKILL.md file for auto-configuration |
 | `--non-interactive` | | `false` | Skip interactive prompts |
+| `--compression` | | `false` | Enable reversible context compression — writes `compression.enabled: true` to the scaffolded forge.yaml. See [Context Compression](../core-concepts/context-compression.md) |
 | `--auth` | | | Auth mode: `none`, `oidc`, `http_verifier`, `aws_sigv4`, `gcp_iap`, `azure_ad`, `custom` |
 | `--auth-issuer` | | | OIDC issuer URL (required with `--auth=oidc`) |
 | `--auth-audience` | | | OIDC audience (required with `--auth=oidc`) |
@@ -217,6 +218,7 @@ forge run [flags]
 | `--enforce-guardrails` | `false` | Enforce guardrail violations as errors |
 | `--model` | | Override model name (sets `MODEL_NAME` env var) |
 | `--provider` | | LLM provider: `openai`, `anthropic`, or `ollama` |
+| `--compression` | | Enable reversible context compression; `--compression=false` forces it off. Absent = forge.yaml/env decide (sets `FORGE_COMPRESSION`). See [Context Compression](../core-concepts/context-compression.md) |
 | `--env` | `.env` | Path to .env file |
 | `--with` | | Comma-separated channel adapters (e.g., `slack,telegram`) |
 | `--auth-url` | | External auth provider URL for token validation |
@@ -291,6 +293,7 @@ forge serve [start|stop|status|logs] [flags]
 | `--host` | `127.0.0.1` | Bind address (secure default) |
 | `--with` | | Channel adapters |
 | `--cors-origins` | localhost | Comma-separated CORS allowed origins |
+| `--compression` | | Enable reversible context compression; `--compression=false` forces it off. Forwarded to the daemon `forge run` only when explicitly passed |
 
 ### Examples
 

diff --git a/docs/reference/environment-variables.md b/docs/reference/environment-variables.md
@@ -12,6 +12,7 @@ order: 3
 | `FORGE_MODEL_FALLBACKS` | Fallback chain (e.g., `"anthropic:claude-sonnet-4,gemini"`) |
 | `FORGE_MEMORY_PERSISTENCE` | Set `false` to disable session persistence |
 | `FORGE_MEMORY_LONG_TERM` | Set `true` to enable long-term memory |
+| `FORGE_COMPRESSION` | Set `true`/`false` to override `compression.enabled` (reversible context compression); the `--compression` flag overrides both |
 | `FORGE_EMBEDDING_PROVIDER` | Override embedding provider |
 | `OPENAI_API_KEY` | OpenAI API key |
 | `OPENAI_ORG_ID` | OpenAI Organization ID (enterprise); overrides `organization_id` in YAML |