Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions .claude/skills/forge.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,19 @@ caps, compaction triggers. Embedding provider auto-detects from the
LLM provider (Anthropic → `voyage-3` family; OpenAI → `text-embedding-3-small`)
unless `memory.embedding_provider` is explicit.

**Read**: `docs/core-concepts/memory-system.md`.
Opt-in **context compression** (ctxzip): when `compression.enabled` is
set, large tool outputs are compressed reversibly before reaching the
LLM — an `AfterToolExec` hook compresses once at production time, an
`llm.Client` wrapper compresses each request's live zone, and the
`context_expand` builtin retrieves offloaded originals by
`<<ctxzip:HASH>>` marker from a bbolt store (`.forge/ctxzip.db`,
30-min TTL). `compression.keep_patterns` declares domain vocabulary
that is never dropped; `compression.cache_hints` injects provider
prompt-cache primitives (anthropic `cache_control`, openai
`prompt_cache_key`). Fail-open: any error runs uncompressed.

**Read**: `docs/core-concepts/memory-system.md`,
`docs/core-concepts/context-compression.md`.

---

Expand Down Expand Up @@ -768,10 +780,10 @@ Full reference: `docs/reference/cli-reference.md`.

| Subcommand | Purpose | Key flags |
|---|---|---|
| `forge init` | Scaffold a new agent: `forge.yaml`, `.env`, `SKILL.md`, `guardrails.json`. Interactive TUI by default; `--non-interactive` for CI | `--model-provider`, `--model-name`, `--channels`, `--auth`, `--from-skills` |
| `forge init` | Scaffold a new agent: `forge.yaml`, `.env`, `SKILL.md`, `guardrails.json`. Interactive TUI by default; `--non-interactive` for CI | `--model-provider`, `--model-name`, `--channels`, `--auth`, `--from-skills`, `--compression` |
| `forge build` | Run the build pipeline → `.forge-output/agent.json` + container Dockerfile + K8s manifests + (optional) signature | `--output-dir`, `--sign` |
| `forge validate` | Lint `forge.yaml` + SKILL.md. `--platform-policy=PATH` lints a policy file standalone | `--strict`, `--command-compat`, `--platform-policy` |
| `forge run` | Dev-mode A2A server with hot-reload | `--port`, `--host`, `--with slack,telegram`, `--mock-tools`, `--no-auth`, `--cors-origins`, `--audit-socket`, `--audit-http-endpoint`, `--rate-limit-*`, `--otel-enabled`, `--otel-endpoint`, `--otel-sampler` |
| `forge run` | Dev-mode A2A server with hot-reload | `--port`, `--host`, `--with slack,telegram`, `--mock-tools`, `--no-auth`, `--cors-origins`, `--audit-socket`, `--audit-http-endpoint`, `--rate-limit-*`, `--otel-enabled`, `--otel-endpoint`, `--otel-sampler`, `--compression[=false]` |
| `forge serve start \| stop \| status \| logs` | Daemonized A2A server (forks `forge run`). Forwards CLI flags + env to the child | `--port`, `--shutdown-timeout`, `--with` |
| `forge export` | Export `agent.json` for registry upload | |
| `forge package` | Generate Dockerfile + Kubernetes manifests + `egress_allowlist.json`. `--prod` rejects `dev-open` egress + dev-only tools | `--registry`, `--tag`, `--base`, `--prod` |
Expand Down Expand Up @@ -852,6 +864,11 @@ memory:
long_term: false
embedding_provider: openai

compression:
enabled: false # reversible context compression (ctxzip)
keep_patterns: [] # never-drop vocabulary
cache_hints: true # provider prompt-cache primitives

mcp:
token_store_path: ~/.forge/mcp-tokens.enc
servers:
Expand Down Expand Up @@ -1042,7 +1059,9 @@ when OTel tracing is enabled (OTel v1 / Phase 4 / #105). Both use
| `EventMCPToolConflict` | `mcp_tool_conflict` | Namespaced tool collision detected |
| `EventMCPTokenRefresh` | `mcp_token_refresh` | OAuth 2.1 token refresh result |
| `EventAgentCardPublished` | `agent_card_published` | Agent Card finalized at startup / hot-reload; `name`, `version`, `protocol_version`, `url`, `skill_count`, `capabilities`, `security_schemes`, `card_size_bytes`, `card_sha256` (FWS-1) |
| `AuditInvocationComplete` | `invocation_complete` | A2A invocation closed; `duration_ms`, `input_tokens_total`, `output_tokens_total`, `llm_call_count`, `model`, `provider` (FWS-3) |
| `context_compressed` | `context_compressed` | Context compression shrank content; `seam` (`tool_output` / `request`), `tool`, `tokens_before` / `tokens_after` / `saved_tokens` + running totals (tokenizer estimates) |
| `context_expanded` | `context_expanded` | Model retrieved offloaded content via `context_expand`; `hash`, `hit`, `bytes` + running totals |
| `AuditInvocationComplete` | `invocation_complete` | A2A invocation closed; `duration_ms`, `input_tokens_total`, `output_tokens_total`, `llm_call_count`, `model`, `provider` (FWS-3); with compression enabled also `compression_saved_tokens_total`, `compression_count`, `expansion_count` |
| `AuditInvocationCancelled` | `invocation_cancelled` | A2A invocation cancelled via `tasks/cancel`; classified `reason` + partial token totals (FWS-4) |
| `AuditTaskAdmissionDenied` | `task_admission_denied` | Inbound `tasks/send` denied by the platform admission middleware (#201; opt-in via `FORGE_ADMISSION_URL` + `FORGE_PLATFORM_TOKEN`); `reason`, `scope`, `window`, `reset_at`, `cached`. Caller sees HTTP 402 Payment Required. |
| `AuditPolicyLoaded` | `policy_loaded` | One per non-empty policy layer at startup; `layer`, `source`, per-list size counters (FWS-5/6) |
Expand Down Expand Up @@ -1103,6 +1122,7 @@ docs/
│ ├── skill-md-format.md ← SKILL.md schema
│ ├── channels.md
│ ├── memory-system.md
│ ├── context-compression.md ← reversible tool-output compression
│ ├── scheduling.md
│ └── observability-tracing.md ← OTel v1 (#108) — spans, propagation, audit cross-link
├── security/
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ You write a `SKILL.md`. Forge compiles it into a secure, runnable agent with egr
| [Tools](docs/core-concepts/tools-and-builtins.md) | Built-in tools, adapters, and custom tools |
| [Runtime](docs/core-concepts/runtime-engine.md) | LLM providers, fallback chains, running modes |
| [Memory](docs/core-concepts/memory-system.md) | Session persistence and long-term memory |
| [Context Compression](docs/core-concepts/context-compression.md) | Reversible compression of bulky tool outputs — fewer tokens, nothing lost |
| [Channels](docs/core-concepts/channels.md) | Slack and Telegram adapter setup |
| [Scheduling](docs/core-concepts/scheduling.md) | Cron configuration and schedule tools |
| [Tracing](docs/core-concepts/observability-tracing.md) | OpenTelemetry distributed tracing — spans, propagation, audit cross-link |
Expand Down
117 changes: 117 additions & 0 deletions docs/core-concepts/context-compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: "Context Compression"
description: "Reversible compression of bulky tool outputs — fewer tokens, nothing lost."
order: 6
---

Forge can compress bulky tool outputs before they reach the LLM — reversibly: everything dropped stays retrievable, so compression is lossy on the wire but lossless end-to-end.

Powered by [ctxzip](https://github.com/initializ/ctxzip). Off by default; enable per agent in `forge.yaml`, per run with a flag, or at scaffold time in the init wizard.

## The problem it solves

Agent tool outputs are dominated by repetition: 149 pods that are `Running` and one that is `CrashLoopBackOff`; hundreds of log lines differing only by timestamp; JSON list responses where the model needs three rows. Without compression these outputs either flood the context window or get **truncated** — destroying whatever fell past the cut, which is frequently the one row that mattered.

Compression inverts the tradeoff: keep what matters (errors, anomalies, query-relevant rows, boundaries), offload the rest to a local store, and let the model retrieve the original if it turns out to need it.

## How it works

```
tool executes
AfterToolExec hook ──── output ≥ 2 KB? ──── compress once, at production time
│ dropped content → .forge/ctxzip.db
│ replaced by <<ctxzip:HASH note>> marker
Memory (compressed bytes never change → provider prompt caches stay warm)
LLM client wrapper ──── compresses the live zone of each request
│ (system prompt + recent turns forwarded byte-identical)
LLM sees: [... kept rows, errors intact ...] <<ctxzip:ac998fea694b 149_lines_offloaded>>
└─ needs the offloaded data? → calls context_expand(hash) → original returned
```

Three pieces, all automatic once enabled:

| Piece | What it does |
|-------|--------------|
| Tool-output hook | Compresses each large tool result once, before it enters session memory. Error results and small outputs are left verbatim. |
| Client wrapper | Compresses the remaining live zone of each outbound request. Deterministic across turns so historic messages always compress to identical bytes. |
| `context_expand` tool | Registered automatically. The model calls it with a marker's hash to get the original content back. A system-prompt directive teaches every agent what markers are — skills need zero awareness. |

## What is never dropped

Fidelity is layered; every layer only ever adds protection:

1. **Error floor** — content matching error vocabulary (`error`, `fail`, `panic`, `timeout`, `crash`, `backoff`, `oomkilled`, `evicted`, …) is kept verbatim.
2. **`keep_patterns`** — your domain's never-drop vocabulary (see below).
3. **Query anchors** — items matching the conversation's ask survive.
4. **Structure** — head/tail windows and one exemplar of each near-duplicate group.
5. **Reversibility** — everything else is offloaded to the store, not deleted.
6. **Source of truth** — after the store TTL (30 min), the disk or the original command still holds the data; a retrieval miss tells the model to re-run the producing tool.

## Configuration

```yaml
# forge.yaml
compression:
enabled: true # default: false
keep_patterns: # domain vocabulary that must never be dropped
- CrashLoopBackOff
- PAYMENT_DECLINED
# store_path: .forge/ctxzip.db # offloaded-originals store (bbolt)
# ttl: 30m # how long originals stay retrievable
# min_tool_output_chars: 2048 # hook floor; smaller outputs untouched
# cache_hints: true # provider prompt-cache hints (defaults to enabled)
```

Precedence (most specific wins):

```
forge run --compression[=false] > FORGE_COMPRESSION=true|false > compression.enabled > off
```

| Surface | Usage |
|---------|-------|
| `forge run --compression` | Enable for one run; `--compression=false` force-disables even when forge.yaml enables it |
| `forge serve --compression[=false]` | Forwarded to the daemon |
| `forge init --compression` | Scaffold a new agent with the block enabled |
| init TUI wizard | "Context Compression" step (between Skills and Auth) |

## Provider prompt-cache hints

Compressing the wrong bytes can *cost* tokens by busting the provider's prompt cache, so compression never touches the system prompt, tool definitions, or recent turns, and its output is deterministic across turns. On top of that, `cache_hints` (on by default when compression is enabled) injects each provider's native cache primitives:

| Provider | Hint |
|----------|------|
| anthropic | `cache_control: {type: ephemeral}` breakpoints on the last tool definition and the system block — caches the stable tools+system prefix across turns. Also applies on the `aws_sigv4` Bedrock-passthrough path. |
| openai / gemini | A stable `prompt_cache_key` derived from (model, system prompt, tool names) — pins cache routing; prefix caching itself is automatic. |

When `cache_hints` is off, provider wire formats are byte-identical to a build without compression.

## Observability

Savings are first-class audit events, not log noise — see [Audit Logging](../security/audit-logging.md) for the event schema:

- `context_compressed` — per compression: seam, tool, tokens before/after/saved, plus running totals.
- `context_expanded` — per retrieval: hash, hit, bytes — the cost side to net against savings.
- `invocation_complete` gains `compression_saved_tokens_total`, `compression_count`, and `expansion_count`, accumulated per invocation (concurrent tasks never cross-contaminate).

Token figures are tokenizer estimates (directionally accurate); billed truth remains `llm_call.input_tokens`. A surgical session that produced only small outputs correctly reports `compression_count: 0` — compression is insurance against bulk, not a tax on every call.

## Failure posture

Fail-open, always: if the store cannot be opened, a compressor errors, or "compression" would grow a message, the original content is used unchanged. Error tool results are never compressed. An expired retrieval is not a dead end — the model is told to re-run the tool that produced the output.

**Single-writer store.** The bbolt store at `store_path` holds an exclusive file lock — one store per process. A second process pointing at the same file (two replicas on a shared volume, or `forge run` alongside `forge serve` in the same directory) fails to acquire the lock after a 5-second timeout and that process runs uncompressed (fail-open, with a startup warning). Give each replica its own `store_path` — offloaded originals are only ever retrieved by the process that offloaded them, so the store has no reason to be shared.

## Related

- [Runtime Engine](runtime-engine.md) — where the hook and client wrapper sit in the agent loop
- [Tools & Builtins](tools-and-builtins.md) — the `context_expand` tool
- [forge.yaml Schema](../reference/forge-yaml-schema.md) — the `compression` block
- [CLI Reference](../reference/cli-reference.md) — flags and wizard step
4 changes: 4 additions & 0 deletions docs/core-concepts/runtime-engine.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,10 @@ The `FilesDir` is set via `LLMExecutorConfig.FilesDir` and made available to too

For details on session persistence, context window management, compaction, and long-term memory, see [Memory](memory-system.md).

## Context Compression

When `compression.enabled` is set, the runner wires reversible context compression (ctxzip) into the loop at three points: an `AfterToolExec` hook compresses large tool outputs once, before they enter memory (registered after the guardrail hooks, so it compresses redacted output); the LLM client is wrapped in a compressing decorator below the fallback chain (so retries and compactor summarization calls are covered too); and the `context_expand` retrieval tool is registered so the model can recover offloaded content by marker hash. A constant system-prompt directive teaches the model what `<<ctxzip:...>>` markers are — individual skills need no awareness. Compression output is deterministic across turns and never touches the system prompt or recent messages, keeping provider prompt caches warm; `compression.cache_hints` additionally injects the provider's native cache primitives (anthropic `cache_control` breakpoints, openai `prompt_cache_key`). See [Context Compression](context-compression.md).

## Hooks

The engine fires hooks at key points in the loop. See [Hooks](hooks.md) for details.
Expand Down
5 changes: 5 additions & 0 deletions docs/core-concepts/tools-and-builtins.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Tools are capabilities that an LLM agent can invoke during execution. Forge prov
| `read_skill` | Load full instructions for an available skill on demand |
| `memory_search` | Search long-term memory (when enabled) |
| `memory_get` | Read memory files (when enabled) |
| `context_expand` | Retrieve the original content behind a `<<ctxzip:...>>` compression marker (when [compression](context-compression.md) is enabled) |
| `cli_execute` | Execute pre-approved CLI binaries |
| `schedule_set` | Create or update a recurring cron schedule |
| `schedule_list` | List all active and inactive schedules |
Expand Down Expand Up @@ -180,6 +181,10 @@ When [long-term memory](memory-system.md) is enabled, two additional tools are r

These tools allow the agent to recall information from previous sessions.

## Context Expansion Tool

When [context compression](context-compression.md) is enabled, the `context_expand` tool is registered. Compressed tool outputs carry inline `<<ctxzip:HASH note>>` markers; the model calls `context_expand` with the hash to retrieve the offloaded original from the local store. The tool tolerates imperfect input — a whole marker pasted as the hash, or a truncated hash that uniquely prefixes a recently emitted one — and a miss (expired/evicted entry) returns guidance to re-run the producing tool rather than an error.

## Development Tools

Development tools (`local_shell`, `local_file_browser`, `debug_console`, `test_runner`) are available during `forge run --dev` but are **automatically filtered out** in production builds by the `ToolFilterStage`.
Expand Down
5 changes: 4 additions & 1 deletion docs/reference/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Complete reference for all Forge CLI commands.

## `forge init`

Initialize a new agent project.
Initialize a new agent project. Without `--non-interactive`, a TUI wizard walks through: name → model provider → fallbacks → channel → tools → skills → context compression → authentication → egress review → summary.

```
forge init [name] [flags]
Expand All @@ -39,6 +39,7 @@ forge init [name] [flags]
| `--org-id` | | | OpenAI Organization ID (enterprise) |
| `--from-skills` | | | Path to a SKILL.md file for auto-configuration |
| `--non-interactive` | | `false` | Skip interactive prompts |
| `--compression` | | `false` | Enable reversible context compression — writes `compression.enabled: true` to the scaffolded forge.yaml. See [Context Compression](../core-concepts/context-compression.md) |
| `--auth` | | | Auth mode: `none`, `oidc`, `http_verifier`, `aws_sigv4`, `gcp_iap`, `azure_ad`, `custom` |
| `--auth-issuer` | | | OIDC issuer URL (required with `--auth=oidc`) |
| `--auth-audience` | | | OIDC audience (required with `--auth=oidc`) |
Expand Down Expand Up @@ -217,6 +218,7 @@ forge run [flags]
| `--enforce-guardrails` | `false` | Enforce guardrail violations as errors |
| `--model` | | Override model name (sets `MODEL_NAME` env var) |
| `--provider` | | LLM provider: `openai`, `anthropic`, or `ollama` |
| `--compression` | | Enable reversible context compression; `--compression=false` forces it off. Absent = forge.yaml/env decide (sets `FORGE_COMPRESSION`). See [Context Compression](../core-concepts/context-compression.md) |
| `--env` | `.env` | Path to .env file |
| `--with` | | Comma-separated channel adapters (e.g., `slack,telegram`) |
| `--auth-url` | | External auth provider URL for token validation |
Expand Down Expand Up @@ -291,6 +293,7 @@ forge serve [start|stop|status|logs] [flags]
| `--host` | `127.0.0.1` | Bind address (secure default) |
| `--with` | | Channel adapters |
| `--cors-origins` | localhost | Comma-separated CORS allowed origins |
| `--compression` | | Enable reversible context compression; `--compression=false` forces it off. Forwarded to the daemon `forge run` only when explicitly passed |

### Examples

Expand Down
1 change: 1 addition & 0 deletions docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ order: 3
| `FORGE_MODEL_FALLBACKS` | Fallback chain (e.g., `"anthropic:claude-sonnet-4,gemini"`) |
| `FORGE_MEMORY_PERSISTENCE` | Set `false` to disable session persistence |
| `FORGE_MEMORY_LONG_TERM` | Set `true` to enable long-term memory |
| `FORGE_COMPRESSION` | Set `true`/`false` to override `compression.enabled` (reversible context compression); the `--compression` flag overrides both |
| `FORGE_EMBEDDING_PROVIDER` | Override embedding provider |
| `OPENAI_API_KEY` | OpenAI API key |
| `OPENAI_ORG_ID` | OpenAI Organization ID (enterprise); overrides `organization_id` in YAML |
Expand Down
Loading
Loading