feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82
feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82manthis wants to merge 15 commits into
Conversation
Refuse to compress payloads larger than 2 MB. Prevents pathological or malicious tool outputs from turning a single request into a DoS vector on the gateway. Applied at the central segment-protection helper, which all three agent pipelines (Claude/Codex/OpenCode) route through.
Prepend an invisible marker `<!--ec1-->` to every compressed result and short-circuit the compressor when the input already carries it. Without this guard, every conversation turn re-runs every previously compressed tool message — and any change in compressor behaviour (bug fix, threshold tweak, new strategy) silently invalidates the upstream prompt cache, turning what should be a cache hit into a full re-encode. The marker prefix `<!--ec` is recognized regardless of version number, so rolling forward to `ec2` will still treat older outputs as already compressed and leave them alone.
Introduce CompressionMetrics, a thread-safe collector of per-tool counters (invocations, skips, bytes_in, bytes_out). Embedded in CompressionConfig and shared across cloned layer/service handles via Arc, so the gateway can scrape coherent stats from anywhere. Without measurement we tune compression strategies blind. The snapshot API returns sorted owned data, ready to be exposed through a metrics endpoint or a CLI report.
…ive mode - Recognize TOML, JSON, YAML, Markdown extensions. JSON/Markdown skip comment-stripping entirely so values that look like comments survive. - Detect well-known lockfiles by basename (Cargo.lock, package-lock.json, yarn.lock, pnpm-lock.yaml, go.sum, etc.) and replace the body with a short head + tail + elision stub. The LLM almost never needs the full contents of a generated lockfile. - For brace-language files above 500 lines, collapse function/class bodies of 8+ lines into a `// ... (N lines collapsed)` placeholder. Comment-stripping alone leaves a 5000-line file at maybe 4500 lines — this turns big files into a usable skeleton.
…peline Add CompressionTechnique trait + CompressionPipeline that chains techniques in order. Existing tool-output compression becomes ToolResultsTechnique — the only technique shipped today, but the trait is the seam for upcoming work (image down-sample, system-prompt deduplication, conversation summarization, …). CompressionLayer now builds a default `[tool-results]` pipeline so existing call sites stay drop-in compatible. CompressionLayer::with_pipeline is the new escape hatch for callers that want to assemble a custom chain.
…l silent chains Bash command dispatch now unwraps: - Leading env-var assignments (FOO=bar cargo build) - Wrapper keywords: sudo, time, env, nohup, exec - Silent leading sub-commands followed by &&/;/|| (cd path && cargo build, export X=1 && cargo test, ...) After unwrapping, dispatch is by basename so absolute paths like /usr/local/bin/cargo route to the cargo compressor identically to bare "cargo". Net effect: real-world tool calls like `cd src && cargo build` and `sudo /usr/bin/find . -name '*.rs'` now compress instead of falling through to "no compression".
Codex tool outputs carry an "Exit code: N" line in the header. The header is stripped before compression so that line was being thrown away — the agent saw a successfully compressed body and lost the failure signal entirely. We now parse the exit code from the header (supports both "Exit code: N" and "Process exited with code N" formats), and re-inject it as a `[exit N]` prefix on non-zero exits. The prefix is placed AFTER the version marker so the idempotency check stays intact across passes.
`<tool_use_error>` and `<persisted-output>` were detected with bare `String::contains`, which produced false positives the moment a tool output mentioned the tag in body content (a Read of this very source file would have triggered). Anchor the check to line start so only genuine tag blocks short-circuit compression. Documents and code that *talk about* the tags now compress normally.
The pure 10 % savings ratio was rejecting useful compression on big files: 8 % of 100 KB is 8 KB of avoidable tokens that we were throwing away. Switch the gate to (savings ≥ 10 % OR savings ≥ 200 bytes) — either threshold alone is enough to keep the result. Applied to both the Claude and OpenCode Read compressors so they stay in sync.
split_into_segments was running a regex find_iter on every output. 99% of tool outputs do not contain a `<system-reminder>` block, so a literal substring check is enough to short-circuit and skip the regex entirely. Same correctness, less work on the hot path.
Coding agents send the same long system prompt (CLAUDE.md, agent rules,
tool descriptions) on every request. Anthropic's prompt cache can avoid
re-encoding it — but only when the block carries a cache_control hint.
Many clients don't set one.
This new technique scans system messages and injects
`cache_control: {"type": "ephemeral"}` on large, un-hinted blocks.
Caps total injections to leave headroom under Anthropic's 4-breakpoint
limit, and counts pre-existing hints (ours or upstream's) so repeated
applies stay idempotent.
Also demonstrates the pipeline architecture from the previous commit:
just plug it into CompressionLayer::with_pipeline alongside the existing
ToolResultsTechnique.
Session logs already record `tool_compression_stats` per tool; the
stats command was throwing it away. Add a `--per-tool` flag that sums
{count, before, after} across every stored session and renders the
biggest absolute savings first, with the same compression bar widget
as the per-session row.
Useful both for tuning thresholds and for spotting tools where the
compressor is doing nothing.
Six micro-benches covering compress_tool_output for Bash/Read/Grep/Glob, the segment-protection helper on its fast path, and the Codex shell pipeline. Each one exercises a realistic-sized payload (40 hunks, 800 lines, 2k matches, etc.) so a regression in the strategy or in the marker/threshold pipeline is visible without the noise of micro-noise. Run with: cargo bench -p edgee-compressor
…ixture Builds a Claude-shaped CompletionRequest mirroring a real coding-agent mid-session payload (3.5 KB system prompt + 4-turn dialogue with Read, Glob, Bash cargo, Grep tool calls totalling ~75 KB of tool output) and runs it through the full pipeline (SystemPromptCacheTechnique + ToolResultsTechnique). Asserts: - end-to-end byte savings ≥ 40 % (measured: 59 %) - every compressed tool message starts with the version marker - system prompt receives cache_control: ephemeral - per-tool metrics are populated and self-consistent - second pass through the pipeline is a no-op (idempotency, prompt cache stays stable across turns) - Codex variant: non-zero exit codes survive as `[exit N]` after the marker, and the codex pipeline is also idempotent Per-tool numbers from this fixture: Bash cargo: 3892 → 220 (94 %) Glob 250: 5139 → 619 (88 %) Grep 240: 13380 → 1859 (86 %) Read 800ln: 52983 → 27954 (47 %)
KokaKiwi
left a comment
There was a problem hiding this comment.
Thanks for the PR, lots of nice stuff in here. The per-tool metrics, the memory guard, and SystemPromptCacheTechnique all fill real gaps, and the jump from 386 to 435 tests with zero clippy warnings is appreciated.
Flagging the review with "Request changes" because of the brace-collapse compression on crates/compressor/src/strategy/claude/read.rs that could silently drop code when braces appear inside string literals.
We previously had some "aggressive" Read compression as well but it seemed to incapacitate the coding agent a lot when doing that so we had to tone down a bit this tool compression specifically.
Rest is smaller reviews plus one design question I would love your take on.
| /// Brace counting is naive — it does not strip braces inside strings or | ||
| /// comments. Mis-counts only ever cause "kept too much" (the body fails to | ||
| /// collapse), never silent data loss. | ||
| pub(crate) fn aggressive_collapse_braces( |
There was a problem hiding this comment.
The brace counter on lines 547-560 walks every { / } character without filtering string literals, raw strings, or block comments. If a body contains a literal brace, depth reaches zero inside the literal, closed is set, and everything between that false close and the real } is silently dropped.
Repro shape:
fn build_query() {
let sql = r#"SELECT * FROM t WHERE x = '}'"#;
actual_code_here();
}The doc-comment above the function says mis-counts only cause "kept too much, never silent data loss". That contract is violated here.
Tightening the boundary check is enough to make this sound. We accept fewer collapses (when } shares a line) but never drop code:
let closing = lines[j].1.trim();
if closed && j > i && matches!(closing, "}" | "};" | "},") {
// safe to collapse
}While we're here, the docstring at lines 530-531 needs to be updated too.
Review comment authored with Claude Code (Opus 4.7).
| let mut map = self | ||
| .inner | ||
| .lock() | ||
| .expect("compression metrics mutex poisoned"); |
There was a problem hiding this comment.
All four public methods of CompressionMetrics (lines 54, 68, 80, 92) lock the mutex with .expect("compression metrics mutex poisoned"). Once any thread panics while holding the lock, every subsequent request on the hot path panics as well, a single observability-path issue takes the whole gateway down.
For accumulating counters, a partial / stale read is always preferable to a crash. The standard recovery is enough here:
self.inner.lock().unwrap_or_else(|e| e.into_inner())Apply to all four call sites.
Review comment authored with Claude Code (Opus 4.7).
| //! prompt cache can avoid re-encoding it — but only if the request marks the | ||
| //! block with a `cache_control` hint. Many clients don't. | ||
| //! | ||
| //! This technique scans every system / developer message and, for those large |
There was a problem hiding this comment.
The doc says "system / developer message" but the implementation only matches Message::System (line 93), and DeveloperMessage doesn't have a cache_control field anyway. Either drop "developer" from the doc, or add the field in gateway-core and a Message::Developer arm in the loop and the counter.
There was a problem hiding this comment.
The CompressionTechnique trait + CompressionPipeline design works and makes sense for the near-term roadmap (MCP cleaning, caveman summarization, etc.). The builder API (CompressionLayer::with_pipeline) is a nice escape hatch too.
One thing I had in mind when thinking about this was using Tower layers directly, each technique as its own tower::Layer in the ServiceBuilder chain, which would keep the composition model consistent with the rest of the gateway stack. That said, I think there's a middle ground: we could keep the internal CompressionPipeline as-is and bridge it to Tower properly in a follow-up, once we have a clearer picture of what the next techniques look like. Not asking for changes here, just flagging it as something worth revisiting post-merge.
There was a problem hiding this comment.
I want to double-check the assumption behind is_already_compressed and the COMPRESSION_MARKER prefix.
The stated goal is cache-safety: if the same compressed message re-enters the pipeline unchanged, the upstream Anthropic prompt cache sees a stable byte sequence. That logic holds if the gateway ever receives its own compressed output back, but in our proxy model, it doesn't. The gateway compresses tool results before forwarding to Anthropic, but the compressed form is never echoed back to the client. On every subsequent turn the client resends its own copy of the conversation history with the original, uncompressed tool outputs, so the gateway always starts fresh.
The scenario the marker is protecting against (a previously compressed message re-entering the pipeline) can't happen structurally, so the guard feels un-needed. And even if we wanted that protection at the compressor crate level, it seems more natural for the crate's caller to be responsible for not feeding already-compressed content back in, rather than baking the marker into the output format.
Happy to be convinced otherwise if there's a flow I'm missing.
Summary
Reworks the compression layer end-to-end. Two main goals:
would re-process every tool message on every turn — any change in
compression behaviour (bug fix, threshold tweak) would invalidate the
Anthropic / OpenAI prompt cache for the whole conversation.
always promised "multiple composable techniques", but only one was wired
in and the dispatch was hard-coded. This PR introduces the trait + chain
and demonstrates it with a second technique.
Along the way, fixes a handful of correctness gaps (
<tool_use_error>falsepositives, lost Codex exit codes), tightens dispatch on real-world bash
invocations (
sudo cargo build,cd src && cargo test,/usr/local/bin/cargo),adds per-tool metrics, image-resistant memory bounds, and a benchmark suite.
14 commits, +2 100 / −80 lines, 49 new tests (386 → 435), zero clippy
warnings under
-D warnings.Real-world results
End-to-end integration test (
tests/integration.rs) on a fixture mirroring amid-session Claude Code request — 3.5 KB system prompt + 4-turn dialogue with
Read / Glob / Bash / Grep tool calls totalling ~75 KB of tool output:
Per-tool breakdown:
cargo build)Idempotency verified: a second pass through the pipeline is a byte-for-byte
no-op, so re-encoding never wakes the prompt cache.
What changed
Core correctness & cache safety
<!--ec1-->. Re-entry detects the marker and short-circuits, so a stablecompressed message stays byte-identical across turns and the upstream
prompt cache survives.
MAX_COMPRESSIBLE_BYTES = 2 MiB. Pathological tooloutputs no longer turn a single request into a DoS vector.
<tool_use_error>/<persisted-output>are now anchored to line start. Documents and codethat mention the tags compress normally instead of being silently
rejected.
shell_commandarere-injected as
[exit N]after the marker, so the agent still seesfailure signal even after the header is stripped.
Architecture
CompressionTechniquetrait +CompressionPipeline— composable chainapplied in order to every
CompletionRequest. Existing tool-resultcompression becomes
ToolResultsTechnique.CompressionLayer::newkeepsthe previous behaviour drop-in;
CompressionLayer::with_pipelineis thenew escape hatch for custom chains.
SystemPromptCacheTechnique(new) — auto-injectscache_control: {"type": "ephemeral"}on large, un-hinted system messagesso Anthropic can serve them from prompt cache. Caps total injections to
stay under the 4-breakpoint limit, counts pre-existing hints to remain
idempotent across passes.
Bash dispatch
Real-world tool calls now route correctly:
FOO=bar cargo build).sudo,time,env,nohup,exec.&&/;/||(
cd src && cargo build,export X=1 && cargo test)./usr/local/bin/cargoand barecargorouteidentically.
Read enhancements
comment-stripping entirely so values that look like comments survive.
Cargo.lock,package-lock.json,yarn.lock,pnpm-lock.yaml,go.sum, …) — replaced with a head + tail + elisionstub. Generated lockfiles eat token budget for almost no informational
value.
class bodies of 8+ lines into a
// ... (N lines collapsed)placeholder.A 5 000-line source file becomes a usable skeleton instead of a slightly
smaller wall of text.
(whichever fires first). Stops rejecting modest absolute gains on large
files.
Observability
CompressionMetrics— per-tool counters (invocations,skipped,bytes_in,bytes_out) shared across cloned layer / service handlesvia
Arc. Snapshot + totals APIs return owned data, ready for exportvia a
/metricsHTTP handler.edgee stats --per-tool— aggregatestool_compression_statsfromevery stored session log, sorted by absolute savings descending. Useful
both for tuning and for spotting tools where the compressor does
nothing.
Performance & test infrastructure
split_into_segments— skip the regex walkentirely when the literal
<system-reminder>substring is absent(>99 % of inputs).
Glob / segment-protection / Codex pipeline.
cargo bench -p edgee-compressor.Estimated cost per improvement
Cost dimensions per change:
provider, multiplied by realistic conversation length.
Latency numbers below are taken from
cargo bench -p edgee-compressor -- --quickon an Apple Silicon dev machine, measured per tool message.MAX_COMPRESSIBLE_BYTES)len()compare)<!--ec1-->)starts_with+ oneformat!)Stringgrow)lines().anyinstead ofcontains)format!only on non-zero exits&strslices)Stringon hitCargo.lockshrinks from 5–50 KB to ~200 bytesVecreallocated to filtered size<system-reminder>CompressionTechniquepipelineVeciter + dyn dispatch)Arc<CompressionPipeline>per layer (shared)SystemPromptCacheTechniqueserde_json::Valueper injection (≤2 per request)CompressionMetricsHashMap::entry)String(tool name) on first encounter, then nothingedgee stats --per-toolConcrete bench results on the fixture in
tests/integration.rs:Net per-request overhead is well under a millisecond for everything except
the Grep
contentstrategy, which was already this slow before this PR(2 000 matches × O(matches) regrouping). Filing a follow-up to revisit it.
Test plan
cargo fmt --allcargo clippy --all-targets -- -D warnings— zero warningscargo test --all— 435 passed, 0 failedcargo bench -p edgee-compressor --no-run— benches compilecargo bench -p edgee-compressor -- --quick— benches execute, numbers abovecargo test -p edgee-compression-layer --test integration -- --nocapture— 59 % byte savings on a realistic fixture, idempotency confirmed
survives round-trips and the Anthropic cache_creation / cache_read
counters reflect the SystemPromptCacheTechnique injections
edgee stats --per-toolagainst an existing session logdirectory
Backwards compatibility
CompressionLayer::new(config)still exists and produces a defaultpipeline of
[tool-results]— no behavioural change for current callers.CompressionConfig::new(agent)returns the sameArc<CompressionConfig>as before; the new
metricsfield defaults to a freshArc<CompressionMetrics>.(
compress_tool_output,compress_codex_tool_output,claude_compressor_for,…).
Future work (deliberately out of scope)
imagecrate dependency, invasive content-typehandling).
argumentsdedup via content-hash references (request-levelstate).
cargo-fuzz/proptest.contentmode performance — 234 ms on 2 000 matches is theone outlier in the bench suite (pre-existing).
From HelloMax to Edgee with 🫶