feat(compression): full overhaul — pipeline, metrics, idempotency, robustness by manthis · Pull Request #82 · edgee-ai/edgee

manthis · 2026-05-01T08:16:39Z

Summary

Reworks the compression layer end-to-end. Two main goals:

Stop silently breaking the upstream prompt cache. The previous compressor
would re-process every tool message on every turn — any change in
compression behaviour (bug fix, threshold tweak) would invalidate the
Anthropic / OpenAI prompt cache for the whole conversation.
Open the architecture for new techniques. The crate description has
always promised "multiple composable techniques", but only one was wired
in and the dispatch was hard-coded. This PR introduces the trait + chain
and demonstrates it with a second technique.

Along the way, fixes a handful of correctness gaps (<tool_use_error> false
positives, lost Codex exit codes), tightens dispatch on real-world bash
invocations (sudo cargo build, cd src && cargo test, /usr/local/bin/cargo),
adds per-tool metrics, image-resistant memory bounds, and a benchmark suite.

14 commits, +2 100 / −80 lines, 49 new tests (386 → 435), zero clippy
warnings under -D warnings.

Real-world results

End-to-end integration test (tests/integration.rs) on a fixture mirroring a
mid-session Claude Code request — 3.5 KB system prompt + 4-turn dialogue with
Read / Glob / Bash / Grep tool calls totalling ~75 KB of tool output:

system prompt: 3 602 bytes  → cache_control: ephemeral injected
tool inputs:   75 394 bytes
tool outputs:  30 652 bytes
saved:         44 742 bytes (59 %)

Per-tool breakdown:

Tool	Input	Output	Saved
Bash (`cargo build`)	3 892	220	94 %
Glob (250 paths)	5 139	619	88 %
Grep (240 matches)	13 380	1 859	86 %
Read (800-line Rust)	52 983	27 954	47 %

Idempotency verified: a second pass through the pipeline is a byte-for-byte
no-op, so re-encoding never wakes the prompt cache.

What changed

Core correctness & cache safety

Versioned output marker — every compressed result is prefixed with
. Re-entry detects the marker and short-circuits, so a stable
compressed message stays byte-identical across turns and the upstream
prompt cache survives.
Memory guard — MAX_COMPRESSIBLE_BYTES = 2 MiB. Pathological tool
outputs no longer turn a single request into a DoS vector.
Stricter protected-tag detection — <tool_use_error> /
<persisted-output> are now anchored to line start. Documents and code
that mention the tags compress normally instead of being silently
rejected.
Codex exit code preserved — non-zero exits from shell_command are
re-injected as [exit N] after the marker, so the agent still sees
failure signal even after the header is stripped.

Architecture

CompressionTechnique trait + CompressionPipeline — composable chain
applied in order to every CompletionRequest. Existing tool-result
compression becomes ToolResultsTechnique. CompressionLayer::new keeps
the previous behaviour drop-in; CompressionLayer::with_pipeline is the
new escape hatch for custom chains.
SystemPromptCacheTechnique (new) — auto-injects
cache_control: {"type": "ephemeral"} on large, un-hinted system messages
so Anthropic can serve them from prompt cache. Caps total injections to
stay under the 4-breakpoint limit, counts pre-existing hints to remain
idempotent across passes.

Bash dispatch

Real-world tool calls now route correctly:

Strip leading env-var assignments (FOO=bar cargo build).
Strip wrapper keywords: sudo, time, env, nohup, exec.
Peel silent leading sub-commands followed by && / ; / ||
(cd src && cargo build, export X=1 && cargo test).
Dispatch by basename so /usr/local/bin/cargo and bare cargo route
identically.

Read enhancements

New language coverage: TOML, JSON, YAML, Markdown. JSON / Markdown skip
comment-stripping entirely so values that look like comments survive.
Lockfile detection (Cargo.lock, package-lock.json, yarn.lock,
pnpm-lock.yaml, go.sum, …) — replaced with a head + tail + elision
stub. Generated lockfiles eat token budget for almost no informational
value.
Aggressive mode for brace-language files >500 lines: collapse function /
class bodies of 8+ lines into a // ... (N lines collapsed) placeholder.
A 5 000-line source file becomes a usable skeleton instead of a slightly
smaller wall of text.
Threshold relaxed: accept either ≥10 % savings or ≥200 bytes saved
(whichever fires first). Stops rejecting modest absolute gains on large
files.

Observability

CompressionMetrics — per-tool counters (invocations, skipped,
bytes_in, bytes_out) shared across cloned layer / service handles
via Arc. Snapshot + totals APIs return owned data, ready for export
via a /metrics HTTP handler.
edgee stats --per-tool — aggregates tool_compression_stats from
every stored session log, sorted by absolute savings descending. Useful
both for tuning and for spotting tools where the compressor does
nothing.

Performance & test infrastructure

Regex early-out in split_into_segments — skip the regex walk
entirely when the literal <system-reminder> substring is absent
(>99 % of inputs).
Criterion benchmarks — six micro-benches covering Bash / Read / Grep /
Glob / segment-protection / Codex pipeline. cargo bench -p edgee-compressor.

Estimated cost per improvement

Cost dimensions per change:

Latency — added time per request on the hot path.
Memory — bytes allocated per request beyond what we had before.
Tokens — bytes added to (or removed from) the wire payload sent to the
provider, multiplied by realistic conversation length.
Code — net SLOC added and conceptual surface to maintain.

Latency numbers below are taken from cargo bench -p edgee-compressor -- --quick on an Apple Silicon dev machine, measured per tool message.

Improvement	Latency	Memory	Tokens	Code
Memory guard (`MAX_COMPRESSIBLE_BYTES`)	~1 ns (`len()` compare)	0	0	+30 SLOC
Versioned marker (`<!--ec1-->`)	~50 ns (one `starts_with` + one `format!`)	+10 bytes per compressed message (one `String` grow)	+10 bytes ≈ 3 tokens per tool message; pays for itself in 1 cache hit on the next turn	+50 SLOC
Stricter tag detection	+5–50 µs on outputs that don't carry the tags (`lines().any` instead of `contains`)	0	0	+10 SLOC
Codex exit-code preservation	~80 ns header scan; one `format!` only on non-zero exits	+10 bytes when exit ≠ 0	+~5 tokens on failed shell calls only	+30 SLOC
Bash dispatch (prefixes / basename / silent chains)	~150 ns of byte scanning per Bash tool call	0 (all `&str` slices)	0	+160 SLOC
Read: new langs + threshold relax	~0 (added match arms)	0	Negative — unlocks compression that was previously rejected	+15 SLOC
Read: lockfile stubs	O(1) basename match; format!() only on hit	One small `String` on hit	Large negative — typical `Cargo.lock` shrinks from 5–50 KB to ~200 bytes	+60 SLOC
Read: aggressive brace collapse	O(n) brace scan, only when file > 500 lines AND brace-language; ~30 µs on a 1 000-line file	One `Vec` reallocated to filtered size	Large negative — a 5 000-line file collapses to a few hundred lines of skeleton	+120 SLOC
Regex early-out	Negative latency — skips a 1–10 µs regex walk on the >99 % of outputs without a `<system-reminder>`	0	0	+5 SLOC
`CompressionTechnique` pipeline	~5 ns per technique (`Vec` iter + dyn dispatch)	One `Arc<CompressionPipeline>` per layer (shared)	0	+130 SLOC
`SystemPromptCacheTechnique`	~1 µs per request (one walk over messages)	One `serde_json::Value` per injection (≤2 per request)	0 added; enables 90 %+ token-cost discount on cached system blocks via Anthropic prompt cache	+220 SLOC
Per-tool `CompressionMetrics`	~200–500 ns per tool message (one mutex acquire + one `HashMap::entry`)	One `String` (tool name) on first encounter, then nothing	0	+180 SLOC
`edgee stats --per-tool`	Out of hot path — only runs on CLI invocation	Negligible	0	+100 SLOC
Criterion benches	Out of hot path — dev-only	0	0	+310 SLOC (dev-only)

Concrete bench results on the fixture in tests/integration.rs:

bash_git_diff_40_hunks            22 µs
read_rust_800_lines              114 µs
grep_content_2000_matches        234 ms   (pre-existing algorithm; flagged for follow-up)
glob_400_paths                    55 µs
segment_protection_no_reminder    28 µs   (the early-out path)
codex_shell_command_200_files     52 µs

Net per-request overhead is well under a millisecond for everything except
the Grep content strategy, which was already this slow before this PR
(2 000 matches × O(matches) regrouping). Filing a follow-up to revisit it.

Test plan

cargo fmt --all
cargo clippy --all-targets -- -D warnings — zero warnings
cargo test --all — 435 passed, 0 failed
cargo bench -p edgee-compressor --no-run — benches compile
cargo bench -p edgee-compressor -- --quick — benches execute, numbers above
cargo test -p edgee-compression-layer --test integration -- --nocapture
— 59 % byte savings on a realistic fixture, idempotency confirmed
Smoke test against a live Claude Code session — verify the marker
survives round-trips and the Anthropic cache_creation / cache_read
counters reflect the SystemPromptCacheTechnique injections
Confirm edgee stats --per-tool against an existing session log
directory

Backwards compatibility

CompressionLayer::new(config) still exists and produces a default
pipeline of [tool-results] — no behavioural change for current callers.
CompressionConfig::new(agent) returns the same Arc<CompressionConfig>
as before; the new metrics field defaults to a fresh
Arc<CompressionMetrics>.
All public re-exports from previous releases are preserved
(compress_tool_output, compress_codex_tool_output, claude_compressor_for,
…).

Future work (deliberately out of scope)

Image down-sampling (needs image crate dependency, invasive content-type
handling).
Tool-call arguments dedup via content-hash references (request-level
state).
Async LLM-based conversation roll-up for old turns.
Fuzzing the parsers via cargo-fuzz / proptest.
Per-conversation LRU cache to avoid re-indexing on every request.
Revisit grep content mode performance — 234 ms on 2 000 matches is the
one outlier in the bench suite (pre-existing).

From HelloMax to Edgee with 🫶

Refuse to compress payloads larger than 2 MB. Prevents pathological or malicious tool outputs from turning a single request into a DoS vector on the gateway. Applied at the central segment-protection helper, which all three agent pipelines (Claude/Codex/OpenCode) route through.

Prepend an invisible marker `` to every compressed result and short-circuit the compressor when the input already carries it. Without this guard, every conversation turn re-runs every previously compressed tool message — and any change in compressor behaviour (bug fix, threshold tweak, new strategy) silently invalidates the upstream prompt cache, turning what should be a cache hit into a full re-encode. The marker prefix `<!--ec` is recognized regardless of version number, so rolling forward to `ec2` will still treat older outputs as already compressed and leave them alone.

Introduce CompressionMetrics, a thread-safe collector of per-tool counters (invocations, skips, bytes_in, bytes_out). Embedded in CompressionConfig and shared across cloned layer/service handles via Arc, so the gateway can scrape coherent stats from anywhere. Without measurement we tune compression strategies blind. The snapshot API returns sorted owned data, ready to be exposed through a metrics endpoint or a CLI report.

…ive mode - Recognize TOML, JSON, YAML, Markdown extensions. JSON/Markdown skip comment-stripping entirely so values that look like comments survive. - Detect well-known lockfiles by basename (Cargo.lock, package-lock.json, yarn.lock, pnpm-lock.yaml, go.sum, etc.) and replace the body with a short head + tail + elision stub. The LLM almost never needs the full contents of a generated lockfile. - For brace-language files above 500 lines, collapse function/class bodies of 8+ lines into a `// ... (N lines collapsed)` placeholder. Comment-stripping alone leaves a 5000-line file at maybe 4500 lines — this turns big files into a usable skeleton.

…peline Add CompressionTechnique trait + CompressionPipeline that chains techniques in order. Existing tool-output compression becomes ToolResultsTechnique — the only technique shipped today, but the trait is the seam for upcoming work (image down-sample, system-prompt deduplication, conversation summarization, …). CompressionLayer now builds a default `[tool-results]` pipeline so existing call sites stay drop-in compatible. CompressionLayer::with_pipeline is the new escape hatch for callers that want to assemble a custom chain.

…l silent chains Bash command dispatch now unwraps: - Leading env-var assignments (FOO=bar cargo build) - Wrapper keywords: sudo, time, env, nohup, exec - Silent leading sub-commands followed by &&/;/|| (cd path && cargo build, export X=1 && cargo test, ...) After unwrapping, dispatch is by basename so absolute paths like /usr/local/bin/cargo route to the cargo compressor identically to bare "cargo". Net effect: real-world tool calls like `cd src && cargo build` and `sudo /usr/bin/find . -name '*.rs'` now compress instead of falling through to "no compression".

Codex tool outputs carry an "Exit code: N" line in the header. The header is stripped before compression so that line was being thrown away — the agent saw a successfully compressed body and lost the failure signal entirely. We now parse the exit code from the header (supports both "Exit code: N" and "Process exited with code N" formats), and re-inject it as a `[exit N]` prefix on non-zero exits. The prefix is placed AFTER the version marker so the idempotency check stays intact across passes.

`<tool_use_error>` and `<persisted-output>` were detected with bare `String::contains`, which produced false positives the moment a tool output mentioned the tag in body content (a Read of this very source file would have triggered). Anchor the check to line start so only genuine tag blocks short-circuit compression. Documents and code that *talk about* the tags now compress normally.

The pure 10 % savings ratio was rejecting useful compression on big files: 8 % of 100 KB is 8 KB of avoidable tokens that we were throwing away. Switch the gate to (savings ≥ 10 % OR savings ≥ 200 bytes) — either threshold alone is enough to keep the result. Applied to both the Claude and OpenCode Read compressors so they stay in sync.

split_into_segments was running a regex find_iter on every output. 99% of tool outputs do not contain a `<system-reminder>` block, so a literal substring check is enough to short-circuit and skip the regex entirely. Same correctness, less work on the hot path.

Coding agents send the same long system prompt (CLAUDE.md, agent rules, tool descriptions) on every request. Anthropic's prompt cache can avoid re-encoding it — but only when the block carries a cache_control hint. Many clients don't set one. This new technique scans system messages and injects `cache_control: {"type": "ephemeral"}` on large, un-hinted blocks. Caps total injections to leave headroom under Anthropic's 4-breakpoint limit, and counts pre-existing hints (ours or upstream's) so repeated applies stay idempotent. Also demonstrates the pipeline architecture from the previous commit: just plug it into CompressionLayer::with_pipeline alongside the existing ToolResultsTechnique.

Session logs already record `tool_compression_stats` per tool; the stats command was throwing it away. Add a `--per-tool` flag that sums {count, before, after} across every stored session and renders the biggest absolute savings first, with the same compression bar widget as the per-session row. Useful both for tuning thresholds and for spotting tools where the compressor is doing nothing.

Six micro-benches covering compress_tool_output for Bash/Read/Grep/Glob, the segment-protection helper on its fast path, and the Codex shell pipeline. Each one exercises a realistic-sized payload (40 hunks, 800 lines, 2k matches, etc.) so a regression in the strategy or in the marker/threshold pipeline is visible without the noise of micro-noise. Run with: cargo bench -p edgee-compressor

…ixture Builds a Claude-shaped CompletionRequest mirroring a real coding-agent mid-session payload (3.5 KB system prompt + 4-turn dialogue with Read, Glob, Bash cargo, Grep tool calls totalling ~75 KB of tool output) and runs it through the full pipeline (SystemPromptCacheTechnique + ToolResultsTechnique). Asserts: - end-to-end byte savings ≥ 40 % (measured: 59 %) - every compressed tool message starts with the version marker - system prompt receives cache_control: ephemeral - per-tool metrics are populated and self-consistent - second pass through the pipeline is a no-op (idempotency, prompt cache stays stable across turns) - Codex variant: non-zero exit codes survive as `[exit N]` after the marker, and the codex pipeline is also idempotent Per-tool numbers from this fixture: Bash cargo: 3892 → 220 (94 %) Glob 250: 5139 → 619 (88 %) Grep 240: 13380 → 1859 (86 %) Read 800ln: 52983 → 27954 (47 %)

KokaKiwi

Thanks for the PR, lots of nice stuff in here. The per-tool metrics, the memory guard, and SystemPromptCacheTechnique all fill real gaps, and the jump from 386 to 435 tests with zero clippy warnings is appreciated.

Flagging the review with "Request changes" because of the brace-collapse compression on crates/compressor/src/strategy/claude/read.rs that could silently drop code when braces appear inside string literals.
We previously had some "aggressive" Read compression as well but it seemed to incapacitate the coding agent a lot when doing that so we had to tone down a bit this tool compression specifically.

Rest is smaller reviews plus one design question I would love your take on.

KokaKiwi · 2026-05-07T07:53:27Z

+/// Brace counting is naive — it does not strip braces inside strings or
+/// comments. Mis-counts only ever cause "kept too much" (the body fails to
+/// collapse), never silent data loss.
+pub(crate) fn aggressive_collapse_braces(


The brace counter on lines 547-560 walks every { / } character without filtering string literals, raw strings, or block comments. If a body contains a literal brace, depth reaches zero inside the literal, closed is set, and everything between that false close and the real } is silently dropped.

Repro shape:

fn build_query() { let sql = r#"SELECT * FROM t WHERE x = '}'"#; actual_code_here(); }

The doc-comment above the function says mis-counts only cause "kept too much, never silent data loss". That contract is violated here.

Tightening the boundary check is enough to make this sound. We accept fewer collapses (when } shares a line) but never drop code:

let closing = lines[j].1.trim(); if closed && j > i && matches!(closing, "}" | "};" | "},") { // safe to collapse }

While we're here, the docstring at lines 530-531 needs to be updated too.

Review comment authored with Claude Code (Opus 4.7).

KokaKiwi · 2026-05-07T07:58:20Z

+        let mut map = self
+            .inner
+            .lock()
+            .expect("compression metrics mutex poisoned");


All four public methods of CompressionMetrics (lines 54, 68, 80, 92) lock the mutex with .expect("compression metrics mutex poisoned"). Once any thread panics while holding the lock, every subsequent request on the hot path panics as well, a single observability-path issue takes the whole gateway down.

For accumulating counters, a partial / stale read is always preferable to a crash. The standard recovery is enough here:

self.inner.lock().unwrap_or_else(|e| e.into_inner())

Apply to all four call sites.

Review comment authored with Claude Code (Opus 4.7).

KokaKiwi · 2026-05-07T08:08:29Z

+//! prompt cache can avoid re-encoding it — but only if the request marks the
+//! block with a `cache_control` hint. Many clients don't.
+//!
+//! This technique scans every system / developer message and, for those large


The doc says "system / developer message" but the implementation only matches Message::System (line 93), and DeveloperMessage doesn't have a cache_control field anyway. Either drop "developer" from the doc, or add the field in gateway-core and a Message::Developer arm in the loop and the counter.

KokaKiwi · 2026-05-07T08:18:23Z

The CompressionTechnique trait + CompressionPipeline design works and makes sense for the near-term roadmap (MCP cleaning, caveman summarization, etc.). The builder API (CompressionLayer::with_pipeline) is a nice escape hatch too.

One thing I had in mind when thinking about this was using Tower layers directly, each technique as its own tower::Layer in the ServiceBuilder chain, which would keep the composition model consistent with the rest of the gateway stack. That said, I think there's a middle ground: we could keep the internal CompressionPipeline as-is and bridge it to Tower properly in a follow-up, once we have a clearer picture of what the next techniques look like. Not asking for changes here, just flagging it as something worth revisiting post-merge.

KokaKiwi · 2026-05-07T08:24:15Z

I want to double-check the assumption behind is_already_compressed and the COMPRESSION_MARKER prefix.

The stated goal is cache-safety: if the same compressed message re-enters the pipeline unchanged, the upstream Anthropic prompt cache sees a stable byte sequence. That logic holds if the gateway ever receives its own compressed output back, but in our proxy model, it doesn't. The gateway compresses tool results before forwarding to Anthropic, but the compressed form is never echoed back to the client. On every subsequent turn the client resends its own copy of the conversation history with the original, uncompressed tool outputs, so the gateway always starts fresh.

The scenario the marker is protecting against (a previously compressed message re-entering the pipeline) can't happen structurally, so the guard feels un-needed. And even if we wanted that protection at the compressor crate level, it seems more natural for the crate's caller to be responsible for not feeding already-compressed content back in, rather than baking the marker into the output format.

Happy to be convinced otherwise if there's a flow I'm missing.

manthis requested a review from a team as a code owner May 1, 2026 08:16

edgee-ai deleted a comment from edgee Bot May 5, 2026

manthis added 15 commits May 6, 2026 16:24

chore: ignore .vscode directory

a4f54e5

KokaKiwi force-pushed the feat/compression branch from 4c4975e to 45f1cbb Compare May 6, 2026 14:24

KokaKiwi requested changes May 7, 2026

View reviewed changes

edgee-ai deleted a comment from edgee Bot May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82

feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82
manthis wants to merge 15 commits into
edgee-ai:mainfrom
manthis:feat/compression

manthis commented May 1, 2026 •

edited

Loading

Uh oh!

KokaKiwi left a comment

Uh oh!

KokaKiwi May 7, 2026

Uh oh!

KokaKiwi May 7, 2026

Uh oh!

KokaKiwi May 7, 2026

Uh oh!

KokaKiwi May 7, 2026

Uh oh!

KokaKiwi May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

manthis commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real-world results

What changed

Core correctness & cache safety

Architecture

Bash dispatch

Read enhancements

Observability

Performance & test infrastructure

Estimated cost per improvement

Test plan

Backwards compatibility

Future work (deliberately out of scope)

Uh oh!

KokaKiwi left a comment

Choose a reason for hiding this comment

Uh oh!

KokaKiwi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

KokaKiwi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

KokaKiwi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

KokaKiwi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

KokaKiwi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

manthis commented May 1, 2026 •

edited

Loading