Skip to content

feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82

Open
manthis wants to merge 15 commits into
edgee-ai:mainfrom
manthis:feat/compression
Open

feat(compression): full overhaul — pipeline, metrics, idempotency, robustness#82
manthis wants to merge 15 commits into
edgee-ai:mainfrom
manthis:feat/compression

Conversation

@manthis
Copy link
Copy Markdown
Contributor

@manthis manthis commented May 1, 2026

Summary

Reworks the compression layer end-to-end. Two main goals:

  1. Stop silently breaking the upstream prompt cache. The previous compressor
    would re-process every tool message on every turn — any change in
    compression behaviour (bug fix, threshold tweak) would invalidate the
    Anthropic / OpenAI prompt cache for the whole conversation.
  2. Open the architecture for new techniques. The crate description has
    always promised "multiple composable techniques", but only one was wired
    in and the dispatch was hard-coded. This PR introduces the trait + chain
    and demonstrates it with a second technique.

Along the way, fixes a handful of correctness gaps (<tool_use_error> false
positives, lost Codex exit codes), tightens dispatch on real-world bash
invocations (sudo cargo build, cd src && cargo test, /usr/local/bin/cargo),
adds per-tool metrics, image-resistant memory bounds, and a benchmark suite.

14 commits, +2 100 / −80 lines, 49 new tests (386 → 435), zero clippy
warnings under -D warnings.

Real-world results

End-to-end integration test (tests/integration.rs) on a fixture mirroring a
mid-session Claude Code request — 3.5 KB system prompt + 4-turn dialogue with
Read / Glob / Bash / Grep tool calls totalling ~75 KB of tool output:

system prompt: 3 602 bytes  → cache_control: ephemeral injected
tool inputs:   75 394 bytes
tool outputs:  30 652 bytes
saved:         44 742 bytes (59 %)

Per-tool breakdown:

Tool Input Output Saved
Bash (cargo build) 3 892 220 94 %
Glob (250 paths) 5 139 619 88 %
Grep (240 matches) 13 380 1 859 86 %
Read (800-line Rust) 52 983 27 954 47 %

Idempotency verified: a second pass through the pipeline is a byte-for-byte
no-op, so re-encoding never wakes the prompt cache.

What changed

Core correctness & cache safety

  • Versioned output marker — every compressed result is prefixed with
    <!--ec1-->. Re-entry detects the marker and short-circuits, so a stable
    compressed message stays byte-identical across turns and the upstream
    prompt cache survives.
  • Memory guardMAX_COMPRESSIBLE_BYTES = 2 MiB. Pathological tool
    outputs no longer turn a single request into a DoS vector.
  • Stricter protected-tag detection<tool_use_error> /
    <persisted-output> are now anchored to line start. Documents and code
    that mention the tags compress normally instead of being silently
    rejected.
  • Codex exit code preserved — non-zero exits from shell_command are
    re-injected as [exit N] after the marker, so the agent still sees
    failure signal even after the header is stripped.

Architecture

  • CompressionTechnique trait + CompressionPipeline — composable chain
    applied in order to every CompletionRequest. Existing tool-result
    compression becomes ToolResultsTechnique. CompressionLayer::new keeps
    the previous behaviour drop-in; CompressionLayer::with_pipeline is the
    new escape hatch for custom chains.
  • SystemPromptCacheTechnique (new) — auto-injects
    cache_control: {"type": "ephemeral"} on large, un-hinted system messages
    so Anthropic can serve them from prompt cache. Caps total injections to
    stay under the 4-breakpoint limit, counts pre-existing hints to remain
    idempotent across passes.

Bash dispatch

Real-world tool calls now route correctly:

  • Strip leading env-var assignments (FOO=bar cargo build).
  • Strip wrapper keywords: sudo, time, env, nohup, exec.
  • Peel silent leading sub-commands followed by && / ; / ||
    (cd src && cargo build, export X=1 && cargo test).
  • Dispatch by basename so /usr/local/bin/cargo and bare cargo route
    identically.

Read enhancements

  • New language coverage: TOML, JSON, YAML, Markdown. JSON / Markdown skip
    comment-stripping entirely so values that look like comments survive.
  • Lockfile detection (Cargo.lock, package-lock.json, yarn.lock,
    pnpm-lock.yaml, go.sum, …) — replaced with a head + tail + elision
    stub. Generated lockfiles eat token budget for almost no informational
    value.
  • Aggressive mode for brace-language files >500 lines: collapse function /
    class bodies of 8+ lines into a // ... (N lines collapsed) placeholder.
    A 5 000-line source file becomes a usable skeleton instead of a slightly
    smaller wall of text.
  • Threshold relaxed: accept either ≥10 % savings or ≥200 bytes saved
    (whichever fires first). Stops rejecting modest absolute gains on large
    files.

Observability

  • CompressionMetrics — per-tool counters (invocations, skipped,
    bytes_in, bytes_out) shared across cloned layer / service handles
    via Arc. Snapshot + totals APIs return owned data, ready for export
    via a /metrics HTTP handler.
  • edgee stats --per-tool — aggregates tool_compression_stats from
    every stored session log, sorted by absolute savings descending. Useful
    both for tuning and for spotting tools where the compressor does
    nothing.

Performance & test infrastructure

  • Regex early-out in split_into_segments — skip the regex walk
    entirely when the literal <system-reminder> substring is absent
    (>99 % of inputs).
  • Criterion benchmarks — six micro-benches covering Bash / Read / Grep /
    Glob / segment-protection / Codex pipeline. cargo bench -p edgee-compressor.

Estimated cost per improvement

Cost dimensions per change:

  • Latency — added time per request on the hot path.
  • Memory — bytes allocated per request beyond what we had before.
  • Tokens — bytes added to (or removed from) the wire payload sent to the
    provider, multiplied by realistic conversation length.
  • Code — net SLOC added and conceptual surface to maintain.

Latency numbers below are taken from cargo bench -p edgee-compressor -- --quick on an Apple Silicon dev machine, measured per tool message.

Improvement Latency Memory Tokens Code
Memory guard (MAX_COMPRESSIBLE_BYTES) ~1 ns (len() compare) 0 0 +30 SLOC
Versioned marker (<!--ec1-->) ~50 ns (one starts_with + one format!) +10 bytes per compressed message (one String grow) +10 bytes ≈ 3 tokens per tool message; pays for itself in 1 cache hit on the next turn +50 SLOC
Stricter tag detection +5–50 µs on outputs that don't carry the tags (lines().any instead of contains) 0 0 +10 SLOC
Codex exit-code preservation ~80 ns header scan; one format! only on non-zero exits +10 bytes when exit ≠ 0 +~5 tokens on failed shell calls only +30 SLOC
Bash dispatch (prefixes / basename / silent chains) ~150 ns of byte scanning per Bash tool call 0 (all &str slices) 0 +160 SLOC
Read: new langs + threshold relax ~0 (added match arms) 0 Negative — unlocks compression that was previously rejected +15 SLOC
Read: lockfile stubs O(1) basename match; format!() only on hit One small String on hit Large negative — typical Cargo.lock shrinks from 5–50 KB to ~200 bytes +60 SLOC
Read: aggressive brace collapse O(n) brace scan, only when file > 500 lines AND brace-language; ~30 µs on a 1 000-line file One Vec reallocated to filtered size Large negative — a 5 000-line file collapses to a few hundred lines of skeleton +120 SLOC
Regex early-out Negative latency — skips a 1–10 µs regex walk on the >99 % of outputs without a <system-reminder> 0 0 +5 SLOC
CompressionTechnique pipeline ~5 ns per technique (Vec iter + dyn dispatch) One Arc<CompressionPipeline> per layer (shared) 0 +130 SLOC
SystemPromptCacheTechnique ~1 µs per request (one walk over messages) One serde_json::Value per injection (≤2 per request) 0 added; enables 90 %+ token-cost discount on cached system blocks via Anthropic prompt cache +220 SLOC
Per-tool CompressionMetrics ~200–500 ns per tool message (one mutex acquire + one HashMap::entry) One String (tool name) on first encounter, then nothing 0 +180 SLOC
edgee stats --per-tool Out of hot path — only runs on CLI invocation Negligible 0 +100 SLOC
Criterion benches Out of hot path — dev-only 0 0 +310 SLOC (dev-only)

Concrete bench results on the fixture in tests/integration.rs:

bash_git_diff_40_hunks            22 µs
read_rust_800_lines              114 µs
grep_content_2000_matches        234 ms   (pre-existing algorithm; flagged for follow-up)
glob_400_paths                    55 µs
segment_protection_no_reminder    28 µs   (the early-out path)
codex_shell_command_200_files     52 µs

Net per-request overhead is well under a millisecond for everything except
the Grep content strategy, which was already this slow before this PR
(2 000 matches × O(matches) regrouping). Filing a follow-up to revisit it.

Test plan

  • cargo fmt --all
  • cargo clippy --all-targets -- -D warnings — zero warnings
  • cargo test --all — 435 passed, 0 failed
  • cargo bench -p edgee-compressor --no-run — benches compile
  • cargo bench -p edgee-compressor -- --quick — benches execute, numbers above
  • cargo test -p edgee-compression-layer --test integration -- --nocapture
    — 59 % byte savings on a realistic fixture, idempotency confirmed
  • Smoke test against a live Claude Code session — verify the marker
    survives round-trips and the Anthropic cache_creation / cache_read
    counters reflect the SystemPromptCacheTechnique injections
  • Confirm edgee stats --per-tool against an existing session log
    directory

Backwards compatibility

  • CompressionLayer::new(config) still exists and produces a default
    pipeline of [tool-results] — no behavioural change for current callers.
  • CompressionConfig::new(agent) returns the same Arc<CompressionConfig>
    as before; the new metrics field defaults to a fresh
    Arc<CompressionMetrics>.
  • All public re-exports from previous releases are preserved
    (compress_tool_output, compress_codex_tool_output, claude_compressor_for,
    …).

Future work (deliberately out of scope)

  • Image down-sampling (needs image crate dependency, invasive content-type
    handling).
  • Tool-call arguments dedup via content-hash references (request-level
    state).
  • Async LLM-based conversation roll-up for old turns.
  • Fuzzing the parsers via cargo-fuzz / proptest.
  • Per-conversation LRU cache to avoid re-indexing on every request.
  • Revisit grep content mode performance — 234 ms on 2 000 matches is the
    one outlier in the bench suite (pre-existing).

From HelloMax to Edgee with 🫶

@manthis manthis requested a review from a team as a code owner May 1, 2026 08:16
@edgee-ai edgee-ai deleted a comment from edgee Bot May 5, 2026
manthis added 15 commits May 6, 2026 16:24
Refuse to compress payloads larger than 2 MB. Prevents pathological
or malicious tool outputs from turning a single request into a DoS
vector on the gateway.

Applied at the central segment-protection helper, which all three
agent pipelines (Claude/Codex/OpenCode) route through.
Prepend an invisible marker `<!--ec1-->` to every compressed result and
short-circuit the compressor when the input already carries it. Without
this guard, every conversation turn re-runs every previously compressed
tool message — and any change in compressor behaviour (bug fix, threshold
tweak, new strategy) silently invalidates the upstream prompt cache,
turning what should be a cache hit into a full re-encode.

The marker prefix `<!--ec` is recognized regardless of version number, so
rolling forward to `ec2` will still treat older outputs as already
compressed and leave them alone.
Introduce CompressionMetrics, a thread-safe collector of per-tool
counters (invocations, skips, bytes_in, bytes_out). Embedded in
CompressionConfig and shared across cloned layer/service handles via
Arc, so the gateway can scrape coherent stats from anywhere.

Without measurement we tune compression strategies blind. The snapshot
API returns sorted owned data, ready to be exposed through a metrics
endpoint or a CLI report.
…ive mode

- Recognize TOML, JSON, YAML, Markdown extensions. JSON/Markdown skip
  comment-stripping entirely so values that look like comments survive.
- Detect well-known lockfiles by basename (Cargo.lock, package-lock.json,
  yarn.lock, pnpm-lock.yaml, go.sum, etc.) and replace the body with a
  short head + tail + elision stub. The LLM almost never needs the full
  contents of a generated lockfile.
- For brace-language files above 500 lines, collapse function/class
  bodies of 8+ lines into a `// ... (N lines collapsed)` placeholder.
  Comment-stripping alone leaves a 5000-line file at maybe 4500 lines —
  this turns big files into a usable skeleton.
…peline

Add CompressionTechnique trait + CompressionPipeline that chains
techniques in order. Existing tool-output compression becomes
ToolResultsTechnique — the only technique shipped today, but the trait
is the seam for upcoming work (image down-sample, system-prompt
deduplication, conversation summarization, …).

CompressionLayer now builds a default `[tool-results]` pipeline so
existing call sites stay drop-in compatible. CompressionLayer::with_pipeline
is the new escape hatch for callers that want to assemble a custom chain.
…l silent chains

Bash command dispatch now unwraps:
- Leading env-var assignments (FOO=bar cargo build)
- Wrapper keywords: sudo, time, env, nohup, exec
- Silent leading sub-commands followed by &&/;/|| (cd path && cargo
  build, export X=1 && cargo test, ...)

After unwrapping, dispatch is by basename so absolute paths like
/usr/local/bin/cargo route to the cargo compressor identically to bare
"cargo".

Net effect: real-world tool calls like `cd src && cargo build` and
`sudo /usr/bin/find . -name '*.rs'` now compress instead of falling
through to "no compression".
Codex tool outputs carry an "Exit code: N" line in the header. The
header is stripped before compression so that line was being thrown
away — the agent saw a successfully compressed body and lost the
failure signal entirely.

We now parse the exit code from the header (supports both
"Exit code: N" and "Process exited with code N" formats), and re-inject
it as a `[exit N]` prefix on non-zero exits. The prefix is placed AFTER
the version marker so the idempotency check stays intact across passes.
`<tool_use_error>` and `<persisted-output>` were detected with bare
`String::contains`, which produced false positives the moment a tool
output mentioned the tag in body content (a Read of this very source
file would have triggered).

Anchor the check to line start so only genuine tag blocks short-circuit
compression. Documents and code that *talk about* the tags now compress
normally.
The pure 10 % savings ratio was rejecting useful compression on big
files: 8 % of 100 KB is 8 KB of avoidable tokens that we were throwing
away. Switch the gate to (savings ≥ 10 % OR savings ≥ 200 bytes) — either
threshold alone is enough to keep the result.

Applied to both the Claude and OpenCode Read compressors so they stay
in sync.
split_into_segments was running a regex find_iter on every output.
99% of tool outputs do not contain a `<system-reminder>` block, so a
literal substring check is enough to short-circuit and skip the regex
entirely. Same correctness, less work on the hot path.
Coding agents send the same long system prompt (CLAUDE.md, agent rules,
tool descriptions) on every request. Anthropic's prompt cache can avoid
re-encoding it — but only when the block carries a cache_control hint.
Many clients don't set one.

This new technique scans system messages and injects
`cache_control: {"type": "ephemeral"}` on large, un-hinted blocks.
Caps total injections to leave headroom under Anthropic's 4-breakpoint
limit, and counts pre-existing hints (ours or upstream's) so repeated
applies stay idempotent.

Also demonstrates the pipeline architecture from the previous commit:
just plug it into CompressionLayer::with_pipeline alongside the existing
ToolResultsTechnique.
Session logs already record `tool_compression_stats` per tool; the
stats command was throwing it away. Add a `--per-tool` flag that sums
{count, before, after} across every stored session and renders the
biggest absolute savings first, with the same compression bar widget
as the per-session row.

Useful both for tuning thresholds and for spotting tools where the
compressor is doing nothing.
Six micro-benches covering compress_tool_output for Bash/Read/Grep/Glob,
the segment-protection helper on its fast path, and the Codex shell
pipeline. Each one exercises a realistic-sized payload (40 hunks,
800 lines, 2k matches, etc.) so a regression in the strategy or in the
marker/threshold pipeline is visible without the noise of micro-noise.

Run with: cargo bench -p edgee-compressor
…ixture

Builds a Claude-shaped CompletionRequest mirroring a real coding-agent
mid-session payload (3.5 KB system prompt + 4-turn dialogue with Read,
Glob, Bash cargo, Grep tool calls totalling ~75 KB of tool output) and
runs it through the full pipeline (SystemPromptCacheTechnique +
ToolResultsTechnique).

Asserts:
- end-to-end byte savings ≥ 40 % (measured: 59 %)
- every compressed tool message starts with the version marker
- system prompt receives cache_control: ephemeral
- per-tool metrics are populated and self-consistent
- second pass through the pipeline is a no-op (idempotency, prompt
  cache stays stable across turns)
- Codex variant: non-zero exit codes survive as `[exit N]` after the
  marker, and the codex pipeline is also idempotent

Per-tool numbers from this fixture:
  Bash cargo:    3892 → 220   (94 %)
  Glob 250:      5139 → 619   (88 %)
  Grep 240:    13380 → 1859   (86 %)
  Read 800ln:  52983 → 27954  (47 %)
@KokaKiwi KokaKiwi force-pushed the feat/compression branch from 4c4975e to 45f1cbb Compare May 6, 2026 14:24
Copy link
Copy Markdown
Member

@KokaKiwi KokaKiwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, lots of nice stuff in here. The per-tool metrics, the memory guard, and SystemPromptCacheTechnique all fill real gaps, and the jump from 386 to 435 tests with zero clippy warnings is appreciated.

Flagging the review with "Request changes" because of the brace-collapse compression on crates/compressor/src/strategy/claude/read.rs that could silently drop code when braces appear inside string literals.
We previously had some "aggressive" Read compression as well but it seemed to incapacitate the coding agent a lot when doing that so we had to tone down a bit this tool compression specifically.

Rest is smaller reviews plus one design question I would love your take on.

/// Brace counting is naive — it does not strip braces inside strings or
/// comments. Mis-counts only ever cause "kept too much" (the body fails to
/// collapse), never silent data loss.
pub(crate) fn aggressive_collapse_braces(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The brace counter on lines 547-560 walks every { / } character without filtering string literals, raw strings, or block comments. If a body contains a literal brace, depth reaches zero inside the literal, closed is set, and everything between that false close and the real } is silently dropped.

Repro shape:

fn build_query() {
    let sql = r#"SELECT * FROM t WHERE x = '}'"#;
    actual_code_here();
}

The doc-comment above the function says mis-counts only cause "kept too much, never silent data loss". That contract is violated here.

Tightening the boundary check is enough to make this sound. We accept fewer collapses (when } shares a line) but never drop code:

let closing = lines[j].1.trim();
if closed && j > i && matches!(closing, "}" | "};" | "},") {
    // safe to collapse
}

While we're here, the docstring at lines 530-531 needs to be updated too.


Review comment authored with Claude Code (Opus 4.7).

let mut map = self
.inner
.lock()
.expect("compression metrics mutex poisoned");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four public methods of CompressionMetrics (lines 54, 68, 80, 92) lock the mutex with .expect("compression metrics mutex poisoned"). Once any thread panics while holding the lock, every subsequent request on the hot path panics as well, a single observability-path issue takes the whole gateway down.

For accumulating counters, a partial / stale read is always preferable to a crash. The standard recovery is enough here:

self.inner.lock().unwrap_or_else(|e| e.into_inner())

Apply to all four call sites.


Review comment authored with Claude Code (Opus 4.7).

//! prompt cache can avoid re-encoding it — but only if the request marks the
//! block with a `cache_control` hint. Many clients don't.
//!
//! This technique scans every system / developer message and, for those large
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "system / developer message" but the implementation only matches Message::System (line 93), and DeveloperMessage doesn't have a cache_control field anyway. Either drop "developer" from the doc, or add the field in gateway-core and a Message::Developer arm in the loop and the counter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CompressionTechnique trait + CompressionPipeline design works and makes sense for the near-term roadmap (MCP cleaning, caveman summarization, etc.). The builder API (CompressionLayer::with_pipeline) is a nice escape hatch too.

One thing I had in mind when thinking about this was using Tower layers directly, each technique as its own tower::Layer in the ServiceBuilder chain, which would keep the composition model consistent with the rest of the gateway stack. That said, I think there's a middle ground: we could keep the internal CompressionPipeline as-is and bridge it to Tower properly in a follow-up, once we have a clearer picture of what the next techniques look like. Not asking for changes here, just flagging it as something worth revisiting post-merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to double-check the assumption behind is_already_compressed and the COMPRESSION_MARKER prefix.

The stated goal is cache-safety: if the same compressed message re-enters the pipeline unchanged, the upstream Anthropic prompt cache sees a stable byte sequence. That logic holds if the gateway ever receives its own compressed output back, but in our proxy model, it doesn't. The gateway compresses tool results before forwarding to Anthropic, but the compressed form is never echoed back to the client. On every subsequent turn the client resends its own copy of the conversation history with the original, uncompressed tool outputs, so the gateway always starts fresh.

The scenario the marker is protecting against (a previously compressed message re-entering the pipeline) can't happen structurally, so the guard feels un-needed. And even if we wanted that protection at the compressor crate level, it seems more natural for the crate's caller to be responsible for not feeding already-compressed content back in, rather than baking the marker into the output format.

Happy to be convinced otherwise if there's a flow I'm missing.

@edgee-ai edgee-ai deleted a comment from edgee Bot May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants