Comprehensive guide to Lynkr's token optimization strategies — benchmarked on real agentic coding workloads.
Lynkr reduces tokens sent to the model through multiple independent mechanisms. Benchmarked results on Claude Code / Cursor sessions:
| Optimization | Measured Reduction | Scenario |
|---|---|---|
| Smart tool selection | 47–60% | 14-tool request (read or write task) |
| TOON JSON compression | 87.6% | Large grep/file-read tool result (60-item array) |
| Tool-result compression (RTK) | up to 87.6% | grep/test/git/lint/build/log/JSON tool output |
| Semantic cache | 100% on hit, 171ms | Paraphrased repeat query |
| MCP Code Mode | 96% | 100+ MCP tool schemas → 4 meta-tools |
| History compression | up to 80% | Long multi-turn sessions |
At 100,000 requests/month on a tool-heavy agentic workload, this translates to $77k–$115k annual savings.
Measured on identical prompts, same backend provider (June 2026):
| Scenario | Tokens without Lynkr | Tokens with Lynkr | Reduction |
|---|---|---|---|
| 14-tool read request | 1,042 | 547 | 47% |
| 14-tool write request | 1,043 | 412 | 60% |
| JSON grep result (60 items) | 3,458 | 427 | 87.6% |
| Semantic cache (2nd call) | 2,857 | 0 | 100% |
Scenario: 100,000 requests/month, 50k input tokens, 2k output tokens per request
| Provider | Without Lynkr | With Lynkr | Monthly Savings | Annual Savings |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $16,000 | $6,400 | $9,600 | $115,200 |
| GPT-4o | $12,000 | $4,800 | $7,200 | $86,400 |
| Ollama (Local) | API costs | $0 | $12,000+ | $144,000+ |
Problem: Sending 100+ MCP tool schemas consumes massive tokens (~17,500 tokens).
Solution: Replace all MCP tool schemas with 4 meta-tools that enable lazy tool discovery.
How it works:
- Without Code Mode: Every MCP tool schema sent on every request
- With Code Mode: Only 4 meta-tools sent (~700 tokens)
mcp_list_tools→ Discover available tools (compact listing)mcp_tool_info→ Load full schema for one specific toolmcp_tool_docs→ Get usage examples + parametersmcp_execute→ Execute a tool by name with JSON args
Example workflow:
Turn 1: mcp_list_tools({ server_id: "github" })
→ Returns: ["create_issue", "list_prs", "merge_pr", ...]
Turn 2: mcp_tool_info({ server_id: "github", tool_name: "create_issue" })
→ Returns: { inputSchema: { title: string, body: string, ... } }
Turn 3: mcp_execute({
server_id: "github",
tool_name: "create_issue",
arguments: { title: "Bug", body: "..." }
})
Token savings:
Without Code Mode: 100 tools × 175 tokens = 17,500 tokens
With Code Mode: 4 meta-tools × 175 tokens = 700 tokens
Savings: 96% (16,800 tokens saved)
Trade-off: Requires 3 sequential tool calls (discover → inspect → execute) instead of 1 direct call. This adds latency but saves massive context in MCP-heavy setups.
Configuration:
# Enable MCP Code Mode
CODE_MODE_ENABLED=true
# Tool list cache TTL in milliseconds (default: 60000 = 1 minute)
CODE_MODE_CACHE_TTL=60000Inspired by: Bifrost's Code Mode architecture.
Problem: Sending all tool schemas on every request wastes tokens. A read-only query doesn't need Write, Edit, Bash, or Git schemas.
Solution: Classifies each request and strips irrelevant tool definitions before forwarding.
How it works:
- Chat queries → Only Read tool
- File operations → Read, Write, Edit tools
- Git operations → git_* tools
- Code execution → Bash tool
Benchmarked on 14-tool Claude Code session:
Read task: 1,042 tokens raw → 547 tokens after selection (−47%)
Write task: 1,043 tokens raw → 412 tokens after selection (−60%)
Configuration:
# Automatic - no configuration needed
# Lynkr detects request type and filters toolsProblem: Repeated system prompts consume tokens.
Solution: Cache and reuse prompts across requests.
How it works:
- SHA-256 hash of prompt
- LRU cache with TTL (default: 5 minutes)
- Cache hit = free tokens
Example:
First request: 2,000 token system prompt
Subsequent requests: 0 tokens (cache hit)
10 requests: Save 18,000 tokens (90% reduction)
Configuration:
# Enable prompt caching (default: enabled)
PROMPT_CACHE_ENABLED=true
# Cache TTL in milliseconds (default: 300000 = 5 minutes)
PROMPT_CACHE_TTL_MS=300000
# Max cached entries (default: 64)
PROMPT_CACHE_MAX_ENTRIES=64Problem: Duplicate memories inject redundant context.
Solution: Deduplicate memories before injection.
How it works:
- Track last N memories injected
- Skip if same memory was in last 5 requests
- Only inject novel context
Example:
Original: 5 memories × 200 tokens × 10 requests = 10,000 tokens
With dedup: 5 memories × 200 tokens + 3 new × 200 = 1,600 tokens
Savings: 84% (8,400 tokens saved)
Configuration:
# Enable memory deduplication (default: enabled)
MEMORY_DEDUP_ENABLED=true
# Lookback window for dedup (default: 5)
MEMORY_DEDUP_LOOKBACK=5Problem: Long tool outputs (file contents, bash output) waste tokens.
Solution: Intelligently truncate tool responses.
How it works:
- File Read: Limit to 2,000 lines
- Bash output: Limit to 1,000 lines
- Keep most relevant portions
- Add truncation indicator
Example:
Original file read: 10,000 lines = 50,000 tokens
Truncated: 2,000 lines = 10,000 tokens
Savings: 80% (40,000 tokens saved)
Configuration:
# Automatic - no configuration needed
# Built into Read and Bash toolsProblem: Long system prompts for simple queries.
Solution: Adapt prompt complexity to request type.
How it works:
- Simple chat: Minimal system prompt (500 tokens)
- File operations: Medium prompt (1,000 tokens)
- Complex multi-tool: Full prompt (2,000 tokens)
Example:
10 simple queries with full prompt: 10 × 2,000 = 20,000 tokens
10 simple queries with minimal: 10 × 500 = 5,000 tokens
Savings: 75% (15,000 tokens saved)
Configuration:
# Automatic - no configuration needed
# Lynkr detects request complexityProblem: Long conversation history accumulates tokens, especially with repetitive tool outputs.
Solution: Compress old messages using Distill algorithms while keeping recent ones detailed.
How it works:
- Last 5 messages: Full detail
- Messages 6-20: Summarized
- Messages 21+: Archived (not sent)
- Distill structural dedup: Repetitive tool results across history are collapsed
- Delta rendering: Sequential similar tool outputs show only changes
- ANSI/whitespace normalization: Cleans up noisy terminal output
Distill Algorithms (ported from samuelfaj/distill):
| Algorithm | What it does | Savings |
|---|---|---|
| Structural similarity | Jaccard index on normalized line signatures — detects near-duplicate tool results | 30-50% on repetitive outputs |
| Delta rendering | Only sends added/removed lines between sequential results | 60-90% when re-reading same files |
| Block deduplication | Collapses consecutive similar sections within a single output | 20-40% on verbose logs |
| Bad distillation detection | Prevents compression when it would lose too much information | Quality guard |
| Text normalization | Strips ANSI codes, normalizes whitespace and line endings | 5-10% on terminal output |
Example:
20-turn conversation without compression: 100,000 tokens
With Distill compression: 20,000 tokens
- Old messages summarized: -60,000 tokens
- Duplicate tool results collapsed: -15,000 tokens
- Delta rendering on re-reads: -5,000 tokens
Savings: 80% (80,000 tokens saved)
Configuration:
# Automatic - no configuration needed
# Distill algorithms are built into the compression pipeline
HISTORY_COMPRESSION_ENABLED=true # Enable conversation compression (default: true)
HISTORY_KEEP_RECENT_TURNS=10 # Keep last N turns verbatim (default: 10)
HISTORY_SUMMARIZE_OLDER=true # Summarize older turns (default: true)Problem: Tool results dominate agentic token usage. A single grep, test run, git diff, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.
Lynkr compresses tool_result blocks in-process before forwarding (no added latency), via two complementary mechanisms.
Detects the shape of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.
| Detector | What it compresses | Example outcome |
|---|---|---|
test_output |
jest/vitest/pytest/cargo/go test logs | Keep the summary line + failures, drop passing-test noise |
git_diff |
git diff |
Per-file +adds/-dels with capped change lines |
git_status |
git status |
Branch + staged/modified/untracked lists |
git_log |
git log |
One line per commit (<sha7> <subject> (author, date)) |
lint_output |
eslint/tsc/ruff/clippy/biome | Counts grouped by rule, not every occurrence |
build_output |
npm/cargo/webpack | Errors + capped warnings + success line |
container_output |
docker/kubectl tables | Header + first N rows + “+M more” |
json_response |
large JSON objects | Structural skeleton (search/fetch results preserved) |
grep_output |
grep/rg (file:line:content) |
Grouped by file, capped at 10 matches/file |
directory_listing |
ls/find/tree |
Grouped by directory with counts |
large_file |
long source files | Imports + signatures skeleton |
dedup_log |
repetitive logs | Collapses consecutive duplicate lines |
smart_truncate |
very long unmatched output | Keeps head + tail, drops the middle |
Tier-aware thresholds — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:
| Tier | Compress if result exceeds |
|---|---|
| SIMPLE | 300 chars |
| MEDIUM | 800 chars |
| COMPLEX | 2,000 chars |
| REASONING | never |
Lossless recovery (tee): the full original is stashed for 5 minutes and a pointer ([full: tee_…]) is appended to the compressed result. The model — or you — can fetch the original via GET /tee/:id if the detail is actually needed.
Always on (no configuration). Metrics: GET /metrics/tool-compression.
For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — 87.6% reduction on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.
TOON_ENABLED=true # opt-in (default: false)
TOON_MIN_BYTES=4096 # only compress payloads larger than this
TOON_FAIL_OPEN=true # on any encode error, forward the original (default: true)
TOON_LOG_STATS=true # log per-call compression statsProblem: Even with all other optimizations, large requests can still exceed context limits.
Solution: Headroom is a Python sidecar that applies ML-based compression.
How it works:
- Smart Crusher: Statistical JSON field compression
- Cache Aligner: Stabilizes dynamic content for provider cache hits
- CCR: Reversible compression with on-demand retrieval
- Rolling Window: Token budget enforcement
- LLMLingua (optional): BERT-based 20x compression
Auto-rebuild: When you run npm start, Lynkr automatically rebuilds the Headroom Docker image if source files changed — ensuring you always run the latest code.
Configuration:
HEADROOM_ENABLED=true
# See headroom.md for full configuration referenceWhen all phases work together:
Example Request Flow:
-
Original request: 50,000 input tokens
- System prompt: 2,000 tokens
- Tools: 4,500 tokens (30 tools)
- Memories: 1,000 tokens (5 memories)
- Conversation: 20,000 tokens (20 messages)
- User query: 22,500 tokens
-
After optimization: 12,500 input tokens
- System prompt: 0 tokens (cache hit)
- Tools: 450 tokens (3 relevant tools)
- Memories: 200 tokens (deduplicated)
- Conversation: 5,000 tokens (compressed)
- User query: 22,500 tokens (same)
-
Savings: 75% reduction (37,500 tokens saved)
# Check metrics endpoint
curl http://localhost:8081/metrics | grep lynkr_tokens
# Output:
# lynkr_tokens_input_total{provider="databricks"} 1234567
# lynkr_tokens_output_total{provider="databricks"} 234567
# lynkr_tokens_cached_total 500000# Enable token logging
LOG_LEVEL=info
# Logs show:
# {"level":"info","tokens":{"input":1250,"output":234,"cached":750}}# All optimizations are enabled by default
# No configuration needed# Route simple requests to free Ollama, complex to cloud
# Set all 4 TIER_* env vars to enable tier-based routing
TIER_SIMPLE=ollama:llama3.2
TIER_MEDIUM=openrouter:openai/gpt-4o-mini
TIER_COMPLEX=azure-openai:gpt-4o
TIER_REASONING=azure-openai:gpt-4o
FALLBACK_ENABLED=true
FALLBACK_PROVIDER=databricks# Check cache hit rate
curl http://localhost:8081/metrics | grep cache_hits
# Adjust cache size if needed
PROMPT_CACHE_MAX_ENTRIES=128 # Increase for more cachingCalculate your potential savings:
Formula:
Monthly Requests = 100,000
Avg Input Tokens = 50,000
Avg Output Tokens = 2,000
Cost per 1M Input = $3.00
Cost per 1M Output = $15.00
Without Lynkr:
Input Cost = (100,000 × 50,000 ÷ 1,000,000) × $3 = $15,000
Output Cost = (100,000 × 2,000 ÷ 1,000,000) × $15 = $3,000
Total = $18,000/month
With Lynkr (60% savings):
Total = $7,200/month
Savings = $10,800/month = $129,600/year
Your numbers:
- Monthly requests: _____
- Avg input tokens: _____
- Avg output tokens: _____
- Provider cost: _____
Result: $_____ saved per year
- Installation Guide - Install Lynkr
- Provider Configuration - Configure providers
- Production Guide - Deploy to production
- FAQ - Common questions
- GitHub Discussions - Ask questions
- GitHub Issues - Report issues