Token Optimization Guide

Comprehensive guide to Lynkr's token optimization strategies — benchmarked on real agentic coding workloads.

Overview

Lynkr reduces tokens sent to the model through multiple independent mechanisms. Benchmarked results on Claude Code / Cursor sessions:

Optimization	Measured Reduction	Scenario
Smart tool selection	47–60%	14-tool request (read or write task)
TOON JSON compression	87.6%	Large grep/file-read tool result (60-item array)
Tool-result compression (RTK)	up to 87.6%	grep/test/git/lint/build/log/JSON tool output
Semantic cache	100% on hit, 171ms	Paraphrased repeat query
MCP Code Mode	96%	100+ MCP tool schemas → 4 meta-tools
History compression	up to 80%	Long multi-turn sessions

At 100,000 requests/month on a tool-heavy agentic workload, this translates to $77k–$115k annual savings.

Benchmarked Savings Breakdown

Measured on identical prompts, same backend provider (June 2026):

Scenario	Tokens without Lynkr	Tokens with Lynkr	Reduction
14-tool read request	1,042	547	47%
14-tool write request	1,043	412	60%
JSON grep result (60 items)	3,458	427	87.6%
Semantic cache (2nd call)	2,857	0	100%

Estimated Savings at Scale

Scenario: 100,000 requests/month, 50k input tokens, 2k output tokens per request

Provider	Without Lynkr	With Lynkr	Monthly Savings	Annual Savings
Claude Sonnet 4.5	$16,000	$6,400	$9,600	$115,200
GPT-4o	$12,000	$4,800	$7,200	$86,400
Ollama (Local)	API costs	$0	$12,000+	$144,000+

Optimization Phases

Phase 0: MCP Code Mode (96% reduction for MCP tools)

Problem: Sending 100+ MCP tool schemas consumes massive tokens (~17,500 tokens).

Solution: Replace all MCP tool schemas with 4 meta-tools that enable lazy tool discovery.

How it works:

Without Code Mode: Every MCP tool schema sent on every request
With Code Mode: Only 4 meta-tools sent (~700 tokens)
- mcp_list_tools → Discover available tools (compact listing)
- mcp_tool_info → Load full schema for one specific tool
- mcp_tool_docs → Get usage examples + parameters
- mcp_execute → Execute a tool by name with JSON args

Example workflow:

Turn 1: mcp_list_tools({ server_id: "github" })
  → Returns: ["create_issue", "list_prs", "merge_pr", ...]

Turn 2: mcp_tool_info({ server_id: "github", tool_name: "create_issue" })
  → Returns: { inputSchema: { title: string, body: string, ... } }

Turn 3: mcp_execute({
    server_id: "github",
    tool_name: "create_issue",
    arguments: { title: "Bug", body: "..." }
  })

Token savings:

Without Code Mode: 100 tools × 175 tokens = 17,500 tokens
With Code Mode: 4 meta-tools × 175 tokens = 700 tokens
Savings: 96% (16,800 tokens saved)

Trade-off: Requires 3 sequential tool calls (discover → inspect → execute) instead of 1 direct call. This adds latency but saves massive context in MCP-heavy setups.

Configuration:

# Enable MCP Code Mode
CODE_MODE_ENABLED=true

# Tool list cache TTL in milliseconds (default: 60000 = 1 minute)
CODE_MODE_CACHE_TTL=60000

Inspired by: Bifrost's Code Mode architecture.

Phase 1: Smart Tool Selection (47–60% measured reduction)

Problem: Sending all tool schemas on every request wastes tokens. A read-only query doesn't need Write, Edit, Bash, or Git schemas.

Solution: Classifies each request and strips irrelevant tool definitions before forwarding.

How it works:

Chat queries → Only Read tool
File operations → Read, Write, Edit tools
Git operations → git_* tools
Code execution → Bash tool

Benchmarked on 14-tool Claude Code session:

Read task:  1,042 tokens raw → 547 tokens after selection  (−47%)
Write task: 1,043 tokens raw → 412 tokens after selection  (−60%)

Configuration:

# Automatic - no configuration needed
# Lynkr detects request type and filters tools

Phase 2: Prompt Caching (30-45% reduction)

Problem: Repeated system prompts consume tokens.

Solution: Cache and reuse prompts across requests.

How it works:

SHA-256 hash of prompt
LRU cache with TTL (default: 5 minutes)
Cache hit = free tokens

Example:

First request: 2,000 token system prompt
Subsequent requests: 0 tokens (cache hit)
10 requests: Save 18,000 tokens (90% reduction)

Configuration:

# Enable prompt caching (default: enabled)
PROMPT_CACHE_ENABLED=true

# Cache TTL in milliseconds (default: 300000 = 5 minutes)
PROMPT_CACHE_TTL_MS=300000

# Max cached entries (default: 64)
PROMPT_CACHE_MAX_ENTRIES=64

Phase 3: Memory Deduplication (20-30% reduction)

Problem: Duplicate memories inject redundant context.

Solution: Deduplicate memories before injection.

How it works:

Track last N memories injected
Skip if same memory was in last 5 requests
Only inject novel context

Example:

Original: 5 memories × 200 tokens × 10 requests = 10,000 tokens
With dedup: 5 memories × 200 tokens + 3 new × 200 = 1,600 tokens
Savings: 84% (8,400 tokens saved)

Configuration:

# Enable memory deduplication (default: enabled)
MEMORY_DEDUP_ENABLED=true

# Lookback window for dedup (default: 5)
MEMORY_DEDUP_LOOKBACK=5

Phase 4: Tool Response Truncation (15-25% reduction)

Problem: Long tool outputs (file contents, bash output) waste tokens.

Solution: Intelligently truncate tool responses.

How it works:

File Read: Limit to 2,000 lines
Bash output: Limit to 1,000 lines
Keep most relevant portions
Add truncation indicator

Example:

Original file read: 10,000 lines = 50,000 tokens
Truncated: 2,000 lines = 10,000 tokens
Savings: 80% (40,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Built into Read and Bash tools

Phase 5: Dynamic System Prompts (10-20% reduction)

Problem: Long system prompts for simple queries.

Solution: Adapt prompt complexity to request type.

How it works:

Simple chat: Minimal system prompt (500 tokens)
File operations: Medium prompt (1,000 tokens)
Complex multi-tool: Full prompt (2,000 tokens)

Example:

10 simple queries with full prompt: 10 × 2,000 = 20,000 tokens
10 simple queries with minimal: 10 × 500 = 5,000 tokens
Savings: 75% (15,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Lynkr detects request complexity

Phase 6: Conversation Compression with Distill (20-40% reduction)

Problem: Long conversation history accumulates tokens, especially with repetitive tool outputs.

Solution: Compress old messages using Distill algorithms while keeping recent ones detailed.

How it works:

Last 5 messages: Full detail
Messages 6-20: Summarized
Messages 21+: Archived (not sent)
Distill structural dedup: Repetitive tool results across history are collapsed
Delta rendering: Sequential similar tool outputs show only changes
ANSI/whitespace normalization: Cleans up noisy terminal output

Distill Algorithms (ported from samuelfaj/distill):

Algorithm	What it does	Savings
Structural similarity	Jaccard index on normalized line signatures — detects near-duplicate tool results	30-50% on repetitive outputs
Delta rendering	Only sends added/removed lines between sequential results	60-90% when re-reading same files
Block deduplication	Collapses consecutive similar sections within a single output	20-40% on verbose logs
Bad distillation detection	Prevents compression when it would lose too much information	Quality guard
Text normalization	Strips ANSI codes, normalizes whitespace and line endings	5-10% on terminal output

Example:

20-turn conversation without compression: 100,000 tokens
With Distill compression: 20,000 tokens
  - Old messages summarized: -60,000 tokens
  - Duplicate tool results collapsed: -15,000 tokens
  - Delta rendering on re-reads: -5,000 tokens
Savings: 80% (80,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Distill algorithms are built into the compression pipeline
HISTORY_COMPRESSION_ENABLED=true     # Enable conversation compression (default: true)
HISTORY_KEEP_RECENT_TURNS=10         # Keep last N turns verbatim (default: 10)
HISTORY_SUMMARIZE_OLDER=true         # Summarize older turns (default: true)

Phase 7: Tool-Result Compression (up to 87.6% on tool output)

Problem: Tool results dominate agentic token usage. A single grep, test run, git diff, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.

Lynkr compresses tool_result blocks in-process before forwarding (no added latency), via two complementary mechanisms.

7a. RTK pattern compression

Detects the shape of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.

Detector	What it compresses	Example outcome
`test_output`	jest/vitest/pytest/cargo/go test logs	Keep the summary line + failures, drop passing-test noise
`git_diff`	`git diff`	Per-file `+adds/-dels` with capped change lines
`git_status`	`git status`	Branch + staged/modified/untracked lists
`git_log`	`git log`	One line per commit (`<sha7> <subject> (author, date)`)
`lint_output`	eslint/tsc/ruff/clippy/biome	Counts grouped by rule, not every occurrence
`build_output`	npm/cargo/webpack	Errors + capped warnings + success line
`container_output`	docker/kubectl tables	Header + first N rows + “+M more”
`json_response`	large JSON objects	Structural skeleton (search/fetch results preserved)
`grep_output`	`grep`/`rg` (`file:line:content`)	Grouped by file, capped at 10 matches/file
`directory_listing`	`ls`/`find`/`tree`	Grouped by directory with counts
`large_file`	long source files	Imports + signatures skeleton
`dedup_log`	repetitive logs	Collapses consecutive duplicate lines
`smart_truncate`	very long unmatched output	Keeps head + tail, drops the middle

Tier-aware thresholds — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:

Tier	Compress if result exceeds
SIMPLE	300 chars
MEDIUM	800 chars
COMPLEX	2,000 chars
REASONING	never

Lossless recovery (tee): the full original is stashed for 5 minutes and a pointer ([full: tee_…]) is appended to the compressed result. The model — or you — can fetch the original via GET /tee/:id if the detail is actually needed.

Always on (no configuration). Metrics: GET /metrics/tool-compression.

7b. TOON compression (binary JSON encoding)

For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — 87.6% reduction on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.

TOON_ENABLED=true        # opt-in (default: false)
TOON_MIN_BYTES=4096      # only compress payloads larger than this
TOON_FAIL_OPEN=true      # on any encode error, forward the original (default: true)
TOON_LOG_STATS=true      # log per-call compression stats

Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

Problem: Even with all other optimizations, large requests can still exceed context limits.

Solution: Headroom is a Python sidecar that applies ML-based compression.

How it works:

Smart Crusher: Statistical JSON field compression
Cache Aligner: Stabilizes dynamic content for provider cache hits
CCR: Reversible compression with on-demand retrieval
Rolling Window: Token budget enforcement
LLMLingua (optional): BERT-based 20x compression

Auto-rebuild: When you run npm start, Lynkr automatically rebuilds the Headroom Docker image if source files changed — ensuring you always run the latest code.

Configuration:

HEADROOM_ENABLED=true
# See headroom.md for full configuration reference

Combined Savings

When all phases work together:

Example Request Flow:

Original request: 50,000 input tokens
- System prompt: 2,000 tokens
- Tools: 4,500 tokens (30 tools)
- Memories: 1,000 tokens (5 memories)
- Conversation: 20,000 tokens (20 messages)
- User query: 22,500 tokens
After optimization: 12,500 input tokens
- System prompt: 0 tokens (cache hit)
- Tools: 450 tokens (3 relevant tools)
- Memories: 200 tokens (deduplicated)
- Conversation: 5,000 tokens (compressed)
- User query: 22,500 tokens (same)
Savings: 75% reduction (37,500 tokens saved)

Monitoring Token Usage

Real-Time Tracking

# Check metrics endpoint
curl http://localhost:8081/metrics | grep lynkr_tokens

# Output:
# lynkr_tokens_input_total{provider="databricks"} 1234567
# lynkr_tokens_output_total{provider="databricks"} 234567
# lynkr_tokens_cached_total 500000

Per-Request Logging

# Enable token logging
LOG_LEVEL=info

# Logs show:
# {"level":"info","tokens":{"input":1250,"output":234,"cached":750}}

Best Practices

1. Enable All Optimizations

# All optimizations are enabled by default
# No configuration needed

2. Use Tier-Based Routing

# Route simple requests to free Ollama, complex to cloud
# Set all 4 TIER_* env vars to enable tier-based routing
TIER_SIMPLE=ollama:llama3.2
TIER_MEDIUM=openrouter:openai/gpt-4o-mini
TIER_COMPLEX=azure-openai:gpt-4o
TIER_REASONING=azure-openai:gpt-4o
FALLBACK_ENABLED=true
FALLBACK_PROVIDER=databricks

3. Monitor and Tune

# Check cache hit rate
curl http://localhost:8081/metrics | grep cache_hits

# Adjust cache size if needed
PROMPT_CACHE_MAX_ENTRIES=128  # Increase for more caching

ROI Calculator

Calculate your potential savings:

Formula:

Monthly Requests = 100,000
Avg Input Tokens = 50,000
Avg Output Tokens = 2,000
Cost per 1M Input = $3.00
Cost per 1M Output = $15.00

Without Lynkr:
Input Cost = (100,000 × 50,000 ÷ 1,000,000) × $3 = $15,000
Output Cost = (100,000 × 2,000 ÷ 1,000,000) × $15 = $3,000
Total = $18,000/month

With Lynkr (60% savings):
Total = $7,200/month

Savings = $10,800/month = $129,600/year

Your numbers:

Monthly requests: _____
Avg input tokens: _____
Avg output tokens: _____
Provider cost: _____

Result: $_____ saved per year

Next Steps

Installation Guide - Install Lynkr
Provider Configuration - Configure providers
Production Guide - Deploy to production
FAQ - Common questions

Getting Help

GitHub Discussions - Ask questions
GitHub Issues - Report issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token Optimization Guide

Overview

Benchmarked Savings Breakdown

Estimated Savings at Scale

Optimization Phases

Phase 0: MCP Code Mode (96% reduction for MCP tools)

Phase 1: Smart Tool Selection (47–60% measured reduction)

Phase 2: Prompt Caching (30-45% reduction)

Phase 3: Memory Deduplication (20-30% reduction)

Phase 4: Tool Response Truncation (15-25% reduction)

Phase 5: Dynamic System Prompts (10-20% reduction)

Phase 6: Conversation Compression with Distill (20-40% reduction)

Phase 7: Tool-Result Compression (up to 87.6% on tool output)

7a. RTK pattern compression

7b. TOON compression (binary JSON encoding)

Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

Combined Savings

Monitoring Token Usage

Real-Time Tracking

Per-Request Logging

Best Practices

1. Enable All Optimizations

2. Use Tier-Based Routing

3. Monitor and Tune

ROI Calculator

Next Steps

Getting Help

Uh oh!

FilesExpand file tree

token-optimization.md

Latest commit

History

token-optimization.md

File metadata and controls

Token Optimization Guide

Overview

Benchmarked Savings Breakdown

Estimated Savings at Scale

Optimization Phases

Phase 0: MCP Code Mode (96% reduction for MCP tools)

Phase 1: Smart Tool Selection (47–60% measured reduction)

Phase 2: Prompt Caching (30-45% reduction)

Phase 3: Memory Deduplication (20-30% reduction)

Phase 4: Tool Response Truncation (15-25% reduction)

Phase 5: Dynamic System Prompts (10-20% reduction)

Phase 6: Conversation Compression with Distill (20-40% reduction)

Phase 7: Tool-Result Compression (up to 87.6% on tool output)

7a. RTK pattern compression

7b. TOON compression (binary JSON encoding)

Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

Combined Savings

Monitoring Token Usage

Real-Time Tracking

Per-Request Logging

Best Practices

1. Enable All Optimizations

2. Use Tier-Based Routing

3. Monitor and Tune

ROI Calculator

Next Steps

Getting Help