Skip to content

Latest commit

 

History

History
491 lines (356 loc) · 15.4 KB

File metadata and controls

491 lines (356 loc) · 15.4 KB

Token Optimization Guide

Comprehensive guide to Lynkr's token optimization strategies — benchmarked on real agentic coding workloads.


Overview

Lynkr reduces tokens sent to the model through multiple independent mechanisms. Benchmarked results on Claude Code / Cursor sessions:

Optimization Measured Reduction Scenario
Smart tool selection 47–60% 14-tool request (read or write task)
TOON JSON compression 87.6% Large grep/file-read tool result (60-item array)
Tool-result compression (RTK) up to 87.6% grep/test/git/lint/build/log/JSON tool output
Semantic cache 100% on hit, 171ms Paraphrased repeat query
MCP Code Mode 96% 100+ MCP tool schemas → 4 meta-tools
History compression up to 80% Long multi-turn sessions

At 100,000 requests/month on a tool-heavy agentic workload, this translates to $77k–$115k annual savings.


Benchmarked Savings Breakdown

Measured on identical prompts, same backend provider (June 2026):

Scenario Tokens without Lynkr Tokens with Lynkr Reduction
14-tool read request 1,042 547 47%
14-tool write request 1,043 412 60%
JSON grep result (60 items) 3,458 427 87.6%
Semantic cache (2nd call) 2,857 0 100%

Estimated Savings at Scale

Scenario: 100,000 requests/month, 50k input tokens, 2k output tokens per request

Provider Without Lynkr With Lynkr Monthly Savings Annual Savings
Claude Sonnet 4.5 $16,000 $6,400 $9,600 $115,200
GPT-4o $12,000 $4,800 $7,200 $86,400
Ollama (Local) API costs $0 $12,000+ $144,000+

Optimization Phases

Phase 0: MCP Code Mode (96% reduction for MCP tools)

Problem: Sending 100+ MCP tool schemas consumes massive tokens (~17,500 tokens).

Solution: Replace all MCP tool schemas with 4 meta-tools that enable lazy tool discovery.

How it works:

  • Without Code Mode: Every MCP tool schema sent on every request
  • With Code Mode: Only 4 meta-tools sent (~700 tokens)
    • mcp_list_tools → Discover available tools (compact listing)
    • mcp_tool_info → Load full schema for one specific tool
    • mcp_tool_docs → Get usage examples + parameters
    • mcp_execute → Execute a tool by name with JSON args

Example workflow:

Turn 1: mcp_list_tools({ server_id: "github" })
  → Returns: ["create_issue", "list_prs", "merge_pr", ...]

Turn 2: mcp_tool_info({ server_id: "github", tool_name: "create_issue" })
  → Returns: { inputSchema: { title: string, body: string, ... } }

Turn 3: mcp_execute({
    server_id: "github",
    tool_name: "create_issue",
    arguments: { title: "Bug", body: "..." }
  })

Token savings:

Without Code Mode: 100 tools × 175 tokens = 17,500 tokens
With Code Mode: 4 meta-tools × 175 tokens = 700 tokens
Savings: 96% (16,800 tokens saved)

Trade-off: Requires 3 sequential tool calls (discover → inspect → execute) instead of 1 direct call. This adds latency but saves massive context in MCP-heavy setups.

Configuration:

# Enable MCP Code Mode
CODE_MODE_ENABLED=true

# Tool list cache TTL in milliseconds (default: 60000 = 1 minute)
CODE_MODE_CACHE_TTL=60000

Inspired by: Bifrost's Code Mode architecture.


Phase 1: Smart Tool Selection (47–60% measured reduction)

Problem: Sending all tool schemas on every request wastes tokens. A read-only query doesn't need Write, Edit, Bash, or Git schemas.

Solution: Classifies each request and strips irrelevant tool definitions before forwarding.

How it works:

  • Chat queries → Only Read tool
  • File operations → Read, Write, Edit tools
  • Git operations → git_* tools
  • Code execution → Bash tool

Benchmarked on 14-tool Claude Code session:

Read task:  1,042 tokens raw → 547 tokens after selection  (−47%)
Write task: 1,043 tokens raw → 412 tokens after selection  (−60%)

Configuration:

# Automatic - no configuration needed
# Lynkr detects request type and filters tools

Phase 2: Prompt Caching (30-45% reduction)

Problem: Repeated system prompts consume tokens.

Solution: Cache and reuse prompts across requests.

How it works:

  • SHA-256 hash of prompt
  • LRU cache with TTL (default: 5 minutes)
  • Cache hit = free tokens

Example:

First request: 2,000 token system prompt
Subsequent requests: 0 tokens (cache hit)
10 requests: Save 18,000 tokens (90% reduction)

Configuration:

# Enable prompt caching (default: enabled)
PROMPT_CACHE_ENABLED=true

# Cache TTL in milliseconds (default: 300000 = 5 minutes)
PROMPT_CACHE_TTL_MS=300000

# Max cached entries (default: 64)
PROMPT_CACHE_MAX_ENTRIES=64

Phase 3: Memory Deduplication (20-30% reduction)

Problem: Duplicate memories inject redundant context.

Solution: Deduplicate memories before injection.

How it works:

  • Track last N memories injected
  • Skip if same memory was in last 5 requests
  • Only inject novel context

Example:

Original: 5 memories × 200 tokens × 10 requests = 10,000 tokens
With dedup: 5 memories × 200 tokens + 3 new × 200 = 1,600 tokens
Savings: 84% (8,400 tokens saved)

Configuration:

# Enable memory deduplication (default: enabled)
MEMORY_DEDUP_ENABLED=true

# Lookback window for dedup (default: 5)
MEMORY_DEDUP_LOOKBACK=5

Phase 4: Tool Response Truncation (15-25% reduction)

Problem: Long tool outputs (file contents, bash output) waste tokens.

Solution: Intelligently truncate tool responses.

How it works:

  • File Read: Limit to 2,000 lines
  • Bash output: Limit to 1,000 lines
  • Keep most relevant portions
  • Add truncation indicator

Example:

Original file read: 10,000 lines = 50,000 tokens
Truncated: 2,000 lines = 10,000 tokens
Savings: 80% (40,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Built into Read and Bash tools

Phase 5: Dynamic System Prompts (10-20% reduction)

Problem: Long system prompts for simple queries.

Solution: Adapt prompt complexity to request type.

How it works:

  • Simple chat: Minimal system prompt (500 tokens)
  • File operations: Medium prompt (1,000 tokens)
  • Complex multi-tool: Full prompt (2,000 tokens)

Example:

10 simple queries with full prompt: 10 × 2,000 = 20,000 tokens
10 simple queries with minimal: 10 × 500 = 5,000 tokens
Savings: 75% (15,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Lynkr detects request complexity

Phase 6: Conversation Compression with Distill (20-40% reduction)

Problem: Long conversation history accumulates tokens, especially with repetitive tool outputs.

Solution: Compress old messages using Distill algorithms while keeping recent ones detailed.

How it works:

  • Last 5 messages: Full detail
  • Messages 6-20: Summarized
  • Messages 21+: Archived (not sent)
  • Distill structural dedup: Repetitive tool results across history are collapsed
  • Delta rendering: Sequential similar tool outputs show only changes
  • ANSI/whitespace normalization: Cleans up noisy terminal output

Distill Algorithms (ported from samuelfaj/distill):

Algorithm What it does Savings
Structural similarity Jaccard index on normalized line signatures — detects near-duplicate tool results 30-50% on repetitive outputs
Delta rendering Only sends added/removed lines between sequential results 60-90% when re-reading same files
Block deduplication Collapses consecutive similar sections within a single output 20-40% on verbose logs
Bad distillation detection Prevents compression when it would lose too much information Quality guard
Text normalization Strips ANSI codes, normalizes whitespace and line endings 5-10% on terminal output

Example:

20-turn conversation without compression: 100,000 tokens
With Distill compression: 20,000 tokens
  - Old messages summarized: -60,000 tokens
  - Duplicate tool results collapsed: -15,000 tokens
  - Delta rendering on re-reads: -5,000 tokens
Savings: 80% (80,000 tokens saved)

Configuration:

# Automatic - no configuration needed
# Distill algorithms are built into the compression pipeline
HISTORY_COMPRESSION_ENABLED=true     # Enable conversation compression (default: true)
HISTORY_KEEP_RECENT_TURNS=10         # Keep last N turns verbatim (default: 10)
HISTORY_SUMMARIZE_OLDER=true         # Summarize older turns (default: true)

Phase 7: Tool-Result Compression (up to 87.6% on tool output)

Problem: Tool results dominate agentic token usage. A single grep, test run, git diff, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.

Lynkr compresses tool_result blocks in-process before forwarding (no added latency), via two complementary mechanisms.

7a. RTK pattern compression

Detects the shape of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.

Detector What it compresses Example outcome
test_output jest/vitest/pytest/cargo/go test logs Keep the summary line + failures, drop passing-test noise
git_diff git diff Per-file +adds/-dels with capped change lines
git_status git status Branch + staged/modified/untracked lists
git_log git log One line per commit (<sha7> <subject> (author, date))
lint_output eslint/tsc/ruff/clippy/biome Counts grouped by rule, not every occurrence
build_output npm/cargo/webpack Errors + capped warnings + success line
container_output docker/kubectl tables Header + first N rows + “+M more”
json_response large JSON objects Structural skeleton (search/fetch results preserved)
grep_output grep/rg (file:line:content) Grouped by file, capped at 10 matches/file
directory_listing ls/find/tree Grouped by directory with counts
large_file long source files Imports + signatures skeleton
dedup_log repetitive logs Collapses consecutive duplicate lines
smart_truncate very long unmatched output Keeps head + tail, drops the middle

Tier-aware thresholds — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:

Tier Compress if result exceeds
SIMPLE 300 chars
MEDIUM 800 chars
COMPLEX 2,000 chars
REASONING never

Lossless recovery (tee): the full original is stashed for 5 minutes and a pointer ([full: tee_…]) is appended to the compressed result. The model — or you — can fetch the original via GET /tee/:id if the detail is actually needed.

Always on (no configuration). Metrics: GET /metrics/tool-compression.

7b. TOON compression (binary JSON encoding)

For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — 87.6% reduction on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.

TOON_ENABLED=true        # opt-in (default: false)
TOON_MIN_BYTES=4096      # only compress payloads larger than this
TOON_FAIL_OPEN=true      # on any encode error, forward the original (default: true)
TOON_LOG_STATS=true      # log per-call compression stats

Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

Problem: Even with all other optimizations, large requests can still exceed context limits.

Solution: Headroom is a Python sidecar that applies ML-based compression.

How it works:

  • Smart Crusher: Statistical JSON field compression
  • Cache Aligner: Stabilizes dynamic content for provider cache hits
  • CCR: Reversible compression with on-demand retrieval
  • Rolling Window: Token budget enforcement
  • LLMLingua (optional): BERT-based 20x compression

Auto-rebuild: When you run npm start, Lynkr automatically rebuilds the Headroom Docker image if source files changed — ensuring you always run the latest code.

Configuration:

HEADROOM_ENABLED=true
# See headroom.md for full configuration reference

Combined Savings

When all phases work together:

Example Request Flow:

  1. Original request: 50,000 input tokens

    • System prompt: 2,000 tokens
    • Tools: 4,500 tokens (30 tools)
    • Memories: 1,000 tokens (5 memories)
    • Conversation: 20,000 tokens (20 messages)
    • User query: 22,500 tokens
  2. After optimization: 12,500 input tokens

    • System prompt: 0 tokens (cache hit)
    • Tools: 450 tokens (3 relevant tools)
    • Memories: 200 tokens (deduplicated)
    • Conversation: 5,000 tokens (compressed)
    • User query: 22,500 tokens (same)
  3. Savings: 75% reduction (37,500 tokens saved)


Monitoring Token Usage

Real-Time Tracking

# Check metrics endpoint
curl http://localhost:8081/metrics | grep lynkr_tokens

# Output:
# lynkr_tokens_input_total{provider="databricks"} 1234567
# lynkr_tokens_output_total{provider="databricks"} 234567
# lynkr_tokens_cached_total 500000

Per-Request Logging

# Enable token logging
LOG_LEVEL=info

# Logs show:
# {"level":"info","tokens":{"input":1250,"output":234,"cached":750}}

Best Practices

1. Enable All Optimizations

# All optimizations are enabled by default
# No configuration needed

2. Use Tier-Based Routing

# Route simple requests to free Ollama, complex to cloud
# Set all 4 TIER_* env vars to enable tier-based routing
TIER_SIMPLE=ollama:llama3.2
TIER_MEDIUM=openrouter:openai/gpt-4o-mini
TIER_COMPLEX=azure-openai:gpt-4o
TIER_REASONING=azure-openai:gpt-4o
FALLBACK_ENABLED=true
FALLBACK_PROVIDER=databricks

3. Monitor and Tune

# Check cache hit rate
curl http://localhost:8081/metrics | grep cache_hits

# Adjust cache size if needed
PROMPT_CACHE_MAX_ENTRIES=128  # Increase for more caching

ROI Calculator

Calculate your potential savings:

Formula:

Monthly Requests = 100,000
Avg Input Tokens = 50,000
Avg Output Tokens = 2,000
Cost per 1M Input = $3.00
Cost per 1M Output = $15.00

Without Lynkr:
Input Cost = (100,000 × 50,000 ÷ 1,000,000) × $3 = $15,000
Output Cost = (100,000 × 2,000 ÷ 1,000,000) × $15 = $3,000
Total = $18,000/month

With Lynkr (60% savings):
Total = $7,200/month

Savings = $10,800/month = $129,600/year

Your numbers:

  • Monthly requests: _____
  • Avg input tokens: _____
  • Avg output tokens: _____
  • Provider cost: _____

Result: $_____ saved per year


Next Steps


Getting Help