The @rapid/llm-proxy unified LLM layer consolidates three previously separate LLM abstractions into a single, comprehensive library that provides intelligent routing, resilience, and cost optimization across 14 different LLM providers, including subscription-based providers that eliminate per-token API costs. It is maintained as a standalone package shared with OKB and other projects.
Why it was created:
- Before: 3 separate LLM abstractions (Semantic Analysis, Unified Inference Engine, Semantic Validator)
- Problem: Code duplication, inconsistent provider handling, no shared caching or resilience
- After: Single unified layer (
@rapid/llm-proxy) with shared infrastructure, tier-based routing, circuit breaker, and LRU cache
Key Features:
- Parallelized copilot-first routing — Copilot scales beautifully with parallelism (0.77s effective per call at 10 concurrent)
- Zero-cost routing via GitHub Copilot and Claude Code subscriptions
- Automatic fallback to paid APIs on quota exhaustion
- Batch-optimized — agents already use
Promise.allwith concurrency 5-20, copilot as primary unlocks peak throughput - Optimistic quota tracking with exponential backoff
-
LLMService (Facade)
- Single entry point for all LLM operations
- Handles tier-based routing
- Manages provider selection and fallback
- Integrates with infrastructure (cache, circuit breaker, metrics)
-
ProviderRegistry
- Central registry for all LLM providers
- Dynamic provider registration and lookup
- Configuration validation
-
Infrastructure Layer
- Circuit Breaker: Prevents cascading failures (threshold: 5 failures, reset: 60s)
- LRU Cache: 1000 entries, 1-hour TTL
- Metrics: Request tracking, cost monitoring, performance stats
The unified layer serves three primary consumers:
-
SemanticAnalyzer (
integrations/mcp-server-semantic-analysis/)- Batch analysis workflows
- Git history analysis
- Ontology classification
-
UnifiedInferenceEngine (shared utility)
- General-purpose LLM inference
- Multi-provider support
-
SemanticValidator (
integrations/mcp-constraint-monitor/)- Constraint violation detection
- Semantic code analysis
The system supports 14 LLM providers with tier-based model selection:
CLI Command: claude
Cost: $0 per token (uses existing Claude max subscription)
| Tier | Model | Description |
|---|---|---|
| Fast | sonnet |
Claude Sonnet 4.5 (fast tier) |
| Standard | sonnet |
Claude Sonnet 4.5 (standard tier) |
| Premium | opus |
Claude Opus 4.6 (highest quality) |
Requirements:
- Install Claude Code CLI: https://claude.ai/downloads
- Authenticate:
claude login - Verify:
claude --version
Features:
- Automatic quota tracking with persistent storage
- Exponential backoff on exhaustion (5m → 15m → 1h)
- Seamless fallback to API providers
- From containers: Falls back to LLM Proxy Bridge on
host.docker.internal:12435
Method: Direct HTTP POST to Copilot API Cost: $0 per token (uses existing GitHub Copilot subscription)
| Tier | Model | Description |
|---|---|---|
| Fast | claude-haiku-4.5 |
Benchmarked: 5s sequential, 0.77s @10 parallel |
| Standard | claude-sonnet-4.5 |
Claude Sonnet 4.5 via Copilot |
| Premium | claude-opus-4.6 |
Claude Opus 4.6 via Copilot |
Why Copilot is primary: Performance benchmarks revealed that Copilot API calls scale beautifully with parallelism — 0.77s effective per call at 10 concurrent (vs 5s sequential). Since batch agents already parallelize LLM calls via Promise.all (concurrency 5-20), copilot as the first-choice provider unlocks peak throughput.
Authentication:
- Reads OAuth token from
~/.local/share/opencode/auth.json - Direct HTTP POST to OpenAI-compatible Copilot API endpoint
- No CLI tools required
Features:
- Shared quota tracking system
- Automatic provider rotation on exhaustion
- Zero API costs
- From containers: Falls back to LLM Proxy Bridge on
host.docker.internal:12435
When running inside Docker, host-side tools are unavailable. The LLM Proxy Bridge runs on the host (port 12435) and forwards requests. For Copilot, the proxy bridge reads OAuth tokens from ~/.local/share/opencode/auth.json and makes direct HTTP POST calls to the Copilot API. For Claude Code, it spawns the claude CLI. Each provider automatically detects and uses the proxy during initialization when the LLM_CLI_PROXY_URL environment variable is set.
API Key: GROQ_API_KEY
| Tier | Model | Performance | Cost |
|---|---|---|---|
| Fast | llama-3.1-8b-instant |
750 tok/s | ~$0.05/M tokens |
| Standard | llama-3.3-70b-versatile |
275 tok/s | ~$0.59/M tokens |
| Premium | openai/gpt-oss-120b |
- | High |
API Key: ANTHROPIC_API_KEY
| Tier | Model | Cost |
|---|---|---|
| Fast | claude-haiku-4-5 |
$1/$5 per MTok |
| Standard | claude-sonnet-4-5 |
$3/$15 per MTok |
| Premium | claude-opus-4-6 |
$5/$25 per MTok |
API Key: OPENAI_API_KEY
| Tier | Model | Description |
|---|---|---|
| Fast | gpt-4.1-mini |
Affordable small model |
| Standard | gpt-4.1 |
Latest standard model |
| Premium | o4-mini |
Reasoning model |
API Key: GOOGLE_API_KEY
| Tier | Model | Description |
|---|---|---|
| Fast | gemini-2.5-flash |
Fast, cost-effective |
| Standard | gemini-2.5-flash |
Good balance |
| Premium | gemini-2.5-pro |
Deep reasoning |
API Key: GITHUB_TOKEN
Base URL: https://models.github.ai/inference/v1
| Tier | Model |
|---|---|
| Fast | gpt-4.1-mini |
| Standard | gpt-4.1 |
| Premium | o4-mini |
Local provider - no API key required
Base URL: http://localhost:12434/engines/v1
- Default Model:
ai/llama3.2 - Specialized Models:
ai/llama3.2:3B-Q4_K_M(lightweight tasks)ai/qwen2.5-coder:7B-Q4_K_M(code analysis)
Local provider - no API key required
- Supports any locally installed Ollama models
- Used as final fallback for local-only mode
Test/debug mode - no API key required
- Returns simulated responses
- Used for testing and development
The system routes requests to providers based on task complexity and cost optimization:
Fast Tier (zero cost → low cost, high speed)
- Simple extraction and parsing
- Basic classification
- File pattern matching
- Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models
Standard Tier (zero cost → balanced cost/quality)
- Semantic code analysis
- Git history analysis
- Documentation linking
- Ontology classification
- Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models
Premium Tier (zero cost → highest quality)
- Insight generation
- Pattern recognition
- Quality assurance review
- Deep code analysis
- Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models
Tasks are automatically mapped to tiers based on their complexity:
# Fast tier examples
- git_file_extraction
- commit_message_parsing
- basic_classification
# Standard tier examples
- git_history_analysis
- semantic_code_analysis
- ontology_classification
# Premium tier examples
- insight_generation
- observation_generation
- pattern_recognition- Primary: Try providers in priority order (copilot first — parallelism-optimized)
- Subscription check: Verify quota availability (copilot, claude-code)
- Circuit breaker check: Skip failed providers temporarily
- Cache check: Return cached results if available
- API fallback: Use paid API providers (Groq, Anthropic, OpenAI, Gemini, GitHub Models)
- Local fallback: DMR → Ollama (always available, no API costs)
Parallelism: Batch agents call LLMService via Promise.all (concurrency 5-20). Copilot scales from 5s sequential to 0.77s effective per call at 10 concurrent, making it ideal as the primary provider.
The system tracks subscription usage and automatically handles quota exhaustion:
Storage: .data/llm-subscription-usage.json
Tracked Metrics:
- Completions per hour (rolling window)
- Estimated token usage
- Quota exhaustion state
- Consecutive failure count
Soft Limits:
- Claude Code: 100 completions/hour
- Copilot: 100 completions/hour
When quota is exhausted, the system applies exponential backoff:
- First exhaustion: Retry after 5 minutes
- Second exhaustion: Retry after 15 minutes
- Third+ exhaustion: Retry after 1 hour
Automatic recovery: On successful completion, reset failure counters
Request → Check Copilot quota (primary — parallelism-optimized)
↓ (exhausted)
→ Use Groq (paid API, fast fallback)
↓ (circuit breaker open)
→ Check Claude Code quota
↓ (exhausted)
→ Use Anthropic (paid API)
↓ (all failed)
→ Use DMR (local)
Cost Impact:
- If subscriptions available: $0
- If subscriptions exhausted: Standard API costs apply
- Seamless transition - no user intervention needed
Quota data is automatically:
- Persisted to disk after each request
- Pruned (keep last 24 hours only)
- Loaded on service initialization
Reset quota tracking (for testing):
rm .data/llm-subscription-usage.jsonThe system supports three routing modes:
Environment: SEMANTIC_ANALYSIS_MODE=mock
- Uses Mock provider exclusively
- Returns simulated responses
- No API calls or costs
- Ideal for testing and development
Environment: SEMANTIC_ANALYSIS_MODE=local
- Uses DMR and Ollama only
- No external API calls
- Zero API costs
- Requires local model servers running
Environment: SEMANTIC_ANALYSIS_MODE=public or unset
- Uses all cloud providers (Groq, Anthropic, OpenAI, Gemini, GitHub)
- Falls back to DMR/Ollama if all cloud providers fail
- Optimizes for quality and availability
The system provides interfaces for extending functionality:
interface MockServiceInterface {
getMockResponse(task: string, tier: string): Promise<string>;
}Used to customize mock responses for testing.
interface BudgetTrackerInterface {
trackCost(provider: string, tokens: number, cost: number): void;
getBudgetRemaining(): number;
isOverBudget(): boolean;
}Enables cost tracking and budget enforcement.
interface SensitivityClassifierInterface {
classifyContent(content: string): 'public' | 'internal' | 'confidential';
canUseProvider(provider: string, sensitivity: string): boolean;
}Restricts provider usage based on content sensitivity.
All LLM provider configuration is centralized in:
File: config/llm-providers.yaml
Key configuration sections:
providers:
# Subscription providers (zero cost)
claude-code:
cliCommand: "claude"
timeout: 60000
models:
fast: "sonnet"
standard: "sonnet"
premium: "opus"
quotaTracking:
enabled: true
softLimitPerHour: 100
copilot:
cliCommand: "copilot-cli"
timeout: 120000
models:
fast: "claude-haiku-4.5" # Benchmarked: 0.77s @10 parallel
standard: "claude-sonnet-4.5"
premium: "claude-opus-4.6"
quotaTracking:
enabled: true
softLimitPerHour: 100
# API providers (per-token cost)
groq:
apiKeyEnvVar: GROQ_API_KEY
fast: "llama-3.1-8b-instant"
standard: "llama-3.3-70b-versatile"
premium: "openai/gpt-oss-120b"
anthropic:
apiKeyEnvVar: ANTHROPIC_API_KEY
fast: "claude-haiku-4-5"
standard: "claude-sonnet-4-5"
premium: "claude-opus-4-6"
# ... more providers (openai, gemini, github-models)
# Copilot first — scales with parallelism (0.77s effective @10 concurrent)
# Batch agents use Promise.all, so copilot as primary unlocks peak throughput
provider_priority:
fast: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
standard: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
premium: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
cache:
maxSize: 1000
ttlMs: 3600000 # 1 hour
circuit_breaker:
threshold: 5
resetTimeoutMs: 60000 # 1 minuteForce specific behavior via environment variables:
# Force all tasks to premium tier
export SEMANTIC_ANALYSIS_TIER=premium
# Force specific provider (skip routing)
export SEMANTIC_ANALYSIS_PROVIDER=anthropic
# Use budget mode (fast tier everywhere)
export SEMANTIC_ANALYSIS_COST_MODE=budget
# Use local-only mode
export SEMANTIC_ANALYSIS_MODE=localPer-workflow limits (configured in llm-providers.yaml):
- Budget mode: $0.05 per run
- Standard mode: $0.50 per run
- Quality mode: $2.00 per run
Batch workflow limits:
- Max tokens per batch: 500,000
- Max cost per batch: $1.00 USD
- Total budget: $50.00 USD
- Automatic fallback to local on quota exceeded
- Copilot-first parallelized routing: Copilot scales with concurrency (0.77s @10 parallel), batch agents use Promise.all
- Tier-based routing: Use cheapest provider that meets quality requirements
- Caching: Avoid duplicate LLM calls (1-hour TTL)
- Automatic fallback: Switch to paid APIs only when subscriptions exhausted
- Local fallback: Switch to DMR/Ollama when budget exhausted
- Circuit breaker: Stop calling failed providers quickly
Typical UKB batch analysis run:
- 50 fast tier calls (extraction, parsing)
- 100 standard tier calls (semantic analysis)
- 20 premium tier calls (insight generation)
Before subscriptions (all API):
- Fast: 50 × $0.001 = $0.05
- Standard: 100 × $0.01 = $1.00
- Premium: 20 × $0.05 = $1.00
- Total: ~$2.05 per run
After subscriptions (until quota exhausted):
- Fast: $0 (Copilot/claude-haiku-4.5, parallelized)
- Standard: $0 (Copilot/claude-sonnet-4.5)
- Premium: $0 (Copilot/claude-opus-4.6)
- Total: $0.00 per run ✅
- Bonus: ~3x faster via parallelized copilot calls
Estimated savings: ~$50-100/month for active development
The LLM layer tracks:
- Request metrics: Total calls per provider, success/failure rates
- Performance: Latency per provider, throughput
- Cost: Token usage, estimated costs per provider
- Cache: Hit/miss ratio, cache size
Metrics are exposed via the LLMService.getMetrics() method.
import { LLMService, loadProviderConfig } from '@rapid/llm-proxy';
import type { LLMCompletionRequest } from '@rapid/llm-proxy';
const llmService = LLMService.getInstance();
const request: LLMCompletionRequest = {
messages: [
{ role: 'user', content: 'Analyze this code for bugs...' }
],
tier: 'standard', // or 'fast' / 'premium'
task: 'semantic_code_analysis',
temperature: 0.7,
maxTokens: 4096
};
const result = await llmService.complete(request);
Logger.log('info', result.content); // LLM response
Logger.log('info', `Provider: ${result.provider}`); // Which provider was used
Logger.log('info', `Model: ${result.model}`); // Specific model
Logger.log('info', `Tokens: ${result.usage.totalTokens}`); // Token usage
Logger.log('info', `Cached: ${result.cached}`); // Was it from cache?- LLM Provider Guide - User guide for working with providers
- Semantic Analysis Integration - SA consumer usage
- Getting Started - Installation and API key setup
Configuration Files:
config/llm-providers.yaml- Full provider configuration schemadocs/provider-configuration.md- Detailed API key setup guide

