LLM Architecture

Overview

The @rapid/llm-proxy unified LLM layer consolidates three previously separate LLM abstractions into a single, comprehensive library that provides intelligent routing, resilience, and cost optimization across 14 different LLM providers, including subscription-based providers that eliminate per-token API costs. It is maintained as a standalone package shared with OKB and other projects.

Why it was created:

Before: 3 separate LLM abstractions (Semantic Analysis, Unified Inference Engine, Semantic Validator)
Problem: Code duplication, inconsistent provider handling, no shared caching or resilience
After: Single unified layer (@rapid/llm-proxy) with shared infrastructure, tier-based routing, circuit breaker, and LRU cache

Key Features:

Parallelized copilot-first routing — Copilot scales beautifully with parallelism (0.77s effective per call at 10 concurrent)
Zero-cost routing via GitHub Copilot and Claude Code subscriptions
Automatic fallback to paid APIs on quota exhaustion
Batch-optimized — agents already use Promise.all with concurrency 5-20, copilot as primary unlocks peak throughput
Optimistic quota tracking with exponential backoff

Architecture Components

Core Components

LLMService (Facade)
- Single entry point for all LLM operations
- Handles tier-based routing
- Manages provider selection and fallback
- Integrates with infrastructure (cache, circuit breaker, metrics)
ProviderRegistry
- Central registry for all LLM providers
- Dynamic provider registration and lookup
- Configuration validation
Infrastructure Layer
- Circuit Breaker: Prevents cascading failures (threshold: 5 failures, reset: 60s)
- LRU Cache: 1000 entries, 1-hour TTL
- Metrics: Request tracking, cost monitoring, performance stats

Consumers

The unified layer serves three primary consumers:

SemanticAnalyzer (integrations/mcp-server-semantic-analysis/)
- Batch analysis workflows
- Git history analysis
- Ontology classification
UnifiedInferenceEngine (shared utility)
- General-purpose LLM inference
- Multi-provider support
SemanticValidator (integrations/mcp-constraint-monitor/)
- Constraint violation detection
- Semantic code analysis

Supported Providers

The system supports 14 LLM providers with tier-based model selection:

Subscription Providers (Zero Cost)

1. Claude Code

CLI Command: claude Cost: $0 per token (uses existing Claude max subscription)

Tier	Model	Description
Fast	`sonnet`	Claude Sonnet 4.5 (fast tier)
Standard	`sonnet`	Claude Sonnet 4.5 (standard tier)
Premium	`opus`	Claude Opus 4.6 (highest quality)

Requirements:

Install Claude Code CLI: https://claude.ai/downloads
Authenticate: claude login
Verify: claude --version

Features:

Automatic quota tracking with persistent storage
Exponential backoff on exhaustion (5m → 15m → 1h)
Seamless fallback to API providers
From containers: Falls back to LLM Proxy Bridge on host.docker.internal:12435

2. GitHub Copilot (Primary Provider)

Method: Direct HTTP POST to Copilot API Cost: $0 per token (uses existing GitHub Copilot subscription)

Tier	Model	Description
Fast	`claude-haiku-4.5`	Benchmarked: 5s sequential, 0.77s @10 parallel
Standard	`claude-sonnet-4.5`	Claude Sonnet 4.5 via Copilot
Premium	`claude-opus-4.6`	Claude Opus 4.6 via Copilot

Why Copilot is primary: Performance benchmarks revealed that Copilot API calls scale beautifully with parallelism — 0.77s effective per call at 10 concurrent (vs 5s sequential). Since batch agents already parallelize LLM calls via Promise.all (concurrency 5-20), copilot as the first-choice provider unlocks peak throughput.

Authentication:

Reads OAuth token from ~/.local/share/opencode/auth.json
Direct HTTP POST to OpenAI-compatible Copilot API endpoint
No CLI tools required

Features:

Shared quota tracking system
Automatic provider rotation on exhaustion
Zero API costs
From containers: Falls back to LLM Proxy Bridge on host.docker.internal:12435

LLM Proxy Bridge (Docker Bridge)

When running inside Docker, host-side tools are unavailable. The LLM Proxy Bridge runs on the host (port 12435) and forwards requests. For Copilot, the proxy bridge reads OAuth tokens from ~/.local/share/opencode/auth.json and makes direct HTTP POST calls to the Copilot API. For Claude Code, it spawns the claude CLI. Each provider automatically detects and uses the proxy during initialization when the LLM_CLI_PROXY_URL environment variable is set.

API Providers (Per-Token Cost)

3. Groq

API Key: GROQ_API_KEY

Tier	Model	Performance	Cost
Fast	`llama-3.1-8b-instant`	750 tok/s	~$0.05/M tokens
Standard	`llama-3.3-70b-versatile`	275 tok/s	~$0.59/M tokens
Premium	`openai/gpt-oss-120b`	-	High

4. Anthropic

API Key: ANTHROPIC_API_KEY

Tier	Model	Cost
Fast	`claude-haiku-4-5`	$1/$5 per MTok
Standard	`claude-sonnet-4-5`	$3/$15 per MTok
Premium	`claude-opus-4-6`	$5/$25 per MTok

5. OpenAI

API Key: OPENAI_API_KEY

Tier	Model	Description
Fast	`gpt-4.1-mini`	Affordable small model
Standard	`gpt-4.1`	Latest standard model
Premium	`o4-mini`	Reasoning model

6. Google Gemini

API Key: GOOGLE_API_KEY

Tier	Model	Description
Fast	`gemini-2.5-flash`	Fast, cost-effective
Standard	`gemini-2.5-flash`	Good balance
Premium	`gemini-2.5-pro`	Deep reasoning

7. GitHub Models

API Key: GITHUB_TOKEN Base URL: https://models.github.ai/inference/v1

Tier	Model
Fast	`gpt-4.1-mini`
Standard	`gpt-4.1`
Premium	`o4-mini`

8. DMR (Docker Model Runner)

Local provider - no API key required Base URL: http://localhost:12434/engines/v1

Default Model: ai/llama3.2
Specialized Models:
- ai/llama3.2:3B-Q4_K_M (lightweight tasks)
- ai/qwen2.5-coder:7B-Q4_K_M (code analysis)

9. Ollama

Local provider - no API key required

Supports any locally installed Ollama models
Used as final fallback for local-only mode

10. Mock Provider

Test/debug mode - no API key required

Returns simulated responses
Used for testing and development

Tier-Based Routing

The system routes requests to providers based on task complexity and cost optimization:

Tier Definitions

Fast Tier (zero cost → low cost, high speed)

Simple extraction and parsing
Basic classification
File pattern matching
Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Standard Tier (zero cost → balanced cost/quality)

Semantic code analysis
Git history analysis
Documentation linking
Ontology classification
Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Premium Tier (zero cost → highest quality)

Insight generation
Pattern recognition
Quality assurance review
Deep code analysis
Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Task-to-Tier Mapping

Tasks are automatically mapped to tiers based on their complexity:

# Fast tier examples
- git_file_extraction
- commit_message_parsing
- basic_classification

# Standard tier examples
- git_history_analysis
- semantic_code_analysis
- ontology_classification

# Premium tier examples
- insight_generation
- observation_generation
- pattern_recognition

Fallback Chain

Primary: Try providers in priority order (copilot first — parallelism-optimized)
Subscription check: Verify quota availability (copilot, claude-code)
Circuit breaker check: Skip failed providers temporarily
Cache check: Return cached results if available
API fallback: Use paid API providers (Groq, Anthropic, OpenAI, Gemini, GitHub Models)
Local fallback: DMR → Ollama (always available, no API costs)

Parallelism: Batch agents call LLMService via Promise.all (concurrency 5-20). Copilot scales from 5s sequential to 0.77s effective per call at 10 concurrent, making it ideal as the primary provider.

Subscription Quota Management

The system tracks subscription usage and automatically handles quota exhaustion:

Quota Tracking

Storage: .data/llm-subscription-usage.json

Tracked Metrics:

Completions per hour (rolling window)
Estimated token usage
Quota exhaustion state
Consecutive failure count

Soft Limits:

Claude Code: 100 completions/hour
Copilot: 100 completions/hour

Exponential Backoff

When quota is exhausted, the system applies exponential backoff:

First exhaustion: Retry after 5 minutes
Second exhaustion: Retry after 15 minutes
Third+ exhaustion: Retry after 1 hour

Automatic recovery: On successful completion, reset failure counters

Automatic Fallback

Request → Check Copilot quota (primary — parallelism-optimized)
       ↓ (exhausted)
       → Use Groq (paid API, fast fallback)
       ↓ (circuit breaker open)
       → Check Claude Code quota
       ↓ (exhausted)
       → Use Anthropic (paid API)
       ↓ (all failed)
       → Use DMR (local)

Cost Impact:

If subscriptions available: $0
If subscriptions exhausted: Standard API costs apply
Seamless transition - no user intervention needed

Data Persistence

Quota data is automatically:

Persisted to disk after each request
Pruned (keep last 24 hours only)
Loaded on service initialization

Reset quota tracking (for testing):

rm .data/llm-subscription-usage.json

Mode Routing

The system supports three routing modes:

1. Mock Mode

Environment: SEMANTIC_ANALYSIS_MODE=mock

Uses Mock provider exclusively
Returns simulated responses
No API calls or costs
Ideal for testing and development

2. Local Mode

Environment: SEMANTIC_ANALYSIS_MODE=local

Uses DMR and Ollama only
No external API calls
Zero API costs
Requires local model servers running

3. Public Mode (Default)

Environment: SEMANTIC_ANALYSIS_MODE=public or unset

Uses all cloud providers (Groq, Anthropic, OpenAI, Gemini, GitHub)
Falls back to DMR/Ollama if all cloud providers fail
Optimizes for quality and availability

Dependency Injection Hooks

The system provides interfaces for extending functionality:

1. MockServiceInterface

interface MockServiceInterface {
  getMockResponse(task: string, tier: string): Promise<string>;
}

Used to customize mock responses for testing.

2. BudgetTrackerInterface

interface BudgetTrackerInterface {
  trackCost(provider: string, tokens: number, cost: number): void;
  getBudgetRemaining(): number;
  isOverBudget(): boolean;
}

Enables cost tracking and budget enforcement.

3. SensitivityClassifierInterface

interface SensitivityClassifierInterface {
  classifyContent(content: string): 'public' | 'internal' | 'confidential';
  canUseProvider(provider: string, sensitivity: string): boolean;
}

Restricts provider usage based on content sensitivity.

Configuration

All LLM provider configuration is centralized in:

File: config/llm-providers.yaml

Key configuration sections:

providers:
  # Subscription providers (zero cost)
  claude-code:
    cliCommand: "claude"
    timeout: 60000
    models:
      fast: "sonnet"
      standard: "sonnet"
      premium: "opus"
    quotaTracking:
      enabled: true
      softLimitPerHour: 100

  copilot:
    cliCommand: "copilot-cli"
    timeout: 120000
    models:
      fast: "claude-haiku-4.5"        # Benchmarked: 0.77s @10 parallel
      standard: "claude-sonnet-4.5"
      premium: "claude-opus-4.6"
    quotaTracking:
      enabled: true
      softLimitPerHour: 100

  # API providers (per-token cost)
  groq:
    apiKeyEnvVar: GROQ_API_KEY
    fast: "llama-3.1-8b-instant"
    standard: "llama-3.3-70b-versatile"
    premium: "openai/gpt-oss-120b"

  anthropic:
    apiKeyEnvVar: ANTHROPIC_API_KEY
    fast: "claude-haiku-4-5"
    standard: "claude-sonnet-4-5"
    premium: "claude-opus-4-6"
  # ... more providers (openai, gemini, github-models)

# Copilot first — scales with parallelism (0.77s effective @10 concurrent)
# Batch agents use Promise.all, so copilot as primary unlocks peak throughput
provider_priority:
  fast: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
  standard: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
  premium: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]

cache:
  maxSize: 1000
  ttlMs: 3600000  # 1 hour

circuit_breaker:
  threshold: 5
  resetTimeoutMs: 60000  # 1 minute

Environment Overrides

Force specific behavior via environment variables:

# Force all tasks to premium tier
export SEMANTIC_ANALYSIS_TIER=premium

# Force specific provider (skip routing)
export SEMANTIC_ANALYSIS_PROVIDER=anthropic

# Use budget mode (fast tier everywhere)
export SEMANTIC_ANALYSIS_COST_MODE=budget

# Use local-only mode
export SEMANTIC_ANALYSIS_MODE=local

Cost Management

Cost Limits

Per-workflow limits (configured in llm-providers.yaml):

Budget mode: $0.05 per run
Standard mode: $0.50 per run
Quality mode: $2.00 per run

Batch workflow limits:

Max tokens per batch: 500,000
Max cost per batch: $1.00 USD
Total budget: $50.00 USD
Automatic fallback to local on quota exceeded

Cost Optimization Strategies

Copilot-first parallelized routing: Copilot scales with concurrency (0.77s @10 parallel), batch agents use Promise.all
Tier-based routing: Use cheapest provider that meets quality requirements
Caching: Avoid duplicate LLM calls (1-hour TTL)
Automatic fallback: Switch to paid APIs only when subscriptions exhausted
Local fallback: Switch to DMR/Ollama when budget exhausted
Circuit breaker: Stop calling failed providers quickly

Cost Savings Example

Typical UKB batch analysis run:

50 fast tier calls (extraction, parsing)
100 standard tier calls (semantic analysis)
20 premium tier calls (insight generation)

Before subscriptions (all API):

Fast: 50 × $0.001 = $0.05
Standard: 100 × $0.01 = $1.00
Premium: 20 × $0.05 = $1.00
Total: ~$2.05 per run

After subscriptions (until quota exhausted):

Fast: $0 (Copilot/claude-haiku-4.5, parallelized)
Standard: $0 (Copilot/claude-sonnet-4.5)
Premium: $0 (Copilot/claude-opus-4.6)
Total: $0.00 per run ✅
Bonus: ~3x faster via parallelized copilot calls

Estimated savings: ~$50-100/month for active development

Metrics and Monitoring

The LLM layer tracks:

Request metrics: Total calls per provider, success/failure rates
Performance: Latency per provider, throughput
Cost: Token usage, estimated costs per provider
Cache: Hit/miss ratio, cache size

Metrics are exposed via the LLMService.getMetrics() method.

Integration Example

import { LLMService, loadProviderConfig } from '@rapid/llm-proxy';
import type { LLMCompletionRequest } from '@rapid/llm-proxy';

const llmService = LLMService.getInstance();

const request: LLMCompletionRequest = {
  messages: [
    { role: 'user', content: 'Analyze this code for bugs...' }
  ],
  tier: 'standard',  // or 'fast' / 'premium'
  task: 'semantic_code_analysis',
  temperature: 0.7,
  maxTokens: 4096
};

const result = await llmService.complete(request);

Logger.log('info', result.content);           // LLM response
Logger.log('info', `Provider: ${result.provider}`);          // Which provider was used
Logger.log('info', `Model: ${result.model}`);             // Specific model
Logger.log('info', `Tokens: ${result.usage.totalTokens}`); // Token usage
Logger.log('info', `Cached: ${result.cached}`);            // Was it from cache?

FilesExpand file tree

llm-architecture.md

Latest commit

History