Skip to content

Latest commit

 

History

History
557 lines (413 loc) · 17 KB

File metadata and controls

557 lines (413 loc) · 17 KB

LLM Architecture

Overview

The @rapid/llm-proxy unified LLM layer consolidates three previously separate LLM abstractions into a single, comprehensive library that provides intelligent routing, resilience, and cost optimization across 14 different LLM providers, including subscription-based providers that eliminate per-token API costs. It is maintained as a standalone package shared with OKB and other projects.

Why it was created:

  • Before: 3 separate LLM abstractions (Semantic Analysis, Unified Inference Engine, Semantic Validator)
  • Problem: Code duplication, inconsistent provider handling, no shared caching or resilience
  • After: Single unified layer (@rapid/llm-proxy) with shared infrastructure, tier-based routing, circuit breaker, and LRU cache

Key Features:

  • Parallelized copilot-first routing — Copilot scales beautifully with parallelism (0.77s effective per call at 10 concurrent)
  • Zero-cost routing via GitHub Copilot and Claude Code subscriptions
  • Automatic fallback to paid APIs on quota exhaustion
  • Batch-optimized — agents already use Promise.all with concurrency 5-20, copilot as primary unlocks peak throughput
  • Optimistic quota tracking with exponential backoff

Architecture Components

LLM Provider Architecture

Core Components

  1. LLMService (Facade)

    • Single entry point for all LLM operations
    • Handles tier-based routing
    • Manages provider selection and fallback
    • Integrates with infrastructure (cache, circuit breaker, metrics)
  2. ProviderRegistry

    • Central registry for all LLM providers
    • Dynamic provider registration and lookup
    • Configuration validation
  3. Infrastructure Layer

    • Circuit Breaker: Prevents cascading failures (threshold: 5 failures, reset: 60s)
    • LRU Cache: 1000 entries, 1-hour TTL
    • Metrics: Request tracking, cost monitoring, performance stats

Consumers

The unified layer serves three primary consumers:

  1. SemanticAnalyzer (integrations/mcp-server-semantic-analysis/)

    • Batch analysis workflows
    • Git history analysis
    • Ontology classification
  2. UnifiedInferenceEngine (shared utility)

    • General-purpose LLM inference
    • Multi-provider support
  3. SemanticValidator (integrations/mcp-constraint-monitor/)

    • Constraint violation detection
    • Semantic code analysis

Supported Providers

The system supports 14 LLM providers with tier-based model selection:

Subscription Providers (Zero Cost)

1. Claude Code

CLI Command: claude Cost: $0 per token (uses existing Claude max subscription)

Tier Model Description
Fast sonnet Claude Sonnet 4.5 (fast tier)
Standard sonnet Claude Sonnet 4.5 (standard tier)
Premium opus Claude Opus 4.6 (highest quality)

Requirements:

Features:

  • Automatic quota tracking with persistent storage
  • Exponential backoff on exhaustion (5m → 15m → 1h)
  • Seamless fallback to API providers
  • From containers: Falls back to LLM Proxy Bridge on host.docker.internal:12435

2. GitHub Copilot (Primary Provider)

Method: Direct HTTP POST to Copilot API Cost: $0 per token (uses existing GitHub Copilot subscription)

Tier Model Description
Fast claude-haiku-4.5 Benchmarked: 5s sequential, 0.77s @10 parallel
Standard claude-sonnet-4.5 Claude Sonnet 4.5 via Copilot
Premium claude-opus-4.6 Claude Opus 4.6 via Copilot

Why Copilot is primary: Performance benchmarks revealed that Copilot API calls scale beautifully with parallelism — 0.77s effective per call at 10 concurrent (vs 5s sequential). Since batch agents already parallelize LLM calls via Promise.all (concurrency 5-20), copilot as the first-choice provider unlocks peak throughput.

Authentication:

  • Reads OAuth token from ~/.local/share/opencode/auth.json
  • Direct HTTP POST to OpenAI-compatible Copilot API endpoint
  • No CLI tools required

Features:

  • Shared quota tracking system
  • Automatic provider rotation on exhaustion
  • Zero API costs
  • From containers: Falls back to LLM Proxy Bridge on host.docker.internal:12435

LLM Proxy Bridge (Docker Bridge)

When running inside Docker, host-side tools are unavailable. The LLM Proxy Bridge runs on the host (port 12435) and forwards requests. For Copilot, the proxy bridge reads OAuth tokens from ~/.local/share/opencode/auth.json and makes direct HTTP POST calls to the Copilot API. For Claude Code, it spawns the claude CLI. Each provider automatically detects and uses the proxy during initialization when the LLM_CLI_PROXY_URL environment variable is set.


API Providers (Per-Token Cost)

3. Groq

API Key: GROQ_API_KEY

Tier Model Performance Cost
Fast llama-3.1-8b-instant 750 tok/s ~$0.05/M tokens
Standard llama-3.3-70b-versatile 275 tok/s ~$0.59/M tokens
Premium openai/gpt-oss-120b - High

4. Anthropic

API Key: ANTHROPIC_API_KEY

Tier Model Cost
Fast claude-haiku-4-5 $1/$5 per MTok
Standard claude-sonnet-4-5 $3/$15 per MTok
Premium claude-opus-4-6 $5/$25 per MTok

5. OpenAI

API Key: OPENAI_API_KEY

Tier Model Description
Fast gpt-4.1-mini Affordable small model
Standard gpt-4.1 Latest standard model
Premium o4-mini Reasoning model

6. Google Gemini

API Key: GOOGLE_API_KEY

Tier Model Description
Fast gemini-2.5-flash Fast, cost-effective
Standard gemini-2.5-flash Good balance
Premium gemini-2.5-pro Deep reasoning

7. GitHub Models

API Key: GITHUB_TOKEN Base URL: https://models.github.ai/inference/v1

Tier Model
Fast gpt-4.1-mini
Standard gpt-4.1
Premium o4-mini

8. DMR (Docker Model Runner)

Local provider - no API key required Base URL: http://localhost:12434/engines/v1

  • Default Model: ai/llama3.2
  • Specialized Models:
    • ai/llama3.2:3B-Q4_K_M (lightweight tasks)
    • ai/qwen2.5-coder:7B-Q4_K_M (code analysis)

9. Ollama

Local provider - no API key required

  • Supports any locally installed Ollama models
  • Used as final fallback for local-only mode

10. Mock Provider

Test/debug mode - no API key required

  • Returns simulated responses
  • Used for testing and development

Tier-Based Routing

LLM Tier Routing

The system routes requests to providers based on task complexity and cost optimization:

Tier Definitions

Fast Tier (zero cost → low cost, high speed)

  • Simple extraction and parsing
  • Basic classification
  • File pattern matching
  • Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Standard Tier (zero cost → balanced cost/quality)

  • Semantic code analysis
  • Git history analysis
  • Documentation linking
  • Ontology classification
  • Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Premium Tier (zero cost → highest quality)

  • Insight generation
  • Pattern recognition
  • Quality assurance review
  • Deep code analysis
  • Provider Priority: Copilot → Groq → Claude Code → Anthropic → OpenAI → Gemini → GitHub Models

Task-to-Tier Mapping

Tasks are automatically mapped to tiers based on their complexity:

# Fast tier examples
- git_file_extraction
- commit_message_parsing
- basic_classification

# Standard tier examples
- git_history_analysis
- semantic_code_analysis
- ontology_classification

# Premium tier examples
- insight_generation
- observation_generation
- pattern_recognition

Fallback Chain

  1. Primary: Try providers in priority order (copilot first — parallelism-optimized)
  2. Subscription check: Verify quota availability (copilot, claude-code)
  3. Circuit breaker check: Skip failed providers temporarily
  4. Cache check: Return cached results if available
  5. API fallback: Use paid API providers (Groq, Anthropic, OpenAI, Gemini, GitHub Models)
  6. Local fallback: DMR → Ollama (always available, no API costs)

Parallelism: Batch agents call LLMService via Promise.all (concurrency 5-20). Copilot scales from 5s sequential to 0.77s effective per call at 10 concurrent, making it ideal as the primary provider.


Subscription Quota Management

The system tracks subscription usage and automatically handles quota exhaustion:

Quota Tracking

Storage: .data/llm-subscription-usage.json

Tracked Metrics:

  • Completions per hour (rolling window)
  • Estimated token usage
  • Quota exhaustion state
  • Consecutive failure count

Soft Limits:

  • Claude Code: 100 completions/hour
  • Copilot: 100 completions/hour

Exponential Backoff

When quota is exhausted, the system applies exponential backoff:

  1. First exhaustion: Retry after 5 minutes
  2. Second exhaustion: Retry after 15 minutes
  3. Third+ exhaustion: Retry after 1 hour

Automatic recovery: On successful completion, reset failure counters

Automatic Fallback

Request → Check Copilot quota (primary — parallelism-optimized)
       ↓ (exhausted)
       → Use Groq (paid API, fast fallback)
       ↓ (circuit breaker open)
       → Check Claude Code quota
       ↓ (exhausted)
       → Use Anthropic (paid API)
       ↓ (all failed)
       → Use DMR (local)

Cost Impact:

  • If subscriptions available: $0
  • If subscriptions exhausted: Standard API costs apply
  • Seamless transition - no user intervention needed

Data Persistence

Quota data is automatically:

  • Persisted to disk after each request
  • Pruned (keep last 24 hours only)
  • Loaded on service initialization

Reset quota tracking (for testing):

rm .data/llm-subscription-usage.json

Mode Routing

The system supports three routing modes:

1. Mock Mode

Environment: SEMANTIC_ANALYSIS_MODE=mock

  • Uses Mock provider exclusively
  • Returns simulated responses
  • No API calls or costs
  • Ideal for testing and development

2. Local Mode

Environment: SEMANTIC_ANALYSIS_MODE=local

  • Uses DMR and Ollama only
  • No external API calls
  • Zero API costs
  • Requires local model servers running

3. Public Mode (Default)

Environment: SEMANTIC_ANALYSIS_MODE=public or unset

  • Uses all cloud providers (Groq, Anthropic, OpenAI, Gemini, GitHub)
  • Falls back to DMR/Ollama if all cloud providers fail
  • Optimizes for quality and availability

Dependency Injection Hooks

The system provides interfaces for extending functionality:

1. MockServiceInterface

interface MockServiceInterface {
  getMockResponse(task: string, tier: string): Promise<string>;
}

Used to customize mock responses for testing.

2. BudgetTrackerInterface

interface BudgetTrackerInterface {
  trackCost(provider: string, tokens: number, cost: number): void;
  getBudgetRemaining(): number;
  isOverBudget(): boolean;
}

Enables cost tracking and budget enforcement.

3. SensitivityClassifierInterface

interface SensitivityClassifierInterface {
  classifyContent(content: string): 'public' | 'internal' | 'confidential';
  canUseProvider(provider: string, sensitivity: string): boolean;
}

Restricts provider usage based on content sensitivity.


Configuration

All LLM provider configuration is centralized in:

File: config/llm-providers.yaml

Key configuration sections:

providers:
  # Subscription providers (zero cost)
  claude-code:
    cliCommand: "claude"
    timeout: 60000
    models:
      fast: "sonnet"
      standard: "sonnet"
      premium: "opus"
    quotaTracking:
      enabled: true
      softLimitPerHour: 100

  copilot:
    cliCommand: "copilot-cli"
    timeout: 120000
    models:
      fast: "claude-haiku-4.5"        # Benchmarked: 0.77s @10 parallel
      standard: "claude-sonnet-4.5"
      premium: "claude-opus-4.6"
    quotaTracking:
      enabled: true
      softLimitPerHour: 100

  # API providers (per-token cost)
  groq:
    apiKeyEnvVar: GROQ_API_KEY
    fast: "llama-3.1-8b-instant"
    standard: "llama-3.3-70b-versatile"
    premium: "openai/gpt-oss-120b"

  anthropic:
    apiKeyEnvVar: ANTHROPIC_API_KEY
    fast: "claude-haiku-4-5"
    standard: "claude-sonnet-4-5"
    premium: "claude-opus-4-6"
  # ... more providers (openai, gemini, github-models)

# Copilot first — scales with parallelism (0.77s effective @10 concurrent)
# Batch agents use Promise.all, so copilot as primary unlocks peak throughput
provider_priority:
  fast: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
  standard: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]
  premium: ["copilot", "groq", "claude-code", "anthropic", "openai", "gemini", "github-models"]

cache:
  maxSize: 1000
  ttlMs: 3600000  # 1 hour

circuit_breaker:
  threshold: 5
  resetTimeoutMs: 60000  # 1 minute

Environment Overrides

Force specific behavior via environment variables:

# Force all tasks to premium tier
export SEMANTIC_ANALYSIS_TIER=premium

# Force specific provider (skip routing)
export SEMANTIC_ANALYSIS_PROVIDER=anthropic

# Use budget mode (fast tier everywhere)
export SEMANTIC_ANALYSIS_COST_MODE=budget

# Use local-only mode
export SEMANTIC_ANALYSIS_MODE=local

Cost Management

Cost Limits

Per-workflow limits (configured in llm-providers.yaml):

  • Budget mode: $0.05 per run
  • Standard mode: $0.50 per run
  • Quality mode: $2.00 per run

Batch workflow limits:

  • Max tokens per batch: 500,000
  • Max cost per batch: $1.00 USD
  • Total budget: $50.00 USD
  • Automatic fallback to local on quota exceeded

Cost Optimization Strategies

  1. Copilot-first parallelized routing: Copilot scales with concurrency (0.77s @10 parallel), batch agents use Promise.all
  2. Tier-based routing: Use cheapest provider that meets quality requirements
  3. Caching: Avoid duplicate LLM calls (1-hour TTL)
  4. Automatic fallback: Switch to paid APIs only when subscriptions exhausted
  5. Local fallback: Switch to DMR/Ollama when budget exhausted
  6. Circuit breaker: Stop calling failed providers quickly

Cost Savings Example

Typical UKB batch analysis run:

  • 50 fast tier calls (extraction, parsing)
  • 100 standard tier calls (semantic analysis)
  • 20 premium tier calls (insight generation)

Before subscriptions (all API):

  • Fast: 50 × $0.001 = $0.05
  • Standard: 100 × $0.01 = $1.00
  • Premium: 20 × $0.05 = $1.00
  • Total: ~$2.05 per run

After subscriptions (until quota exhausted):

  • Fast: $0 (Copilot/claude-haiku-4.5, parallelized)
  • Standard: $0 (Copilot/claude-sonnet-4.5)
  • Premium: $0 (Copilot/claude-opus-4.6)
  • Total: $0.00 per run
  • Bonus: ~3x faster via parallelized copilot calls

Estimated savings: ~$50-100/month for active development


Metrics and Monitoring

The LLM layer tracks:

  • Request metrics: Total calls per provider, success/failure rates
  • Performance: Latency per provider, throughput
  • Cost: Token usage, estimated costs per provider
  • Cache: Hit/miss ratio, cache size

Metrics are exposed via the LLMService.getMetrics() method.


Integration Example

import { LLMService, loadProviderConfig } from '@rapid/llm-proxy';
import type { LLMCompletionRequest } from '@rapid/llm-proxy';

const llmService = LLMService.getInstance();

const request: LLMCompletionRequest = {
  messages: [
    { role: 'user', content: 'Analyze this code for bugs...' }
  ],
  tier: 'standard',  // or 'fast' / 'premium'
  task: 'semantic_code_analysis',
  temperature: 0.7,
  maxTokens: 4096
};

const result = await llmService.complete(request);

Logger.log('info', result.content);           // LLM response
Logger.log('info', `Provider: ${result.provider}`);          // Which provider was used
Logger.log('info', `Model: ${result.model}`);             // Specific model
Logger.log('info', `Tokens: ${result.usage.totalTokens}`); // Token usage
Logger.log('info', `Cached: ${result.cached}`);            // Was it from cache?

Related Documentation

Configuration Files:

  • config/llm-providers.yaml - Full provider configuration schema
  • docs/provider-configuration.md - Detailed API key setup guide