This document describes the high-level architecture of LlamaGate, including component interactions, data flow, and design decisions.
- Overview
- System Architecture
- Component Overview
- Request Flow
- Data Flow
- Component Interactions
- Design Patterns
- Concurrency Model
- Error Handling
- Configuration Management
LlamaGate is a single-binary HTTP proxy/gateway that sits between clients and Ollama, providing:
- OpenAI-compatible API endpoints
- Request caching
- Authentication and rate limiting
- MCP (Model Context Protocol) client integration
- Extension system for extensibility
- Structured logging
Architecture Principles:
- Single Binary: Everything compiled into one executable
- Stateless: No persistent state (except in-memory cache)
- Layered: Clear separation of concerns
- Extensible: Extension system for custom functionality
┌─────────────────────────────────────────────────────────────┐
│ Client Application │
│ (Python, JavaScript, cURL, etc.) │
└───────────────────────┬─────────────────────────────────────┘
│ HTTP Requests
│ (OpenAI-compatible)
▼
┌─────────────────────────────────────────────────────────────┐
│ LlamaGate Server │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ HTTP Layer (Gin Router) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │
│ │ │ Middleware │ │ API Handlers │ │ Routes │ │ │
│ │ │ - Auth │ │ - Health │ │ - /v1/* │ │ │
│ │ │ - Rate Limit │ │ - MCP │ │ - /health │ │ │
│ │ │ - Request ID │ │ - Extensions │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Proxy Layer │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │
│ │ │ Cache │ │ Proxy │ │ Tool Loop │ │ │
│ │ │ Manager │ │ Handler │ │ Executor │ │ │
│ │ └──────────────┘ └──────────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ MCP Client Layer (Optional) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │
│ │ │ Manager │ │ Client │ │ Pool │ │ │
│ │ │ (Servers) │ │ (Per │ │ (HTTP │ │ │
│ │ │ │ │ Server) │ │ Only) │ │ │
│ │ └──────────────┘ └──────────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Extension System (Optional) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │
│ │ │ Registry │ │ Workflow │ │ Context │ │ │
│ │ │ │ │ Executor │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └───────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└───────────────────────┬─────────────────────────────────────┘
│ HTTP Requests
│ (Ollama API)
▼
┌─────────────────────────────────────────────────────────────┐
│ Ollama Server │
│ (Local LLM Inference Engine) │
└─────────────────────────────────────────────────────────────┘
Responsibilities:
- Load configuration
- Initialize logger
- Initialize components (cache, proxy, MCP, extensions)
- Set up HTTP router and middleware
- Register API routes
- Start HTTP server
- Handle graceful shutdown
Key Functions:
main()- Application entry point- Component initialization
- Route registration
- Server lifecycle management
Responsibilities:
- Load configuration from multiple sources
- Environment variables
.envfile- YAML/JSON config files
- Validate configuration
- Provide defaults
Key Components:
Configstruct - Main configuration containerLoad()- Configuration loaderValidate()- Configuration validator
Configuration Sources (Priority):
- Environment variables (highest)
- Config files (YAML/JSON)
.envfile- Default values (lowest)
Responsibilities:
- Handle HTTP requests
- Route requests to appropriate handlers
- Apply middleware (auth, rate limiting, logging)
- Format responses
- Error handling
Key Components:
- RequestIDMiddleware - Generate unique request IDs
- AuthMiddleware - API key authentication
- RateLimitMiddleware - Rate limiting (leaky bucket)
- Helpers - Path normalization, health endpoint detection
- HealthHandler - Health check endpoint
- MCPHandler - MCP server management endpoints
- ExtensionHandler - Extension management endpoints
Responsibilities:
- Handle OpenAI-compatible requests
- Convert between OpenAI and Ollama formats
- Manage caching
- Execute tool loops (MCP tools)
- Handle streaming responses
- Inject MCP resource context
Key Components:
- Proxy - Main proxy handler
- ToolLoop - Multi-round tool execution
- ResourceContext - MCP resource injection
- Validation - Request validation
- ExtensionLLMHandler - LLM handler for extensions
Request Flow:
- Receive OpenAI-format request
- Validate request
- Check cache (non-streaming)
- Inject MCP resource context (if MCP enabled)
- Execute tool loop (if tools requested)
- Convert to Ollama format
- Forward to Ollama
- Convert response back to OpenAI format
- Cache response (non-streaming)
- Return to client
Responsibilities:
- In-memory caching of requests/responses
- TTL-based expiration
- Cache key generation
- Cache size management
Key Features:
- TTL-based expiration
- Model-aware caching
- Message-based cache keys
- Thread-safe operations
Cache Key Format:
{model}:{hash(messages)}
Responsibilities:
- Connect to MCP servers
- Discover tools, resources, prompts
- Execute tools
- Manage connections
- Health monitoring
- Connection pooling (HTTP transport)
Key Components:
- Client - MCP client per server
- ServerManager - Manages multiple servers
- Transport - Communication layer (stdio, HTTP, SSE)
- HealthMonitor - Health checking
- ConnectionPool - Connection pooling (HTTP)
- Cache - Metadata caching
Transport Types:
- Stdio - Process-based (fully implemented)
- HTTP - HTTP-based (fully implemented)
- SSE - Server-Sent Events (stub, not implemented)
Responsibilities:
- Register MCP tools
- Convert MCP tools to OpenAI format
- Apply security guardrails
- Tool allow/deny lists
- Tool execution limits
Key Components:
- Manager - Tool registry
- Mapper - Format conversion
- Guardrails - Security and limits
Tool Naming:
- Tools are namespaced:
mcp.<serverName>.<toolName> - Prevents collisions between servers
Responsibilities:
- Extension discovery and registration
- Workflow execution
- Middleware hooks
- Observer hooks
- Extension manifest management
Key Components:
- Registry - Extension registration
- WorkflowExecutor - Execute agentic workflows
- HookManager - Middleware and observer hooks
- Manifest - YAML-based extension definitions
- Types - Core extension types (LLMHandlerFunc, etc.)
Extension Types:
- Workflow Extension - Agentic workflows with LLM calls
- Middleware Extension - Request/response middleware hooks
- Observer Extension - Response observation hooks
Responsibilities:
- Initialize structured logging
- Configure log levels
- File/console output
- Request/response logging
Key Features:
- JSON structured logging
- Request ID correlation
- Configurable log levels
- File and console output
1. Client Request
└─> POST /v1/chat/completions
└─> Headers: X-API-Key, Content-Type
└─> Body: { model, messages, ... }
2. HTTP Layer
└─> RequestIDMiddleware (generate request ID)
└─> AuthMiddleware (validate API key)
└─> RateLimitMiddleware (check rate limits)
└─> LoggingMiddleware (log request)
3. Proxy Layer
└─> Parse and validate request
└─> Check cache (if not streaming)
└─> Inject MCP resource context (if MCP enabled)
└─> Execute tool loop (if tools requested)
└─> Convert to Ollama format
└─> Forward to Ollama
4. Ollama Processing
└─> Load model (if not loaded)
└─> Process request
└─> Generate response
5. Response Handling
└─> Receive response from Ollama
└─> Convert to OpenAI format
└─> Cache response (if not streaming)
└─> Return to client
6. HTTP Layer
└─> LoggingMiddleware (log response)
└─> Return HTTP response
1. Client Request (with tools)
└─> POST /v1/chat/completions
└─> Body: { model, messages, tools: [...] }
2. Proxy Layer
└─> Detect tool request
└─> Enter tool loop
3. Tool Loop (Multi-Round)
└─> Round 1:
├─> Call Ollama with tools
├─> Model returns tool calls
├─> Execute tools via MCP
└─> Inject tool results
└─> Round 2:
├─> Call Ollama with tool results
├─> Model may call more tools
└─> Repeat until done or limit reached
4. Final Response
└─> Model returns final answer
└─> Convert to OpenAI format
└─> Return to client
1. Client Request
└─> POST /v1/extensions/:name/execute
└─> Body: { input: {...} }
2. Extension Handler
└─> Get extension from registry
└─> Validate input
└─> Execute extension
3. Extension Execution
└─> Execute workflow (if workflow type)
├─> Step 1: LLM call (via LLMHandler)
├─> Step 2: Template render
├─> Step 3: File write
└─> Step N: Final result
└─> Return result
4. Response
└─> Format extension result
└─> Return to client
OpenAI Format (Client)
│
├─> { model, messages, temperature, stream }
│
▼
LlamaGate Processing
│
├─> Add MCP resource context (if enabled)
├─> Add tool descriptions (if tools available)
├─> Generate cache key
│
▼
Ollama Format
│
├─> { model, messages, options: { temperature }, stream }
│
▼
Ollama Server
│
├─> Process request
├─> Generate response
│
▼
Ollama Response
│
├─> { message: { role, content }, ... }
│
▼
LlamaGate Processing
│
├─> Convert to OpenAI format
├─> Cache response (if not streaming)
│
▼
OpenAI Format (Client)
│
├─> { id, object, created, model, choices: [...] }
Client Request
│
├─> { model, messages, tools: [...] }
│
▼
Tool Loop
│
├─> Round 1:
│ ├─> Call Ollama → Model returns tool_calls
│ ├─> Extract tool calls
│ ├─> Execute tools via MCP
│ └─> Inject tool results
│
├─> Round 2:
│ ├─> Call Ollama with results
│ ├─> Model may call more tools
│ └─> Repeat...
│
└─> Final round:
└─> Model returns final answer
main.go
│
├─> config.Load()
│ └─> Loads from .env, YAML, environment
│
├─> logger.Init()
│ └─> Sets up Zerolog
│
├─> cache.New()
│ └─> Creates in-memory cache
│
├─> proxy.NewWithTimeout()
│ ├─> Uses cache
│ └─> Creates HTTP client
│
├─> setup.InitializeMCP()
│ ├─> Creates ServerManager
│ ├─> Creates Clients (per server)
│ ├─> Creates ConnectionPool (HTTP)
│ ├─> Creates HealthMonitor
│ └─> Discovers tools/resources/prompts
│
├─> setup.ConfigureProxy()
│ ├─> Sets tool manager
│ └─> Sets guardrails
│
└─> extensions.NewRegistry()
└─> Creates extension registry
└─> extensions.DiscoverExtensions()
└─> Discovers extensions from extensions/ directory
Proxy ↔ Cache:
- Proxy checks cache before forwarding requests
- Proxy stores responses in cache
- Cache provides TTL-based expiration
Proxy ↔ MCP:
- Proxy uses ToolManager to get available tools
- Proxy uses ToolManager to execute tools
- ToolManager uses MCP clients to execute tools
Proxy ↔ Extensions:
- Proxy provides LLM handler to extensions
- Extensions can make LLM calls through proxy
- Extensions can access MCP tools via workflow steps
MCP ↔ Tools:
- MCP clients discover tools from servers
- ToolManager registers tools from MCP
- ToolManager converts MCP tools to OpenAI format
Layers:
- HTTP Layer - Request/response handling
- Proxy Layer - Business logic, format conversion
- Service Layer - MCP, extensions, tools
- Transport Layer - HTTP client, MCP transports
Benefits:
- Clear separation of concerns
- Easy to test
- Easy to modify
Pattern:
- Components receive dependencies via constructors
- No global state
- Easy to mock for testing
Example:
proxy := proxy.NewWithTimeout(ollamaHost, cache, timeout)Pattern:
- Components communicate via interfaces
- Easy to swap implementations
- Better testability
Examples:
Transportinterface (stdio, HTTP, SSE)LLMHandlerFuncinterface (for extensions)ServerManagerInterface(for proxy)
Pattern:
- Centralized registration and lookup
- Thread-safe access
- Extension and tool registries
Examples:
ExtensionRegistry- Extension registrationToolManager- Tool registrationServerManager- MCP server registration
Pattern:
- Factory functions for creating instances
- Configuration-based creation
- Default values
Examples:
NewProxy(),NewClient(),NewRegistry()DefaultPoolConfig(),DefaultPoolConfig()
Background Goroutines:
- HealthMonitor - Periodic health checks
- Cache Cleanup - TTL-based cache cleanup
- MCP Connection Pool - Connection management
Request Handling:
- Each HTTP request handled in separate goroutine
- Gin router manages goroutine pool
- No shared mutable state (except caches with locks)
Mutexes:
- Cache operations (read/write locks)
- Extension registry (read/write locks)
- MCP server manager (read/write locks)
- Connection pool (mutex for pool operations)
Channels:
- Health monitor stop signal
- Cache cleanup stop signal
- Graceful shutdown coordination
sync.Once:
- Health monitor start (prevents race conditions)
- Cache cleanup start (prevents race conditions)
- Health monitor stop (prevents double-close)
- Cache stop (prevents double-close)
Pattern:
- Errors bubble up from lower layers
- Structured error responses
- Request IDs for tracing
Error Types:
ValidationError- Invalid inputInternalError- Server errorsServiceUnavailable- MCP/extension system unavailableNotFound- Resource not foundRateLimitError- Rate limit exceeded
{
"error": {
"message": "Error description",
"type": "error_type",
"request_id": "550e8400-..."
}
}Priority Order:
- Environment variables (highest)
- YAML/JSON config files
.envfile- Default values (lowest)
Process:
- Load
.envfile (if exists) - Load YAML/JSON config (if exists)
- Override with environment variables
- Validate configuration
- Apply defaults for missing values
type Config struct {
// Core
OllamaHost string
APIKey string
RateLimitRPS float64
Debug bool
Port string
LogFile string
Timeout time.Duration
// MCP (optional)
MCP *MCPConfig
}Decision: Everything compiled into one executable
Rationale:
- Easy deployment
- No dependency management
- Fast startup
- Simple distribution
Decision: Cache is in-memory only, lost on restart
Rationale:
- Simple implementation
- Fast access
- No external dependencies
- Good for most use cases
Trade-off:
- Cache lost on restart
- Limited by memory
- No persistence
Decision: MCP is optional, disabled by default
Rationale:
- Reduces complexity for basic use cases
- Only enable when needed
- Faster startup without MCP
Decision: YAML-based extension system for custom functionality
Rationale:
- Allows customization without modifying core
- Enables agentic workflows
- Model-friendly (YAML manifest definitions)
- No compilation required
Decision: Perfect OpenAI API compatibility
Rationale:
- Zero migration effort
- Same SDKs work
- Drop-in replacement
- This is the core value proposition
- Cache Key: Model + message hash
- TTL: Configurable (default: 5 minutes)
- Size Limits: Configurable
- Thread-Safe: Read/write locks
- HTTP Transport: Connection pooling enabled
- Pool Size: Configurable (default: 10)
- Idle Timeout: Configurable (default: 5 minutes)
- Reuse: Connections reused across requests
- Algorithm: Leaky bucket
- Scope: Global (all requests)
- Configurable: Requests per second
- Response: 429 Too Many Requests
- Method: API key via header
- Header:
X-API-KeyorAuthorization: Bearer - Optional: Can be disabled
- Implementation: Constant-time comparison
- Algorithm: Leaky bucket
- Scope: Global
- Configurable: RPS limit
- Response: 429 with retry-after
- Allow Lists: Glob patterns for allowed tools
- Deny Lists: Glob patterns for denied tools
- Timeouts: Per-tool execution timeouts
- Size Limits: Maximum result size
- Round Limits: Maximum tool execution rounds
How to Extend:
- Create
manifest.yamlinextensions/directory - Define workflow steps, middleware hooks, or observer hooks
- Extensions are auto-discovered at startup
- Access LLM via
LLMHandlerFuncin workflow steps - Access MCP tools via workflow steps
How to Extend:
- Create workflow extension with
manifest.yaml - Define steps: template.load, template.render, llm.chat, file.write
- Extensions execute via
POST /v1/extensions/:name/execute
How to Extend:
- Add MCP server to config
- Server automatically discovered
- Tools automatically exposed
-
Persistent Cache
- Redis integration
- File-based cache
- Database-backed cache
-
HTTPS/TLS Support
- Native TLS support
- Let's Encrypt integration
- Certificate management
-
Monitoring Dashboard
- Health dashboard
- Metrics visualization
- Performance monitoring
-
Clustering
- Multi-instance support
- Load balancing
- Shared cache
-
Extension Marketplace (Future)
- Extension discovery
- Extension sharing
- Extension versioning
- Project Structure - Directory structure
- MCP Integration - MCP client details
- API Reference - HTTP API details
- Configuration Guide - Configuration options
Last Updated: 2026-01-09