The LLM Router provides a unified interface for text generation across multiple LLM providers, with built-in caching, retry logic, and integration with the existing IPFS Accelerate endpoint multiplexing infrastructure.
- Unified API: Single
generate_text()function works with all providers - Multiple Providers: OpenRouter, Codex CLI, Copilot CLI/SDK, Gemini, Claude, Backend Manager, local HuggingFace
- Automatic Fallback: Tries multiple providers in order of availability
- Response Caching: CID-based or SHA256-based caching for deterministic results
- Dependency Injection: Share resources (caches, managers) across calls
- Integration: Works seamlessly with existing CLI wrappers and backend manager
- No Duplication: Reuses existing infrastructure (DualModeWrapper, BaseCLIWrapper)
from ipfs_accelerate_py import generate_text
# Auto-select best available provider
response = generate_text("Explain what IPFS is in one sentence")
print(response)from ipfs_accelerate_py import generate_text
# Use OpenRouter
response = generate_text(
"Write a Python function to calculate fibonacci",
provider="openrouter",
model_name="openai/gpt-4o-mini",
max_tokens=256,
temperature=0.7
)import os
from ipfs_accelerate_py import generate_text
# Enable response cache (enabled by default)
os.environ["IPFS_ACCELERATE_PY_ROUTER_RESPONSE_CACHE"] = "1"
# First call - cache miss
response1 = generate_text("What is 2+2?")
# Second call - cache hit (much faster)
response2 = generate_text("What is 2+2?")from ipfs_accelerate_py import register_llm_provider, generate_text
# Define a custom provider
class MyProvider:
def generate(self, prompt, *, model_name=None, **kwargs):
return f"Custom response to: {prompt}"
# Register it
register_llm_provider("my_provider", lambda: MyProvider())
# Use it
response = generate_text("test", provider="my_provider")from ipfs_accelerate_py import RouterDeps, generate_text
# Create shared deps container
deps = RouterDeps()
# You can inject pre-configured components
# deps.backend_manager = my_backend_manager
# deps.remote_cache = my_remote_cache
# All calls will share the same resources
response1 = generate_text("First request", deps=deps)
response2 = generate_text("Second request", deps=deps)
print(f"Shared cache has {len(deps.router_cache)} items")API-based access to multiple LLM providers.
Configuration:
export OPENROUTER_API_KEY="your-api-key"
export IPFS_ACCELERATE_PY_OPENROUTER_MODEL="openai/gpt-4o-mini" # optionalUsage:
response = generate_text(
"Your prompt",
provider="openrouter",
model_name="openai/gpt-4o-mini",
max_tokens=256
)Uses existing OpenAI Codex CLI integration.
Configuration:
export IPFS_ACCELERATE_PY_CODEX_MODEL="gpt-3.5-turbo" # optionalUsage:
response = generate_text(
"Write a sorting algorithm",
provider="codex_cli"
)Uses existing GitHub Copilot CLI wrapper.
Usage:
response = generate_text(
"Suggest a git command to...",
provider="copilot_cli"
)Uses existing GitHub Copilot SDK wrapper.
Configuration:
export IPFS_ACCELERATE_PY_COPILOT_SDK_MODEL="gpt-4o" # optionalUsage:
response = generate_text(
"Your prompt",
provider="copilot_sdk",
model_name="gpt-4o"
)Uses existing Gemini CLI integration (SDK-based).
Configuration:
export GOOGLE_API_KEY="your-api-key"
export IPFS_ACCELERATE_PY_GEMINI_MODEL="gemini-pro" # optionalUsage:
response = generate_text(
"Explain quantum computing",
provider="gemini_cli",
temperature=0.7
)Uses existing Claude Code CLI integration (SDK-based).
Configuration:
export ANTHROPIC_API_KEY="your-api-key"
export IPFS_ACCELERATE_PY_CLAUDE_MODEL="claude-3-5-sonnet-20241022" # optionalUsage:
response = generate_text(
"Write a unit test for...",
provider="claude_code"
)Uses InferenceBackendManager for distributed/multiplexed inference.
Note: This provider requires backends to expose a callable instance attribute.
The actual execution behavior depends on backend implementation. This is a best-effort
provider intended for future integration with fully-featured backend instances.
Configuration:
export IPFS_ACCELERATE_PY_ENABLE_BACKEND_MANAGER="1"
export IPFS_ACCELERATE_PY_LLM_LOAD_BALANCING="round_robin" # or least_loaded, best_performanceUsage:
response = generate_text(
"Your prompt",
provider="backend_manager"
)Current Limitations:
- Requires backends with callable
instanceattribute for inference execution - The InferenceBackendManager singleton is used (config parameters documented for future use)
Fallback to local transformers models.
Configuration:
export IPFS_ACCELERATE_PY_LLM_MODEL="gpt2" # or any HF modelUsage:
response = generate_text(
"Your prompt",
provider="local_hf",
model_name="gpt2"
)IPFS_ACCELERATE_PY_LLM_PROVIDER: Force a specific provider (bypasses auto-detection)
IPFS_ACCELERATE_PY_ROUTER_CACHE: Enable/disable provider caching (default: "1")IPFS_ACCELERATE_PY_ROUTER_RESPONSE_CACHE: Enable/disable response caching (default: "1")IPFS_ACCELERATE_PY_ROUTER_CACHE_KEY: Cache key strategy ("sha256" or "cid", default: "sha256")IPFS_ACCELERATE_PY_ROUTER_CACHE_CID_BASE: CID encoding base (default: "base32")
IPFS_ACCELERATE_PY_ENABLE_BACKEND_MANAGER: Enable backend manager provider (default: "0")IPFS_ACCELERATE_PY_LLM_LOAD_BALANCING: Load balancing strategy (default: "round_robin")
IPFS_ACCELERATE_PY_LLM_MODEL: Default model for providers that support it
When no provider is specified, the router tries providers in this order:
- Backend Manager (if enabled via env var)
- OpenRouter (if API key configured)
- Codex CLI (if available)
- Copilot CLI (if available)
- Copilot SDK (if available)
- Gemini (if API key configured)
- Claude (if API key configured)
- Local HuggingFace (if transformers installed)
The router integrates with existing CLI wrappers without duplication:
# These are already available and used by the router
from ipfs_accelerate_py.cli_integrations import (
OpenAICodexCLIIntegration,
GeminiCLIIntegration,
ClaudeCodeCLIIntegration
)
from ipfs_accelerate_py.copilot_cli.wrapper import CopilotCLI
from ipfs_accelerate_py.copilot_sdk.wrapper import CopilotSDKThe router can use the InferenceBackendManager for distributed inference:
from ipfs_accelerate_py import generate_text
import os
# Enable backend manager
os.environ["IPFS_ACCELERATE_PY_ENABLE_BACKEND_MANAGER"] = "1"
# This will route through the backend manager
response = generate_text("Your prompt", provider="backend_manager")Response caching integrates with the existing CID-based caching:
from ipfs_accelerate_py import generate_text
import os
# Use CID-based caching
os.environ["IPFS_ACCELERATE_PY_ROUTER_CACHE_KEY"] = "cid"
# Responses are cached by content-addressed CID
response = generate_text("Your prompt")from ipfs_accelerate_py import get_llm_provider, generate_text
# Get a provider instance
provider = get_llm_provider("openrouter")
# Reuse it for multiple requests
response1 = generate_text("First", provider_instance=provider)
response2 = generate_text("Second", provider_instance=provider)from ipfs_accelerate_py import RouterDeps, generate_text
from ipfs_accelerate_py.inference_backend_manager import get_backend_manager
# Create custom deps with pre-configured components
deps = RouterDeps()
deps.backend_manager = get_backend_manager()
# All calls use the same backend manager
response1 = generate_text("First", deps=deps)
response2 = generate_text("Second", deps=deps)from ipfs_accelerate_py import clear_llm_router_caches
# Clear internal provider caches
clear_llm_router_caches()See examples/llm_router_example.py for comprehensive usage examples.
Run the example:
python examples/llm_router_example.pyRun the integration tests:
python test/test_llm_router_integration.pyThe LLM Router follows these design principles:
- No Import-Time Side Effects: All heavy imports are lazy
- Reuse Existing Infrastructure: No duplication of CLI wrappers or backend managers
- Dependency Injection: Optional
RouterDepsfor sharing resources - Provider Registry: Extensible via
register_llm_provider() - Automatic Fallback: Tries multiple providers in order
- CID-Based Caching: Content-addressed caching for determinism
- Integration Ready: Works with existing endpoint multiplexing
- Full Integration: Seamlessly works with existing InferenceBackendManager
- No Duplication: Reuses all existing CLI/SDK wrappers
- Distributed Ready: Supports distributed/P2P inference via backend manager
- CID Caching: Built-in CID-based caching support
- Existing Patterns: Follows DualModeWrapper and BaseCLIWrapper patterns
- Endpoint Multiplexing: Can multiplex across peers via backend manager
- Add streaming support for long responses
- Add token counting and rate limiting
- Add more provider-specific optimizations
- Add metrics and monitoring integration
- Add distributed caching via libp2p
- Add provider health checks
See the main project LICENSE file.