The local provider connects Koan to any OpenAI-compatible LLM server running on your machine. This gives you full offline operation with no API costs — ideal for experimentation, privacy-sensitive work, or when you've exhausted your cloud quota.
Pick one:
Ollama (recommended for beginners)
# macOS
brew install ollama
# Start the server
ollama serve
# Pull a model
ollama pull qwen2.5-coder:14bOllama exposes an OpenAI-compatible API at http://localhost:11434/v1.
llama.cpp
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a GGUF model and start the server
./llama-server -m /path/to/model.gguf --port 8080API at http://localhost:8080/v1.
LM Studio
Download from https://lmstudio.ai, load a model, and start the local
server from the UI. API at http://localhost:1234/v1.
vLLM
pip install vllm
vllm serve "Qwen/Qwen2.5-Coder-14B" --port 8000API at http://localhost:8000/v1.
In config.yaml:
cli_provider: "local"
local_llm:
base_url: "http://localhost:11434/v1" # Adjust for your server
model: "qwen2.5-coder:14b" # Model name on your server
api_key: "" # Usually empty for local serversOr via environment variables (in .env):
KOAN_CLI_PROVIDER=local
KOAN_LOCAL_LLM_BASE_URL=http://localhost:11434/v1
KOAN_LOCAL_LLM_MODEL=qwen2.5-coder:14b
KOAN_LOCAL_LLM_API_KEY=Environment variables override config.yaml values.
If you're using Ollama, the easiest way to start Koan is:
make ollamaThis single command starts all three components in the background:
ollama serve— the LLM server- The Telegram bridge (awake)
- The agent loop (run)
To stop everything:
make stopThis stops all Koan processes including ollama serve.
To check what's running:
make statusEnvironment variables from .env (like OLLAMA_HOST) are automatically
loaded and passed to all components.
If you want to manually verify the LLM server:
# Quick test with curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5-coder:14b", "messages": [{"role": "user", "content": "Hello"}]}'Use local LLM for specific projects (e.g., small libraries) while keeping Claude for complex work:
# projects.yaml
defaults:
cli_provider: "claude"
projects:
critical-app:
path: "/path/to/app"
# Uses Claude (default)
side-project:
path: "/path/to/side"
cli_provider: "local" # Use local LLM for this projectUnlike Claude and Copilot which call external CLI binaries, the local
provider runs Koan's own local_llm_runner.py — a Python-based agentic
loop that:
- Sends your prompt + system context to the LLM via the OpenAI API
- Parses
tool_callsfrom the response (function calling format) - Executes tools locally (read, write, edit, grep, glob, shell)
- Feeds results back to the LLM
- Repeats until the LLM produces a final text response or max turns
This means any LLM server supporting the OpenAI function calling protocol will work.
Not all local models handle tool use (function calling) well. Models that work best with Koan's agentic loop:
| Model | Size | Tool Use | Notes |
|---|---|---|---|
qwen2.5-coder:14b |
14B | Good | Best balance of size and capability |
qwen2.5-coder:7b |
7B | Fair | Lighter, faster, less reliable tool use |
deepseek-coder-v2:16b |
16B | Good | Strong coding, good function calling |
codellama:34b |
34B | Fair | Needs more RAM, variable tool use |
mistral:7b |
7B | Basic | Fast but limited tool use |
Hardware requirements vary by model size:
- 7B models: 8GB RAM minimum, 16GB recommended
- 14B models: 16GB RAM minimum, 32GB recommended
- 34B+ models: 32GB+ RAM, consider GPU acceleration
| Feature | Claude Code | Local LLM |
|---|---|---|
| Binary | claude (external) |
Python runner (built-in) |
| Tool protocol | Native Claude tools | OpenAI function calling |
| Fallback model | Supported | Not supported |
| MCP support | Yes | No (tools are built-in) |
| Output format | JSON supported | JSON supported |
| Max turns | Supported | Supported |
| Cost | API subscription | Free (hardware only) |
| Quality | State of the art | Varies by model |
| Speed | Network latency | Local inference speed |
cli_provider: "local"
local_llm:
# Server URL — must expose /v1/chat/completions
base_url: "http://localhost:11434/v1"
# Model name — as recognized by your server
model: "qwen2.5-coder:14b"
# API key — usually empty for local servers
# Some servers (e.g., vLLM with auth) may require one
api_key: ""| Variable | Default | Description |
|---|---|---|
KOAN_CLI_PROVIDER |
claude |
Set to local to enable |
KOAN_LOCAL_LLM_BASE_URL |
http://localhost:11434/v1 |
Server URL |
KOAN_LOCAL_LLM_MODEL |
(none) | Model name (required) |
KOAN_LOCAL_LLM_API_KEY |
(none) | API key if needed |
| Server | Default URL |
|---|---|
| Ollama | http://localhost:11434/v1 |
| llama.cpp | http://localhost:8080/v1 |
| LM Studio | http://localhost:1234/v1 |
| vLLM | http://localhost:8000/v1 |
Your LLM server isn't running. Start it:
# Ollama
ollama serve
# llama.cpp
./llama-server -m /path/to/model.gguf
# Check if the server is up
curl http://localhost:11434/v1/modelsThe model name in your config doesn't match what's loaded on the server.
# Ollama — list available models
ollama list
# Pull a model if needed
ollama pull qwen2.5-coder:14bLocal models have variable tool-use capability. If the agent produces garbage or ignores tool results:
- Try a larger model (14B+ recommended for tool use)
- Try a different model family (Qwen2.5-Coder works well)
- Consider using Claude for complex missions and local LLM for simpler tasks via per-project configuration
Local inference speed depends on your hardware. Tips:
- Use quantized models (Q4_K_M is a good balance)
- Enable GPU acceleration in your server config
- Use a smaller model for chat/lightweight tasks
- Set
models.lightweightto a smaller local model
Most local servers don't require an API key. If yours does, set it in config:
local_llm:
api_key: "your-key-here"Or via environment:
KOAN_LOCAL_LLM_API_KEY=your-key-here