Run Marcus end-to-end on your own hardware — no API keys, no usage costs. This guide covers picking models, configuring Ollama, and running enough capacity to keep multiple agents busy in parallel.
Marcus is multi-agent. Two distinct LLM roles must both work:
| Role | What it does | Hard requirement |
|---|---|---|
| Planner | Decomposes a project description into a task graph on the board. Marcus calls it once at create_project time. |
Strong instruction-following + structured-output reliability |
| Workers | Actual coding agents (Claude Code, Codex, Aider, custom). Each pulls tasks from the board and writes code. | Must support tool / function calling — Marcus and MCP both depend on it |
⚠️ Worker models without tool-calling will silently fail. They can't invokerequest_next_task,report_task_progress,log_artifact, etc. If you pick a worker model, verify it advertises tool-calling support on its model card.
qwen3.5:35b-a3b-coding-nvfp4 runs comfortably on a 16GB+ M-series Mac and serves as both planner and worker. NVFP4 quantization is tuned for Apple Silicon — strong code generation, reliable structured output, and tool-calling support. If you're on a Mac, start here and skip the rest of the matrix.
ollama pull qwen3.5:35b-a3b-coding-nvfp4Capacity on 16GB unified memory: 1 planner + ~2 workers concurrently.
| Model | Quantization | Notes |
|---|---|---|
qwen3.5:35b-a3b-coding-nvfp4 |
NVFP4 | Best on Apple Silicon. Doubles as worker. |
qwen2.5-coder:7b |
Q4 or Q5 | Lowest known-working planner. Reliable on modest hardware. |
ministral:14b (Ministral-3-14B) |
Q4+ | Larger planner option — better task decomposition on complex projects. |
qwen2.5-coder:14b |
Q4+ | Higher-quality plans when you have RAM to spare. |
Anything below 7B has not produced reliable plans in our testing.
| Model | Notes |
|---|---|
qwen3.5:35b-a3b-coding-nvfp4 |
Best on Apple Silicon. Same model can serve the planner. |
qwen2.5-coder:7b / :14b / :32b |
Tool-calling supported, strong code generation. |
deepseek-coder (instruct variants) |
Tool-calling supported. |
| Hosted Claude / GPT via the worker agent itself | The easiest path — let Claude Code or Codex use their normal models. |
If you're unsure whether a model supports tool calling, check the Ollama model page for "Tools" in the capabilities list.
One Ollama process serves requests serially per model. If two workers ask the same ollama instance for completions at the same time, the second request waits. To get real parallelism:
- Option A — multiple Ollama instances. Launch additional
ollama serveprocesses on different ports (OLLAMA_HOST=127.0.0.1:11435 ollama serve, then point a worker at:11435). One instance per concurrent worker. - Option B —
OLLAMA_NUM_PARALLEL. Setexport OLLAMA_NUM_PARALLEL=4before starting Ollama to let a single instance handle multiple requests concurrently. Each parallel slot uses additional VRAM — verify you have headroom. - Option C — fewer workers. If hardware is tight, run 1 planner + 2 workers. Most coordination value shows up before you saturate the box.
Rule of thumb: 16GB unified memory → 1 planner + 2 workers. 32GB+ → 4+ workers comfortably.
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download# Apple Silicon — best dual-role pick (planner + workers)
ollama pull qwen3.5:35b-a3b-coding-nvfp4
# Or, the lowest known-working planner for modest hardware
ollama pull qwen2.5-coder:7b
# Or, a larger planner option
ollama pull ministral:14bEdit config_marcus.json:
{
"ai": {
"provider": "local",
"enabled": true,
"local_model": "qwen3.5:35b-a3b-coding-nvfp4",
"local_url": "http://localhost:11434/v1",
"local_key": "none"
}
}Or override with environment variables (these win over config_marcus.json):
export MARCUS_LLM_PROVIDER=local
export MARCUS_LOCAL_LLM_PATH=qwen3.5:35b-a3b-coding-nvfp4
export MARCUS_LOCAL_LLM_URL=http://localhost:11434/v1./marcus start
./marcus board # check tasks land on the boardEach worker is a coding agent — most commonly Claude Code, but any MCP-compatible agent works. Point each worker at its own Ollama endpoint (see "Running multiple workers in parallel" above) and confirm the model supports tool calling.
{
"auto_find_board": false,
"kanban": {
"provider": "sqlite",
"sqlite_db_path": "./data/kanban.db",
"sqlite_attachments_dir": "./data/attachments"
},
"ai": {
"provider": "local",
"enabled": true,
"local_model": "qwen2.5-coder:7b",
"local_url": "http://localhost:11434/v1",
"local_key": "none",
"anthropic_api_key": "",
"openai_api_key": ""
},
"features": {
"events": true,
"context": true,
"memory": false,
"visibility": false
}
}Anything that speaks the OpenAI API works (llama.cpp server, LocalAI, text-generation-webui, vLLM):
{
"ai": {
"provider": "local",
"local_model": "your-model",
"local_url": "http://localhost:8080/v1",
"local_key": "your-api-key-if-needed"
}
}- Environment variables (
MARCUS_*) config_marcus.json- Built-in defaults
export OLLAMA_NUM_CTX=8192 # bigger context window
export OLLAMA_NUM_PARALLEL=4 # concurrent requests per instance
export OLLAMA_KEEP_ALIVE=30m # keep model resident between callsLocal-provider request timeout is 120s by default.
export MARCUS_LLM_PROVIDER=anthropic # or openaiOr set "ai.provider" in config_marcus.json.
Failed to connect to local LLM server
ollama list— is Ollama actually running?curl http://localhost:11434/api/tags— does it answer?- Did you pull the model?
ollama pull <model>
Worker silently does nothing / never calls request_next_task
- The model likely lacks tool-calling support. Switch to a model whose card lists Tools as a capability.
Plans come back malformed / empty
- Your planner model is too small or too quantized. Try
qwen2.5-coder:7bat Q5 minimum.
Second worker stalls when first is busy
- One Ollama instance, no parallelism. Set
OLLAMA_NUM_PARALLELor run a secondollama serveon a different port.
Slow responses
- Smaller model, GPU acceleration, lower
max_tokens, or reduceOLLAMA_NUM_CTX.
- Privacy — code never leaves your machine
- Cost — zero per-token charges, run as many experiments as you want
- Offline — works on a plane
- Reproducibility — pin a quantization, get the same outputs
- Browse
good first issueand try a contribution end-to-end on local models. - See Configuration Reference for every option.
- See PROTOCOL.md if you're building a worker runner for a non-Claude agent.