Skip to content

feat: add OllamaAdapter for local-inference benchmarking#1495

Open
SyncroAgency wants to merge 1 commit into
garrytan:mainfrom
SyncroAgency:feat/ollama-adapter
Open

feat: add OllamaAdapter for local-inference benchmarking#1495
SyncroAgency wants to merge 1 commit into
garrytan:mainfrom
SyncroAgency:feat/ollama-adapter

Conversation

@SyncroAgency
Copy link
Copy Markdown

Summary

Wires a fourth provider into /benchmark-models alongside Claude, GPT, and Gemini. Ollama talks HTTP directly to a local daemon (http://localhost:11434 by default) rather than shelling out to a CLI, so adapter setup is ollama serve instead of a login flow. Cost is zero (local inference) and tool-call surface is empty (/api/generate is pure completion).

Why

Lets users compare local models (qwen2.5-coder:7b, llama3.2:3b, etc.) head-to-head against cloud providers using the same harness. Use cases:

  • Cost/quality tradeoff sizing before routing a task local vs cloud
  • Validating a local model is "good enough" for a specific skill prompt
  • Latency baselines for local inference vs cloud RTT

Strictly additive — zero behavioral change for existing claude/gpt/gemini paths.

Adapter behavior

Method Behavior
available() Probes GET /api/tags with 2s timeout. Reports remediation hint on every failure mode (install URL, ollama serve, ollama pull <model>)
run() POSTs to /api/generate with {model, prompt, stream:false}. Parses response, prompt_eval_count, eval_count, model
estimateCost() Returns 0 via priced-at-zero rows in PRICING table (future paid-GPU hosts can override per-model)

Env overrides: GSTACK_OLLAMA_URL (custom host/port), GSTACK_OLLAMA_MODEL (default model, falls back to qwen2.5-coder:7b).

Files touched

  • test/helpers/providers/types.ts — extend Family union with 'ollama'
  • test/helpers/providers/ollama.tsnew, the adapter
  • test/helpers/benchmark-runner.ts — register OllamaAdapter; switch hardcoded provider unions to Family type for forward compat
  • bin/gstack-model-benchmark — register adapter, generalize whitelist into a VALID_PROVIDERS array (parser still rejects unknown names with WARN, unchanged behavior)
  • test/helpers/tool-map.ts — add ollama row (all tools false — /api/generate exposes no agentic surface)
  • test/helpers/pricing.ts — zero-cost rows for qwen2.5-coder:7b, llama3.2:3b, nomic-embed-text
  • benchmark-models/SKILL.md.tmpl + regen SKILL.md — include ollama in default dry-run + describe local-vs-cloud workflow

Tests

  • test/providers-ollama.test.tsnew, 14 offline unit tests via fetch stubbing covering:
    • availability probing (ok, no models, unreachable, non-2xx, env-overridden URL)
    • generate POST shape + model defaults + env override
    • ECONNREFUSED → binary_missing routing
    • 404 → remediation hint routing
    • AbortController timeout → timeout error code
    • zero-cost estimation
    • stable name/family identity
  • test/skill-e2e-benchmark-providers.test.ts — live smoke gated on EVALS=1. Skips cleanly if daemon not running. Skips ok-text assertion (local model quality varies).
  • test/benchmark-cli.test.ts — +2 regression tests covering ollama in --models whitelist + valid-providers WARN message
  • test/benchmark-runner.test.ts — PRICING + missingTools + TOOL_COMPATIBILITY assertions extended to the fourth family

Test plan

  • bun test test/providers-ollama.test.ts — 14/14 pass (offline)
  • bun test test/benchmark-{cli,runner}.test.ts — green for ollama-related additions; pre-existing Windows-only failure (gpt: NOT READY env-strip test) unchanged
  • Full bun test — zero new failures vs main baseline
  • bun run gen:skill-docs --host all — regenerated; SKILL.md freshness checks pass
  • bun run build — compiles browse/design/pdf binaries; trailing chmod/rm steps fail on Windows shell but that's pre-existing
  • Manual smoke: bun run bin/gstack-model-benchmark --prompt hi --models claude,gpt,gemini,ollama --dry-run prints availability for all 4 with helpful remediation when ollama daemon is down
  • Live bun run test:e2e EVALS=1 with daemon running — to be validated by reviewer's machine (mine doesn't have Ollama running today; adapter response confirmed via offline stubs)

Out of scope

This PR does not wire Ollama into /qa, /review, /ship, or any other slash skill — those skills hardcode the current Claude session. Adding routing-layer integration would be a separate, larger architectural change. This PR is the minimal additive step that lets /benchmark-models ask "is local good enough?" with data.

Wires a fourth provider into `/benchmark-models` alongside Claude, GPT,
and Gemini. Ollama talks HTTP directly to a local daemon
(http://localhost:11434 by default) rather than shelling out to a CLI,
so adapter setup is `ollama serve` instead of a login flow. Cost is
zero (local inference) and tool-call surface is empty (the
`/api/generate` endpoint is pure completion).

Why: lets users compare local models (qwen2.5-coder:7b, llama3.2:3b,
etc.) head-to-head against cloud providers using the same harness. Use
cases include cost/quality tradeoff sizing before routing a task local
vs cloud, and validating that a local model is "good enough" for a
specific skill prompt.

Adapter behavior:
- available() probes GET /api/tags with 2s timeout; reports
  remediation hint (install URL, ollama serve, ollama pull <model>)
  on every failure mode.
- run() POSTs to /api/generate with {model, prompt, stream:false};
  parses response, prompt_eval_count, eval_count, model.
- estimateCost() returns 0 via priced-at-zero rows in PRICING table
  (future paid-GPU hosts can override per-model).
- Env overrides: GSTACK_OLLAMA_URL (custom host/port),
  GSTACK_OLLAMA_MODEL (default model).

Wiring touched:
- types.ts: extend Family union with 'ollama'.
- benchmark-runner.ts: import + register OllamaAdapter; switch
  hardcoded unions to Family type for forward compat.
- bin/gstack-model-benchmark: register adapter, generalize
  whitelist into a VALID_PROVIDERS array.
- tool-map.ts: add ollama row (all tools false — /api/generate
  exposes no agentic surface).
- pricing.ts: add zero-cost rows for qwen2.5-coder:7b,
  llama3.2:3b, nomic-embed-text.
- benchmark-models/SKILL.md(.tmpl): include ollama in default
  dry-run model list + describe local-vs-cloud workflow.

Tests:
- test/providers-ollama.test.ts: 14 offline unit tests via fetch
  stubbing covering availability probing, generate POST shape, model
  override env vars, ECONNREFUSED→binary_missing routing, 404→remediation
  hint routing, timeout via AbortController, zero-cost estimation,
  stable name/family identity.
- test/skill-e2e-benchmark-providers.test.ts: live smoke gated on
  EVALS=1 — skips cleanly if daemon not running.
- test/benchmark-cli.test.ts: regression coverage for ollama
  acceptance in --models whitelist + valid-providers WARN message.
- test/benchmark-runner.test.ts: PRICING + missingTools + TOOL_COMPATIBILITY
  assertions extended to the fourth family.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant