feat: add OllamaAdapter for local-inference benchmarking#1495
Open
SyncroAgency wants to merge 1 commit into
Open
feat: add OllamaAdapter for local-inference benchmarking#1495SyncroAgency wants to merge 1 commit into
SyncroAgency wants to merge 1 commit into
Conversation
Wires a fourth provider into `/benchmark-models` alongside Claude, GPT,
and Gemini. Ollama talks HTTP directly to a local daemon
(http://localhost:11434 by default) rather than shelling out to a CLI,
so adapter setup is `ollama serve` instead of a login flow. Cost is
zero (local inference) and tool-call surface is empty (the
`/api/generate` endpoint is pure completion).
Why: lets users compare local models (qwen2.5-coder:7b, llama3.2:3b,
etc.) head-to-head against cloud providers using the same harness. Use
cases include cost/quality tradeoff sizing before routing a task local
vs cloud, and validating that a local model is "good enough" for a
specific skill prompt.
Adapter behavior:
- available() probes GET /api/tags with 2s timeout; reports
remediation hint (install URL, ollama serve, ollama pull <model>)
on every failure mode.
- run() POSTs to /api/generate with {model, prompt, stream:false};
parses response, prompt_eval_count, eval_count, model.
- estimateCost() returns 0 via priced-at-zero rows in PRICING table
(future paid-GPU hosts can override per-model).
- Env overrides: GSTACK_OLLAMA_URL (custom host/port),
GSTACK_OLLAMA_MODEL (default model).
Wiring touched:
- types.ts: extend Family union with 'ollama'.
- benchmark-runner.ts: import + register OllamaAdapter; switch
hardcoded unions to Family type for forward compat.
- bin/gstack-model-benchmark: register adapter, generalize
whitelist into a VALID_PROVIDERS array.
- tool-map.ts: add ollama row (all tools false — /api/generate
exposes no agentic surface).
- pricing.ts: add zero-cost rows for qwen2.5-coder:7b,
llama3.2:3b, nomic-embed-text.
- benchmark-models/SKILL.md(.tmpl): include ollama in default
dry-run model list + describe local-vs-cloud workflow.
Tests:
- test/providers-ollama.test.ts: 14 offline unit tests via fetch
stubbing covering availability probing, generate POST shape, model
override env vars, ECONNREFUSED→binary_missing routing, 404→remediation
hint routing, timeout via AbortController, zero-cost estimation,
stable name/family identity.
- test/skill-e2e-benchmark-providers.test.ts: live smoke gated on
EVALS=1 — skips cleanly if daemon not running.
- test/benchmark-cli.test.ts: regression coverage for ollama
acceptance in --models whitelist + valid-providers WARN message.
- test/benchmark-runner.test.ts: PRICING + missingTools + TOOL_COMPATIBILITY
assertions extended to the fourth family.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires a fourth provider into
/benchmark-modelsalongside Claude, GPT, and Gemini. Ollama talks HTTP directly to a local daemon (http://localhost:11434by default) rather than shelling out to a CLI, so adapter setup isollama serveinstead of a login flow. Cost is zero (local inference) and tool-call surface is empty (/api/generateis pure completion).Why
Lets users compare local models (
qwen2.5-coder:7b,llama3.2:3b, etc.) head-to-head against cloud providers using the same harness. Use cases:Strictly additive — zero behavioral change for existing claude/gpt/gemini paths.
Adapter behavior
available()GET /api/tagswith 2s timeout. Reports remediation hint on every failure mode (install URL,ollama serve,ollama pull <model>)run()/api/generatewith{model, prompt, stream:false}. Parsesresponse,prompt_eval_count,eval_count,modelestimateCost()Env overrides:
GSTACK_OLLAMA_URL(custom host/port),GSTACK_OLLAMA_MODEL(default model, falls back toqwen2.5-coder:7b).Files touched
test/helpers/providers/types.ts— extendFamilyunion with'ollama'test/helpers/providers/ollama.ts— new, the adaptertest/helpers/benchmark-runner.ts— registerOllamaAdapter; switch hardcoded provider unions toFamilytype for forward compatbin/gstack-model-benchmark— register adapter, generalize whitelist into aVALID_PROVIDERSarray (parser still rejects unknown names with WARN, unchanged behavior)test/helpers/tool-map.ts— addollamarow (all tools false —/api/generateexposes no agentic surface)test/helpers/pricing.ts— zero-cost rows forqwen2.5-coder:7b,llama3.2:3b,nomic-embed-textbenchmark-models/SKILL.md.tmpl+ regenSKILL.md— include ollama in default dry-run + describe local-vs-cloud workflowTests
test/providers-ollama.test.ts— new, 14 offline unit tests viafetchstubbing covering:binary_missingroutingtimeouterror codetest/skill-e2e-benchmark-providers.test.ts— live smoke gated onEVALS=1. Skips cleanly if daemon not running. Skipsok-text assertion (local model quality varies).test/benchmark-cli.test.ts— +2 regression tests covering ollama in--modelswhitelist + valid-providers WARN messagetest/benchmark-runner.test.ts— PRICING +missingTools+TOOL_COMPATIBILITYassertions extended to the fourth familyTest plan
bun test test/providers-ollama.test.ts— 14/14 pass (offline)bun test test/benchmark-{cli,runner}.test.ts— green for ollama-related additions; pre-existing Windows-only failure (gpt: NOT READYenv-strip test) unchangedbun test— zero new failures vsmainbaselinebun run gen:skill-docs --host all— regenerated; SKILL.md freshness checks passbun run build— compiles browse/design/pdf binaries; trailingchmod/rmsteps fail on Windows shell but that's pre-existingbun run bin/gstack-model-benchmark --prompt hi --models claude,gpt,gemini,ollama --dry-runprints availability for all 4 with helpful remediation when ollama daemon is downbun run test:e2e EVALS=1with daemon running — to be validated by reviewer's machine (mine doesn't have Ollama running today; adapter response confirmed via offline stubs)Out of scope
This PR does not wire Ollama into
/qa,/review,/ship, or any other slash skill — those skills hardcode the current Claude session. Adding routing-layer integration would be a separate, larger architectural change. This PR is the minimal additive step that lets/benchmark-modelsask "is local good enough?" with data.