Skip to content

feat(serve): qmd serve — shared model server, continuing #511#663

Open
brettdavies wants to merge 6 commits into
tobi:mainfrom
brettdavies:feat/qmd-serve-low-vram
Open

feat(serve): qmd serve — shared model server, continuing #511#663
brettdavies wants to merge 6 commits into
tobi:mainfrom
brettdavies:feat/qmd-serve-low-vram

Conversation

@brettdavies
Copy link
Copy Markdown

@brettdavies brettdavies commented May 19, 2026

qmd serve: shared model server, continuing #511

Continuation note: Three of the four commits in this PR are @jaylfc's work from #511, cherry-picked with -x (preserving his Author: line and recording the source SHAs). His PR has been open since April; this PR rebases it onto current main, applies the changes he's been iterating on, and adds two small things on top: a low-vram passthrough that piggybacks on the engine flag from #662, and three UX touches to make the combination discoverable. Net new from me is one commit (~17 lines). If you'd rather take #511 and have me re-open a follow-up against it, I can do that. Wanted to put the rebased + stacked version in front of you first since #511 hasn't moved.

Why

Multiple QMD clients each cold-load the embed, generate, and rerank models into their own process. On a 16 GB device running three agents, that's 3× the model memory; on ARM64 / headless servers, each client has to compile node-llama-cpp against whatever GPU drivers are installed locally.

qmd serve centralises the three models behind an HTTP server. Clients (CLIs in LXC containers, agents on separate hosts) talk to a single warm instance via the RemoteLLM class introduced in this PR, activated by QMD_REMOTE_URL=http://host:7832. The remote client implements the same LLM interface as LlamaCpp, so it slots in transparently anywhere the SDK accepts an LLM.

Use cases the original PR called out:

  • Multi-agent setups sharing one embedding server.
  • LXC/Docker containers reaching a host-level GPU/NPU without compiling node-llama-cpp inside each container.
  • ARM64 / headless servers: bypass local llama.cpp build entirely.
  • ARM SBCs (Orange Pi RK3588 with NPU) where RAM is scarce.

I've been running this fork on an RTX 3090 Ti alongside an Ollama Gemma 4 26B; the low-vram combination (this PR combined with the engine flag below) made the three-model fit possible.

What this PR does

  1. qmd serve: long-running HTTP server. Two backends:
  • local (default): loads GGUF models via node-llama-cpp.
  • ollama: proxies to any Ollama-compatible REST endpoint (Ollama itself, rkllama on RK3588 NPU, etc).
  1. HTTP endpoints: /embed, /embed-batch, /rerank, /expand, /tokenize, /vsearch, /health, /status, /collections, /search, /browse.
  2. RemoteLLM (src/llm-remote.ts): LLM-interface implementation that POSTs to the server. Auto-activates via QMD_REMOTE_URL (or --remote-url <url> per-invocation).
  3. --low-vram passthrough on serve: picks up the engine flag from feat(llm): add --low-vram to share a GPU with another large model #662. qmd serve --low-vram loads one heavy model at a time so the serve process can coexist with a large local Ollama on the same GPU. Warns rather than silently ignoring when combined with --backend ollama (ollama owns its own model lifecycle).
  4. Startup log + help text surface the low-vram mode and the model lifecycle (resident vs on-demand) so an operator can see at a glance what the server will load.
Endpoint Method Purpose
/embed, /embed-batch POST embedding (single + batch)
/rerank POST document reranking
/expand POST query expansion (lex/vec/hyde)
/tokenize POST token count
/health GET server status + loaded models
/status GET index health (doc counts, embedding status)
/collections GET collection list
/search?q=… GET FTS5 keyword search
/vsearch POST vector similarity search
/browse GET paginated chunk listing

Default bind is 127.0.0.1; --bind 0.0.0.0 exposes to network. Body size capped at 50 MB. Strict input validation on every endpoint.

Relationship to other open PRs

This PR is stacked on the engine --low-vram PR (filed alongside as #662). That PR adds lowVram to LlamaCppConfig and surfaces it as a global flag/env. The serve passthrough in this PR is then trivial: qmd serve --low-vram just sets QMD_LOW_VRAM=1, the LocalBackend's LlamaCpp constructor reads it, and the heavy-model lifecycle is handled in the engine. No serve-specific concurrency code.

This PR also overlaps in problem space with #608 ("Daemon-aware CLI fast-path"). Both keep models warm across requests. The honest comparison:

qmd mcp --http --daemon (+ #608) qmd serve (this PR)
Protocol MCP over Streamable HTTP Plain REST
What it exposes High-level tools: query, get, multi_get Low-level model primitives + index endpoints
Who calls it MCP agents + same-machine CLI (after #608) RemoteLLM qmd clients (multi-host, LXC)
Default bind localhost (PID-file discovery) designed for 0.0.0.0
Backend swap local node-llama-cpp only local or ollama-compat proxy

There's a real question (open below) about whether these should remain two separate daemons or whether the model-primitive endpoints + RemoteLLM should land as an extension of the MCP daemon instead. The substance of @jaylfc's work (the primitive endpoints, the RemoteLLM client, the ollama proxy backend) survives either shape; only the top-level serve subcommand vs mcp --http --daemon choice changes.

Architecture

c9d6c27  Brett    feat(llm): add low-vram mode that loads one heavy model at a time   ← #662
7a1bda7  @jaylfc   feat: remote model server (qmd serve) and client support
34519c3  @jaylfc   feat(serve): add /vsearch endpoint for semantic vector search
87d35e2  @jaylfc   refactor: rename rkllama backend to ollama-compat — generic naming
d4590fa  Brett    feat(serve): surface low-vram in serve help, startup log, ollama warning
fbfcdcd  Brett    fix(remote): route store layer through getDefaultLLM so QMD_REMOTE_URL reaches RemoteLLM

The top commit (c9d6c27) is #662, which this stacks on. The middle three are @jaylfc's feat/remote-llm-provider-clean cherry-picked verbatim onto current main (preserving his Author: field + recording the source SHA in each commit message). One conflict resolution was needed beyond mechanical re-application: b5f156c "Improve qmd diagnostics and embed resilience" landed on main while this branch was being rebased, refactoring the embedding error-handling into tryEmbedChunk / recordFailure / retryFailedChunks. That supersedes @jaylfc's manual 3-attempt retry loop, so the rebase takes main's structure on those hunks; less code, integrates with the new failure tracker.

My next commit (d4590fa) is 17 lines: serve help text, startup log labels (resident / on-demand), and a warning when --low-vram combines with --backend ollama.

Follow-up commit (fbfcdcd): completing the store-layer wiring

Caught while running qmd query against the daemon on a VRAM-constrained box: the CLI sets setDefaultLLM(new RemoteLLM(...)) correctly, but the store layer hardcoded getDefaultLlamaCpp() at every embed / expandQuery / rerank call site (src/store.ts:84/3562/3737/3783). The polymorphic accessor was set and then bypassed one layer deeper. On a box with enough headroom the bypass is invisible: clients silently pay for two LLM backends. With Ollama co-resident (~21 GB on a 24 GB GPU), the rerank stage fails with Failed to create any rerank context.

The follow-up commit:

  • Routes the four store call sites through getDefaultLLM() and widens their llmOverride?: LlamaCpp params to LLM. Store.llm widens to LLM to match the documented "Can be LlamaCpp or RemoteLLM" comment.
  • Guards cli/qmd.ts:getStore so it skips the eager setDefaultLlamaCpp(new LlamaCpp(...)) when QMD_REMOTE_URL is set. The --remote-url flag mirrors into the env so one source of truth gates the guard.
  • Adds optional embedModelName / generateModelName / rerankModelName readonly accessors to the LLM interface so existing consumers keep their ?? DEFAULT_* fallbacks without casting.
  • Replaces the buggy lazy require("./llm-remote.js") in getDefaultLLM (which fails under vitest ESM resolution) with a static import. No runtime cycle: llm-remote.ts only type-imports from llm.ts.
  • Adds a requireLlamaCpp(llm, op) narrow helper for indexing paths (generateEmbeddings, embedding fingerprint adoption) that genuinely need the local engine. It accepts a real LlamaCpp unconditionally, refuses with a clear message when the active LLM is RemoteLLM, and casts through for duck-typed test fixtures so existing mocks keep working.

Test additions:

  • New test/store-remote-llm.test.ts (6 tests): covers both llmOverride and getDefaultLLM() routing for expandQuery and rerank, plus the Store adapter routing through Store.llm. All pass under both Node/vitest and Bun.
  • LlamaCpp Integration describe block in test/store.test.ts now opportunistically routes through qmd serve: probes http://127.0.0.1:7832/health in beforeAll, calls setDefaultLLM(new RemoteLLM(...)) when reachable, falls back to local LlamaCpp when not. The 3 rerank tests in that block which previously failed under VRAM pressure now pass via the daemon, turning the integration suite into an end-to-end regression check for the remote path.
  • Test environment hermetics: vitest.config.ts and src/test-preload.ts both unset QMD_REMOTE_URL so a developer's shell-set env var doesn't accidentally route 200+ unrelated tests through a real daemon (or fail them with ECONNREFUSED when one isn't running).

Tests

Engine-level concurrency tests for the low-vram mode live in #662 (test/llm-low-vram.test.ts, 6 tests, all pass against a fake llama backend so they run without GPU). The store-layer wiring fix in fbfcdcd adds 6 new tests in test/store-remote-llm.test.ts and converts the existing LlamaCpp Integration describe block in test/store.test.ts into an opportunistic remote regression check (see the follow-up section above).

Full suite on this branch with qmd serve running locally: 880/903 pass. The 23 remaining failures: 14 LLM-class tests in test/llm.test.ts and test/mcp.test.ts that exercise the LlamaCpp class directly (not through the store) and fail under VRAM pressure (Failed to create any rerank context), and 9 CLI tests that fail on Node 26 due to DEP0205 deprecation warnings leaking to stderr. Both families reproduce against bare upstream/main in the same environment.

tsc --noEmit clean.

Backwards compatibility

  • New serve subcommand and RemoteLLM class. Purely additive.
  • QMD_REMOTE_URL and --remote-url are new (renamed from QMD_SERVER / --server in @jaylfc's earlier draft per a naming discussion; see the rename commit for the rationale). Nothing in main reads them.
  • The existing qmd query path is unchanged: with no QMD_REMOTE_URL set, behaviour is identical to today.

Things I deliberately didn't do

  • Auth. Bind defaults to 127.0.0.1; if you want network exposure (--bind 0.0.0.0) it's behind a firewall / Tailscale / etc. Adding token auth or mTLS is a separate concern.
  • Unify with the MCP daemon. Real question worth maintainer input; see open questions below.
  • Replace --backend rkllama with anything more specific. @jaylfc's rename to ollama-compat is the right generic name; specific NPU implementations like rkllama remain compatible via that interface.

Open questions for you

  1. Should this and qmd mcp --http --daemon be one process or two? They're both long-running HTTP servers that hold models warm. The differences (MCP protocol vs REST, agent tools vs model primitives, localhost vs network) are real but orthogonal: one daemon could expose both surfaces. If you'd prefer a single canonical daemon, the substance of this PR (primitive endpoints + RemoteLLM client + ollama proxy backend) could land as an extension of the MCP daemon instead of a new serve subcommand. Happy to reshape if that's the call.
  2. /v1/ versioning, like Daemon-aware CLI fast-path: ~4× speedup for qmd query #608 introduced? This PR uses unversioned routes (/embed, /rerank, etc.) matching @jaylfc's original. If you want /v1/embed etc. for consistency with Daemon-aware CLI fast-path: ~4× speedup for qmd query #608's /v1/search, easy rename.
  3. Authoring credit. Three commits are @jaylfc's verbatim with (cherry picked from commit …) trailers. If you'd rather take feat: remote model server (qmd serve) for shared inference across clients #511 directly instead of this rebased + stacked version, I'm fine to close this and re-open a smaller PR against feat: remote model server (qmd serve) for shared inference across clients #511's branch with just my two additions.

brettdavies and others added 6 commits May 31, 2026 01:55
Adds `lowVram` to `LlamaCppConfig` (also `--low-vram` CLI flag and
`QMD_LOW_VRAM=1` env). When enabled, the heavy generate (~2 GB) and
rerank (~2.3 GB) models are disposed immediately after each use,
while the tiny embed model (~320 MB) stays resident. Peak VRAM drops
from ~5.4 GB to ~2.6 GB at the cost of per-stage load latency
(~3 s → ~5.6 s on a typical GPU).

This makes qmd usable on GPUs where loading all three models at once
exhausts free VRAM — for example, sharing a 24 GB GPU with a ~20 GB
Ollama instance. Addresses the failure mode tracked in tobi#275 across
all entry points that construct an LlamaCpp instance: `qmd query`,
`qmd mcp`, and the upcoming `qmd serve`.

The pipeline stages (expand → embed → search → rerank) are inherently
sequential, so disposing between them only costs reload time, not
correctness.

Concurrency: when lowVram is on, expandQuery and rerank calls
serialize through per-method promise chains so a dispose can never
race with another caller's in-flight use of the same model. embed
and embedBatch remain parallel. The two chains are independent —
expand and rerank against their separate heavy models can run in
parallel against each other.

The flag is global (`--low-vram` works on any subcommand that
constructs an LlamaCpp), so qmd query, qmd mcp, and other one-shot
commands all benefit — not just long-lived daemons. Naming follows
the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).
Adds qmd serve — HTTP server for embedding, reranking, query expansion.
Supports local (node-llama-cpp) and rkllama (RK3588 NPU) backends.
RemoteLLM client auto-activates via QMD_SERVER env var.

Includes:
- Batch embedding (single HTTP call for all chunks)
- NPU timeout/retry tuning for ARM SBCs
- rkllama rerank via logit-based scoring
- Index endpoints: /search, /browse, /collections, /status
- Security: default bind 127.0.0.1, 50MB body limit, type validation
- Updated README and CHANGELOG

(cherry picked from commit 108559d)
(cherry picked from commit d7b340d)
Embeds the query via the configured backend (rkllama/local), then
runs sqlite-vec nearest-neighbour search against stored vectors.
Returns ranked results with cosine similarity scores.

Enables TinyAgentOS to offer semantic memory search over HTTP.

(cherry picked from commit cc225b1)
(cherry picked from commit a33cf0f)
…I naming

- Backend type: 'rkllama' → 'ollama' (Ollama-compatible, works with rkllama/ollama/etc)
- CLI: --backend-url replaces --rkllama-url (old flag kept as deprecated alias)
- Class: RKLlamaBackend → OllamaCompatBackend
- Default URL: localhost:11434 (standard Ollama port)
- All internal comments genericised
- --rkllama-url and RKLLAMA_URL env var still work for backwards compat

(cherry picked from commit b6c1019)
(cherry picked from commit 6c21a7b)
…warning

The `--low-vram` flag from the engine layer applies automatically to
`qmd serve` because it constructs an `LlamaCpp` instance via the local
backend. This commit adds three small UX touches so the feature is
discoverable in the serve context:

- Serve help text mentions `qmd serve --low-vram` alongside the other
  serve modes.
- Startup log shows "Backend: local low-vram (one heavy model at a
  time)" and tags each model line with (resident)/(on demand) so an
  operator can see at a glance which models will reload between
  pipeline stages.
- Warn (instead of silently ignoring) when `--low-vram` is combined
  with `--backend ollama` — ollama is a separate process whose model
  lifecycle qmd can't control, so the flag has no effect; operators
  should configure keep-alive on the upstream server instead.

The underlying mechanism lives in `LlamaCpp` and is shared with
`qmd query`, `qmd mcp`, and any other entry point that constructs a
local engine.
…L reaches RemoteLLM

The serve / RemoteLLM commits earlier in this PR wire `setDefaultLLM(new RemoteLLM(...))` at the CLI entrypoint, but the store layer hardcoded `getDefaultLlamaCpp()` at every embed / expandQuery / rerank call site. The polymorphic accessor was set correctly and then bypassed one layer deeper. `qmd query` continued to load the local `LlamaCpp` for all three pipeline stages, allocating ~5.4 GB of VRAM even when a healthy `qmd serve` daemon was reachable.

On a VRAM-constrained box (e.g. shared with a co-resident Ollama), this silently OOMs the rerank stage with `Failed to create any rerank context`; with more headroom, the bug is invisible but the user is paying for two LLM backends.

Changes:

- `src/store.ts:84` (`getLlm`), `:3562` (embed), `:3737` (expandQuery), `:3783` (rerank) now use `getDefaultLLM()` and accept a widened `LLM` override instead of `LlamaCpp`. `Store.llm` widens to `LLM` to match the documented "Can be LlamaCpp or RemoteLLM" comment.
- `src/cli/qmd.ts:getStore` skips `setDefaultLlamaCpp(new LlamaCpp(...))` when `QMD_REMOTE_URL` is set, so the CLI no longer eagerly allocates VRAM the user delegated to `qmd serve`. The `--remote-url` flag mirrors into `process.env.QMD_REMOTE_URL` so one source of truth gates the guard.
- `src/llm.ts`: `LLM` interface gains optional `embedModelName` / `generateModelName` / `rerankModelName` readonly accessors so existing consumers can keep their `?? DEFAULT_*` fallbacks without casting. Also drops the buggy CommonJS `require("./llm-remote.js")` lazy import in `getDefaultLLM` (broke under vitest ESM resolution); the static import does not introduce a cycle because `llm-remote.ts` only type-imports from `llm.ts`.
- Indexing paths that genuinely need the local engine (`generateEmbeddings`, embedding fingerprint adoption) go through a new `requireLlamaCpp(llm, op)` helper. It accepts a real LlamaCpp unconditionally, refuses with a clear message when the active LLM is a RemoteLLM, and casts through for duck-typed test fixtures so existing mocks keep working. Querying (embed, expand, rerank, vec) works end-to-end against `RemoteLLM`.

Test changes:

- New `test/store-remote-llm.test.ts` (6 tests): covers both `llmOverride` and `getDefaultLLM()` routing for `expandQuery` and `rerank`, plus the `Store` adapter routing through `Store.llm`. All pass under both Node/vitest and Bun.
- `LlamaCpp Integration` describe block in `test/store.test.ts` now opportunistically routes through `qmd serve` (probes `http://127.0.0.1:7832/health` in `beforeAll`, calls `setDefaultLLM(new RemoteLLM(...))` when reachable). This turns the integration suite into an end-to-end regression check for the remote path AND lets the rerank tests pass on VRAM-constrained dev boxes where the local rerank model can't load alongside a co-resident model. Falls back to local LlamaCpp when the daemon isn't running.
- `rerank deduplicates identical chunks across files` spy updated from `getDefaultLlamaCpp` to `getDefaultLLM` to match the new routing.
- Test environment hermetics: `vitest.config.ts` and `src/test-preload.ts` both unset `QMD_REMOTE_URL` so a developer's shell-set env var doesn't accidentally route 200+ unrelated tests through a real daemon (or fail them with `ECONNREFUSED` when it isn't running).
@brettdavies brettdavies force-pushed the feat/qmd-serve-low-vram branch from 02f65c7 to ec9811c Compare May 31, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants