feat(serve): qmd serve — shared model server, continuing #511#663
Open
brettdavies wants to merge 6 commits into
Open
feat(serve): qmd serve — shared model server, continuing #511#663brettdavies wants to merge 6 commits into
qmd serve — shared model server, continuing #511#663brettdavies wants to merge 6 commits into
Conversation
15 tasks
fbfcdcd to
02f65c7
Compare
2 tasks
Adds `lowVram` to `LlamaCppConfig` (also `--low-vram` CLI flag and `QMD_LOW_VRAM=1` env). When enabled, the heavy generate (~2 GB) and rerank (~2.3 GB) models are disposed immediately after each use, while the tiny embed model (~320 MB) stays resident. Peak VRAM drops from ~5.4 GB to ~2.6 GB at the cost of per-stage load latency (~3 s → ~5.6 s on a typical GPU). This makes qmd usable on GPUs where loading all three models at once exhausts free VRAM — for example, sharing a 24 GB GPU with a ~20 GB Ollama instance. Addresses the failure mode tracked in tobi#275 across all entry points that construct an LlamaCpp instance: `qmd query`, `qmd mcp`, and the upcoming `qmd serve`. The pipeline stages (expand → embed → search → rerank) are inherently sequential, so disposing between them only costs reload time, not correctness. Concurrency: when lowVram is on, expandQuery and rerank calls serialize through per-method promise chains so a dispose can never race with another caller's in-flight use of the same model. embed and embedBatch remain parallel. The two chains are independent — expand and rerank against their separate heavy models can run in parallel against each other. The flag is global (`--low-vram` works on any subcommand that constructs an LlamaCpp), so qmd query, qmd mcp, and other one-shot commands all benefit — not just long-lived daemons. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).
Adds qmd serve — HTTP server for embedding, reranking, query expansion. Supports local (node-llama-cpp) and rkllama (RK3588 NPU) backends. RemoteLLM client auto-activates via QMD_SERVER env var. Includes: - Batch embedding (single HTTP call for all chunks) - NPU timeout/retry tuning for ARM SBCs - rkllama rerank via logit-based scoring - Index endpoints: /search, /browse, /collections, /status - Security: default bind 127.0.0.1, 50MB body limit, type validation - Updated README and CHANGELOG (cherry picked from commit 108559d) (cherry picked from commit d7b340d)
Embeds the query via the configured backend (rkllama/local), then runs sqlite-vec nearest-neighbour search against stored vectors. Returns ranked results with cosine similarity scores. Enables TinyAgentOS to offer semantic memory search over HTTP. (cherry picked from commit cc225b1) (cherry picked from commit a33cf0f)
…I naming - Backend type: 'rkllama' → 'ollama' (Ollama-compatible, works with rkllama/ollama/etc) - CLI: --backend-url replaces --rkllama-url (old flag kept as deprecated alias) - Class: RKLlamaBackend → OllamaCompatBackend - Default URL: localhost:11434 (standard Ollama port) - All internal comments genericised - --rkllama-url and RKLLAMA_URL env var still work for backwards compat (cherry picked from commit b6c1019) (cherry picked from commit 6c21a7b)
…warning The `--low-vram` flag from the engine layer applies automatically to `qmd serve` because it constructs an `LlamaCpp` instance via the local backend. This commit adds three small UX touches so the feature is discoverable in the serve context: - Serve help text mentions `qmd serve --low-vram` alongside the other serve modes. - Startup log shows "Backend: local low-vram (one heavy model at a time)" and tags each model line with (resident)/(on demand) so an operator can see at a glance which models will reload between pipeline stages. - Warn (instead of silently ignoring) when `--low-vram` is combined with `--backend ollama` — ollama is a separate process whose model lifecycle qmd can't control, so the flag has no effect; operators should configure keep-alive on the upstream server instead. The underlying mechanism lives in `LlamaCpp` and is shared with `qmd query`, `qmd mcp`, and any other entry point that constructs a local engine.
…L reaches RemoteLLM
The serve / RemoteLLM commits earlier in this PR wire `setDefaultLLM(new RemoteLLM(...))` at the CLI entrypoint, but the store layer hardcoded `getDefaultLlamaCpp()` at every embed / expandQuery / rerank call site. The polymorphic accessor was set correctly and then bypassed one layer deeper. `qmd query` continued to load the local `LlamaCpp` for all three pipeline stages, allocating ~5.4 GB of VRAM even when a healthy `qmd serve` daemon was reachable.
On a VRAM-constrained box (e.g. shared with a co-resident Ollama), this silently OOMs the rerank stage with `Failed to create any rerank context`; with more headroom, the bug is invisible but the user is paying for two LLM backends.
Changes:
- `src/store.ts:84` (`getLlm`), `:3562` (embed), `:3737` (expandQuery), `:3783` (rerank) now use `getDefaultLLM()` and accept a widened `LLM` override instead of `LlamaCpp`. `Store.llm` widens to `LLM` to match the documented "Can be LlamaCpp or RemoteLLM" comment.
- `src/cli/qmd.ts:getStore` skips `setDefaultLlamaCpp(new LlamaCpp(...))` when `QMD_REMOTE_URL` is set, so the CLI no longer eagerly allocates VRAM the user delegated to `qmd serve`. The `--remote-url` flag mirrors into `process.env.QMD_REMOTE_URL` so one source of truth gates the guard.
- `src/llm.ts`: `LLM` interface gains optional `embedModelName` / `generateModelName` / `rerankModelName` readonly accessors so existing consumers can keep their `?? DEFAULT_*` fallbacks without casting. Also drops the buggy CommonJS `require("./llm-remote.js")` lazy import in `getDefaultLLM` (broke under vitest ESM resolution); the static import does not introduce a cycle because `llm-remote.ts` only type-imports from `llm.ts`.
- Indexing paths that genuinely need the local engine (`generateEmbeddings`, embedding fingerprint adoption) go through a new `requireLlamaCpp(llm, op)` helper. It accepts a real LlamaCpp unconditionally, refuses with a clear message when the active LLM is a RemoteLLM, and casts through for duck-typed test fixtures so existing mocks keep working. Querying (embed, expand, rerank, vec) works end-to-end against `RemoteLLM`.
Test changes:
- New `test/store-remote-llm.test.ts` (6 tests): covers both `llmOverride` and `getDefaultLLM()` routing for `expandQuery` and `rerank`, plus the `Store` adapter routing through `Store.llm`. All pass under both Node/vitest and Bun.
- `LlamaCpp Integration` describe block in `test/store.test.ts` now opportunistically routes through `qmd serve` (probes `http://127.0.0.1:7832/health` in `beforeAll`, calls `setDefaultLLM(new RemoteLLM(...))` when reachable). This turns the integration suite into an end-to-end regression check for the remote path AND lets the rerank tests pass on VRAM-constrained dev boxes where the local rerank model can't load alongside a co-resident model. Falls back to local LlamaCpp when the daemon isn't running.
- `rerank deduplicates identical chunks across files` spy updated from `getDefaultLlamaCpp` to `getDefaultLLM` to match the new routing.
- Test environment hermetics: `vitest.config.ts` and `src/test-preload.ts` both unset `QMD_REMOTE_URL` so a developer's shell-set env var doesn't accidentally route 200+ unrelated tests through a real daemon (or fail them with `ECONNREFUSED` when it isn't running).
02f65c7 to
ec9811c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
qmd serve: shared model server, continuing #511Why
Multiple QMD clients each cold-load the embed, generate, and rerank models into their own process. On a 16 GB device running three agents, that's 3× the model memory; on ARM64 / headless servers, each client has to compile
node-llama-cppagainst whatever GPU drivers are installed locally.qmd servecentralises the three models behind an HTTP server. Clients (CLIs in LXC containers, agents on separate hosts) talk to a single warm instance via theRemoteLLMclass introduced in this PR, activated byQMD_REMOTE_URL=http://host:7832. The remote client implements the sameLLMinterface asLlamaCpp, so it slots in transparently anywhere the SDK accepts an LLM.Use cases the original PR called out:
node-llama-cppinside each container.I've been running this fork on an RTX 3090 Ti alongside an Ollama Gemma 4 26B; the low-vram combination (this PR combined with the engine flag below) made the three-model fit possible.
What this PR does
qmd serve: long-running HTTP server. Two backends:local(default): loads GGUF models vianode-llama-cpp.ollama: proxies to any Ollama-compatible REST endpoint (Ollama itself, rkllama on RK3588 NPU, etc)./embed,/embed-batch,/rerank,/expand,/tokenize,/vsearch,/health,/status,/collections,/search,/browse.RemoteLLM(src/llm-remote.ts):LLM-interface implementation that POSTs to the server. Auto-activates viaQMD_REMOTE_URL(or--remote-url <url>per-invocation).--low-vrampassthrough on serve: picks up the engine flag from feat(llm): add--low-vramto share a GPU with another large model #662.qmd serve --low-vramloads one heavy model at a time so the serve process can coexist with a large local Ollama on the same GPU. Warns rather than silently ignoring when combined with--backend ollama(ollama owns its own model lifecycle)./embed,/embed-batch/rerank/expand/tokenize/health/status/collections/search?q=…/vsearch/browseDefault bind is
127.0.0.1;--bind 0.0.0.0exposes to network. Body size capped at 50 MB. Strict input validation on every endpoint.Relationship to other open PRs
This PR is stacked on the engine
--low-vramPR (filed alongside as #662). That PR addslowVramtoLlamaCppConfigand surfaces it as a global flag/env. The serve passthrough in this PR is then trivial:qmd serve --low-vramjust setsQMD_LOW_VRAM=1, the LocalBackend'sLlamaCppconstructor reads it, and the heavy-model lifecycle is handled in the engine. No serve-specific concurrency code.This PR also overlaps in problem space with #608 ("Daemon-aware CLI fast-path"). Both keep models warm across requests. The honest comparison:
qmd mcp --http --daemon(+ #608)qmd serve(this PR)query,get,multi_getRemoteLLMqmd clients (multi-host, LXC)0.0.0.0There's a real question (open below) about whether these should remain two separate daemons or whether the model-primitive endpoints +
RemoteLLMshould land as an extension of the MCP daemon instead. The substance of @jaylfc's work (the primitive endpoints, theRemoteLLMclient, the ollama proxy backend) survives either shape; only the top-levelservesubcommand vsmcp --http --daemonchoice changes.Architecture
The top commit (
c9d6c27) is #662, which this stacks on. The middle three are @jaylfc'sfeat/remote-llm-provider-cleancherry-picked verbatim onto current main (preserving hisAuthor:field + recording the source SHA in each commit message). One conflict resolution was needed beyond mechanical re-application:b5f156c "Improve qmd diagnostics and embed resilience"landed on main while this branch was being rebased, refactoring the embedding error-handling intotryEmbedChunk/recordFailure/retryFailedChunks. That supersedes @jaylfc's manual 3-attempt retry loop, so the rebase takes main's structure on those hunks; less code, integrates with the new failure tracker.My next commit (
d4590fa) is 17 lines: serve help text, startup log labels (resident / on-demand), and a warning when--low-vramcombines with--backend ollama.Follow-up commit (fbfcdcd): completing the store-layer wiring
Caught while running
qmd queryagainst the daemon on a VRAM-constrained box: the CLI setssetDefaultLLM(new RemoteLLM(...))correctly, but the store layer hardcodedgetDefaultLlamaCpp()at every embed / expandQuery / rerank call site (src/store.ts:84/3562/3737/3783). The polymorphic accessor was set and then bypassed one layer deeper. On a box with enough headroom the bypass is invisible: clients silently pay for two LLM backends. With Ollama co-resident (~21 GB on a 24 GB GPU), the rerank stage fails withFailed to create any rerank context.The follow-up commit:
getDefaultLLM()and widens theirllmOverride?: LlamaCppparams toLLM.Store.llmwidens toLLMto match the documented "Can be LlamaCpp or RemoteLLM" comment.cli/qmd.ts:getStoreso it skips the eagersetDefaultLlamaCpp(new LlamaCpp(...))whenQMD_REMOTE_URLis set. The--remote-urlflag mirrors into the env so one source of truth gates the guard.embedModelName/generateModelName/rerankModelNamereadonly accessors to theLLMinterface so existing consumers keep their?? DEFAULT_*fallbacks without casting.require("./llm-remote.js")ingetDefaultLLM(which fails under vitest ESM resolution) with a static import. No runtime cycle:llm-remote.tsonly type-imports fromllm.ts.requireLlamaCpp(llm, op)narrow helper for indexing paths (generateEmbeddings, embedding fingerprint adoption) that genuinely need the local engine. It accepts a real LlamaCpp unconditionally, refuses with a clear message when the active LLM isRemoteLLM, and casts through for duck-typed test fixtures so existing mocks keep working.Test additions:
test/store-remote-llm.test.ts(6 tests): covers bothllmOverrideandgetDefaultLLM()routing forexpandQueryandrerank, plus theStoreadapter routing throughStore.llm. All pass under both Node/vitest and Bun.LlamaCpp Integrationdescribe block intest/store.test.tsnow opportunistically routes throughqmd serve: probeshttp://127.0.0.1:7832/healthinbeforeAll, callssetDefaultLLM(new RemoteLLM(...))when reachable, falls back to local LlamaCpp when not. The 3 rerank tests in that block which previously failed under VRAM pressure now pass via the daemon, turning the integration suite into an end-to-end regression check for the remote path.vitest.config.tsandsrc/test-preload.tsboth unsetQMD_REMOTE_URLso a developer's shell-set env var doesn't accidentally route 200+ unrelated tests through a real daemon (or fail them withECONNREFUSEDwhen one isn't running).Tests
Engine-level concurrency tests for the low-vram mode live in #662 (
test/llm-low-vram.test.ts, 6 tests, all pass against a fake llama backend so they run without GPU). The store-layer wiring fix infbfcdcdadds 6 new tests intest/store-remote-llm.test.tsand converts the existingLlamaCpp Integrationdescribe block intest/store.test.tsinto an opportunistic remote regression check (see the follow-up section above).Full suite on this branch with
qmd serverunning locally: 880/903 pass. The 23 remaining failures: 14 LLM-class tests intest/llm.test.tsandtest/mcp.test.tsthat exercise theLlamaCppclass directly (not through the store) and fail under VRAM pressure (Failed to create any rerank context), and 9 CLI tests that fail on Node 26 due toDEP0205deprecation warnings leaking to stderr. Both families reproduce against bareupstream/mainin the same environment.tsc --noEmitclean.Backwards compatibility
servesubcommand andRemoteLLMclass. Purely additive.QMD_REMOTE_URLand--remote-urlare new (renamed fromQMD_SERVER/--serverin @jaylfc's earlier draft per a naming discussion; see the rename commit for the rationale). Nothing inmainreads them.qmd querypath is unchanged: with noQMD_REMOTE_URLset, behaviour is identical to today.Things I deliberately didn't do
127.0.0.1; if you want network exposure (--bind 0.0.0.0) it's behind a firewall / Tailscale / etc. Adding token auth or mTLS is a separate concern.--backend rkllamawith anything more specific. @jaylfc's rename toollama-compatis the right generic name; specific NPU implementations like rkllama remain compatible via that interface.Open questions for you
qmd mcp --http --daemonbe one process or two? They're both long-running HTTP servers that hold models warm. The differences (MCP protocol vs REST, agent tools vs model primitives, localhost vs network) are real but orthogonal: one daemon could expose both surfaces. If you'd prefer a single canonical daemon, the substance of this PR (primitive endpoints +RemoteLLMclient + ollama proxy backend) could land as an extension of the MCP daemon instead of a newservesubcommand. Happy to reshape if that's the call./v1/versioning, like Daemon-aware CLI fast-path: ~4× speedup for qmd query #608 introduced? This PR uses unversioned routes (/embed,/rerank, etc.) matching @jaylfc's original. If you want/v1/embedetc. for consistency with Daemon-aware CLI fast-path: ~4× speedup for qmd query #608's/v1/search, easy rename.(cherry picked from commit …)trailers. If you'd rather take feat: remote model server (qmd serve) for shared inference across clients #511 directly instead of this rebased + stacked version, I'm fine to close this and re-open a smaller PR against feat: remote model server (qmd serve) for shared inference across clients #511's branch with just my two additions.