A design rationale for the enhancements layered on top of upstream
tobi/qmd. This document is intentionally descriptive, not prescriptive — it explains what changed, why it changed, and what was deliberately left alone, so future contributors (and future-me) can reason about the codebase without guessing at intent.
Author of enhancements: Nguyen Ngoc Tuan — Founder, Transform Group (Lark Platinum Partner) Original author: Tobi Lutke
Upstream QMD is a sharply-focused on-device hybrid search engine: BM25 + vector
- LLM reranking, everything running locally via
node-llama-cppagainst small GGUF models. It is excellent for a personal Obsidian vault on a developer laptop with a GPU.
It starts to strain in three directions:
- CPU-only deployment — embedding a large corpus with
embeddinggemma-300Mon a CPU takes hours. For servers, CI runners, and low-end laptops this is a non-starter. - Non-English content —
embeddinggemma-300Mis heavily English-biased. Vietnamese, CJK, and mixed-language corpora get poor semantic recall. - Opaque cost — once any cloud API is involved, users need to see their consumption. Upstream had no concept of usage because there was no usage to track.
The changes in this fork address all three without compromising upstream's core promise of local-first operation. Every new capability is opt-in; the default execution path is byte-for-byte compatible with upstream.
Headline additions:
- Polymorphic provider backend —
embed/rerankcan be routed to Jina AI independently, via env var or per-index YAML. - Full observability layer — usage tracking, quota warnings, ASCII histograms, JSON/CSV exports for scripting.
- Reproducible benchmarking —
qmd bench jinameasures local vs remote latency and throughput with multi-run statistics. - Secrets hygiene —
.envauto-load, gitignored by default, plus a pre-commit secret scanner that catches keys for eight providers.
Delivered with zero new runtime dependencies, zero regressions, and +68 tests (467 → from 399 upstream).
The enhancement strategy is captured by a single rule:
Additive, never subtractive. Every new capability must be opt-in. Upstream users who set no new env vars and touch no new YAML fields must experience an identical tool.
Concretely this means:
- No existing CLI command changed its default behaviour.
- No existing function signature changed.
- No existing env var acquired a new meaning.
- No new runtime dependency was added to
package.json. - The local-first execution path has the same I/O footprint as before.
This rule was the single hardest constraint of the work. Many "obvious" refactors were rejected because they would have forced a migration on upstream users. The cost is some code duplication around branching logic (local-path vs remote-path); the payoff is painless merges if upstream chooses to pull any of this work back.
Each section below describes one shift: what the code does now, what the code used to do, and — most importantly — why the change exists.
Before. LlamaCpp was a direct wrapper around node-llama-cpp. Calling
embed() always hit a local GGUF model; calling rerank() always loaded
qwen3-reranker into VRAM. There was no indirection, no strategy interface,
no way to swap models without forking.
After. LlamaCpp still exists and is still the sole entry point for
all callers — the SDK, the CLI, the MCP server. But its constructor now
resolves two optional pluggable components:
LlamaCpp config resolution (for embed, same for rerank):
1. explicit config.remoteEmbedder instance (SDK path)
2. config.embedModel starts with "jina:" (YAML path)
3. env QMD_EMBED_PROVIDER=jina (env path)
4. fall through to local node-llama-cpp (default)
When a remote provider resolves, embed() / embedBatch() delegates to it
and the local embedding model is never loaded. When nothing resolves,
behaviour is identical to upstream — same code path, same performance, same
VRAM usage.
Why. Upstream's single-path design is a feature for most users, but it forecloses three legitimate deployments:
- Running QMD on a CPU-only server where
embeddinggemmabatch throughput is measured in single-digit docs/second. - Indexing a corpus that's primarily Vietnamese, where
embeddinggemma-300Mfalls short on semantic recall. - Deploying QMD inside a constrained container (serverless, minimal images) where bundling a GGUF model is wasteful.
The polymorphic backend lets users hit any of these targets without giving up the local-first default. Generation (query expansion) is deliberately not pluggable — it runs at query time and is the most latency-sensitive step, so the local ~1.7B parameter model stays put.
Key files.
src/embedders/jina.ts—JinaEmbedder,JinaReranker, and env factories.src/llm.ts— the resolution logic lives inLlamaCpp's constructor.
Before. qmd embed printed a progress bar and exited. If you had been
running with a cloud provider it would still have told you nothing about
cost, consumption, or trajectory. Upstream had no place for this because
upstream had no cloud provider.
After. Every successful Jina API call flows through a UsageReporter
callback that appends a row to a new jina_usage SQLite table:
CREATE TABLE jina_usage (
id INTEGER PRIMARY KEY AUTOINCREMENT,
operation TEXT NOT NULL, -- 'embed_query' | 'embed_passage' | 'rerank'
model TEXT NOT NULL,
total_tokens INTEGER NOT NULL,
prompt_tokens INTEGER,
at TEXT NOT NULL -- ISO 8601
);
CREATE INDEX idx_jina_usage_at ON jina_usage(at);The table is an append-only event log, not a counter. This distinction matters: counters lose history and cannot answer questions like "how much did I spend on re-indexing last Friday?". Event logs answer every windowed question you might later ask (24h, 7d, 30d, custom, per-operation, per-model) without a schema migration.
On top of the event log sits a single helper, getUsageSnapshot(), that
computes rolling-window totals plus optional quota state. All display
paths consume this one helper:
qmd usage(text)qmd usage --jsonqmd usage --csvqmd usage chart(ASCII histogram)qmd status(compact summary)
This guarantees consistency: the number shown in the histogram footer is always the same number in the JSON payload. No display format can drift away from the source of truth.
Why. The moment you introduce a metered cloud API, not knowing your consumption becomes a blocker to actually using it. Users will not deploy a tool in production if they cannot answer "how much have I used this month?". The observability layer is not a nice-to-have — it is the minimum viable trust surface for cloud usage.
Key files.
src/store.ts—recordJinaUsage,getUsageSnapshot,getDailyUsage,JinaUsageSummarytype.src/cli/qmd.ts—showUsage,showUsageChart,renderUsageJson,renderUsageCsv,renderUsageChart.
Before. QMD made no performance claims because it had only one backend. There was nothing to compare against.
After. qmd bench jina runs a reproducible latency + throughput
benchmark with deterministic synthetic workloads:
- Three measured stages:
embed_single,embed_batch,rerank. - Built-in vocabulary mix (English prose, code snippets, Vietnamese text) so results are reproducible and multilingual tokenisation is exercised.
- Each run warms up the backend once, then collects samples for each stage.
- With
--runs N(up to 100), every stage's samples are flattened across runs and summarised with median, mean, standard deviation, p95, min, and max. - Stages with stddev > 20% of median are highlighted yellow ("this measurement is noisy — re-run with more samples").
- When both backends run, a comparison table shows per-stage speedup ratios and picks a winner.
The entire report is emitted as a stable qmd.bench.jina.v1 JSON document
via --json, suitable for CI regression gates.
Why. The answer to "is Jina actually faster for me?" depends on local
hardware (CPU / GPU / RAM), geography (network RTT to Jina's EU/US
datacenters), workload (batch indexing vs query-time), and content mix
(code vs prose vs CJK). Any universal claim I make in documentation is
wrong for some subset of users. qmd bench jina lets users stop trusting
documentation and start trusting their own numbers.
The --runs flag exists because single measurements over a network are a
lie: RTT variance can be 2× the median under load. Without multi-sample
statistics there is no honest comparison.
Key files.
src/bench/bench-jina.ts— full benchmark harness.src/cli/qmd.ts—renderBenchJinaTablefor the human-readable output.
Before. Upstream had a few graceful-degradation paths: if a YAML config
failed to parse, getStore() swallowed the error and used defaults. This
is reasonable when stakes are low.
After. The same kind of error is now fatal if it involves a remote provider. Specifically:
- If
~/.config/qmd/<index>.ymlspecifiesmodels.embed: "jina:..."but noJINA_API_KEYis present in the environment, QMD exits immediately with a formatted error message telling the user exactly which env var to set. It does not silently fall back to the local embeddinggemma model. - If
QMD_EMBED_PROVIDER=jinais set but the API key is missing,qmd statusdisplays a red error line naming the problem. - Init errors during
LlamaCppconstruction are caught, re-thrown with a user-friendly wrapper, and surfaced inqmd statusunder the relevant provider row.
Why. Silent fallback is a UX bug when stakes are high. If a user configures Jina, forgets the API key, and silently drops to local embeddinggemma, they will produce 384-dimensional vectors that are incompatible with the 1024-dimensional vectors they expected — and they may not notice until search quality degrades in production. Explicit errors cost one extra line of code and save a class of incidents.
The principle is "fail loud when intent is clear, recover gracefully when intent is ambiguous." Upstream's silent fallback is correct for the local-only path; explicit failure is correct for the remote-provider path.
Key files.
src/cli/qmd.ts—getStore()error surfacing for YAML-driven Jina config.src/llm.ts—buildJinaEmbedderFromUri/buildJinaRerankerFromUrithrow with helpful messages.
Before. The canonical way to configure QMD was to export variables
in your shell. If you wanted persistence across sessions you either put
the exports in ~/.bashrc (global scope pollution) or wrote a wrapper
shell script (per-project brittleness).
After. QMD auto-loads a .env file at CLI startup, searching the
current working directory and walking up to five parent directories.
Shell env vars always win; the .env file provides defaults. A
QMD_ENV_FILE=/path/to/other.env override supports multi-environment
workflows (.env.production, .env.staging, etc.).
The loader is a custom ~180-line zero-dependency implementation in
src/dotenv.ts, supporting:
- Simple
KEY=valueassignment - Double-quoted values with escape sequences (
\n,\t,\",\\) - Single-quoted raw values (no escape interpretation)
- Optional
exportprefix (shell-compatible) #comments, including trailing inline comments on unquoted values- UTF-8 BOM at file start
- Upward directory walk (capped at 5 levels to avoid surprising behaviour)
Shipped with 18 unit tests covering every parser corner case plus the override and walk semantics.
Why. Two converging reasons:
-
Dependency minimalism. Adding the
dotenvnpm package for ~30 lines of parser code would have violated the "zero new runtime dependencies" constraint. Custom implementation was the proportional response, and it lets us add featuresdotenvdoesn't have (BOM stripping,exportprefix) without waiting on upstream releases. -
API key safety. Without a persistent non-shell storage location for secrets, users inevitably hardcode keys into shell scripts, git history, or config files.
.envis the industry-standard safe place because every sensible.gitignorealready excludes it.
Key files.
src/dotenv.ts— the parser + loader.src/cli/qmd.ts—loadDotenv()call at the top, before any other import..env.example— full config template with inline documentation.
Before. Upstream had no secret-handling story because it had no secrets.
The .gitignore did not exclude .env (it didn't need to).
After. Six layers of protection, each designed to catch failures in the previous one:
| Layer | Mechanism | Catches |
|---|---|---|
| 0 | Rotate key immediately at provider on any exposure | Keys already leaked |
| 1 | .gitignore excludes .env, .env.*, *.key, *.pem, secrets/, credentials.json |
Accidental git add of .env |
| 2 | .env.example template shows the shape, never the value |
User copy-pasting the wrong file |
| 3 | Code reads env vars only — no code path writes a key to a file | Logic errors persisting secrets |
| 4 | Secret scanner (scripts/scan-secrets.sh) matches provider prefixes for Jina, OpenAI, Anthropic, Voyage, Cohere, GitHub tokens, AWS access keys, PEM private key blocks |
Hardcoded keys in any file |
| 5 | Pre-commit hook auto-installed via scripts/install-hooks.sh invokes the scanner on staged files |
Forgetting to run the scanner manually |
Two subtle details matter:
-
Masked scanner output. When the scanner finds a match, it prints
file:line (key redacted — check the file). It never echoes the actual key value, so running the scanner in a CI log does not re-leak the secret. Several commercial scanners (including early TruffleHog) made this mistake; QMD's doesn't. -
Scanner allowlist. The scanner skips
test/*.test.ts,README.md,CHANGELOG.md, and.env.examplebecause those files legitimately contain placeholder keys (for tests) or descriptions of key shapes (for docs). The allowlist is a named regex constant, not scatteredcontinuestatements, so it's auditable.
Why. Once an API key enters the codebase anywhere — even accidentally — the only 100% safe response is to rotate it. Git history is forever; force-pushing to rewrite history is unreliable and social-engineering-risky. Prevention is the only game. No single layer can promise 100%, but six stacked layers asymptote toward it.
Key files.
.gitignore— layer 1..env.example— layer 2.scripts/scan-secrets.sh— layer 4.scripts/pre-commit— layer 5.scripts/install-hooks.sh— installer.
Listing the non-changes is as important as listing the changes, because it tells you where the upstream design was already correct and should not be disturbed.
| Preserved | Reason |
|---|---|
| Hybrid retrieval pipeline (BM25 + vector + RRF + rerank) | Upstream's strongest idea. New providers plug into existing stages; the pipeline shape is untouched. |
| Smart chunking (900 tokens, markdown boundaries, AST-aware optional) | The chunker is provider-agnostic. With remote embedding it falls back to a char-based token estimate (~3 chars/token) since no local tokenizer is loaded. |
| SQLite + sqlite-vec storage | No vector DB swap. New tables (jina_usage) are additive; the vectors_vec schema is untouched. |
Manual indexing (qmd embed) |
No automatic file watcher. User stays in control of when re-indexing happens — important because re-embedding with a remote provider consumes quota. |
| CLI ergonomics (flags, output formats, colours) | New commands (usage, bench jina) mimic existing patterns (--json, --csv). No UX shape changes. |
Test structure (vitest, CI=true guards) |
New tests live alongside existing tests in test/. They honour the CI guard except where testing the remote-provider code path requires disabling it (documented in the test itself). |
| Query expansion backend | Runs at every query, latency-critical. Stays local. |
The non-changes form the contract that makes this fork safely mergeable back upstream if that ever becomes desirable.
UPSTREAM
qmd CLI ──► LlamaCpp ──► node-llama-cpp ──► local GGUF
(embed + rerank + generate)
THIS FORK
qmd CLI
│
├─► loadDotenv() ─────────────────── populate process.env from .env (if present)
│
├─► LlamaCpp (constructor dispatches per component)
│ │
│ ├─► embed:
│ │ config.remoteEmbedder ──► JinaEmbedder
│ │ config.embedModel "jina:*" ──► JinaEmbedder
│ │ QMD_EMBED_PROVIDER=jina ──► JinaEmbedder
│ │ else ──► node-llama-cpp (local)
│ │
│ ├─► rerank:
│ │ (same four-way dispatch as embed)
│ │
│ └─► generate:
│ always local (query expansion is latency-critical)
│
├─► setRemoteUsageReporter((event) =>
│ recordJinaUsage(db, event)) ◄── wired at store init
│
└─► commands:
status ──┐
usage ──┤
usage --json ──┼── all share getUsageSnapshot()
usage --csv ──┤ (single source of truth)
usage chart ──┤
bench jina ──┘
Jina API response (includes usage.total_tokens)
│
▼
JinaEmbedder.reportUsage() / JinaReranker.reportUsage()
│ (wrapped in try/catch — reporter must never break request path)
▼
UsageReporter callback
│ (injected by store layer at LlamaCpp construction time)
▼
recordJinaUsage(db, event)
│ (best-effort SQLite INSERT; logs to stderr on failure)
▼
jina_usage table (append-only)
│
▼ (read path)
getUsageSnapshot(db, options)
│
├── totals.last24h / last7d / last30d / allTime
├── byOperation (sorted by tokens desc)
└── quota (optional — null unless QMD_JINA_QUOTA is set)
├── limit, window, used
├── usedFraction, remaining
├── warnFraction
└── severity (ok | warn | critical | over)
| Metric | Upstream | This fork | Delta |
|---|---|---|---|
| Total tests | ~399 | 467 | +68 |
| Jina provider (embed, rerank, usage, quota, bench) | 0 | 50 | +50 |
| Dotenv parser (quoted, escaped, BOM, override, walk, security) | 0 | 18 | +18 |
| Runtime dependencies added | 0 | 0 | 0 |
| Regressions | — | 0 | — |
| Typecheck | clean | clean | — |
Test infrastructure notes:
- All Jina tests mock global
fetchviavi.stubGlobal. Zero real Jina API calls happen during test runs. This means CI works without a Jina API key, tests don't consume quota, and network flakiness cannot break the suite. - The dotenv tests create files in
os.tmpdir()and clean up inafterEach. Each test isolates its ownprocess.envmutations via a saved-and-restored snapshot. - The benchmark tests clear the
CI=trueguard because the bench exercisesLlamaCpp.embedBatch, which upstream disables under CI. This is safe because the test is fully mocked.
Every feature in this document can be mapped to a concrete user request that surfaced during implementation:
| User need (observed) | Feature added |
|---|---|
| "Audit how embedding works in qmd" | (no change — purely informational) |
| "Does Jina speed it up?" | qmd bench jina |
| "Upgrade qmd with Jina as an option, so I can deploy anywhere" | Polymorphic backend |
| Vietnamese content showing weak recall | Jina v3 multilingual provider |
| "Paying plan, 1 billion tokens — how do I track?" | jina_usage table, qmd usage, quota warnings |
| "Will Jina really be faster on my M2?" | qmd bench jina --runs N with stddev |
| "Make the key an env var so I can push to GitHub safely" | .env loader, gitignore, secret scanner, pre-commit hook |
| "Show me the usage over time, not just totals" | qmd usage chart histogram |
| "Export to spreadsheet for reporting" | qmd usage --csv |
| "I want to gate CI on quota" | qmd usage --json with severity field |
None of these features were added speculatively. Each one solved a problem that had already blocked a real workflow. This is the defining property of a non-over-engineered addition.
Things that are deliberately out of scope for now, but may make sense in a follow-up:
-
Other remote providers. The provider abstraction is shaped for Jina but not formally generalised. Adding Voyage or OpenAI embeddings would currently require touching
LlamaCpp's constructor dispatch. A cleanEmbeddingProviderinterface would isolate that — worth doing if a second remote provider is actually requested. -
Automatic
.envencryption at rest. Tools likegit-cryptoragecan encrypt.envso it can be committed. Not done here because it adds a dependency (or a non-trivial custom implementation) and the gitignore+scanner approach covers the 99% case. -
Usage alerting out of the box.
qmd usage --jsongives CI pipelines everything they need to build an alert, but QMD itself never opens a webhook or emails anyone. Deliberate — QMD is a library- CLI, not a daemon. A thin
qmd-alertddaemon could live in a separate package if demand justifies it.
- CLI, not a daemon. A thin
-
Merge back upstream. The strict "additive never subtractive" discipline was specifically designed to keep that option open. Whether Tobi wants any of this upstream is a conversation for later.
-
Jina embedding caching. Currently every
qmd embedrun re-sends every chunk to Jina. A content-hash-keyed cache in the local DB would avoid re-billing for unchanged documents across re-indexing runs. Worth doing once a user actually hits this pain point.
For the next contributor who walks in and asks "where does everything live":
src/
├── dotenv.ts # NEW — zero-dep .env loader
├── embedders/
│ └── jina.ts # NEW — JinaEmbedder + JinaReranker + env factories
├── bench/
│ └── bench-jina.ts # NEW — latency/throughput benchmark harness
├── llm.ts # MOD — LlamaCpp provider dispatch + usage reporter hook
├── store.ts # MOD — jina_usage table + getUsageSnapshot + getDailyUsage
├── cli/
│ └── qmd.ts # MOD — loadDotenv, showUsage, showUsageChart,
│ # renderUsageJson, renderUsageCsv,
│ # renderBenchJinaTable, quota rendering
└── index.ts # MOD — SDK wires recordJinaUsage via setRemoteUsageReporter
scripts/
├── scan-secrets.sh # NEW — multi-provider secret scanner
├── pre-commit # NEW — invokes scan-secrets.sh on staged files
└── install-hooks.sh # MOD — installs pre-commit alongside pre-push
test/
├── jina.test.ts # NEW — 50 tests covering embed, rerank, usage, quota, bench
└── dotenv.test.ts # NEW — 18 tests covering parser corner cases
docs/
└── 01-architecture-changes.md # NEW — this document
.env.example # NEW — config template with inline documentation
.gitignore # MOD — excludes .env, .env.*, *.key, *.pem, secrets/
README.md # MOD — documents all new features
CHANGELOG.md # MOD — Unreleased section documents all shifts
- Original QMD: Tobi Lutke — the hybrid retrieval pipeline, smart chunking, MCP server, and every pre-existing piece of this codebase. This fork would not exist without that foundation.
- Fork enhancements: Nguyen Ngoc Tuan — Founder, Transform Group (Lark Platinum Partner). Polymorphic backend, Jina integration, usage tracking, quota warnings, benchmarking, secrets hygiene, and this document.
License remains MIT, identical to upstream.