01 — Architecture Changes

A design rationale for the enhancements layered on top of upstream tobi/qmd. This document is intentionally descriptive, not prescriptive — it explains what changed, why it changed, and what was deliberately left alone, so future contributors (and future-me) can reason about the codebase without guessing at intent.

Author of enhancements: Nguyen Ngoc Tuan — Founder, Transform Group (Lark Platinum Partner) Original author: Tobi Lutke

1. Executive Summary

Upstream QMD is a sharply-focused on-device hybrid search engine: BM25 + vector

LLM reranking, everything running locally via node-llama-cpp against small GGUF models. It is excellent for a personal Obsidian vault on a developer laptop with a GPU.

It starts to strain in three directions:

CPU-only deployment — embedding a large corpus with embeddinggemma-300M on a CPU takes hours. For servers, CI runners, and low-end laptops this is a non-starter.
Non-English content — embeddinggemma-300M is heavily English-biased. Vietnamese, CJK, and mixed-language corpora get poor semantic recall.
Opaque cost — once any cloud API is involved, users need to see their consumption. Upstream had no concept of usage because there was no usage to track.

The changes in this fork address all three without compromising upstream's core promise of local-first operation. Every new capability is opt-in; the default execution path is byte-for-byte compatible with upstream.

Headline additions:

Polymorphic provider backend — embed/rerank can be routed to Jina AI independently, via env var or per-index YAML.
Full observability layer — usage tracking, quota warnings, ASCII histograms, JSON/CSV exports for scripting.
Reproducible benchmarking — qmd bench jina measures local vs remote latency and throughput with multi-run statistics.
Secrets hygiene — .env auto-load, gitignored by default, plus a pre-commit secret scanner that catches keys for eight providers.

Delivered with zero new runtime dependencies, zero regressions, and +68 tests (467 → from 399 upstream).

2. Design Philosophy

The enhancement strategy is captured by a single rule:

Additive, never subtractive. Every new capability must be opt-in. Upstream users who set no new env vars and touch no new YAML fields must experience an identical tool.

Concretely this means:

No existing CLI command changed its default behaviour.
No existing function signature changed.
No existing env var acquired a new meaning.
No new runtime dependency was added to package.json.
The local-first execution path has the same I/O footprint as before.

This rule was the single hardest constraint of the work. Many "obvious" refactors were rejected because they would have forced a migration on upstream users. The cost is some code duplication around branching logic (local-path vs remote-path); the payoff is painless merges if upstream chooses to pull any of this work back.

3. The Six Substantive Shifts

Each section below describes one shift: what the code does now, what the code used to do, and — most importantly — why the change exists.

3.1 Fixed backend → Polymorphic backend

Before. LlamaCpp was a direct wrapper around node-llama-cpp. Calling embed() always hit a local GGUF model; calling rerank() always loaded qwen3-reranker into VRAM. There was no indirection, no strategy interface, no way to swap models without forking.

After. LlamaCpp still exists and is still the sole entry point for all callers — the SDK, the CLI, the MCP server. But its constructor now resolves two optional pluggable components:

LlamaCpp config resolution (for embed, same for rerank):

  1. explicit config.remoteEmbedder instance         (SDK path)
  2. config.embedModel starts with "jina:"           (YAML path)
  3. env QMD_EMBED_PROVIDER=jina                     (env path)
  4. fall through to local node-llama-cpp            (default)

When a remote provider resolves, embed() / embedBatch() delegates to it and the local embedding model is never loaded. When nothing resolves, behaviour is identical to upstream — same code path, same performance, same VRAM usage.

Why. Upstream's single-path design is a feature for most users, but it forecloses three legitimate deployments:

Running QMD on a CPU-only server where embeddinggemma batch throughput is measured in single-digit docs/second.
Indexing a corpus that's primarily Vietnamese, where embeddinggemma-300M falls short on semantic recall.
Deploying QMD inside a constrained container (serverless, minimal images) where bundling a GGUF model is wasteful.

The polymorphic backend lets users hit any of these targets without giving up the local-first default. Generation (query expansion) is deliberately not pluggable — it runs at query time and is the most latency-sensitive step, so the local ~1.7B parameter model stays put.

Key files.

src/embedders/jina.ts — JinaEmbedder, JinaReranker, and env factories.
src/llm.ts — the resolution logic lives in LlamaCpp's constructor.

3.2 Opaque behaviour → Observable system

Before. qmd embed printed a progress bar and exited. If you had been running with a cloud provider it would still have told you nothing about cost, consumption, or trajectory. Upstream had no place for this because upstream had no cloud provider.

After. Every successful Jina API call flows through a UsageReporter callback that appends a row to a new jina_usage SQLite table:

CREATE TABLE jina_usage (
  id             INTEGER PRIMARY KEY AUTOINCREMENT,
  operation      TEXT NOT NULL,     -- 'embed_query' | 'embed_passage' | 'rerank'
  model          TEXT NOT NULL,
  total_tokens   INTEGER NOT NULL,
  prompt_tokens  INTEGER,
  at             TEXT NOT NULL      -- ISO 8601
);
CREATE INDEX idx_jina_usage_at ON jina_usage(at);

The table is an append-only event log, not a counter. This distinction matters: counters lose history and cannot answer questions like "how much did I spend on re-indexing last Friday?". Event logs answer every windowed question you might later ask (24h, 7d, 30d, custom, per-operation, per-model) without a schema migration.

On top of the event log sits a single helper, getUsageSnapshot(), that computes rolling-window totals plus optional quota state. All display paths consume this one helper:

qmd usage (text)
qmd usage --json
qmd usage --csv
qmd usage chart (ASCII histogram)
qmd status (compact summary)

This guarantees consistency: the number shown in the histogram footer is always the same number in the JSON payload. No display format can drift away from the source of truth.

Why. The moment you introduce a metered cloud API, not knowing your consumption becomes a blocker to actually using it. Users will not deploy a tool in production if they cannot answer "how much have I used this month?". The observability layer is not a nice-to-have — it is the minimum viable trust surface for cloud usage.

Key files.

src/store.ts — recordJinaUsage, getUsageSnapshot, getDailyUsage, JinaUsageSummary type.
src/cli/qmd.ts — showUsage, showUsageChart, renderUsageJson, renderUsageCsv, renderUsageChart.

3.3 Trust me → Measure yourself

Before. QMD made no performance claims because it had only one backend. There was nothing to compare against.

After. qmd bench jina runs a reproducible latency + throughput benchmark with deterministic synthetic workloads:

Three measured stages: embed_single, embed_batch, rerank.
Built-in vocabulary mix (English prose, code snippets, Vietnamese text) so results are reproducible and multilingual tokenisation is exercised.
Each run warms up the backend once, then collects samples for each stage.
With --runs N (up to 100), every stage's samples are flattened across runs and summarised with median, mean, standard deviation, p95, min, and max.
Stages with stddev > 20% of median are highlighted yellow ("this measurement is noisy — re-run with more samples").
When both backends run, a comparison table shows per-stage speedup ratios and picks a winner.

The entire report is emitted as a stable qmd.bench.jina.v1 JSON document via --json, suitable for CI regression gates.

Why. The answer to "is Jina actually faster for me?" depends on local hardware (CPU / GPU / RAM), geography (network RTT to Jina's EU/US datacenters), workload (batch indexing vs query-time), and content mix (code vs prose vs CJK). Any universal claim I make in documentation is wrong for some subset of users. qmd bench jina lets users stop trusting documentation and start trusting their own numbers.

The --runs flag exists because single measurements over a network are a lie: RTT variance can be 2× the median under load. Without multi-sample statistics there is no honest comparison.

Key files.

src/bench/bench-jina.ts — full benchmark harness.
src/cli/qmd.ts — renderBenchJinaTable for the human-readable output.

3.4 Silent failures → Explicit errors

Before. Upstream had a few graceful-degradation paths: if a YAML config failed to parse, getStore() swallowed the error and used defaults. This is reasonable when stakes are low.

After. The same kind of error is now fatal if it involves a remote provider. Specifically:

If ~/.config/qmd/<index>.yml specifies models.embed: "jina:..." but no JINA_API_KEY is present in the environment, QMD exits immediately with a formatted error message telling the user exactly which env var to set. It does not silently fall back to the local embeddinggemma model.
If QMD_EMBED_PROVIDER=jina is set but the API key is missing, qmd status displays a red error line naming the problem.
Init errors during LlamaCpp construction are caught, re-thrown with a user-friendly wrapper, and surfaced in qmd status under the relevant provider row.

Why. Silent fallback is a UX bug when stakes are high. If a user configures Jina, forgets the API key, and silently drops to local embeddinggemma, they will produce 384-dimensional vectors that are incompatible with the 1024-dimensional vectors they expected — and they may not notice until search quality degrades in production. Explicit errors cost one extra line of code and save a class of incidents.

The principle is "fail loud when intent is clear, recover gracefully when intent is ambiguous." Upstream's silent fallback is correct for the local-only path; explicit failure is correct for the remote-provider path.

Key files.

src/cli/qmd.ts — getStore() error surfacing for YAML-driven Jina config.
src/llm.ts — buildJinaEmbedderFromUri / buildJinaRerankerFromUri throw with helpful messages.

3.5 Shell env vars → Shell OR `.env` file

Before. The canonical way to configure QMD was to export variables in your shell. If you wanted persistence across sessions you either put the exports in ~/.bashrc (global scope pollution) or wrote a wrapper shell script (per-project brittleness).

After. QMD auto-loads a .env file at CLI startup, searching the current working directory and walking up to five parent directories. Shell env vars always win; the .env file provides defaults. A QMD_ENV_FILE=/path/to/other.env override supports multi-environment workflows (.env.production, .env.staging, etc.).

The loader is a custom ~180-line zero-dependency implementation in src/dotenv.ts, supporting:

Simple KEY=value assignment
Double-quoted values with escape sequences (\n, \t, \", \\)
Single-quoted raw values (no escape interpretation)
Optional export prefix (shell-compatible)
# comments, including trailing inline comments on unquoted values
UTF-8 BOM at file start
Upward directory walk (capped at 5 levels to avoid surprising behaviour)

Shipped with 18 unit tests covering every parser corner case plus the override and walk semantics.

Why. Two converging reasons:

Dependency minimalism. Adding the dotenv npm package for ~30 lines of parser code would have violated the "zero new runtime dependencies" constraint. Custom implementation was the proportional response, and it lets us add features dotenv doesn't have (BOM stripping, export prefix) without waiting on upstream releases.
API key safety. Without a persistent non-shell storage location for secrets, users inevitably hardcode keys into shell scripts, git history, or config files. .env is the industry-standard safe place because every sensible .gitignore already excludes it.

Key files.

src/dotenv.ts — the parser + loader.
src/cli/qmd.ts — loadDotenv() call at the top, before any other import.
.env.example — full config template with inline documentation.

3.6 No safety rails → Defense in depth

Before. Upstream had no secret-handling story because it had no secrets. The .gitignore did not exclude .env (it didn't need to).

After. Six layers of protection, each designed to catch failures in the previous one:

Layer	Mechanism	Catches
0	Rotate key immediately at provider on any exposure	Keys already leaked
1	*`.gitignore` excludes `.env`, `.env.`, `.key`, `.pem`, `secrets/`, `credentials.json`**	Accidental `git add` of `.env`
2	`.env.example` template shows the shape, never the value	User copy-pasting the wrong file
3	Code reads env vars only — no code path writes a key to a file	Logic errors persisting secrets
4	Secret scanner (`scripts/scan-secrets.sh`) matches provider prefixes for Jina, OpenAI, Anthropic, Voyage, Cohere, GitHub tokens, AWS access keys, PEM private key blocks	Hardcoded keys in any file
5	Pre-commit hook auto-installed via `scripts/install-hooks.sh` invokes the scanner on staged files	Forgetting to run the scanner manually

Two subtle details matter:

Masked scanner output. When the scanner finds a match, it prints file:line (key redacted — check the file). It never echoes the actual key value, so running the scanner in a CI log does not re-leak the secret. Several commercial scanners (including early TruffleHog) made this mistake; QMD's doesn't.
Scanner allowlist. The scanner skips test/*.test.ts, README.md, CHANGELOG.md, and .env.example because those files legitimately contain placeholder keys (for tests) or descriptions of key shapes (for docs). The allowlist is a named regex constant, not scattered continue statements, so it's auditable.

Why. Once an API key enters the codebase anywhere — even accidentally — the only 100% safe response is to rotate it. Git history is forever; force-pushing to rewrite history is unreliable and social-engineering-risky. Prevention is the only game. No single layer can promise 100%, but six stacked layers asymptote toward it.

Key files.

.gitignore — layer 1.
.env.example — layer 2.
scripts/scan-secrets.sh — layer 4.
scripts/pre-commit — layer 5.
scripts/install-hooks.sh — installer.

4. What Deliberately Did Not Change

Listing the non-changes is as important as listing the changes, because it tells you where the upstream design was already correct and should not be disturbed.

Preserved	Reason
Hybrid retrieval pipeline (BM25 + vector + RRF + rerank)	Upstream's strongest idea. New providers plug into existing stages; the pipeline shape is untouched.
Smart chunking (900 tokens, markdown boundaries, AST-aware optional)	The chunker is provider-agnostic. With remote embedding it falls back to a char-based token estimate (~3 chars/token) since no local tokenizer is loaded.
SQLite + sqlite-vec storage	No vector DB swap. New tables (`jina_usage`) are additive; the `vectors_vec` schema is untouched.
Manual indexing (`qmd embed`)	No automatic file watcher. User stays in control of when re-indexing happens — important because re-embedding with a remote provider consumes quota.
CLI ergonomics (flags, output formats, colours)	New commands (`usage`, `bench jina`) mimic existing patterns (`--json`, `--csv`). No UX shape changes.
Test structure (vitest, `CI=true` guards)	New tests live alongside existing tests in `test/`. They honour the CI guard except where testing the remote-provider code path requires disabling it (documented in the test itself).
Query expansion backend	Runs at every query, latency-critical. Stays local.

The non-changes form the contract that makes this fork safely mergeable back upstream if that ever becomes desirable.

5. Architectural Delta (Diagrams)

Backend resolution

UPSTREAM

  qmd CLI ──► LlamaCpp ──► node-llama-cpp ──► local GGUF
                              (embed + rerank + generate)


THIS FORK

  qmd CLI
    │
    ├─► loadDotenv() ─────────────────── populate process.env from .env (if present)
    │
    ├─► LlamaCpp (constructor dispatches per component)
    │     │
    │     ├─► embed:
    │     │     config.remoteEmbedder      ──► JinaEmbedder
    │     │     config.embedModel "jina:*" ──► JinaEmbedder
    │     │     QMD_EMBED_PROVIDER=jina    ──► JinaEmbedder
    │     │     else                       ──► node-llama-cpp (local)
    │     │
    │     ├─► rerank:
    │     │     (same four-way dispatch as embed)
    │     │
    │     └─► generate:
    │           always local (query expansion is latency-critical)
    │
    ├─► setRemoteUsageReporter((event) =>
    │     recordJinaUsage(db, event))  ◄── wired at store init
    │
    └─► commands:
          status       ──┐
          usage        ──┤
          usage --json ──┼── all share getUsageSnapshot()
          usage --csv  ──┤   (single source of truth)
          usage chart  ──┤
          bench jina   ──┘

Usage data flow

Jina API response (includes usage.total_tokens)
    │
    ▼
JinaEmbedder.reportUsage() / JinaReranker.reportUsage()
    │   (wrapped in try/catch — reporter must never break request path)
    ▼
UsageReporter callback
    │   (injected by store layer at LlamaCpp construction time)
    ▼
recordJinaUsage(db, event)
    │   (best-effort SQLite INSERT; logs to stderr on failure)
    ▼
jina_usage table (append-only)
    │
    ▼ (read path)
getUsageSnapshot(db, options)
    │
    ├── totals.last24h / last7d / last30d / allTime
    ├── byOperation (sorted by tokens desc)
    └── quota (optional — null unless QMD_JINA_QUOTA is set)
          ├── limit, window, used
          ├── usedFraction, remaining
          ├── warnFraction
          └── severity (ok | warn | critical | over)

6. Test Delta

Metric	Upstream	This fork	Delta
Total tests	~399	467	+68
Jina provider (embed, rerank, usage, quota, bench)	0	50	+50
Dotenv parser (quoted, escaped, BOM, override, walk, security)	0	18	+18
Runtime dependencies added	0	0	0
Regressions	—	0	—
Typecheck	clean	clean	—

Test infrastructure notes:

All Jina tests mock global fetch via vi.stubGlobal. Zero real Jina API calls happen during test runs. This means CI works without a Jina API key, tests don't consume quota, and network flakiness cannot break the suite.
The dotenv tests create files in os.tmpdir() and clean up in afterEach. Each test isolates its own process.env mutations via a saved-and-restored snapshot.
The benchmark tests clear the CI=true guard because the bench exercises LlamaCpp.embedBatch, which upstream disables under CI. This is safe because the test is fully mocked.

7. Why This Isn't Over-Engineering

Every feature in this document can be mapped to a concrete user request that surfaced during implementation:

User need (observed)	Feature added
"Audit how embedding works in qmd"	(no change — purely informational)
"Does Jina speed it up?"	`qmd bench jina`
"Upgrade qmd with Jina as an option, so I can deploy anywhere"	Polymorphic backend
Vietnamese content showing weak recall	Jina v3 multilingual provider
"Paying plan, 1 billion tokens — how do I track?"	`jina_usage` table, `qmd usage`, quota warnings
"Will Jina really be faster on my M2?"	`qmd bench jina --runs N` with stddev
"Make the key an env var so I can push to GitHub safely"	`.env` loader, gitignore, secret scanner, pre-commit hook
"Show me the usage over time, not just totals"	`qmd usage chart` histogram
"Export to spreadsheet for reporting"	`qmd usage --csv`
"I want to gate CI on quota"	`qmd usage --json` with `severity` field

None of these features were added speculatively. Each one solved a problem that had already blocked a real workflow. This is the defining property of a non-over-engineered addition.

8. Open Questions & Future Work

Things that are deliberately out of scope for now, but may make sense in a follow-up:

Other remote providers. The provider abstraction is shaped for Jina but not formally generalised. Adding Voyage or OpenAI embeddings would currently require touching LlamaCpp's constructor dispatch. A clean EmbeddingProvider interface would isolate that — worth doing if a second remote provider is actually requested.
Automatic .env encryption at rest. Tools like git-crypt or age can encrypt .env so it can be committed. Not done here because it adds a dependency (or a non-trivial custom implementation) and the gitignore+scanner approach covers the 99% case.
Usage alerting out of the box. qmd usage --json gives CI pipelines everything they need to build an alert, but QMD itself never opens a webhook or emails anyone. Deliberate — QMD is a library
- CLI, not a daemon. A thin qmd-alertd daemon could live in a separate package if demand justifies it.
Merge back upstream. The strict "additive never subtractive" discipline was specifically designed to keep that option open. Whether Tobi wants any of this upstream is a conversation for later.
Jina embedding caching. Currently every qmd embed run re-sends every chunk to Jina. A content-hash-keyed cache in the local DB would avoid re-billing for unchanged documents across re-indexing runs. Worth doing once a user actually hits this pain point.

9. File Map

For the next contributor who walks in and asks "where does everything live":

src/
├── dotenv.ts                 # NEW  — zero-dep .env loader
├── embedders/
│   └── jina.ts               # NEW  — JinaEmbedder + JinaReranker + env factories
├── bench/
│   └── bench-jina.ts         # NEW  — latency/throughput benchmark harness
├── llm.ts                    # MOD  — LlamaCpp provider dispatch + usage reporter hook
├── store.ts                  # MOD  — jina_usage table + getUsageSnapshot + getDailyUsage
├── cli/
│   └── qmd.ts                # MOD  — loadDotenv, showUsage, showUsageChart,
│                             #        renderUsageJson, renderUsageCsv,
│                             #        renderBenchJinaTable, quota rendering
└── index.ts                  # MOD  — SDK wires recordJinaUsage via setRemoteUsageReporter

scripts/
├── scan-secrets.sh           # NEW  — multi-provider secret scanner
├── pre-commit                # NEW  — invokes scan-secrets.sh on staged files
└── install-hooks.sh          # MOD  — installs pre-commit alongside pre-push

test/
├── jina.test.ts              # NEW  — 50 tests covering embed, rerank, usage, quota, bench
└── dotenv.test.ts            # NEW  — 18 tests covering parser corner cases

docs/
└── 01-architecture-changes.md # NEW — this document

.env.example                  # NEW  — config template with inline documentation
.gitignore                    # MOD  — excludes .env, .env.*, *.key, *.pem, secrets/
README.md                     # MOD  — documents all new features
CHANGELOG.md                  # MOD  — Unreleased section documents all shifts

10. Credits

Original QMD: Tobi Lutke — the hybrid retrieval pipeline, smart chunking, MCP server, and every pre-existing piece of this codebase. This fork would not exist without that foundation.
Fork enhancements: Nguyen Ngoc Tuan — Founder, Transform Group (Lark Platinum Partner). Polymorphic backend, Jina integration, usage tracking, quota warnings, benchmarking, secrets hygiene, and this document.

License remains MIT, identical to upstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01 — Architecture Changes

1. Executive Summary

2. Design Philosophy

3. The Six Substantive Shifts

3.1 Fixed backend → Polymorphic backend

3.2 Opaque behaviour → Observable system

3.3 Trust me → Measure yourself

3.4 Silent failures → Explicit errors

3.5 Shell env vars → Shell OR `.env` file

3.6 No safety rails → Defense in depth

4. What Deliberately Did Not Change

5. Architectural Delta (Diagrams)

Backend resolution

Usage data flow

6. Test Delta

7. Why This Isn't Over-Engineering

8. Open Questions & Future Work

9. File Map

10. Credits

FilesExpand file tree

01-architecture-changes.md

Latest commit

History

01-architecture-changes.md

File metadata and controls

01 — Architecture Changes

1. Executive Summary

2. Design Philosophy

3. The Six Substantive Shifts

3.1 Fixed backend → Polymorphic backend

3.2 Opaque behaviour → Observable system

3.3 Trust me → Measure yourself

3.4 Silent failures → Explicit errors

3.5 Shell env vars → Shell OR .env file

3.6 No safety rails → Defense in depth

4. What Deliberately Did Not Change

5. Architectural Delta (Diagrams)

Backend resolution

Usage data flow

6. Test Delta

7. Why This Isn't Over-Engineering

8. Open Questions & Future Work

9. File Map

10. Credits

3.5 Shell env vars → Shell OR `.env` file