Substrate-owned embedding: clients send text, server embeds. Load BGE-small via candle, batch concurrent requests, cache results. After this phase, you can take a string and get a deterministic 384-dim vector at ≥ 1K texts/sec on the reference machine.
- Phase 4 complete.
spec/07_embedding/00_purpose.mdspec/07_embedding/01_model_choice.md— why BGE-small.spec/07_embedding/02_inference_pipeline.mdspec/07_embedding/02_inference_pipeline.mdspec/07_embedding/02_inference_pipeline.md— mean pooling + L2 normalize.spec/07_embedding/04_batching_gpu.mdspec/07_embedding/03_caching.mdspec/07_embedding/05_fingerprinting.mdspec/07_embedding/06_migration.md
crates/brain-embedexportsEmbedder,EmbedderConfig.- BGE-small loads from a local model file (downloaded out-of-band, NOT at runtime).
- 1K texts/sec sustained.
- Cache hit on identical strings.
- Tag:
phase-5-complete.
Reads: spec/07_embedding/06_migration.md
Writes: crates/brain-embed/src/model.rs
What to build:
- Load tokenizer (HuggingFace tokenizers crate).
- Load model weights (candle-transformers BERT).
- The model file path is configured (no auto-download — that's an operational concern). Done when: A test loads a checked-in tiny test model fixture and embeds "hello world" to a non-trivial vector.
Reads: spec/07_embedding/02_inference_pipeline.md
Writes: crates/brain-embed/src/tokenize.rs
Done when: Truncation at max_length (512); padding for batches; attention masks correct.
Reads: spec/07_embedding/02_inference_pipeline.md, spec/07_embedding/02_inference_pipeline.md
Writes: crates/brain-embed/src/forward.rs
Done when: Mean-pooled output matches a reference implementation (e.g. sentence-transformers in Python) to within numerical noise (ε = 1e-4).
Reads: spec/07_embedding/04_batching_gpu.md
Writes: crates/brain-embed/src/batcher.rs
What to build:
- Channel-fed batcher: collect requests for up to
batch_window_ms(e.g. 5ms) or untilbatch_size(e.g. 32), then dispatch as one forward pass. - Each request gets a oneshot channel to receive its result. Done when: Concurrent embed calls amortize forward-pass cost; per-call latency ≤ 1.5x the batched per-text latency.
Reads: spec/07_embedding/03_caching.md
Writes: crates/brain-embed/src/cache.rs
Done when: Identical text returns cached vector; cache size configurable; eviction is LRU.
Reads: spec/07_embedding/05_fingerprinting.md
Writes: crates/brain-embed/tests/determinism.rs
Done when: The same input produces bit-identical output across 100 runs. (Numerical determinism may require pinning candle's matmul; document if not.)
Reads: spec/19_benchmarks/02_performance_targets.md
Writes: crates/brain-embed/benches/throughput.rs
Done when: ≥ 1K texts/sec sustained on reference hardware (best-effort if not available; record baseline).
- Sub-tasks 5.1–5.7 complete.
-
cargo test -p brain-embedgreen (53 passed; integration tests gated onBRAIN_EMBED_MODEL_DIRskip cleanly without it). - Determinism test wired (
tests/determinism.rs, 5 properties; gated on env var). - Throughput bench wired with hand-timed 1 000/s floor assert (
benches/throughput.rs; gated on env var). - Tag
phase-5-complete.
Sub-task 5.4 ships the dispatch surface (Dispatcher trait + CpuDispatcher passthrough) rather than the GPU window-and-batch machinery the original sketch implied — spec §02/03 §7 + §10 are explicit that CPU has no internal batching. The window+batch design is reserved for a future GPU sub-task behind the same trait.
Spec deviations logged in docs/development/spec-deviations.md:
- SD-5.1-1: refuse
pytorch_model.binoutright (arbitrary-code-execution risk). - SD-5.1-2: safetensors loaded via the safe full-file loader to preserve
#![forbid(unsafe_code)]inbrain-embed.
The model file is large (~130 MB for BGE-small). Don't check it into git. Operators run scripts/bootstrap-model.sh to download it into the XDG default path; see docs/notes/embedding-model-install.md for path resolution, manual install, air-gapped options, and the fingerprint flow.