Perspicacite supports multiple embedding models and RAG strategies. This guide explains the three tiers we've validated through systematic benchmarking on the SciFact claim-retrieval dataset (5 183 PubMed abstracts, 300 dev claims), and shows how to configure each one.
| Tier | Embedding model | NDCG@10 | RAM footprint | API cost | Best for |
|---|---|---|---|---|---|
| 1 — Default | all-MiniLM-L6-v2 |
0.857 | ~90 MB | free | Any machine, quick setup |
| 2 — OpenAI | text-embedding-3-large |
0.910* | ~0 MB local | ~$0.13 / M tokens | Best accuracy, cloud cost OK |
| 3a — Biomedical local | S-PubMedBert-MS-MARCO |
0.873 | ~440 MB | free | Biomedical / life science |
| 3b — General local SOTA | BAAI/bge-m3 |
0.879 | ~2.3 GB | free | New best overall — beats OpenAI on SciFact |
* OpenAI vector=0.864 + ms-marco-MiniLM-L-12-v2 CE rerank (OR×4)
All NDCG@10 figures are with over_retrieve=4×, BAAI/bge-reranker-v2-m3 reranker unless
noted. Base (no rerank) figures are 5–14 pp lower.
Config file: config.yml
sentence-transformers/all-MiniLM-L6-v2 — 384-dim, ~90 MB, runs on CPU in <10 ms/query.
| Mode | Over-retrieve | Reranker | NDCG@10 | R@5 | MRR |
|---|---|---|---|---|---|
| vector | 1× | — | 0.745 | 0.844 | 0.718 |
| hybrid | 1× | — | 0.774 | 0.839 | 0.757 |
| vector | 4× | ms-marco-MiniLM-L-12-v2 | 0.851 | 0.917 | 0.837 |
| vector | 4× | bge-reranker-v2-m3 | 0.857 | 0.915 | 0.847 |
| hybrid | 4× | bge-reranker-v2-m3 | 0.808 | 0.839 | 0.806 |
cd ~/git/Perspicacite-AI
# Basic (no reranker — fastest, lowest RAM)
uv run perspicacite -c config.yml serve
# With reranker enabled (server-side reranking in /api/chat advanced mode)
# The reranker model is set in config.yml under rag_modes.reranker_model
uv run perspicacite -c config.yml serveKey config.yml settings:
knowledge_base:
embedding_model: "all-MiniLM-L6-v2"
similarity_threshold: 0.7 # MiniLM cosine scores are well-distributed
default_top_k: 10
rag_modes:
reranker_model: "cross-encoder/ms-marco-MiniLM-L-6-v2" # ~120 MBllm:
# Option A — OpenRouter free tier (DeepSeek V4 Flash, no cost)
default_provider: "openrouter"
default_model: "deepseek/deepseek-v4-flash"
# Option B — Local Ollama (completely offline)
# See Tier 3 section for Ollama setup
default_provider: "ollama"
default_model: "qwen3:8b" # ~5 GB; smaller than 14BConfig file: config_openai_large.yml
text-embedding-3-large — 3 072-dim, OpenAI API, ~$0.13 per million tokens.
| Mode | Over-retrieve | Reranker | NDCG@10 | R@5 | MRR |
|---|---|---|---|---|---|
| vector | 1× | — | 0.864 | 0.932 | 0.849 |
| vector | 4× | ms-marco-MiniLM-L-12-v2 | 0.872 | 0.951 | 0.855 |
Gain over MiniLM baseline: +12 pp NDCG@10 (no rerank), +2 pp with CE reranker.
export OPENAI_API_KEY="sk-..."
# Using the dedicated config (port 8002 by default)
uv run perspicacite -c config_openai_large.yml serveKey settings:
knowledge_base:
embedding_model: "text-embedding-3-large"
similarity_threshold: 0.0 # CRITICAL: OpenAI cosine scores differ from MiniLM
# 0.7 threshold will filter most/all resultsCost estimate: ingesting SciFact (5 183 abstracts, ~2.1 M tokens) cost < $0.30. Query embedding for 300 eval claims: < $0.01. For personal KB use, cost is negligible.
You can run both servers simultaneously (they share chroma_db/ but use different KBs):
# Terminal 1 — MiniLM on :8000
uv run perspicacite -c config.yml serve
# Terminal 2 — OpenAI on :8002
OPENAI_API_KEY=$OPENAI_API_KEY uv run perspicacite -c config_openai_large.yml serveEach server uses its own KB (scifact_abstracts for MiniLM, scifact_openai_large
for OpenAI) and embeds queries with the matching model. See
Critical Gotcha #1 below.
Config file: config_pubmedbert.yml
pritamdeka/S-PubMedBert-MS-MARCO — 768-dim, PubMedBERT fine-tuned for retrieval on
MS-MARCO. Domain-adapted for medical/biological text. ~440 MB.
| Mode | Over-retrieve | Reranker | NDCG@10 | R@5 | MRR |
|---|---|---|---|---|---|
| vector | 1× | — | — | — | — |
| vector | 4× | bge-reranker-v2-m3 | 0.873 | 0.933 | 0.864 |
| hybrid | 4× | bge-reranker-v2-m3 | 0.842 | 0.887 | 0.832 |
This is the best overall configuration we found — surpassing OpenAI 3-large on biomedical text while being fully local and free. The key is combining a domain-adapted retrieval model with a powerful cross-encoder reranker.
uv run perspicacite -c config_pubmedbert.yml serve
# Model auto-downloads from HuggingFace on first run (~440 MB)For offline environments (after first download):
TRANSFORMERS_OFFLINE=1 HF_DATASETS_OFFLINE=1 \
uv run perspicacite -c config_pubmedbert.yml serveKey settings:
knowledge_base:
embedding_model: "pritamdeka/S-PubMedBert-MS-MARCO"
similarity_threshold: 0.0 # Lower scores than MiniLM; 0.7 would filter results
default_top_k: 10
rag_modes:
reranker_model: "BAAI/bge-reranker-v2-m3" # ~2.2 GB — pulls on first runWhy bge-reranker-v2-m3? It's the MTEB SOTA cross-encoder and adds ~4–6 pp NDCG@10 over ms-marco-MiniLM-L-12-v2 in our experiments. It requires ~2.2 GB RAM but runs on CPU (slowly) or GPU (fast). If RAM is constrained, use
cross-encoder/ms-marco-MiniLM-L-12-v2(~120 MB) instead — still gains ~2–3 pp.
Config file: config_bge_m3.yml
BAAI/bge-m3 — 1 024-dim, multilingual MTEB SOTA retrieval model. ~2.3 GB.
knowledge_base:
embedding_model: "BAAI/bge-m3"
similarity_threshold: 0.0Benchmark result (SciFact dev, 300 claims, 5 183 PubMed abstracts): NDCG@10 = 0.879 with
BAAI/bge-reranker-v2-m3reranker (over_retrieve=4×). This is the current best overall, beating PubMedBERT (0.873) and OpenAI 3-large + ms-marco (0.872). Note: an earlier result of 0.655 was invalid — that run accidentally targeted the MiniLM server (port 8000) wheresimilarity_threshold: 0.7filtered most bge-m3 results. The corrected run uses port 8004 withsimilarity_threshold: 0.0.
GPU launch:
# If you have a CUDA GPU, sentence-transformers will use it automatically
uv run perspicacite -c config_bge_m3.yml servePerspicacite stores all embedding vectors in a single chroma_db/ directory,
and all servers share it. This means you can run multiple servers (different embedding
models, different ports) and each maintains its own KB collections.
# Port 8000 — MiniLM (always-on, fast, general queries)
uv run perspicacite -c config.yml serve &
# Port 8001 — SPECTER2 (scientific citation context)
TRANSFORMERS_OFFLINE=1 HF_DATASETS_OFFLINE=1 \
uv run perspicacite -c config_specter2.yml serve &
# Port 8002 — OpenAI 3-large (highest accuracy, paid)
OPENAI_API_KEY=$OPENAI_API_KEY \
uv run perspicacite -c config_openai_large.yml serve &Each KB must be ingested separately through the server that owns its embedding model:
# Ingest SciFact abstracts into MiniLM KB (port 8000)
PERSPICACITE_URL=http://localhost:8000 \
uv run python scripts/ingest_corpus.py --corpus abstracts --kb-name scifact_abstracts
# Ingest same abstracts into SPECTER2 KB (port 8001)
PERSPICACITE_URL=http://localhost:8001 \
uv run python scripts/ingest_corpus.py --corpus abstracts --kb-name scifact_specter2
# Ingest into OpenAI KB (port 8002)
PERSPICACITE_URL=http://localhost:8002 \
uv run python scripts/ingest_corpus.py --corpus abstracts --kb-name scifact_openai_largeWhen using perspicacite-eval, the CrossServerRRFAdapter fuses results from two
different-model KBs using Reciprocal Rank Fusion (RRF), then applies a cross-encoder.
This is experimental but shows promise for further +2–5 pp gains.
Cross-encoder rerankers apply a second pass over the retrieved candidates. Configure
rag_modes.reranker_model in your config file.
| Reranker | Size | Speed (CPU) | NDCG gain vs no-rerank |
|---|---|---|---|
cross-encoder/ms-marco-MiniLM-L-6-v2 |
~120 MB | fast (<1 s/query) | +8–10 pp |
cross-encoder/ms-marco-MiniLM-L-12-v2 |
~120 MB | fast (<1 s/query) | +10–12 pp |
BAAI/bge-reranker-v2-m3 |
~2.2 GB | slow (2–5 s/query CPU) | +12–14 pp |
Over-retrieve setting: Reranking only helps when you fetch more candidates than you need. Use
default_top_k× 3–4 for the initial retrieval, then rerank totop_k. In the Perspicacite API, settop_k=20to retrieve 20, and the server reranks to return the best 5–10.
llm:
default_provider: "openrouter"
default_model: "deepseek/deepseek-v4-flash" # free, fast, good quality
# Alternatives:
# "deepseek/deepseek-r1:free" # reasoning model, free
# "google/gemma-3-27b-it:free" # 27B Google model, free
# "anthropic/claude-3-5-haiku" # paid, fast
# "anthropic/claude-opus-4-5" # paid, best quality
providers:
openrouter:
base_url: "https://openrouter.ai/api/v1"
timeout: 120Set OPENROUTER_API_KEY in environment. Free-tier models require no payment.
llm:
default_provider: "anthropic"
default_model: "claude-opus-4-5"
providers:
anthropic:
base_url: "https://api.anthropic.com"
timeout: 120Set ANTHROPIC_API_KEY in environment.
Install Ollama:
brew install ollama # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh (Linux)
ollama serve # start daemon
ollama pull qwen3:14b # ~9 GB; best balance of quality and speed on M-series
# Lighter alternatives:
# ollama pull qwen3:8b # ~5 GB
# ollama pull llama3.2:3b # ~2 GB (fast, lower quality)Config:
llm:
default_provider: "ollama"
default_model: "qwen3:14b"
providers:
ollama:
base_url: "http://localhost:11434"
timeout: 300 # 14B can be slow for long answersSee config_qwen3_14b.yml for a complete example.
Thinking mode (Qwen3): Qwen3 supports /think and /no_think tokens. The server
inserts these based on mode complexity. Set QWEN3_NO_THINK=1 env var to always
disable thinking (faster, lower quality for complex tasks).
# Single server, MiniLM, OpenRouter LLM
cp config.yml config_laptop.yml
# Edit: llm.default_model = "deepseek/deepseek-v4-flash"
uv run perspicacite -c config.yml serveRAM: ~300 MB. Works on any machine with internet for LLM calls.
# Port 8005 — PubMedBERT + bge-reranker + local Qwen3
# First run: downloads ~2.6 GB of models
uv run perspicacite -c config_pubmedbert.yml serve
# With local LLM (Ollama):
# Edit config_pubmedbert.yml: llm.default_provider = "ollama", default_model = "qwen3:14b"RAM: ~3 GB (PubMedBERT + bge-reranker + Qwen3 8B) or ~11 GB (Qwen3 14B).
Run three servers on fixed ports, each serving a different embedding tier. Users choose the server URL that matches their embedding tier. Clients on the same network can all share the same server.
# OpenAI 3-large + bge-reranker + Claude Opus
OPENAI_API_KEY=$OPENAI_API_KEY ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
uv run perspicacite -c config_openai_large.yml serve
# Edit config_openai_large.yml:
# llm.default_provider = "anthropic"
# llm.default_model = "claude-opus-4-5"
# rag_modes.reranker_model = "BAAI/bge-reranker-v2-m3"Perspicacite stores two independent records per KB:
| Store | Location | Used for |
|---|---|---|
| ChromaDB vectors | chroma_db/<uuid>/ |
Actual embedded dimension |
SQLite kb_metadata |
data/perspicacite.db |
Query model selection |
search_knowledge_base reads kb_metadata.embedding_model from SQLite to choose
which model embeds the query. If they disagree (e.g. vectors are 768-dim PubMedBERT
but SQLite says all-MiniLM-L6-v2), every search fails with a ChromaDB
dimension-mismatch error — silently returning 0 results.
Verify metadata after ingest:
sqlite3 ~/git/Perspicacite-AI/data/perspicacite.db \
"SELECT name, embedding_model, paper_count FROM kb_metadata WHERE name LIKE 'scifact%';"Fix metadata mismatch:
sqlite3 ~/git/Perspicacite-AI/data/perspicacite.db \
"UPDATE kb_metadata SET embedding_model='pritamdeka/S-PubMedBert-MS-MARCO'
WHERE name='scifact_pubmedbert';"Always use the right server URL for each KB:
# CORRECT: PubMedBERT KB queried through PubMedBERT server
PERSPICACITE_URL=http://localhost:8005 uv run eval --corpora pubmedbert
# WRONG: PubMedBERT KB queried through MiniLM server → dim mismatch → 0 results
PERSPICACITE_URL=http://localhost:8000 uv run eval --corpora pubmedbertAfter completing our domain-adaptation experiments (in progress), we plan to release fine-tuned retrieval models on HuggingFace:
HolobiomicsLab/perspicacite-retrieval-biomedical— PubMedBERT-based model fine-tuned on biomedical claim → evidence pairs from SciFact + custom in-house dataHolobiomicsLab/perspicacite-reranker-biomedical— Cross-encoder fine-tuned on the same domain
These will be drop-in replacements in the embedding_model and reranker_model
config fields. Expected release: after SciFact fine-tuning experiments complete.
docs/guides/ingest-bibtex.md— Ingesting BibTeX / PDF collectionsdocs/guides/zotero-integration.md— Sync from Zotero librarydocs/MCP.md— MCP tool reference for programmatic accessperspicacite-eval/docs/baseline_2026_05_24.md— Full benchmark results table