Skip to content

feat(rag): decouple embedding provider from active LLM#741

Open
ShaerWare wants to merge 1 commit intomainfrom
feat/embeddings-decouple-llm
Open

feat(rag): decouple embedding provider from active LLM#741
ShaerWare wants to merge 1 commit intomainfrom
feat/embeddings-decouple-llm

Conversation

@ShaerWare
Copy link
Copy Markdown
Owner

Summary

  • Add a third tier in init_wiki_rag() that picks any enabled gemini / openai / deepseek / openrouter provider from the cloud_llm_providers table and uses its key for embeddings — only when the active LLM didn't already produce one.
  • Tighten the OpenAI-compatible branch in the active-LLM tier: only fire for provider_type ∈ {openai, deepseek, openrouter} instead of "any provider with a base_url" (previously this would have happily handed a claude_bridge base_url to the OpenAI embeddings client).
  • Order stays: local sentence-transformers → active-LLM cloud → DB-provider cloud → BM25-only.

Why

Server runs LLM_BACKEND=cloud:claude-bridge-… for chat. claude_bridge has no /v1/embeddings endpoint, so before this patch Wiki RAG fell through to BM25-only (or to Vector Search, which was OOM-crashing on a 6 GB VPS with the 768-dim mpnet model). Result: zero semantic search for every assistant on the server.

With this patch, chat keeps claude_bridge (Sonnet quality) while embeddings go through the existing enabled gemini-default provider via REST. No local model in memory, free tier covers our volume.

Test plan

  • ruff check + ruff format clean
  • Verified locally: cloud_provider_service.get_by_type("gemini") + get_provider_with_key() returns the api_key from DB
  • Server pull → restart ai-secretary → grep startup log for Wiki RAG: cloud embeddings (Gemini from DB: gemini-default) (only appears if VECTOR_SEARCH_URL is unset or empty)
  • Smoke-test a chat query against a collection — should return ranked semantic matches, not just BM25 hits

Notes

  • This patch is the safety net. On the server we also re-enabled Vector Search with a lighter model (paraphrase-multilingual-MiniLM-L12-v2, 384-dim, ~470 MB peak vs the previous 5+ GB). When VS is up, it still wins (this code path is gated on not vector_search_configured). When it's down or unconfigured, embeddings still work — that's the new behavior.

🤖 Generated with Claude Code

Wiki RAG used to derive its embedding provider only from the active LLM
config. When the active provider was claude_bridge (or anything without
/v1/embeddings), this fell through to BM25-only — unless Vector Search
was configured. On the server VS was OOM-crashing, so semantic search
was effectively dead across all assistants.

Add a third tier after the active-LLM lookup: scan the cloud_llm_providers
table for any enabled gemini / openai / deepseek / openrouter provider and
use its key for embeddings. Lets chat keep claude_bridge while embeddings
go through Gemini REST (no local model in memory, free tier 1500 req/min).

Also tighten the OpenAI-compatible branch to a known provider_type list —
the previous `cloud_config.get("base_url")` fallback was too permissive
and would happily hand a claude_bridge base_url to the OpenAI embeddings
client.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant