Feat/OpenAI embeddings#689
Open
jonesj38 wants to merge 13 commits into
Open
Conversation
Adds support for using OpenAI's text-embedding-3-small model as an
alternative to local llama-cpp embeddings.
Changes:
- New openai-llm.ts: OpenAI API client implementing LLM interface
- llm.ts: Embedding config management, getDefaultEmbeddingLLM()
- collections.ts: EmbeddingProviderConfig for YAML config schema
- store.ts: Use configurable embedding LLM, skip local model for
query expansion/rerank when using OpenAI
- qmd.ts: Load embedding config on startup
- package.json: Add openai dependency
- README.md: Documentation for OpenAI embeddings
Configuration (in ~/.config/qmd/index.yml):
embedding:
provider: openai
openai:
api_key: sk-... # Optional, falls back to OPENAI_API_KEY env
model: text-embedding-3-small # Optional, this is the default
Benefits:
- Much faster embedding (~10x vs local models on CPU)
- No GPU/VRAM requirements
- More reliable (no local model loading issues)
- Cost: ~$0.02 per 1M tokens
- OpenAI embeddings (text-embedding-3-small, 1536d) via QMD_OPENAI=1 - Query expansion with gpt-4o-mini (~200ms vs 30s local) - Tiktoken for fast tokenization (no model loading) - Exponential backoff with jitter for rate limits (429) - Inter-batch delay (150ms) to avoid hitting RPM limits - Performance: search 3-5s (was 30-60s), embed ~10min (was 2hrs) Files: openai-llm.ts, llm.ts, store.ts, qmd.ts Deps: openai, tiktoken
Replace the rerank() stub with a real listwise reranker using gpt-4o-mini. - Sends top candidates with query to gpt-4o-mini as a ranking task - Parses comma-separated index output, handles missing/duplicate indices - Skips API call for ≤2 documents (not worth the latency) - Falls back to original order on API failure - Cost: ~$0.001 per rerank call - Updated qmd.ts to route through OpenAI reranker instead of skipping The full qmd query pipeline with OpenAI now: 1. Query expansion (gpt-4o-mini) 2. BM25 + vector search (parallel) 3. RRF fusion 4. Cross-encoder reranking (gpt-4o-mini) ← NEW 5. Position-aware blending
Accept comma-separated collection names in -c flag for cross-collection search. All three search modes (search, vsearch, query) now support querying multiple collections simultaneously. Changes: - resolveCollectionFilter() helper parses and validates comma-separated names - searchFTS() accepts string | string[] for collection filtering - searchVec() accepts string | string[] for collection filtering - SQL uses IN clause for multi-collection filtering - Updated interface types and test for new parameter types Usage: qmd search 'auth' -c repo-a,repo-b qmd vsearch 'auth patterns' -c docs,examples qmd query 'OAuth implementation' -c project,patterns,docs This enables Shad's multi-vault search to pass all vault collections in a single qmd call instead of running separate searches per collection.
Add support for separate OpenAI-compatible servers for embeddings vs chat (expansion/reranking). Common in setups where local GPU serves embeddings and cloud handles chat. Implements Kaspre's split-URL pattern from PR tobi#116 discussion. - Add chat_base_url and chat_api_key to YAML config and OpenAIConfig - Add QMD_OPENAI_* env var prefix (QMD_OPENAI_BASE_URL, QMD_OPENAI_API_KEY, QMD_OPENAI_CHAT_BASE_URL, QMD_OPENAI_CHAT_API_KEY) per alexleach's suggestion - Wire expansion_model and base_url through YAML config per viniciushsantana's feedback - Route expandQuery() and rerank() through chatClient, embed()/embedBatch() through embedding client - Fix upstream rebase issues (Database.transaction type, collectionName rename) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ifferent models can be used to each. Thanks to @Kaspre for their comment embedding: provider: openai openai: api_key: "sk-..." base_url: "http://localhost:8081/v1" # embeddings model: "nomic-embed-text" chat_base_url: "https://ollama.com/v1" # expansion (falls back to base_url) chat_api_key: "..." # (falls back to api_key) expansion_model: "gemma3:4b" rerank_base_url: "https://api.cohere.com/v1" # reranking (falls back to chat_base_url) rerank_api_key: "..." # (falls back to chat_api_key) rerank_model: "rerank-v3" # (falls back to expansion_model) also rebased onto main
…ename, embed fix Changes based on PR comments: 1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure) - collections.ts: EmbeddingProviderConfig already has base_url field - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig - openai-llm.ts: constructor accepts baseURL config 2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY - Avoids conflict with official openai-node SDK (per @alexleach) - Falls back to OPENAI_API_KEY for backwards compatibility 3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana) - OpenAI path calls API directly, no local model session needed - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns 4. expandQuery now actually calls OpenAI for query expansion - Was previously returning lex-only fallback when isUsingOpenAI() - Now uses gpt-4o-mini via openaiLLM.expandQuery() 5. README updated with base_url, expansion_model docs Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix, expansion_model, base_url YAML wiring)
…ments - Lazy-load node-llama-cpp to skip native compilation in OpenAI mode - Add tiktoken-based input truncation (QMD_OPENAI_MAX_INPUT_TOKENS) - QMD_OPENAI_BASE_URL auto-activates OpenAI mode (no QMD_OPENAI=1 needed) - Skip LlamaCpp init in qmd status when using OpenAI - Restore terminal cursor on embed error (try/finally) - Bypass withLLMSession in vectorSearch/querySearch for OpenAI mode Co-authored-by: ALB.Leach <alexleach@users.noreply.github.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-opening of previous pull request #116 and a solution to issue #620
Summary
Optional OpenAI integration for embeddings and query expansion. Dramatically faster for users who prefer API-based inference over local models.
Performance
Operation | Local (llama-cpp) | OpenAI -- | -- | -- Query expansion | 30-40s | 200ms Full re-embed (30k chunks) | ~2 hours | ~10 min Tokenizer load | 30s | 0s Search latency | 30-60s | 3-5s Reranking (30 docs) | 10-15s | 1-2sFeatures
• OpenAI Embeddings — text-embedding-3-small (1536 dims), native batch API, ~$0.02/1M tokens
• OpenAI Query Expansion — gpt-4o-mini for lex/vec/hyde variants
• OpenAI Reranking — API-based reranking replaces local qwen3-reranker, eliminating model download and GGUF inference overhead
• Tiktoken chunking — eliminates model load time for tokenization
• Robust retry logic — exponential backoff with jitter for rate limits
Usage
export OPENAI_API_KEY="sk-..." export QMD_OPENAI=1 qmd embed -f # Re-embed with OpenAI qmd search "query"Design
• Opt-in — local models remain the default
• Graceful fallback — errors don't crash, just skip
• Replace local reranking with OpenAI — no GGUF model download or local inference needed
• No breaking changes — existing workflows unchanged
Files Changed
• src/openai-llm.ts — new OpenAI LLM implementation
• src/llm.ts — embedding config, provider switching
• src/store.ts — tiktoken chunking integration
• src/qmd.ts — QMD_OPENAI env var support
Dependencies
• openai — API client
• tiktoken — fast BPE tokenization