Skip to content

Feat/OpenAI embeddings#689

Open
jonesj38 wants to merge 13 commits into
tobi:mainfrom
jonesj38:feat/openai-embeddings
Open

Feat/OpenAI embeddings#689
jonesj38 wants to merge 13 commits into
tobi:mainfrom
jonesj38:feat/openai-embeddings

Conversation

@jonesj38
Copy link
Copy Markdown

@jonesj38 jonesj38 commented May 28, 2026

Re-opening of previous pull request #116 and a solution to issue #620

Summary

Optional OpenAI integration for embeddings and query expansion. Dramatically faster for users who prefer API-based inference over local models.

Performance

Operation | Local (llama-cpp) | OpenAI -- | -- | -- Query expansion | 30-40s | 200ms Full re-embed (30k chunks) | ~2 hours | ~10 min Tokenizer load | 30s | 0s Search latency | 30-60s | 3-5s Reranking (30 docs) | 10-15s | 1-2s

Features

• OpenAI Embeddings — text-embedding-3-small (1536 dims), native batch API, ~$0.02/1M tokens
• OpenAI Query Expansion — gpt-4o-mini for lex/vec/hyde variants
• OpenAI Reranking — API-based reranking replaces local qwen3-reranker, eliminating model download and GGUF inference overhead
• Tiktoken chunking — eliminates model load time for tokenization
• Robust retry logic — exponential backoff with jitter for rate limits
Usage

export OPENAI_API_KEY="sk-..." export QMD_OPENAI=1 qmd embed -f # Re-embed with OpenAI qmd search "query"

Design

• Opt-in — local models remain the default
• Graceful fallback — errors don't crash, just skip
• Replace local reranking with OpenAI — no GGUF model download or local inference needed
• No breaking changes — existing workflows unchanged
Files Changed

• src/openai-llm.ts — new OpenAI LLM implementation
• src/llm.ts — embedding config, provider switching
• src/store.ts — tiktoken chunking integration
• src/qmd.ts — QMD_OPENAI env var support
Dependencies

• openai — API client
• tiktoken — fast BPE tokenization

jonesj38 and others added 13 commits April 11, 2026 19:23
Adds support for using OpenAI's text-embedding-3-small model as an
alternative to local llama-cpp embeddings.

Changes:
- New openai-llm.ts: OpenAI API client implementing LLM interface
- llm.ts: Embedding config management, getDefaultEmbeddingLLM()
- collections.ts: EmbeddingProviderConfig for YAML config schema
- store.ts: Use configurable embedding LLM, skip local model for
  query expansion/rerank when using OpenAI
- qmd.ts: Load embedding config on startup
- package.json: Add openai dependency
- README.md: Documentation for OpenAI embeddings

Configuration (in ~/.config/qmd/index.yml):
  embedding:
    provider: openai
    openai:
      api_key: sk-...  # Optional, falls back to OPENAI_API_KEY env
      model: text-embedding-3-small  # Optional, this is the default

Benefits:
- Much faster embedding (~10x vs local models on CPU)
- No GPU/VRAM requirements
- More reliable (no local model loading issues)
- Cost: ~$0.02 per 1M tokens
- OpenAI embeddings (text-embedding-3-small, 1536d) via QMD_OPENAI=1
- Query expansion with gpt-4o-mini (~200ms vs 30s local)
- Tiktoken for fast tokenization (no model loading)
- Exponential backoff with jitter for rate limits (429)
- Inter-batch delay (150ms) to avoid hitting RPM limits
- Performance: search 3-5s (was 30-60s), embed ~10min (was 2hrs)

Files: openai-llm.ts, llm.ts, store.ts, qmd.ts
Deps: openai, tiktoken
Replace the rerank() stub with a real listwise reranker using gpt-4o-mini.

- Sends top candidates with query to gpt-4o-mini as a ranking task
- Parses comma-separated index output, handles missing/duplicate indices
- Skips API call for ≤2 documents (not worth the latency)
- Falls back to original order on API failure
- Cost: ~$0.001 per rerank call
- Updated qmd.ts to route through OpenAI reranker instead of skipping

The full qmd query pipeline with OpenAI now:
1. Query expansion (gpt-4o-mini)
2. BM25 + vector search (parallel)
3. RRF fusion
4. Cross-encoder reranking (gpt-4o-mini) ← NEW
5. Position-aware blending
Accept comma-separated collection names in -c flag for cross-collection
search. All three search modes (search, vsearch, query) now support
querying multiple collections simultaneously.

Changes:
- resolveCollectionFilter() helper parses and validates comma-separated names
- searchFTS() accepts string | string[] for collection filtering
- searchVec() accepts string | string[] for collection filtering
- SQL uses IN clause for multi-collection filtering
- Updated interface types and test for new parameter types

Usage:
  qmd search 'auth' -c repo-a,repo-b
  qmd vsearch 'auth patterns' -c docs,examples
  qmd query 'OAuth implementation' -c project,patterns,docs

This enables Shad's multi-vault search to pass all vault collections
in a single qmd call instead of running separate searches per collection.
Add support for separate OpenAI-compatible servers for embeddings vs
chat (expansion/reranking). Common in setups where local GPU serves
embeddings and cloud handles chat. Implements Kaspre's split-URL
pattern from PR tobi#116 discussion.

- Add chat_base_url and chat_api_key to YAML config and OpenAIConfig
- Add QMD_OPENAI_* env var prefix (QMD_OPENAI_BASE_URL,
  QMD_OPENAI_API_KEY, QMD_OPENAI_CHAT_BASE_URL,
  QMD_OPENAI_CHAT_API_KEY) per alexleach's suggestion
- Wire expansion_model and base_url through YAML config
  per viniciushsantana's feedback
- Route expandQuery() and rerank() through chatClient,
  embed()/embedBatch() through embedding client
- Fix upstream rebase issues (Database.transaction type, collectionName
  rename)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ifferent models can be used to each. Thanks to @Kaspre for their comment

embedding:
     provider: openai
     openai:
       api_key: "sk-..."
       base_url: "http://localhost:8081/v1"          # embeddings
       model: "nomic-embed-text"

       chat_base_url: "https://ollama.com/v1"        # expansion (falls back to base_url)
       chat_api_key: "..."                           # (falls back to api_key)
       expansion_model: "gemma3:4b"

       rerank_base_url: "https://api.cohere.com/v1"  # reranking (falls back to chat_base_url)
       rerank_api_key: "..."                          # (falls back to chat_api_key)
       rerank_model: "rerank-v3"                      # (falls back to expansion_model)

also rebased onto main
…ename, embed fix

Changes based on PR comments:

1. Configurable base_url for OpenAI-compatible APIs (Ollama, vLLM, Azure)
   - collections.ts: EmbeddingProviderConfig already has base_url field
   - qmd.ts: now passes base_url and expansion_model from YAML to setEmbeddingConfig
   - openai-llm.ts: constructor accepts baseURL config

2. Env var rename: QMD_OPENAI_API_KEY takes priority over OPENAI_API_KEY
   - Avoids conflict with official openai-node SDK (per @alexleach)
   - Falls back to OPENAI_API_KEY for backwards compatibility

3. generateEmbeddings bypasses LlamaCpp when using OpenAI (per @viniciushsantana)
   - OpenAI path calls API directly, no local model session needed
   - Refactored to shared runEmbedding() with pluggable embed/embedBatch fns

4. expandQuery now actually calls OpenAI for query expansion
   - Was previously returning lex-only fallback when isUsingOpenAI()
   - Now uses gpt-4o-mini via openaiLLM.expandQuery()

5. README updated with base_url, expansion_model docs

Addresses: @alexleach (env naming, base_url), @viniciushsantana (embed fix,
expansion_model, base_url YAML wiring)
…ments

- Lazy-load node-llama-cpp to skip native compilation in OpenAI mode
- Add tiktoken-based input truncation (QMD_OPENAI_MAX_INPUT_TOKENS)
- QMD_OPENAI_BASE_URL auto-activates OpenAI mode (no QMD_OPENAI=1 needed)
- Skip LlamaCpp init in qmd status when using OpenAI
- Restore terminal cursor on embed error (try/finally)
- Bypass withLLMSession in vectorSearch/querySearch for OpenAI mode

Co-authored-by: ALB.Leach <alexleach@users.noreply.github.com>
@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​tiktoken@​1.0.2210010010081100
Addednpm/​openai@​4.104.084100100100100

View full report

@Shrub24
Copy link
Copy Markdown

Shrub24 commented May 29, 2026

How does this compare to #446 #619

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants