You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(llm): switch to llama.cpp backend, fix embedding params
Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU
acceleration (88 files in 70s vs 37+ min on CPU).
Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens,
use AddBos::Never (PromptFormat already adds <bos>), force CPU device
for quantized ops (candle Metal unsupported).
Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M.
0 commit comments