Commit ebb814b
committed
refactor(llm): replace candle backend with llama-cpp-2 for Metal GPU support
candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul).
llama.cpp has mature Metal support and auto-detects GPU at build time.
- Replace candle-core/candle-nn/candle-transformers with llama-cpp-2
- CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator,
CandleRerank -> LlamaRerank
- Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer,
EmbedModelVariant (llama.cpp handles all model loading internally)
- Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU
detection at CMake build time)
- LlamaContext is !Send so contexts are created per-call from the
stored LlamaModel (which is Send+Sync)
- Public API unchanged: traits, MockLlm, download infra, FlexTokenizer,
PromptFormat, heuristic_orchestrate all preserved
- 270 tests pass (net -1: removed select_device test)1 parent 20be487 commit ebb814b
7 files changed
Lines changed: 373 additions & 1915 deletions
0 commit comments