Commit ebb814b

committed

refactor(llm): replace candle backend with llama-cpp-2 for Metal GPU support

candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul). llama.cpp has mature Metal support and auto-detects GPU at build time. - Replace candle-core/candle-nn/candle-transformers with llama-cpp-2 - CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator, CandleRerank -> LlamaRerank - Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer, EmbedModelVariant (llama.cpp handles all model loading internally) - Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU detection at CMake build time) - LlamaContext is !Send so contexts are created per-call from the stored LlamaModel (which is Send+Sync) - Public API unchanged: traits, MockLlm, download infra, FlexTokenizer, PromptFormat, heuristic_orchestrate all preserved - 270 tests pass (net -1: removed select_device test)

1 parent 20be487 commit ebb814bCopy full SHA for ebb814b

7 files changed

Cargo.lock
Cargo.toml
src

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit ebb814b

File tree

0 commit comments