Skip to content

Commit 328874d

Browse files
model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise MUL on a q8_0 weight. That used to return true unconditionally and the weight stayed on GPU by luck. Once supports_op told the truth, the probe got a no and the loader pushed the weight and its matmul to CPU, splitting the graph. Tagging it MUL_MAT asks the real question, the math is unchanged. Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
1 parent c1f1e28 commit 328874d

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

src/llama-arch.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -767,8 +767,9 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
767767
{LLM_TENSOR_NEXTN_SHARED_HEAD_HEAD, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
768768
{LLM_TENSOR_NEXTN_SHARED_HEAD_NORM, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
769769
// Nemotron 3 Super
770-
{LLM_TENSOR_FFN_LATENT_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
771-
{LLM_TENSOR_FFN_LATENT_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
770+
// latent projections feed ggml_mul_mat, the buft probe must use MUL_MAT to keep them on GPU
771+
{LLM_TENSOR_FFN_LATENT_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
772+
{LLM_TENSOR_FFN_LATENT_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
772773
};
773774

774775
LLM_KV::LLM_KV(llm_arch arch, const char * suffix) : arch(arch), suffix(suffix) {}

0 commit comments

Comments
 (0)