Commit 328874d
authored
model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.
Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.1 parent c1f1e28 commit 328874d
1 file changed
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
767 | 767 | | |
768 | 768 | | |
769 | 769 | | |
770 | | - | |
771 | | - | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
772 | 773 | | |
773 | 774 | | |
774 | 775 | | |
| |||
0 commit comments