Commit dbf3d40
fix(llama-graph): n_head_v reshape uses Q-head count, not KV-head count (#78)
Post-attention V-padded reshape in build_attn was using hparams.n_head_kv(il),
but cur returned from build_attn_mha has shape (n_embd_head * n_head, n_tokens) —
n_head is the Q-head count. On GQA models where n_head != n_head_kv (e.g.
Qwen2.5-0.5B with head_dim=64 padded → 128, n_head=14, n_head_kv=2), the
reshape element count fails the assertion in ggml_reshape_3d and the process
aborts.
Symptom: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) at ggml.c:3656.
Reported and diagnosed by @bingh0 in #78. Verified
locally on Qwen2.5-7B (head_dim=128, no padding, regression check passes) and
on AMD MI300X with Qwen2.5-0.5B (head_dim=64, was crashing pre-fix).
Three sites fixed (lines 2285, 2412, 2532 — same idiom in three build_attn
overloads).
Closes #78. Likely also closes #108 (speculative decoding hits the same
assertion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 157f27f commit dbf3d40
1 file changed
Lines changed: 18 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2277 | 2277 | | |
2278 | 2278 | | |
2279 | 2279 | | |
2280 | | - | |
| 2280 | + | |
| 2281 | + | |
| 2282 | + | |
| 2283 | + | |
| 2284 | + | |
| 2285 | + | |
2281 | 2286 | | |
2282 | 2287 | | |
2283 | 2288 | | |
| |||
2399 | 2404 | | |
2400 | 2405 | | |
2401 | 2406 | | |
2402 | | - | |
| 2407 | + | |
| 2408 | + | |
| 2409 | + | |
| 2410 | + | |
| 2411 | + | |
| 2412 | + | |
2403 | 2413 | | |
2404 | 2414 | | |
2405 | 2415 | | |
| |||
2514 | 2524 | | |
2515 | 2525 | | |
2516 | 2526 | | |
2517 | | - | |
| 2527 | + | |
| 2528 | + | |
| 2529 | + | |
| 2530 | + | |
| 2531 | + | |
| 2532 | + | |
2518 | 2533 | | |
2519 | 2534 | | |
2520 | 2535 | | |
| |||
0 commit comments