vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV) by TheTom · Pull Request #186 · TheTom/llama-cpp-turboquant

TheTom · 2026-06-19T23:52:20Z

Fix for #185: draft-mtp device-lost at deep context on gfx1151 / Vulkan RADV

Root cause

On hybrid Qwen3.5/3.6 the dense MTP head is the only full-KV attention path (the main model is SWA + DeltaNet). At deep context the Vulkan graph batches a large amount of work, including that full-attention MTP dispatch, into a single vkQueueSubmit. On integrated GPUs / APUs that can exceed the amdgpu job watchdog (default lockup_timeout ~2000ms on RADV) and trigger an ErrorDeviceLost. The byte-based submit heuristic only accounts for matmul work, so flash-attention-heavy graphs slip past it.

Fix

Submit more frequently on uma devices:

int nodes_per_submit = ctx->device->uma ? 1 : 100;

with a GGML_VK_NODES_PER_SUBMIT env override for tuning. Discrete GPUs are unaffected (uma == false keeps the existing 100). Mirrors upstream ggml-org#21724, which resolves the same ErrorDeviceLost with no measurable regression.

Validation

gfx1151 / RADV (Strix Halo, Radeon 8060S), the exact repro hardware: prefilled to 70,020 tokens with --spec-type draft-mtp and finished normally, 0 device-lost (previously crashed every run at ~43k to 47k). Confirmed by @Defilan.
gfx1201 (RX 9070 XT) and Blackwell (5090): no regression; both are discrete (uma == false), so behavior is unchanged.

Notes

The investigation-time GGML_VK_FA_LOG diagnostic was removed before merge now that the fix is validated.
The earlier per-workgroup-KV split_k approach (this PR's original title) was dropped: on the repro box split_k stayed 1 all the way to 70k with no crash, confirming the submit frequency, not FA chunking, is the operative fix.

Deep-context flash attention device-losts on AMD APUs / RADV (Strix Halo gfx1151, issue #185): the byte-based submit heuristic in graph_compute only accounts for matmul work, so flash-attention-heavy graphs (the dense MTP head doing full attention over deep KV) batch up to nodes_per_submit=100 nodes into one vkQueueSubmit, exceeding the GPU job watchdog (amdgpu.lockup_timeout default 2000ms) and triggering a compute-ring reset / ErrorDeviceLost. This is the same root cause and fix as upstream ggml-org#21724 (lowering nodes_per_submit resolves the identical device-lost with no measurable regression). Default to frequent submits on uma/integrated devices and add a GGML_VK_NODES_PER_SUBMIT override for tuning. Discrete GPUs keep the existing batch size.

…rams Logs per-FLASH_ATTN_EXT dispatch: N, KV, gqa_ratio, K/V types, mask + mask_opt state, uma, split_k/split_kv, and workgroup grid. The last line before a device-lost identifies the exact crashing dispatch and its single-submit GPU shape (issue #185), to drive the KV-chunking thresholds. Env-gated, no effect unless GGML_VK_FA_LOG is set. Pair with GGML_VK_PERF_LOGGER for per-op GPU timing to confirm the FA submit approaching the amdgpu watchdog.

TheTom · 2026-06-21T11:59:34Z

Added env-gated FA diagnostics on this branch (5cc7f0cfc): GGML_VK_FA_LOG=1 logs per-FLASH_ATTN_EXT dispatch params (N, KV, gqa, K/V types, mask/mask_opt, uma, split_k/split_kv, wg). Zero-risk (env-gated, no behavior change). Run the deep-context (>54k) repro on this branch with GGML_VK_FA_LOG=1 GGML_VK_PERF_LOGGER=1 and capture the last [FA] line before the device-lost + the FA op timing. Full instructions in #185. That data sizes the KV-chunking patch (reuse the existing split_k partial + fa_split_k_reduce, driven across per-seq submits). This branch also carries the shipped v2 fix (nodes_per_submit = uma ? 1 : 100) which holds to ~50k.

Defilan · 2026-06-23T16:52:34Z

gfx1151 / RADV validation — fix holds, no device-lost at 70k.

Built the #186 branch tip (5cc7f0c, system_fingerprint: b1-5cc7f0c) and ran the repro on the Strix Halo box (AMD Radeon 8060S, gfx1151, RADV, Vulkan): --flash-attn on --cache-type-k f16 --cache-type-v f16 --spec-type draft-mtp --ubatch-size 2048 --parallel 1, single deep-context request, GGML_VK_FA_LOG=1.

Result: no device-lost. Prefilled to 70,020 prompt tokens and completed normally (finish_reason: stop) — well past the ~43k–47k where it previously crashed every run. 665 FLASH_ATTN_EXT dispatches captured, KV 256 → 70144, 0 crashes.

Deepest dispatches:

[FA] N=2048 KV=65536 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=65536 wg=[2048,24,1]
[FA] N=2048 KV=67584 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=67584 wg=[2048,24,1]
[FA] N=2044 KV=70144 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=70144 wg=[2044,24,1]

Note split_k stayed 1 (a single full-KV dispatch) all the way to 70k — so on this box the device-lost is resolved by the c584da0 submit-frequency change, not by FA chunking. Happy to also build with max_kv_per_workgroup forcing a split if you want to compare the two approaches at depth.

The #185 device-lost fix is validated on gfx1151/RADV (no crash through 70k tokens), so the per-dispatch flash-attention logging scaffolding added during the investigation is no longer needed.

TheTom mentioned this pull request Jun 19, 2026

Bug: draft-mtp device-lost crash at deep context on gfx1151 / Vulkan (RADV) #185

Open

github-actions Bot added ggml Vulkan labels Jun 19, 2026

TheTom force-pushed the investigate/mtp-devicelost branch from 6cc8c50 to c584da0 Compare June 20, 2026 01:37

vulkan: drop GGML_VK_FA_LOG investigation diagnostic

19364e8

The #185 device-lost fix is validated on gfx1151/RADV (no crash through 70k tokens), so the per-dispatch flash-attention logging scaffolding added during the investigation is no longer needed.

TheTom changed the title ~~vulkan: bound per-workgroup KV in flash attention (candidate fix for #185 device-lost)~~ vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV) Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186

vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186
TheTom wants to merge 3 commits into
feature/turboquant-kv-cachefrom
investigate/mtp-devicelost

TheTom commented Jun 19, 2026 •

edited

Loading

Uh oh!

TheTom commented Jun 21, 2026

Uh oh!

Defilan commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheTom commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix for #185: draft-mtp device-lost at deep context on gfx1151 / Vulkan RADV

Root cause

Fix

Validation

Notes

Uh oh!

TheTom commented Jun 21, 2026

Uh oh!

Defilan commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheTom commented Jun 19, 2026 •

edited

Loading