vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186
vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186TheTom wants to merge 3 commits into
Conversation
Deep-context flash attention device-losts on AMD APUs / RADV (Strix Halo gfx1151, issue #185): the byte-based submit heuristic in graph_compute only accounts for matmul work, so flash-attention-heavy graphs (the dense MTP head doing full attention over deep KV) batch up to nodes_per_submit=100 nodes into one vkQueueSubmit, exceeding the GPU job watchdog (amdgpu.lockup_timeout default 2000ms) and triggering a compute-ring reset / ErrorDeviceLost. This is the same root cause and fix as upstream ggml-org#21724 (lowering nodes_per_submit resolves the identical device-lost with no measurable regression). Default to frequent submits on uma/integrated devices and add a GGML_VK_NODES_PER_SUBMIT override for tuning. Discrete GPUs keep the existing batch size.
6cc8c50 to
c584da0
Compare
…rams Logs per-FLASH_ATTN_EXT dispatch: N, KV, gqa_ratio, K/V types, mask + mask_opt state, uma, split_k/split_kv, and workgroup grid. The last line before a device-lost identifies the exact crashing dispatch and its single-submit GPU shape (issue #185), to drive the KV-chunking thresholds. Env-gated, no effect unless GGML_VK_FA_LOG is set. Pair with GGML_VK_PERF_LOGGER for per-op GPU timing to confirm the FA submit approaching the amdgpu watchdog.
|
Added env-gated FA diagnostics on this branch ( |
|
gfx1151 / RADV validation — fix holds, no device-lost at 70k. Built the #186 branch tip ( Result: no device-lost. Prefilled to 70,020 prompt tokens and completed normally ( Deepest dispatches: Note |
The #185 device-lost fix is validated on gfx1151/RADV (no crash through 70k tokens), so the per-dispatch flash-attention logging scaffolding added during the investigation is no longer needed.
Fix for #185: draft-mtp device-lost at deep context on gfx1151 / Vulkan RADV
Root cause
On hybrid Qwen3.5/3.6 the dense MTP head is the only full-KV attention path (the main model is SWA + DeltaNet). At deep context the Vulkan graph batches a large amount of work, including that full-attention MTP dispatch, into a single
vkQueueSubmit. On integrated GPUs / APUs that can exceed the amdgpu job watchdog (defaultlockup_timeout~2000ms on RADV) and trigger anErrorDeviceLost. The byte-based submit heuristic only accounts for matmul work, so flash-attention-heavy graphs slip past it.Fix
Submit more frequently on uma devices:
with a
GGML_VK_NODES_PER_SUBMITenv override for tuning. Discrete GPUs are unaffected (uma == falsekeeps the existing 100). Mirrors upstream ggml-org#21724, which resolves the sameErrorDeviceLostwith no measurable regression.Validation
--spec-type draft-mtpand finished normally, 0 device-lost (previously crashed every run at ~43k to 47k). Confirmed by @Defilan.uma == false), so behavior is unchanged.Notes
GGML_VK_FA_LOGdiagnostic was removed before merge now that the fix is validated.split_kapproach (this PR's original title) was dropped: on the repro boxsplit_kstayed 1 all the way to 70k with no crash, confirming the submit frequency, not FA chunking, is the operative fix.