Skip to content

vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186

Open
TheTom wants to merge 3 commits into
feature/turboquant-kv-cachefrom
investigate/mtp-devicelost
Open

vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV)#186
TheTom wants to merge 3 commits into
feature/turboquant-kv-cachefrom
investigate/mtp-devicelost

Conversation

@TheTom

@TheTom TheTom commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Fix for #185: draft-mtp device-lost at deep context on gfx1151 / Vulkan RADV

Root cause

On hybrid Qwen3.5/3.6 the dense MTP head is the only full-KV attention path (the main model is SWA + DeltaNet). At deep context the Vulkan graph batches a large amount of work, including that full-attention MTP dispatch, into a single vkQueueSubmit. On integrated GPUs / APUs that can exceed the amdgpu job watchdog (default lockup_timeout ~2000ms on RADV) and trigger an ErrorDeviceLost. The byte-based submit heuristic only accounts for matmul work, so flash-attention-heavy graphs slip past it.

Fix

Submit more frequently on uma devices:

int nodes_per_submit = ctx->device->uma ? 1 : 100;

with a GGML_VK_NODES_PER_SUBMIT env override for tuning. Discrete GPUs are unaffected (uma == false keeps the existing 100). Mirrors upstream ggml-org#21724, which resolves the same ErrorDeviceLost with no measurable regression.

Validation

  • gfx1151 / RADV (Strix Halo, Radeon 8060S), the exact repro hardware: prefilled to 70,020 tokens with --spec-type draft-mtp and finished normally, 0 device-lost (previously crashed every run at ~43k to 47k). Confirmed by @Defilan.
  • gfx1201 (RX 9070 XT) and Blackwell (5090): no regression; both are discrete (uma == false), so behavior is unchanged.

Notes

  • The investigation-time GGML_VK_FA_LOG diagnostic was removed before merge now that the fix is validated.
  • The earlier per-workgroup-KV split_k approach (this PR's original title) was dropped: on the repro box split_k stayed 1 all the way to 70k with no crash, confirming the submit frequency, not FA chunking, is the operative fix.

Deep-context flash attention device-losts on AMD APUs / RADV (Strix Halo
gfx1151, issue #185): the byte-based submit heuristic in graph_compute only
accounts for matmul work, so flash-attention-heavy graphs (the dense MTP head
doing full attention over deep KV) batch up to nodes_per_submit=100 nodes into
one vkQueueSubmit, exceeding the GPU job watchdog (amdgpu.lockup_timeout
default 2000ms) and triggering a compute-ring reset / ErrorDeviceLost.

This is the same root cause and fix as upstream ggml-org#21724
(lowering nodes_per_submit resolves the identical device-lost with no
measurable regression). Default to frequent submits on uma/integrated devices
and add a GGML_VK_NODES_PER_SUBMIT override for tuning. Discrete GPUs keep the
existing batch size.
@TheTom TheTom force-pushed the investigate/mtp-devicelost branch from 6cc8c50 to c584da0 Compare June 20, 2026 01:37
…rams

Logs per-FLASH_ATTN_EXT dispatch: N, KV, gqa_ratio, K/V types, mask + mask_opt
state, uma, split_k/split_kv, and workgroup grid. The last line before a
device-lost identifies the exact crashing dispatch and its single-submit GPU
shape (issue #185), to drive the KV-chunking thresholds. Env-gated, no effect
unless GGML_VK_FA_LOG is set. Pair with GGML_VK_PERF_LOGGER for per-op GPU
timing to confirm the FA submit approaching the amdgpu watchdog.
@TheTom

TheTom commented Jun 21, 2026

Copy link
Copy Markdown
Owner Author

Added env-gated FA diagnostics on this branch (5cc7f0cfc): GGML_VK_FA_LOG=1 logs per-FLASH_ATTN_EXT dispatch params (N, KV, gqa, K/V types, mask/mask_opt, uma, split_k/split_kv, wg). Zero-risk (env-gated, no behavior change). Run the deep-context (>54k) repro on this branch with GGML_VK_FA_LOG=1 GGML_VK_PERF_LOGGER=1 and capture the last [FA] line before the device-lost + the FA op timing. Full instructions in #185. That data sizes the KV-chunking patch (reuse the existing split_k partial + fa_split_k_reduce, driven across per-seq submits). This branch also carries the shipped v2 fix (nodes_per_submit = uma ? 1 : 100) which holds to ~50k.

@Defilan

Defilan commented Jun 23, 2026

Copy link
Copy Markdown

gfx1151 / RADV validation — fix holds, no device-lost at 70k.

Built the #186 branch tip (5cc7f0c, system_fingerprint: b1-5cc7f0c) and ran the repro on the Strix Halo box (AMD Radeon 8060S, gfx1151, RADV, Vulkan): --flash-attn on --cache-type-k f16 --cache-type-v f16 --spec-type draft-mtp --ubatch-size 2048 --parallel 1, single deep-context request, GGML_VK_FA_LOG=1.

Result: no device-lost. Prefilled to 70,020 prompt tokens and completed normally (finish_reason: stop) — well past the ~43k–47k where it previously crashed every run. 665 FLASH_ATTN_EXT dispatches captured, KV 256 → 70144, 0 crashes.

Deepest dispatches:

[FA] N=2048 KV=65536 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=65536 wg=[2048,24,1]
[FA] N=2048 KV=67584 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=67584 wg=[2048,24,1]
[FA] N=2044 KV=70144 gqa=1 K=f16 V=f16 mask=1 mask_opt=1 uma=1 split_k=1 split_kv=70144 wg=[2044,24,1]

Note split_k stayed 1 (a single full-KV dispatch) all the way to 70k — so on this box the device-lost is resolved by the c584da0 submit-frequency change, not by FA chunking. Happy to also build with max_kv_per_workgroup forcing a split if you want to compare the two approaches at depth.

The #185 device-lost fix is validated on gfx1151/RADV (no crash through
70k tokens), so the per-dispatch flash-attention logging scaffolding added
during the investigation is no longer needed.
@TheTom TheTom changed the title vulkan: bound per-workgroup KV in flash attention (candidate fix for #185 device-lost) vulkan: submit more frequently on integrated GPUs to fix #185 device-lost (gfx1151/RADV) Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants