Skip to content

Eval bug: Vulkan turbo3 KV produces incoherent decode while HIP on same model is fine (7900 XTX, head_dim=128) #88

@apollosenvy

Description

@apollosenvy

What happened

On an AMD 7900 XTX (gfx1100), llama-cli with -ctk turbo3 -ctv turbo3 produces fluent text when dispatched to the ROCm/HIP backend but produces mostly random UTF-8 when dispatched to the Vulkan backend, using the same model and build.

This shows up on head_dim=128 models and persists with -fa off so it is not the flash-attention path specifically -- it looks like the copy-to-quant / dequant roundtrip on Vulkan diverges from the HIP implementation in some systematic way.

Per-op tests against CPU are all green, so the regression is in a combination that the per-op suite does not cover.

Environment

  • GPU: AMD Radeon RX 7900 XTX, gfx1100, 24GB
  • Driver: RADV NAVI31 (Mesa 25.x), Vulkan 1.4.341
  • ROCm: 7.2.1
  • CPU: Ryzen 9900X, 192GB DDR5
  • Build: feature/turboquant-kv-cache @ 59798f1 with PR vulkan: fix turbo3 build + coopmat FA after April upstream sync #87 on top (needed so the process gets past the prior assertion)
  • CMake: -DGGML_HIP=ON -DGGML_VULKAN=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=OFF

Reproduce

./build/bin/llama-cli \\
  -m Qwen3-8B-Q4_K_M.gguf \\
  -ngl 99 -fa on -ctk turbo3 -ctv turbo3 \\
  -dev Vulkan0 -no-cnv \\
  -p "The capital of France is" -n 20 --no-warmup

Vulkan output from that run (same seed / greedy start): ba(踠enance雉寰侧0呋窄疟

Same invocation with -dev ROCm0: Okay, the user is asking about the capital of France. Let (coherent, model reasons normally from there).

-fa off on Vulkan gives a different bad string (包pany描述 � BANK then the parser complains and the process aborts), which rules out the fused-FA path as the sole cause.

Test suite results

test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3   530/530 pass
test-backend-ops -o SET_ROWS       -b Vulkan0                     147/147 pass
test-backend-ops -o FLASH_ATTN_EXT -b ROCm0   -p type_KV=turbo3    all pass
llama-cli        -dev ROCm0       -ctk turbo3 -ctv turbo3         coherent output
llama-cli        -dev Vulkan0     -ctk turbo3 -ctv turbo3         garbage output

Suspicion

Because per-op tests pass but the full decode diverges, the bug is most likely in how compound operations chain, or in a shape or dtype that the per-op suite does not generate. Candidates worth checking first: the graph-side ggml_cast from turbo3 to F16, the cpy_turbo3_0_f32 / set_rows_turbo3_0 roundtrip across multiple decode steps, and whether the Metal-style norm-correction bake that the CUDA/HIP path uses ever made it into the Vulkan dequant kernel.

Not blocking

Filing for visibility. HIP path is fully functional on this hardware so AMD users have a working route via -DGGML_HIP=ON.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions