You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On an AMD 7900 XTX (gfx1100), llama-cli with -ctk turbo3 -ctv turbo3 produces fluent text when dispatched to the ROCm/HIP backend but produces mostly random UTF-8 when dispatched to the Vulkan backend, using the same model and build.
This shows up on head_dim=128 models and persists with -fa off so it is not the flash-attention path specifically -- it looks like the copy-to-quant / dequant roundtrip on Vulkan diverges from the HIP implementation in some systematic way.
Per-op tests against CPU are all green, so the regression is in a combination that the per-op suite does not cover.
./build/bin/llama-cli \\
-m Qwen3-8B-Q4_K_M.gguf \\
-ngl 99 -fa on -ctk turbo3 -ctv turbo3 \\
-dev Vulkan0 -no-cnv \\
-p "The capital of France is" -n 20 --no-warmup
Vulkan output from that run (same seed / greedy start): ba(踠enance雉寰侧0呋窄疟
Same invocation with -dev ROCm0: Okay, the user is asking about the capital of France. Let (coherent, model reasons normally from there).
-fa off on Vulkan gives a different bad string (包pany描述 � BANK then the parser complains and the process aborts), which rules out the fused-FA path as the sole cause.
Because per-op tests pass but the full decode diverges, the bug is most likely in how compound operations chain, or in a shape or dtype that the per-op suite does not generate. Candidates worth checking first: the graph-side ggml_cast from turbo3 to F16, the cpy_turbo3_0_f32 / set_rows_turbo3_0 roundtrip across multiple decode steps, and whether the Metal-style norm-correction bake that the CUDA/HIP path uses ever made it into the Vulkan dequant kernel.
Not blocking
Filing for visibility. HIP path is fully functional on this hardware so AMD users have a working route via -DGGML_HIP=ON.
What happened
On an AMD 7900 XTX (gfx1100),
llama-cliwith-ctk turbo3 -ctv turbo3produces fluent text when dispatched to the ROCm/HIP backend but produces mostly random UTF-8 when dispatched to the Vulkan backend, using the same model and build.This shows up on head_dim=128 models and persists with
-fa offso it is not the flash-attention path specifically -- it looks like the copy-to-quant / dequant roundtrip on Vulkan diverges from the HIP implementation in some systematic way.Per-op tests against CPU are all green, so the regression is in a combination that the per-op suite does not cover.
Environment
feature/turboquant-kv-cache@ 59798f1 with PR vulkan: fix turbo3 build + coopmat FA after April upstream sync #87 on top (needed so the process gets past the prior assertion)-DGGML_HIP=ON -DGGML_VULKAN=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=OFFReproduce
Vulkan output from that run (same seed / greedy start):
ba(踠enance雉寰侧0呋窄疟Same invocation with
-dev ROCm0:Okay, the user is asking about the capital of France. Let(coherent, model reasons normally from there).-fa offon Vulkan gives a different bad string (包pany描述 � BANKthen the parser complains and the process aborts), which rules out the fused-FA path as the sole cause.Test suite results
Suspicion
Because per-op tests pass but the full decode diverges, the bug is most likely in how compound operations chain, or in a shape or dtype that the per-op suite does not generate. Candidates worth checking first: the graph-side
ggml_castfrom turbo3 to F16, the cpy_turbo3_0_f32 / set_rows_turbo3_0 roundtrip across multiple decode steps, and whether the Metal-style norm-correction bake that the CUDA/HIP path uses ever made it into the Vulkan dequant kernel.Not blocking
Filing for visibility. HIP path is fully functional on this hardware so AMD users have a working route via
-DGGML_HIP=ON.