Eval bug: Vulkan turbo3 KV produces incoherent decode while HIP on same model is fine (7900 XTX, head_dim=128)

### What happened

On an AMD 7900 XTX (gfx1100), `llama-cli` with `-ctk turbo3 -ctv turbo3` produces fluent text when dispatched to the ROCm/HIP backend but produces mostly random UTF-8 when dispatched to the Vulkan backend, using the same model and build.

This shows up on head_dim=128 models and persists with `-fa off` so it is not the flash-attention path specifically -- it looks like the copy-to-quant / dequant roundtrip on Vulkan diverges from the HIP implementation in some systematic way.

Per-op tests against CPU are all green, so the regression is in a combination that the per-op suite does not cover.

### Environment

- GPU: AMD Radeon RX 7900 XTX, gfx1100, 24GB
- Driver: RADV NAVI31 (Mesa 25.x), Vulkan 1.4.341
- ROCm: 7.2.1
- CPU: Ryzen 9900X, 192GB DDR5
- Build: `feature/turboquant-kv-cache` @ 59798f10d with PR #87 on top (needed so the process gets past the prior assertion)
- CMake: `-DGGML_HIP=ON -DGGML_VULKAN=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=OFF`

### Reproduce

```
./build/bin/llama-cli \\
  -m Qwen3-8B-Q4_K_M.gguf \\
  -ngl 99 -fa on -ctk turbo3 -ctv turbo3 \\
  -dev Vulkan0 -no-cnv \\
  -p "The capital of France is" -n 20 --no-warmup
```

Vulkan output from that run (same seed / greedy start): `ba（踠enance雉寰侧0呋窄疟`

Same invocation with `-dev ROCm0`: `Okay, the user is asking about the capital of France. Let` (coherent, model reasons normally from there).

`-fa off` on Vulkan gives a different bad string (`包pany描述 � BANK` then the parser complains and the process aborts), which rules out the fused-FA path as the sole cause.

### Test suite results

```
test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3   530/530 pass
test-backend-ops -o SET_ROWS       -b Vulkan0                     147/147 pass
test-backend-ops -o FLASH_ATTN_EXT -b ROCm0   -p type_KV=turbo3    all pass
llama-cli        -dev ROCm0       -ctk turbo3 -ctv turbo3         coherent output
llama-cli        -dev Vulkan0     -ctk turbo3 -ctv turbo3         garbage output
```

### Suspicion

Because per-op tests pass but the full decode diverges, the bug is most likely in how compound operations chain, or in a shape or dtype that the per-op suite does not generate. Candidates worth checking first: the graph-side `ggml_cast` from turbo3 to F16, the cpy_turbo3_0_f32 / set_rows_turbo3_0 roundtrip across multiple decode steps, and whether the Metal-style norm-correction bake that the CUDA/HIP path uses ever made it into the Vulkan dequant kernel.

### Not blocking

Filing for visibility. HIP path is fully functional on this hardware so AMD users have a working route via `-DGGML_HIP=ON`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Vulkan turbo3 KV produces incoherent decode while HIP on same model is fine (7900 XTX, head_dim=128) #88

What happened

Environment

Reproduce

Test suite results

Suspicion

Not blocking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: Vulkan turbo3 KV produces incoherent decode while HIP on same model is fine (7900 XTX, head_dim=128) #88

Description

What happened

Environment

Reproduce

Test suite results

Suspicion

Not blocking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions