Eval bug: cublasSgemm_v2 CUBLAS_STATUS_INVALID_VALUE during prompt-cache invalidation under sustained load (turbo3 KV + --n-cpu-moe + MoE model)

### Name and Version

```
$ ./build/bin/llama-server --version
version: turboquant feature/turboquant-kv-cache @ 4f331667d9badcc71ab864f1e298591061d82050
built with cc (Ubuntu 13.3.0-...) 13.3.0 for x86_64-linux-gnu
CUDA: 12.0, driver 570.211.01
```

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

AMD Ryzen 5 5600 (6c/12t) + NVIDIA RTX 3070 8 GB, 31 GB system RAM

### Models

[`unsloth/Qwen3.6-35B-A3B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) — `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` (35B-A3B MoE)

### Problem description & steps to reproduce

`llama-server` (built from the `feature/turboquant-kv-cache` branch) crashes via `ggml_abort` on a `cublasSgemm_v2` call returning `CUBLAS_STATUS_INVALID_VALUE`. The crash is triggered mid-prompt-prefill (`progress = 0.688...`), immediately after the slot has restored 1 prompt-cache checkpoint and erased 2 invalidated checkpoints.

Crashes do not happen on cold workloads — only after sustained inference load against the same server. **Time-to-crash is inversely correlated with concurrent request rate**, suggesting a race in prompt-cache lifetime management.

**Build:**
```bash
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j
```

**Launch:**
```bash
./build/bin/llama-server   --model Qwen3.6-35B-A3B-UD-Q4_K_M.gguf   --host 0.0.0.0 --port 8085 --jinja --alias qwen3.6-35b   --ctx-size 131072 --n-gpu-layers 999 --n-cpu-moe 32   --cache-type-k turbo3 --cache-type-v turbo3 --flash-attn on   --batch-size 1024 --ubatch-size 512 --parallel 1   --threads 6 --cont-batching --metrics
```

**Workload:** a script POSTing back-to-back chat completions to `/v1/chat/completions`, each with a fresh ~1500–2000 token user prompt (different content every time — README text from random GitHub repos), `enable_thinking: false`, `max_tokens: 500`. No tool calls. Each request fully completes before the next is sent (single-runner mode); multiple runners issue concurrently.

**Time-to-crash, three runs on the same server build:**

| Concurrent runners | Time-to-crash | Repos completed pre-crash |
|---|---|---|
| 1 | did not crash within 2 h | 249 |
| 2 | ~60 min | ~290 |
| 4 | ~33 min | ~96 |

Crash is non-deterministic but reliably reproducible at sustained load with `--cache-type-k turbo3`.

**Workaround that fixed it:** switching to upstream `ggml-org/llama.cpp` master, dropping `--cache-type-k turbo3 --cache-type-v turbo3` for `q8_0`, and reducing `--ctx-size` to 32768. Stable for 2 h+ at single runner, no errors, **+30 % generation throughput** (49.6 vs 38.0 t/s) on the same hardware. This points strongly at the turboquant-fork-specific code path — likely the interaction between turbo3 KV cache, prompt-cache checkpoint invalidation, and `--n-cpu-moe` expert offload.

**Hypothesis:** `cublasSgemm_v2` is called with shape parameters (`row_diff`, `src1_ncols`, `ne10`, etc.) derived from a tensor descriptor that was valid before the checkpoint invalidations but stale after. cuBLAS rejects invalid stride/dim combinations as `CUBLAS_STATUS_INVALID_VALUE`. The race window is the slot's `update_slots` checkpoint reconciliation between dispatching graph compute and cuBLAS executing the matmul.

I'm happy to test a fix on the same workload — can provide a coredump or full log on request.

### First Bad Commit

Not bisected within the fork — only the merge commit `4f331667d9badcc71ab864f1e298591061d82050` on `feature/turboquant-kv-cache` was tested. Happy to bisect if helpful.

### Relevant log output

<details>
<summary>Crash signature</summary>

```console
slot update_slots: id  0 | task 77444 | restored context checkpoint (pos_min = 139, pos_max = 139, n_tokens = 140, n_past = 140, size = 62.813 MiB)
slot update_slots: id  0 | task 77444 | erased invalidated context checkpoint (pos_min = 913, pos_max = 913, n_tokens = 914, ...)
slot update_slots: id  0 | task 77444 | erased invalidated context checkpoint (pos_min = 1425, pos_max = 1425, n_tokens = 1426, ...)
slot update_slots: id  0 | task 77444 | prompt processing progress, n_tokens = 1164, batch.n_tokens = 1024, progress = 0.688350
CUDA error: an unsupported value or parameter was passed to the function
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml/src/ggml-cuda/ggml-cuda.cu:1666
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10,
                 &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
```

Backtrace (relevant frames):
```
ggml_abort
libggml-cuda.so.0  (ggml_cuda_op_mul_mat_cublas chain)
ggml_backend_sched_graph_compute_async
llama_context::graph_compute
llama_context::process_ubatch
llama_context::decode
llama_decode
```

A new request (`srv params_from_: Chat format: peg-native`) appears interleaved with the abort backtrace — suggesting another request was being scheduled at the moment of the failure.
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: cublasSgemm_v2 CUBLAS_STATUS_INVALID_VALUE during prompt-cache invalidation under sustained load (turbo3 KV + --n-cpu-moe + MoE model) #119

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: cublasSgemm_v2 CUBLAS_STATUS_INVALID_VALUE during prompt-cache invalidation under sustained load (turbo3 KV + --n-cpu-moe + MoE model) #119

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions