Skip to content

Eval bug: cublasSgemm_v2 CUBLAS_STATUS_INVALID_VALUE during prompt-cache invalidation under sustained load (turbo3 KV + --n-cpu-moe + MoE model) #119

@LCARSops

Description

@LCARSops

Name and Version

$ ./build/bin/llama-server --version
version: turboquant feature/turboquant-kv-cache @ 4f331667d9badcc71ab864f1e298591061d82050
built with cc (Ubuntu 13.3.0-...) 13.3.0 for x86_64-linux-gnu
CUDA: 12.0, driver 570.211.01

Operating systems

Linux

GGML backends

CUDA

Hardware

AMD Ryzen 5 5600 (6c/12t) + NVIDIA RTX 3070 8 GB, 31 GB system RAM

Models

unsloth/Qwen3.6-35B-A3B-GGUFQwen3.6-35B-A3B-UD-Q4_K_M.gguf (35B-A3B MoE)

Problem description & steps to reproduce

llama-server (built from the feature/turboquant-kv-cache branch) crashes via ggml_abort on a cublasSgemm_v2 call returning CUBLAS_STATUS_INVALID_VALUE. The crash is triggered mid-prompt-prefill (progress = 0.688...), immediately after the slot has restored 1 prompt-cache checkpoint and erased 2 invalidated checkpoints.

Crashes do not happen on cold workloads — only after sustained inference load against the same server. Time-to-crash is inversely correlated with concurrent request rate, suggesting a race in prompt-cache lifetime management.

Build:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j

Launch:

./build/bin/llama-server   --model Qwen3.6-35B-A3B-UD-Q4_K_M.gguf   --host 0.0.0.0 --port 8085 --jinja --alias qwen3.6-35b   --ctx-size 131072 --n-gpu-layers 999 --n-cpu-moe 32   --cache-type-k turbo3 --cache-type-v turbo3 --flash-attn on   --batch-size 1024 --ubatch-size 512 --parallel 1   --threads 6 --cont-batching --metrics

Workload: a script POSTing back-to-back chat completions to /v1/chat/completions, each with a fresh ~1500–2000 token user prompt (different content every time — README text from random GitHub repos), enable_thinking: false, max_tokens: 500. No tool calls. Each request fully completes before the next is sent (single-runner mode); multiple runners issue concurrently.

Time-to-crash, three runs on the same server build:

Concurrent runners Time-to-crash Repos completed pre-crash
1 did not crash within 2 h 249
2 ~60 min ~290
4 ~33 min ~96

Crash is non-deterministic but reliably reproducible at sustained load with --cache-type-k turbo3.

Workaround that fixed it: switching to upstream ggml-org/llama.cpp master, dropping --cache-type-k turbo3 --cache-type-v turbo3 for q8_0, and reducing --ctx-size to 32768. Stable for 2 h+ at single runner, no errors, +30 % generation throughput (49.6 vs 38.0 t/s) on the same hardware. This points strongly at the turboquant-fork-specific code path — likely the interaction between turbo3 KV cache, prompt-cache checkpoint invalidation, and --n-cpu-moe expert offload.

Hypothesis: cublasSgemm_v2 is called with shape parameters (row_diff, src1_ncols, ne10, etc.) derived from a tensor descriptor that was valid before the checkpoint invalidations but stale after. cuBLAS rejects invalid stride/dim combinations as CUBLAS_STATUS_INVALID_VALUE. The race window is the slot's update_slots checkpoint reconciliation between dispatching graph compute and cuBLAS executing the matmul.

I'm happy to test a fix on the same workload — can provide a coredump or full log on request.

First Bad Commit

Not bisected within the fork — only the merge commit 4f331667d9badcc71ab864f1e298591061d82050 on feature/turboquant-kv-cache was tested. Happy to bisect if helpful.

Relevant log output

Crash signature
slot update_slots: id  0 | task 77444 | restored context checkpoint (pos_min = 139, pos_max = 139, n_tokens = 140, n_past = 140, size = 62.813 MiB)
slot update_slots: id  0 | task 77444 | erased invalidated context checkpoint (pos_min = 913, pos_max = 913, n_tokens = 914, ...)
slot update_slots: id  0 | task 77444 | erased invalidated context checkpoint (pos_min = 1425, pos_max = 1425, n_tokens = 1426, ...)
slot update_slots: id  0 | task 77444 | prompt processing progress, n_tokens = 1164, batch.n_tokens = 1024, progress = 0.688350
CUDA error: an unsupported value or parameter was passed to the function
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml/src/ggml-cuda/ggml-cuda.cu:1666
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10,
                 &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)

Backtrace (relevant frames):

ggml_abort
libggml-cuda.so.0  (ggml_cuda_op_mul_mat_cublas chain)
ggml_backend_sched_graph_compute_async
llama_context::graph_compute
llama_context::process_ubatch
llama_context::decode
llama_decode

A new request (srv params_from_: Chat format: peg-native) appears interleaved with the abort backtrace — suggesting another request was being scheduled at the moment of the failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions