Name and Version
$ ./build/bin/llama-server --version
version: turboquant feature/turboquant-kv-cache @ 4f331667d9badcc71ab864f1e298591061d82050
built with cc (Ubuntu 13.3.0-...) 13.3.0 for x86_64-linux-gnu
CUDA: 12.0, driver 570.211.01
Operating systems
Linux
GGML backends
CUDA
Hardware
AMD Ryzen 5 5600 (6c/12t) + NVIDIA RTX 3070 8 GB, 31 GB system RAM
Models
unsloth/Qwen3.6-35B-A3B-GGUF — Qwen3.6-35B-A3B-UD-Q4_K_M.gguf (35B-A3B MoE)
Problem description & steps to reproduce
llama-server (built from the feature/turboquant-kv-cache branch) crashes via ggml_abort on a cublasSgemm_v2 call returning CUBLAS_STATUS_INVALID_VALUE. The crash is triggered mid-prompt-prefill (progress = 0.688...), immediately after the slot has restored 1 prompt-cache checkpoint and erased 2 invalidated checkpoints.
Crashes do not happen on cold workloads — only after sustained inference load against the same server. Time-to-crash is inversely correlated with concurrent request rate, suggesting a race in prompt-cache lifetime management.
Build:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j
Launch:
./build/bin/llama-server --model Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --host 0.0.0.0 --port 8085 --jinja --alias qwen3.6-35b --ctx-size 131072 --n-gpu-layers 999 --n-cpu-moe 32 --cache-type-k turbo3 --cache-type-v turbo3 --flash-attn on --batch-size 1024 --ubatch-size 512 --parallel 1 --threads 6 --cont-batching --metrics
Workload: a script POSTing back-to-back chat completions to /v1/chat/completions, each with a fresh ~1500–2000 token user prompt (different content every time — README text from random GitHub repos), enable_thinking: false, max_tokens: 500. No tool calls. Each request fully completes before the next is sent (single-runner mode); multiple runners issue concurrently.
Time-to-crash, three runs on the same server build:
| Concurrent runners |
Time-to-crash |
Repos completed pre-crash |
| 1 |
did not crash within 2 h |
249 |
| 2 |
~60 min |
~290 |
| 4 |
~33 min |
~96 |
Crash is non-deterministic but reliably reproducible at sustained load with --cache-type-k turbo3.
Workaround that fixed it: switching to upstream ggml-org/llama.cpp master, dropping --cache-type-k turbo3 --cache-type-v turbo3 for q8_0, and reducing --ctx-size to 32768. Stable for 2 h+ at single runner, no errors, +30 % generation throughput (49.6 vs 38.0 t/s) on the same hardware. This points strongly at the turboquant-fork-specific code path — likely the interaction between turbo3 KV cache, prompt-cache checkpoint invalidation, and --n-cpu-moe expert offload.
Hypothesis: cublasSgemm_v2 is called with shape parameters (row_diff, src1_ncols, ne10, etc.) derived from a tensor descriptor that was valid before the checkpoint invalidations but stale after. cuBLAS rejects invalid stride/dim combinations as CUBLAS_STATUS_INVALID_VALUE. The race window is the slot's update_slots checkpoint reconciliation between dispatching graph compute and cuBLAS executing the matmul.
I'm happy to test a fix on the same workload — can provide a coredump or full log on request.
First Bad Commit
Not bisected within the fork — only the merge commit 4f331667d9badcc71ab864f1e298591061d82050 on feature/turboquant-kv-cache was tested. Happy to bisect if helpful.
Relevant log output
Crash signature
slot update_slots: id 0 | task 77444 | restored context checkpoint (pos_min = 139, pos_max = 139, n_tokens = 140, n_past = 140, size = 62.813 MiB)
slot update_slots: id 0 | task 77444 | erased invalidated context checkpoint (pos_min = 913, pos_max = 913, n_tokens = 914, ...)
slot update_slots: id 0 | task 77444 | erased invalidated context checkpoint (pos_min = 1425, pos_max = 1425, n_tokens = 1426, ...)
slot update_slots: id 0 | task 77444 | prompt processing progress, n_tokens = 1164, batch.n_tokens = 1024, progress = 0.688350
CUDA error: an unsupported value or parameter was passed to the function
current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml/src/ggml-cuda/ggml-cuda.cu:1666
cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10,
&alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
Backtrace (relevant frames):
ggml_abort
libggml-cuda.so.0 (ggml_cuda_op_mul_mat_cublas chain)
ggml_backend_sched_graph_compute_async
llama_context::graph_compute
llama_context::process_ubatch
llama_context::decode
llama_decode
A new request (srv params_from_: Chat format: peg-native) appears interleaved with the abort backtrace — suggesting another request was being scheduled at the moment of the failure.
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
AMD Ryzen 5 5600 (6c/12t) + NVIDIA RTX 3070 8 GB, 31 GB system RAM
Models
unsloth/Qwen3.6-35B-A3B-GGUF—Qwen3.6-35B-A3B-UD-Q4_K_M.gguf(35B-A3B MoE)Problem description & steps to reproduce
llama-server(built from thefeature/turboquant-kv-cachebranch) crashes viaggml_aborton acublasSgemm_v2call returningCUBLAS_STATUS_INVALID_VALUE. The crash is triggered mid-prompt-prefill (progress = 0.688...), immediately after the slot has restored 1 prompt-cache checkpoint and erased 2 invalidated checkpoints.Crashes do not happen on cold workloads — only after sustained inference load against the same server. Time-to-crash is inversely correlated with concurrent request rate, suggesting a race in prompt-cache lifetime management.
Build:
Launch:
Workload: a script POSTing back-to-back chat completions to
/v1/chat/completions, each with a fresh ~1500–2000 token user prompt (different content every time — README text from random GitHub repos),enable_thinking: false,max_tokens: 500. No tool calls. Each request fully completes before the next is sent (single-runner mode); multiple runners issue concurrently.Time-to-crash, three runs on the same server build:
Crash is non-deterministic but reliably reproducible at sustained load with
--cache-type-k turbo3.Workaround that fixed it: switching to upstream
ggml-org/llama.cppmaster, dropping--cache-type-k turbo3 --cache-type-v turbo3forq8_0, and reducing--ctx-sizeto 32768. Stable for 2 h+ at single runner, no errors, +30 % generation throughput (49.6 vs 38.0 t/s) on the same hardware. This points strongly at the turboquant-fork-specific code path — likely the interaction between turbo3 KV cache, prompt-cache checkpoint invalidation, and--n-cpu-moeexpert offload.Hypothesis:
cublasSgemm_v2is called with shape parameters (row_diff,src1_ncols,ne10, etc.) derived from a tensor descriptor that was valid before the checkpoint invalidations but stale after. cuBLAS rejects invalid stride/dim combinations asCUBLAS_STATUS_INVALID_VALUE. The race window is the slot'supdate_slotscheckpoint reconciliation between dispatching graph compute and cuBLAS executing the matmul.I'm happy to test a fix on the same workload — can provide a coredump or full log on request.
First Bad Commit
Not bisected within the fork — only the merge commit
4f331667d9badcc71ab864f1e298591061d82050onfeature/turboquant-kv-cachewas tested. Happy to bisect if helpful.Relevant log output
Crash signature
Backtrace (relevant frames):
A new request (
srv params_from_: Chat format: peg-native) appears interleaved with the abort backtrace — suggesting another request was being scheduled at the moment of the failure.