fix: two x86_64 Linux inference crashes (sched_reserve assert + CUDA concat dispatch) by randomsamples · Pull Request #7 · antirez/llama.cpp-deepseek-v4-flash

randomsamples · 2026-05-08T15:52:20Z

Summary

Two one-line fixes that make the server actually run on x86_64 Linux. Before these patches the server aborts on every run — the model was completely unusable on Linux.

Tested on: Ubuntu 24.04, x86_64, GCC 13.3, CUDA 12.1, RTX 3090, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf.

Bug 1 — `sched_reserve` assert when `n_ctx <= n_batch`

File: src/llama-context.cpp

sched_reserve() picks a synthetic start position reserve_pos0 to size the "resumed decode" graph. When n_ctx <= n_tokens the fallback was n_tokens, placing the dry-run batch at positions [n_tokens .. 2*n_tokens-1] — entirely outside the KV cache [0 .. n_ctx-1].

Inside llm_build_deepseek4 the decode path then computed n_comp_visible = 2*n_ctx/ratio vs n_comp_cache = n_ctx/ratio, triggering:

GGML_ASSERT(n_comp_visible <= n_comp_cache) failed

common_params_fit_impl() probes n_ctx = n_batch as its first candidate, so this assertion fires unconditionally before inference ever starts.

Fix: use 0 as the fallback. When n_ctx == n_batch a "resumed" full-batch decode is impossible anyway — prefill-from-zero is the only valid graph shape.

Bug 2 — CUDA `GGML_OP_CONCAT` dispatch mismatch

File: ggml/src/ggml-cuda/ggml-cuda.cu

ggml_backend_cuda_device_supports_op() returned true for GGML_OP_CONCAT with any type except I32/I16. But ggml_cuda_op_concat() asserts src0->type == GGML_TYPE_F32. DeepSeek V4 attention state tensors are non-F32, so the backend scheduler dispatched them to CUDA and the kernel aborted on the first decode step:

concat.cu:165: GGML_ASSERT(src0->type == GGML_TYPE_F32) failed

Fix: restrict supports_op for GGML_OP_CONCAT to F32 only, matching what the kernel actually implements.

Patch 1 (src/llama-context.cpp): when n_ctx <= n_batch, sched_reserve() set reserve_pos0 = n_tokens, placing the dry-run batch outside the KV cache window. This caused: GGML_ASSERT(n_comp_visible <= n_comp_cache) failed common_params_fit_impl() probes n_ctx=n_batch first, so this fires on every Linux run. Fix: use 0 as fallback (prefill-from-zero is the only valid graph shape when n_ctx == n_batch). Patch 2 (ggml/src/ggml-cuda/ggml-cuda.cu): supports_op() reported GGML_OP_CONCAT as OK for any non-I32/I16 type, but the kernel asserts src0->type == F32. DeepSeek V4 attention state tensors are non-F32, so the scheduler dispatched them to CUDA and the kernel aborted with: GGML_ASSERT(src0->type == GGML_TYPE_F32) failed Fix: restrict supports_op for CONCAT to F32 only, matching the kernel. Verified on Ubuntu 24.04 x86_64, CUDA 12.1, RTX 3090 with DeepSeek-V4-Flash-IQ2XXS GGUF. Both bugs reproduced 100% before fix.

github-actions Bot added Nvidia GPU ggml labels May 8, 2026

tests: add regression tests for the two Linux inference crash fixes

2d6935a

github-actions Bot added the testing label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: two x86_64 Linux inference crashes (sched_reserve assert + CUDA concat dispatch)#7

fix: two x86_64 Linux inference crashes (sched_reserve assert + CUDA concat dispatch)#7
randomsamples wants to merge 2 commits into
antirez:mainfrom
randomsamples:fix/x86-linux-inference-crashes

randomsamples commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

randomsamples commented May 8, 2026

Summary

Bug 1 — sched_reserve assert when n_ctx <= n_batch

Bug 2 — CUDA GGML_OP_CONCAT dispatch mismatch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — `sched_reserve` assert when `n_ctx <= n_batch`

Bug 2 — CUDA `GGML_OP_CONCAT` dispatch mismatch