Skip to content

fix: two x86_64 Linux inference crashes (sched_reserve assert + CUDA concat dispatch)#7

Open
randomsamples wants to merge 2 commits into
antirez:mainfrom
randomsamples:fix/x86-linux-inference-crashes
Open

fix: two x86_64 Linux inference crashes (sched_reserve assert + CUDA concat dispatch)#7
randomsamples wants to merge 2 commits into
antirez:mainfrom
randomsamples:fix/x86-linux-inference-crashes

Conversation

@randomsamples
Copy link
Copy Markdown

Summary

Two one-line fixes that make the server actually run on x86_64 Linux. Before these patches the server aborts on every run — the model was completely unusable on Linux.

Tested on: Ubuntu 24.04, x86_64, GCC 13.3, CUDA 12.1, RTX 3090, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf.


Bug 1 — sched_reserve assert when n_ctx <= n_batch

File: src/llama-context.cpp

sched_reserve() picks a synthetic start position reserve_pos0 to size the "resumed decode" graph. When n_ctx <= n_tokens the fallback was n_tokens, placing the dry-run batch at positions [n_tokens .. 2*n_tokens-1] — entirely outside the KV cache [0 .. n_ctx-1].

Inside llm_build_deepseek4 the decode path then computed n_comp_visible = 2*n_ctx/ratio vs n_comp_cache = n_ctx/ratio, triggering:

GGML_ASSERT(n_comp_visible <= n_comp_cache) failed

common_params_fit_impl() probes n_ctx = n_batch as its first candidate, so this assertion fires unconditionally before inference ever starts.

Fix: use 0 as the fallback. When n_ctx == n_batch a "resumed" full-batch decode is impossible anyway — prefill-from-zero is the only valid graph shape.


Bug 2 — CUDA GGML_OP_CONCAT dispatch mismatch

File: ggml/src/ggml-cuda/ggml-cuda.cu

ggml_backend_cuda_device_supports_op() returned true for GGML_OP_CONCAT with any type except I32/I16. But ggml_cuda_op_concat() asserts src0->type == GGML_TYPE_F32. DeepSeek V4 attention state tensors are non-F32, so the backend scheduler dispatched them to CUDA and the kernel aborted on the first decode step:

concat.cu:165: GGML_ASSERT(src0->type == GGML_TYPE_F32) failed

Fix: restrict supports_op for GGML_OP_CONCAT to F32 only, matching what the kernel actually implements.

Patch 1 (src/llama-context.cpp): when n_ctx <= n_batch,
sched_reserve() set reserve_pos0 = n_tokens, placing the dry-run
batch outside the KV cache window. This caused:
  GGML_ASSERT(n_comp_visible <= n_comp_cache) failed
common_params_fit_impl() probes n_ctx=n_batch first, so this fires
on every Linux run. Fix: use 0 as fallback (prefill-from-zero is the
only valid graph shape when n_ctx == n_batch).

Patch 2 (ggml/src/ggml-cuda/ggml-cuda.cu): supports_op() reported
GGML_OP_CONCAT as OK for any non-I32/I16 type, but the kernel asserts
src0->type == F32. DeepSeek V4 attention state tensors are non-F32,
so the scheduler dispatched them to CUDA and the kernel aborted with:
  GGML_ASSERT(src0->type == GGML_TYPE_F32) failed
Fix: restrict supports_op for CONCAT to F32 only, matching the kernel.

Verified on Ubuntu 24.04 x86_64, CUDA 12.1, RTX 3090 with
DeepSeek-V4-Flash-IQ2XXS GGUF. Both bugs reproduced 100% before fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant