Skip to content

ggml_cuda_op_concat crash on first inference — Blackwell sm_120a, IQ2_XXS GGUF #6

Description

@louisamystudio

Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  • CUDA: 12.8 (driver 595.97, CUDA runtime 13.2)
  • Compute capability: sm_12.0 (CMake auto-upgraded to sm_120a)
  • Build: compiled with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
  • Commit: 2f2d44052b7d15c9c4dd6610f6e14a5f7b2d5f3f
  • GGUF: antirez/deepseek-v4-ggufDeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (86.72 GB, SHA256 verified)
  • OS: WSL2 Ubuntu 24.04 on Windows 11, CUDA toolkit 12.8 at /usr/local/cuda-12.8

What happens

The server boots and loads the model successfully (all 44 layers offloaded to GPU, 81,687 MiB). /health returns {"status":"ok"}, /v1/models returns the correct model entry. However, any inference request crashes the server on the first forward pass.

Crash output (q8_0 KV cache, the default)

/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:217:
GGML_ABORT("Invalid dim: %d", dim)

#3  ggml_cuda_op_concat(ggml_backend_cuda_context&, ggml_tensor*)
#4  ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*)
#5  ggml_cuda_graph_evaluate_and_capture(...)
#6  ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  ggml_backend_sched_graph_compute_async()
#8  llama_context::graph_compute(ggml_cgraph*, bool)
#9  llama_context::process_ubatch(...)
#10 llama_context::decode(llama_batch const&)
#11 llama_decode()
#12 server_context_impl::update_slots()

The crash occurs in ggml_cuda_op_concat at the switch(dim) statement — the V4-Flash graph is passing a dim value outside the supported range 0–3 via the non-contiguous tensor path.

Secondary crash (f16 KV cache, -fa on, no -ctk/-ctv)

/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:165:
GGML_ASSERT(src0->type == GGML_TYPE_F32) failed

concat.cu asserts all inputs are GGML_TYPE_F32, but V4-Flash passes f16 tensors to the concat operation when running without KV quantization.

Root cause analysis

ggml_cuda_op_concat() is F32-only — it asserts src0->type == GGML_TYPE_F32 at line 163–165 and its non-contiguous branch only handles dim 0–3. The DeepSeek V4 Flash architecture requires concat operations on non-F32 tensors and/or with dimension values outside this range.

Workarounds tried — all fail

Configuration Result
-fa on -ctk q8_0 -ctv q8_0 CRASH: Invalid dim at concat.cu:217
-fa off FAIL at load: "V cache quantization requires flash_attn"
-fa on (no KV quant, f16 default) CRASH: F32 assertion at concat.cu:165

Launch command (minimal repro)

./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -a deepseek-v4-flash --host 127.0.0.1 --port 8009 \
  -c 4096 -ngl all -sm none -mg 0 -fa on \
  -ctk q8_0 -ctv q8_0 --no-mmap -np 1
# Server starts and loads model. Send any /v1/chat/completions request → crash.

Notes

  • Model loads fully into GPU VRAM (no CPU offload)
  • /health and /v1/models respond correctly before first inference
  • The T01.5 --list-devices output confirms CUDA is found: CUDA0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97886 MiB)
  • ggml-cuda.cu shows the concat op dispatches to ggml_cuda_op_concat which is the F32-only implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions