Environment
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
- CUDA: 12.8 (driver 595.97, CUDA runtime 13.2)
- Compute capability: sm_12.0 (CMake auto-upgraded to sm_120a)
- Build: compiled with
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120
- Commit:
2f2d44052b7d15c9c4dd6610f6e14a5f7b2d5f3f
- GGUF:
antirez/deepseek-v4-gguf — DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (86.72 GB, SHA256 verified)
- OS: WSL2 Ubuntu 24.04 on Windows 11, CUDA toolkit 12.8 at
/usr/local/cuda-12.8
What happens
The server boots and loads the model successfully (all 44 layers offloaded to GPU, 81,687 MiB). /health returns {"status":"ok"}, /v1/models returns the correct model entry. However, any inference request crashes the server on the first forward pass.
Crash output (q8_0 KV cache, the default)
/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:217:
GGML_ABORT("Invalid dim: %d", dim)
#3 ggml_cuda_op_concat(ggml_backend_cuda_context&, ggml_tensor*)
#4 ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*)
#5 ggml_cuda_graph_evaluate_and_capture(...)
#6 ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7 ggml_backend_sched_graph_compute_async()
#8 llama_context::graph_compute(ggml_cgraph*, bool)
#9 llama_context::process_ubatch(...)
#10 llama_context::decode(llama_batch const&)
#11 llama_decode()
#12 server_context_impl::update_slots()
The crash occurs in ggml_cuda_op_concat at the switch(dim) statement — the V4-Flash graph is passing a dim value outside the supported range 0–3 via the non-contiguous tensor path.
Secondary crash (f16 KV cache, -fa on, no -ctk/-ctv)
/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:165:
GGML_ASSERT(src0->type == GGML_TYPE_F32) failed
concat.cu asserts all inputs are GGML_TYPE_F32, but V4-Flash passes f16 tensors to the concat operation when running without KV quantization.
Root cause analysis
ggml_cuda_op_concat() is F32-only — it asserts src0->type == GGML_TYPE_F32 at line 163–165 and its non-contiguous branch only handles dim 0–3. The DeepSeek V4 Flash architecture requires concat operations on non-F32 tensors and/or with dimension values outside this range.
Workarounds tried — all fail
| Configuration |
Result |
-fa on -ctk q8_0 -ctv q8_0 |
CRASH: Invalid dim at concat.cu:217 |
-fa off |
FAIL at load: "V cache quantization requires flash_attn" |
-fa on (no KV quant, f16 default) |
CRASH: F32 assertion at concat.cu:165 |
Launch command (minimal repro)
./build/bin/llama-server \
-m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
-a deepseek-v4-flash --host 127.0.0.1 --port 8009 \
-c 4096 -ngl all -sm none -mg 0 -fa on \
-ctk q8_0 -ctv q8_0 --no-mmap -np 1
# Server starts and loads model. Send any /v1/chat/completions request → crash.
Notes
- Model loads fully into GPU VRAM (no CPU offload)
/health and /v1/models respond correctly before first inference
- The T01.5
--list-devices output confirms CUDA is found: CUDA0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97886 MiB)
ggml-cuda.cu shows the concat op dispatches to ggml_cuda_op_concat which is the F32-only implementation
Environment
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=1202f2d44052b7d15c9c4dd6610f6e14a5f7b2d5f3fantirez/deepseek-v4-gguf—DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf(86.72 GB, SHA256 verified)/usr/local/cuda-12.8What happens
The server boots and loads the model successfully (all 44 layers offloaded to GPU, 81,687 MiB).
/healthreturns{"status":"ok"},/v1/modelsreturns the correct model entry. However, any inference request crashes the server on the first forward pass.Crash output (q8_0 KV cache, the default)
The crash occurs in
ggml_cuda_op_concatat theswitch(dim)statement — the V4-Flash graph is passing adimvalue outside the supported range 0–3 via the non-contiguous tensor path.Secondary crash (f16 KV cache,
-fa on, no-ctk/-ctv)concat.cuasserts all inputs areGGML_TYPE_F32, but V4-Flash passes f16 tensors to the concat operation when running without KV quantization.Root cause analysis
ggml_cuda_op_concat()is F32-only — it assertssrc0->type == GGML_TYPE_F32at line 163–165 and its non-contiguous branch only handlesdim0–3. The DeepSeek V4 Flash architecture requires concat operations on non-F32 tensors and/or with dimension values outside this range.Workarounds tried — all fail
-fa on -ctk q8_0 -ctv q8_0-fa off-fa on(no KV quant, f16 default)Launch command (minimal repro)
./build/bin/llama-server \ -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \ -a deepseek-v4-flash --host 127.0.0.1 --port 8009 \ -c 4096 -ngl all -sm none -mg 0 -fa on \ -ctk q8_0 -ctv q8_0 --no-mmap -np 1 # Server starts and loads model. Send any /v1/chat/completions request → crash.Notes
/healthand/v1/modelsrespond correctly before first inference--list-devicesoutput confirms CUDA is found:CUDA0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97886 MiB)ggml-cuda.cushows the concat op dispatches toggml_cuda_op_concatwhich is the F32-only implementation