ggml_cuda_op_concat crash on first inference — Blackwell sm_120a, IQ2_XXS GGUF

## Environment

- **GPU:** NVIDIA RTX PRO 6000 Blackwell Workstation Edition
- **CUDA:** 12.8 (driver 595.97, CUDA runtime 13.2)
- **Compute capability:** sm_12.0 (CMake auto-upgraded to sm_120a)
- **Build:** compiled with `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120`
- **Commit:** `2f2d44052b7d15c9c4dd6610f6e14a5f7b2d5f3f`
- **GGUF:** `antirez/deepseek-v4-gguf` — `DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf` (86.72 GB, SHA256 verified)
- **OS:** WSL2 Ubuntu 24.04 on Windows 11, CUDA toolkit 12.8 at `/usr/local/cuda-12.8`

## What happens

The server boots and loads the model successfully (all 44 layers offloaded to GPU, 81,687 MiB). `/health` returns `{"status":"ok"}`, `/v1/models` returns the correct model entry. However, **any inference request crashes the server** on the first forward pass.

## Crash output (q8_0 KV cache, the default)

```
/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:217:
GGML_ABORT("Invalid dim: %d", dim)

#3  ggml_cuda_op_concat(ggml_backend_cuda_context&, ggml_tensor*)
#4  ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*)
#5  ggml_cuda_graph_evaluate_and_capture(...)
#6  ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  ggml_backend_sched_graph_compute_async()
#8  llama_context::graph_compute(ggml_cgraph*, bool)
#9  llama_context::process_ubatch(...)
#10 llama_context::decode(llama_batch const&)
#11 llama_decode()
#12 server_context_impl::update_slots()
```

The crash occurs in `ggml_cuda_op_concat` at the `switch(dim)` statement — the V4-Flash graph is passing a `dim` value outside the supported range 0–3 via the non-contiguous tensor path.

## Secondary crash (f16 KV cache, `-fa on`, no `-ctk/-ctv`)

```
/home/agent/ai-agent/llama.cpp-deepseek-v4-flash/ggml/src/ggml-cuda/concat.cu:165:
GGML_ASSERT(src0->type == GGML_TYPE_F32) failed
```

`concat.cu` asserts all inputs are `GGML_TYPE_F32`, but V4-Flash passes f16 tensors to the concat operation when running without KV quantization.

## Root cause analysis

`ggml_cuda_op_concat()` is F32-only — it asserts `src0->type == GGML_TYPE_F32` at line 163–165 and its non-contiguous branch only handles `dim` 0–3. The DeepSeek V4 Flash architecture requires concat operations on non-F32 tensors and/or with dimension values outside this range.

## Workarounds tried — all fail

| Configuration | Result |
|---|---|
| `-fa on -ctk q8_0 -ctv q8_0` | CRASH: Invalid dim at concat.cu:217 |
| `-fa off` | FAIL at load: "V cache quantization requires flash_attn" |
| `-fa on` (no KV quant, f16 default) | CRASH: F32 assertion at concat.cu:165 |

## Launch command (minimal repro)

```bash
./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -a deepseek-v4-flash --host 127.0.0.1 --port 8009 \
  -c 4096 -ngl all -sm none -mg 0 -fa on \
  -ctk q8_0 -ctv q8_0 --no-mmap -np 1
# Server starts and loads model. Send any /v1/chat/completions request → crash.
```

## Notes

- Model loads fully into GPU VRAM (no CPU offload)
- `/health` and `/v1/models` respond correctly before first inference
- The T01.5 `--list-devices` output confirms CUDA is found: `CUDA0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (97886 MiB)`
- `ggml-cuda.cu` shows the concat op dispatches to `ggml_cuda_op_concat` which is the F32-only implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml_cuda_op_concat crash on first inference — Blackwell sm_120a, IQ2_XXS GGUF #6

Environment

What happens

Crash output (q8_0 KV cache, the default)

Secondary crash (f16 KV cache, `-fa on`, no `-ctk/-ctv`)

Root cause analysis

Workarounds tried — all fail

Launch command (minimal repro)

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Configuration	Result
`-fa on -ctk q8_0 -ctv q8_0`	CRASH: Invalid dim at concat.cu:217
`-fa off`	FAIL at load: "V cache quantization requires flash_attn"
`-fa on` (no KV quant, f16 default)	CRASH: F32 assertion at concat.cu:165

ggml_cuda_op_concat crash on first inference — Blackwell sm_120a, IQ2_XXS GGUF #6

Description

Environment

What happens

Crash output (q8_0 KV cache, the default)

Secondary crash (f16 KV cache, -fa on, no -ctk/-ctv)

Root cause analysis

Workarounds tried — all fail

Launch command (minimal repro)

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Secondary crash (f16 KV cache, `-fa on`, no `-ctk/-ctv`)