fix: support non-F32 quantized types in CUDA concat op by cdome94 · Pull Request #4 · antirez/llama.cpp-deepseek-v4-flash

cdome94 · 2026-05-04T14:40:35Z

Overview

ggml_cuda_op_concat crashed with GGML_ASSERT(src0->type == GGML_TYPE_F32) when running DeepSeek V4 Flash quantized GGUF models on NVIDIA CUDA, making it impossible to use -ngl on any NVIDIA GPU.

Root causes:

Three hard assertions requiring F32 type blocked any quantized input
Float offset calculations used hardcoded / 4 (sizeof float) instead of ggml_nbytes()

Fix:

Removed F32-only assertions, replaced with src0->type == dst->type consistency check
Added byte-level cudaMemcpy path for contiguous quantized tensors along dim 1/2/3
F32 path left entirely unchanged

Tested on: NVIDIA GB10 (122 GB unified memory), DeepSeek V4 Flash IQ2XXS-w2Q2K-AProjQ8-SExpQ8 quantization.

Additional information

Related discussion: ggml-org/llama.cpp#22376

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES — the fix was developed with AI assistance (Claude) for code analysis and patch generation. The patch was manually reviewed, tested, and verified working on hardware by the submitter.

Remove hardcoded F32 assertions in ggml_cuda_op_concat. Add byte-level cudaMemcpy path for contiguous quantized tensors (dim 1/2/3). Fix hardcoded /4 float offset to use ggml_nbytes(). Enables running DeepSeek V4 Flash quantized GGUF on NVIDIA CUDA.

emcalv · 2026-05-06T08:22:09Z

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

cdome94 · 2026-05-07T07:02:33Z

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

Hi! Glad it's working for you.

Here are the parameters I'm using:

./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -ngl 999 \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080

I'm getting around 8-12 t/s on the GB10 with this setup. A few things that might affect speed:

Make sure all layers are offloaded to GPU (-ngl 999) and no layers are falling back to CPU
The GB10 uses unified memory so there's no CPU↔GPU transfer overhead — if you're on a discrete GPU setup the bandwidth characteristics will be different
Context length has a significant impact: I was initially running -c 131072 which halved my throughput compared to -c 65536

What context length are you using?

emcalv · 2026-05-07T20:46:49Z

hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you!

Hi! Glad it's working for you.

Here are the parameters I'm using:
./build/bin/llama-server \
  -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
  -ngl 999 \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080
I'm getting around 8-12 t/s on the GB10 with this setup. A few things that might affect speed:
* Make sure all layers are offloaded to GPU (`-ngl 999`) and no layers are falling back to CPU

* The GB10 uses unified memory so there's no CPU↔GPU transfer overhead — if you're on a discrete GPU setup the bandwidth characteristics will be different

* Context length has a significant impact: I was initially running `-c 131072` which halved my throughput compared to `-c 65536`
What context length are you using?

hey, thanks for the reply - tried to decrease the context and it's all on GPU but still no luck, not many tokens/s; setup is the same, DGX spark.

github-actions Bot added Nvidia GPU ggml labels May 4, 2026

docs: add CUDA build instructions and fix description

c4db844

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support non-F32 quantized types in CUDA concat op#4

fix: support non-F32 quantized types in CUDA concat op#4
cdome94 wants to merge 2 commits into
antirez:mainfrom
cdome94:main

cdome94 commented May 4, 2026

Uh oh!

emcalv commented May 6, 2026

Uh oh!

cdome94 commented May 7, 2026

Uh oh!

emcalv commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cdome94 commented May 4, 2026

Overview

Additional information

Requirements

Uh oh!

emcalv commented May 6, 2026

Uh oh!

cdome94 commented May 7, 2026

Uh oh!

emcalv commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants