fix: support non-F32 quantized types in CUDA concat op#4
Conversation
Remove hardcoded F32 assertions in ggml_cuda_op_concat. Add byte-level cudaMemcpy path for contiguous quantized tensors (dim 1/2/3). Fix hardcoded /4 float offset to use ggml_nbytes(). Enables running DeepSeek V4 Flash quantized GGUF on NVIDIA CUDA.
|
hello @cdome94 thank you for the patch! Downloaded, tested and it's working but I'm getting 1-2 t/s on the same HW of yours; may I ask what parameters are you passing to llama? thank you! |
Hi! Glad it's working for you. Here are the parameters I'm using: ./build/bin/llama-server \
-m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf \
-ngl 999 \
-c 65536 \
--host 0.0.0.0 \
--port 8080I'm getting around 8-12 t/s on the GB10 with this setup. A few things that might affect speed:
What context length are you using? |
hey, thanks for the reply - tried to decrease the context and it's all on GPU but still no luck, not many tokens/s; setup is the same, DGX spark. |
Overview
ggml_cuda_op_concatcrashed withGGML_ASSERT(src0->type == GGML_TYPE_F32)when running DeepSeek V4 Flash quantized GGUF models on NVIDIA CUDA, making it impossible to use-nglon any NVIDIA GPU.Root causes:
/ 4(sizeof float) instead ofggml_nbytes()Fix:
src0->type == dst->typeconsistency checkcudaMemcpypath for contiguous quantized tensors along dim 1/2/3Tested on: NVIDIA GB10 (122 GB unified memory), DeepSeek V4 Flash
IQ2XXS-w2Q2K-AProjQ8-SExpQ8quantization.Additional information
Related discussion: ggml-org/llama.cpp#22376
Requirements