ggml: add DeepSeek V4 hyperconnection + KV ops (CPU)#23122
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
@cchuter is this complete for CPU inference? Where can I download gguf to test? |
|
@segmond No, these are just the backend ops which are needed for the model support in CPU version. @cchuter actually when adding new ops for models, we tend to prefer PRs with the model support added in as well, as otherwise there's no simple way to test whether the operation implementation is correct. |
Five new ggml ops for DeepSeek-V4-Flash with CPU reference implementations and test-backend-ops coverage: DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL. CPU is the reference backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek-V4-Flash model: graph (src/models/deepseek4.cpp), arch / hparams / model-loader wiring, the dsv4_* compressed-KV extension to llama_memory_hybrid_iswa, GGUF conversion (conversion/deepseek.py + constants/writer keys), and the V4 chat template. Standard build_attn_mha attention path; no DeepSeek Sparse Attention. Exercises the DSV4 ops from the preceding commit so they are testable end-to-end on CPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
97704a6 to
d99045f
Compare
|
@segmond, ggufs are here: https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF @pwilkin, I've update the PR to add a full cpu end to end test. THe model loads and answers the test questions. I was a little over aggressive in splitting up the PRs. Here's the plan: This PR 1: base ggml and cpu support PR's 4 and 5 are not done yet and might prove challenging. They will also require new gguf's (which I will make). THe full working code base of PR 1-3 is here: https://github.com/cchuter/llama.cpp/tree/v4-clean-base Before all those PRs are merged I don't expect a performant DeepSeek4, but I do expect a working one at all stages. Right now you should peak around 20-30 t/s generation on GPUs |
ngxson
left a comment
There was a problem hiding this comment.
no good feeling about this PR, I assume backend maintainers won't happy to accept this as-is, but let's wait to see how other maintainers say about this.
it seems like most ops added this this PR can be implemented in another way
| struct ggml_context * ctx, | ||
| struct ggml_tensor * x, | ||
| struct ggml_tensor * weights); | ||
|
|
There was a problem hiding this comment.
why not using ggml_mul, ggml_add and ggml_sum_row?
|
|
||
| // DeepSeek V4 hyperconnection expand helper. | ||
| // Computes post * block_out + comb^T @ residual for each token. | ||
| GGML_API struct ggml_tensor * ggml_dsv4_hc_expand( |
There was a problem hiding this comment.
whay about ggml_mul, ggml_add and ggml_mul_mat?
| // DeepSeek V4 partial RoPE helper. | ||
| // Leaves the non-RoPE prefix unchanged and applies RoPE to the tail, | ||
| // matching ggml_concat(prefix, ggml_rope_ext(tail)). | ||
| GGML_API struct ggml_tensor * ggml_dsv4_rope_tail( |
There was a problem hiding this comment.
you should have skipped reading tips and tricks from https://github.com/ggml-org/llama.cpp/blob/master/docs/development/HOWTO-add-model.md
|
Did you just copied straight from antirez's version? https://github.com/antirez/llama.cpp-deepseek-v4-flash/blob/main/ggml/include/ggml.h Please note that dishonest about code's origin will resulting in being banned from the project. |
|
@ngxson you're right, thanks for pushing on it. This is derived from antirez's llama.cpp-deepseek-v4-flash (which builds on fairydreaming's and the ggml/llama.cpp lineage). The V4 ops and model are his; my part is CUDA/multi-GPU, a DSA-free rebase onto current master, and GGUF packaging. I should have credited that in the PR from the start and didn't. That's on me. I have contacted @antirez and @fairydreaming to let them know I'm building on top of their work and crediting. They've both moved on to other things. |
|
Happy to restructure, or close, if you'd rather. I'm not aware of the history of this work |
|
Since your code contains many lines that are directly copied from antirez's work, we require explicit agreement from the original author to proceed. |
|
Hi, I agree with taking the code I developed and making it part of llama.cpp or any other compatibly-licensed project. I don't ask for any credit. Have fun hacking LLMs! :) |
|
I checked out your v4-clean-base branch and gave it a go, still needs more work. I couldn't build it because GGML_OP_COUNT wasn't updated in ggml-rpc.h. had to bump it to 101 from 96 to build. My guess is you didn't build with RPC. On multi CUDA GPU setup it fails, abords fro ggml_backend_sched_split_graph(), I had to bump up GGML_SCHED_MAX_BACKEND and GGML_SCHED_MAX_SPLIT_INPUTS to stop the crash. Once I go past 1 GPU I get the infamous <<<<<<<<<<< for output. With 1 GPU it works. 3090s btw. Nevermind the bad output, with the model completely loaded in memory vs 1 24gb GPU, performance isn't that much improved, all in memory TG 14.81tk/s, 1 GPU 11.70tk/s for about 4500 tokens generated. PP with multiGPU is 3x 177 vs 65. Performance leaves lots to be desired, but for now let's focus on correctness. To give an example, I get 800 tk/s on PP and 52 tk/s for TG for Qwen3.5-122 which is larger loaded all in memory. From reading the deepseek paper, it's suppose to require less compute.... I have tried every fork I could find, and these 2 in these order have worked best for me on multi cuda setup. mix of 3090s/3080s. https://github.com/nonzod/llama.cpp-deepseek-v4-flash-spark |
This comment was marked as spam.
This comment was marked as spam.
|
@segmond thanks, that's very helpful. I'll hold off on the other GPU PRs until I've confirmed perf and correctness. |
|
any chance this would also support #22436 |
|
No news in 2 weeks? |
|
Things are happening. |
|
Sorry, I've refocused and I've been working on adding multi-gpu support to @antirez 's project. Anyone is free to use my fork and previous work to further this along- no credit needed. If I get good results I'll see if I can get it working in llama.cpp as well. |
|
can ds4 run on multiple gpu on different machines? |
|
@vanmilleru I think antirez ds4 project just added such feature, but it seems to be only for the DS4 PRO Q4 GGUF to run across two 512 GB M3 Ultra Mac Studios. |
|
@fairydreaming any chance nvidia gpu can also get this |
@vanmilleru No idea, but there's #24162, so maybe you can check if it works on multiple machines by using llama.cpp RPC. |
This is the first of at 4 or more PRs to support Deepseek V4 #22319 - the full branch with complete support for DeepSeek V4 is here: https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda (this branch is where the PRs will be carved from and will undoubtedly change with reviewer responses to this and following PRs)
Overview
I used guidance from @CISC in the issue to break up the PR. This is the first to add the basic ggml support. It adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference implementations and test-backend-ops coverage:
THis is CPU only
GGML_OP_COUNT goes 96 -> 101 (5 new ops).
The CPU implementations are the numerical reference; test-backend-ops compares backend ops against the CPU backend, so on a CPU-only build the new cases register but are inert (CPU is the reference). THere will be follow up GPU PRs.
Additional information
Attribution
Derived from antirez/llama.cpp-deepseek-v4-flash (https://github.com/antirez/llama.cpp-deepseek-v4-flash), which builds on fairydreaming's work and the ggml/llama.cpp lineage. The V4 ops and model implementation originate there. Used with antirez's permission
Requirements