Skip to content

ggml: add DeepSeek V4 hyperconnection + KV ops (CPU)#23122

Open
cchuter wants to merge 2 commits into
ggml-org:masterfrom
cchuter:upstream-pr/dsv4-ops-cpu
Open

ggml: add DeepSeek V4 hyperconnection + KV ops (CPU)#23122
cchuter wants to merge 2 commits into
ggml-org:masterfrom
cchuter:upstream-pr/dsv4-ops-cpu

Conversation

@cchuter

@cchuter cchuter commented May 15, 2026

Copy link
Copy Markdown

This is the first of at 4 or more PRs to support Deepseek V4 #22319 - the full branch with complete support for DeepSeek V4 is here: https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda (this branch is where the PRs will be carved from and will undoubtedly change with reviewer responses to this and following PRs)

Overview

I used guidance from @CISC in the issue to break up the PR. This is the first to add the basic ggml support. It adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference implementations and test-backend-ops coverage:

  • GGML_OP_DSV4_HC_SPLIT_SINKHORN hyperconnection mix split + Sinkhorn
  • GGML_OP_DSV4_HC_WEIGHTED_SUM hyperconnection weighted residual sum
  • GGML_OP_DSV4_HC_EXPAND hyperconnection stream expand
  • GGML_OP_DSV4_FP8_KV_QUANTIZE e4m3 FP8 KV-cache quantize/dequantize
  • GGML_OP_DSV4_ROPE_TAIL V4 partial-RoPE tail rotation

THis is CPU only

GGML_OP_COUNT goes 96 -> 101 (5 new ops).

The CPU implementations are the numerical reference; test-backend-ops compares backend ops against the CPU backend, so on a CPU-only build the new cases register but are inert (CPU is the reference). THere will be follow up GPU PRs.

Additional information

Attribution

Derived from antirez/llama.cpp-deepseek-v4-flash (https://github.com/antirez/llama.cpp-deepseek-v4-flash), which builds on fairydreaming's work and the ggml/llama.cpp lineage. The V4 ops and model implementation originate there. Used with antirez's permission

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, I used a combination of models through Claude code. I am the architect and I have built a custom agent dev team for coding purposes

@cchuter cchuter requested a review from ggerganov as a code owner May 15, 2026 21:55
@github-actions github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels May 15, 2026
@ggml-gh-bot

This comment was marked as resolved.

@segmond

segmond commented May 16, 2026

Copy link
Copy Markdown

@cchuter is this complete for CPU inference? Where can I download gguf to test?

@pwilkin

pwilkin commented May 16, 2026

Copy link
Copy Markdown
Member

@segmond No, these are just the backend ops which are needed for the model support in CPU version.

@cchuter actually when adding new ops for models, we tend to prefer PRs with the model support added in as well, as otherwise there's no simple way to test whether the operation implementation is correct.

cchuter and others added 2 commits May 16, 2026 11:31
Five new ggml ops for DeepSeek-V4-Flash with CPU reference
implementations and test-backend-ops coverage:
DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND,
DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL. CPU is the reference backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek-V4-Flash model: graph (src/models/deepseek4.cpp), arch /
hparams / model-loader wiring, the dsv4_* compressed-KV extension to
llama_memory_hybrid_iswa, GGUF conversion (conversion/deepseek.py +
constants/writer keys), and the V4 chat template. Standard build_attn_mha
attention path; no DeepSeek Sparse Attention. Exercises the DSV4 ops
from the preceding commit so they are testable end-to-end on CPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cchuter cchuter force-pushed the upstream-pr/dsv4-ops-cpu branch from 97704a6 to d99045f Compare May 16, 2026 16:55
@cchuter cchuter requested review from a team and CISC as code owners May 16, 2026 16:55
@cchuter

cchuter commented May 16, 2026

Copy link
Copy Markdown
Author

@segmond, ggufs are here: https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF

@pwilkin, I've update the PR to add a full cpu end to end test. THe model loads and answers the test questions.

I was a little over aggressive in splitting up the PRs. Here's the plan:

This PR 1: base ggml and cpu support
PR 2: Metal Kernel
PR 3: CUDA (including multi-gpu support, that was fun)
PR 4: MTP (yay, nice timing on the merge)
PR 5: DSA (this is the PR that was recently closed by @fairydreaming)

PR's 4 and 5 are not done yet and might prove challenging. They will also require new gguf's (which I will make).

THe full working code base of PR 1-3 is here: https://github.com/cchuter/llama.cpp/tree/v4-clean-base

Before all those PRs are merged I don't expect a performant DeepSeek4, but I do expect a working one at all stages. Right now you should peak around 20-30 t/s generation on GPUs

@github-actions github-actions Bot added model Model specific examples python python script changes labels May 16, 2026

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good feeling about this PR, I assume backend maintainers won't happy to accept this as-is, but let's wait to see how other maintainers say about this.

it seems like most ops added this this PR can be implemented in another way

Comment thread ggml/include/ggml.h
struct ggml_context * ctx,
struct ggml_tensor * x,
struct ggml_tensor * weights);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using ggml_mul, ggml_add and ggml_sum_row?

Comment thread ggml/include/ggml.h

// DeepSeek V4 hyperconnection expand helper.
// Computes post * block_out + comb^T @ residual for each token.
GGML_API struct ggml_tensor * ggml_dsv4_hc_expand(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whay about ggml_mul, ggml_add and ggml_mul_mat?

Comment thread ggml/include/ggml.h
// DeepSeek V4 partial RoPE helper.
// Leaves the non-RoPE prefix unchanged and applies RoPE to the tail,
// matching ggml_concat(prefix, ggml_rope_ext(tail)).
GGML_API struct ggml_tensor * ggml_dsv4_rope_tail(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson

ngxson commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Did you just copied straight from antirez's version? https://github.com/antirez/llama.cpp-deepseek-v4-flash/blob/main/ggml/include/ggml.h

Please note that dishonest about code's origin will resulting in being banned from the project.

@cchuter

cchuter commented May 17, 2026

Copy link
Copy Markdown
Author

@ngxson you're right, thanks for pushing on it. This is derived from antirez's llama.cpp-deepseek-v4-flash (which builds on fairydreaming's and the ggml/llama.cpp lineage). The V4 ops and model are his; my part is CUDA/multi-GPU, a DSA-free rebase onto current master, and GGUF packaging. I should have credited that in the PR from the start and didn't. That's on me.

I have contacted @antirez and @fairydreaming to let them know I'm building on top of their work and crediting. They've both moved on to other things.

@cchuter

cchuter commented May 17, 2026

Copy link
Copy Markdown
Author

Happy to restructure, or close, if you'd rather. I'm not aware of the history of this work

@ngxson

ngxson commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Since your code contains many lines that are directly copied from antirez's work, we require explicit agreement from the original author to proceed.

@antirez

antirez commented May 17, 2026

Copy link
Copy Markdown

Hi, I agree with taking the code I developed and making it part of llama.cpp or any other compatibly-licensed project. I don't ask for any credit. Have fun hacking LLMs! :)

@segmond

segmond commented May 17, 2026

Copy link
Copy Markdown

I checked out your v4-clean-base branch and gave it a go, still needs more work. I couldn't build it because GGML_OP_COUNT wasn't updated in ggml-rpc.h. had to bump it to 101 from 96 to build. My guess is you didn't build with RPC.

On multi CUDA GPU setup it fails, abords fro ggml_backend_sched_split_graph(), I had to bump up GGML_SCHED_MAX_BACKEND and GGML_SCHED_MAX_SPLIT_INPUTS to stop the crash.

Once I go past 1 GPU I get the infamous <<<<<<<<<<< for output. With 1 GPU it works. 3090s btw. Nevermind the bad output, with the model completely loaded in memory vs 1 24gb GPU, performance isn't that much improved, all in memory TG 14.81tk/s, 1 GPU 11.70tk/s for about 4500 tokens generated. PP with multiGPU is 3x 177 vs 65. Performance leaves lots to be desired, but for now let's focus on correctness. To give an example, I get 800 tk/s on PP and 52 tk/s for TG for Qwen3.5-122 which is larger loaded all in memory. From reading the deepseek paper, it's suppose to require less compute....

I have tried every fork I could find, and these 2 in these order have worked best for me on multi cuda setup. mix of 3090s/3080s.

https://github.com/nonzod/llama.cpp-deepseek-v4-flash-spark
https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda

@whoreson

This comment was marked as spam.

@cchuter

cchuter commented May 18, 2026

Copy link
Copy Markdown
Author

@segmond thanks, that's very helpful. I'll hold off on the other GPU PRs until I've confirmed perf and correctness.

@vanmilleru

Copy link
Copy Markdown

any chance this would also support #22436

@narikm

narikm commented Jun 1, 2026

Copy link
Copy Markdown

No news in 2 weeks?

@pwilkin

pwilkin commented Jun 1, 2026

Copy link
Copy Markdown
Member

Things are happening.

@cchuter

cchuter commented Jun 1, 2026

Copy link
Copy Markdown
Author

Sorry, I've refocused and I've been working on adding multi-gpu support to @antirez 's project. Anyone is free to use my fork and previous work to further this along- no credit needed. If I get good results I'll see if I can get it working in llama.cpp as well.

@vanmilleru

Copy link
Copy Markdown

can ds4 run on multiple gpu on different machines?

@fairydreaming

Copy link
Copy Markdown
Collaborator

@vanmilleru I think antirez ds4 project just added such feature, but it seems to be only for the DS4 PRO Q4 GGUF to run across two 512 GB M3 Ultra Mac Studios.

@vanmilleru

Copy link
Copy Markdown

@fairydreaming any chance nvidia gpu can also get this

@fairydreaming

Copy link
Copy Markdown
Collaborator

@fairydreaming any chance nvidia gpu can also get this

@vanmilleru No idea, but there's #24162, so maybe you can check if it works on multiple machines by using llama.cpp RPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants