ggml: add DeepSeek V4 hyperconnection + KV ops (CPU) by cchuter · Pull Request #23122 · ggml-org/llama.cpp

cchuter · 2026-05-15T21:55:13Z

This is the first of at 4 or more PRs to support Deepseek V4 #22319 - the full branch with complete support for DeepSeek V4 is here: https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda (this branch is where the PRs will be carved from and will undoubtedly change with reviewer responses to this and following PRs)

Overview

I used guidance from @CISC in the issue to break up the PR. This is the first to add the basic ggml support. It adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference implementations and test-backend-ops coverage:

GGML_OP_DSV4_HC_SPLIT_SINKHORN hyperconnection mix split + Sinkhorn
GGML_OP_DSV4_HC_WEIGHTED_SUM hyperconnection weighted residual sum
GGML_OP_DSV4_HC_EXPAND hyperconnection stream expand
GGML_OP_DSV4_FP8_KV_QUANTIZE e4m3 FP8 KV-cache quantize/dequantize
GGML_OP_DSV4_ROPE_TAIL V4 partial-RoPE tail rotation

THis is CPU only

GGML_OP_COUNT goes 96 -> 101 (5 new ops).

The CPU implementations are the numerical reference; test-backend-ops compares backend ops against the CPU backend, so on a CPU-only build the new cases register but are inert (CPU is the reference). THere will be follow up GPU PRs.

Additional information

Attribution

Derived from antirez/llama.cpp-deepseek-v4-flash (https://github.com/antirez/llama.cpp-deepseek-v4-flash), which builds on fairydreaming's work and the ggml/llama.cpp lineage. The V4 ops and model implementation originate there. Used with antirez's permission

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, I used a combination of models through Claude code. I am the architect and I have built a custom agent dev team for coding purposes

segmond · 2026-05-16T11:46:33Z

@cchuter is this complete for CPU inference? Where can I download gguf to test?

pwilkin · 2026-05-16T12:54:45Z

@segmond No, these are just the backend ops which are needed for the model support in CPU version.

@cchuter actually when adding new ops for models, we tend to prefer PRs with the model support added in as well, as otherwise there's no simple way to test whether the operation implementation is correct.

Five new ggml ops for DeepSeek-V4-Flash with CPU reference implementations and test-backend-ops coverage: DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL. CPU is the reference backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DeepSeek-V4-Flash model: graph (src/models/deepseek4.cpp), arch / hparams / model-loader wiring, the dsv4_* compressed-KV extension to llama_memory_hybrid_iswa, GGUF conversion (conversion/deepseek.py + constants/writer keys), and the V4 chat template. Standard build_attn_mha attention path; no DeepSeek Sparse Attention. Exercises the DSV4 ops from the preceding commit so they are testable end-to-end on CPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cchuter · 2026-05-16T17:05:09Z

@segmond, ggufs are here: https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF

@pwilkin, I've update the PR to add a full cpu end to end test. THe model loads and answers the test questions.

I was a little over aggressive in splitting up the PRs. Here's the plan:

This PR 1: base ggml and cpu support
PR 2: Metal Kernel
PR 3: CUDA (including multi-gpu support, that was fun)
PR 4: MTP (yay, nice timing on the merge)
PR 5: DSA (this is the PR that was recently closed by @fairydreaming)

PR's 4 and 5 are not done yet and might prove challenging. They will also require new gguf's (which I will make).

THe full working code base of PR 1-3 is here: https://github.com/cchuter/llama.cpp/tree/v4-clean-base

Before all those PRs are merged I don't expect a performant DeepSeek4, but I do expect a working one at all stages. Right now you should peak around 20-30 t/s generation on GPUs

ngxson

no good feeling about this PR, I assume backend maintainers won't happy to accept this as-is, but let's wait to see how other maintainers say about this.

it seems like most ops added this this PR can be implemented in another way

ngxson · 2026-05-16T18:26:54Z

+            struct ggml_context * ctx,
+            struct ggml_tensor  * x,
+            struct ggml_tensor  * weights);
+


why not using ggml_mul, ggml_add and ggml_sum_row?

ngxson · 2026-05-16T18:27:57Z

+
+    // DeepSeek V4 hyperconnection expand helper.
+    // Computes post * block_out + comb^T @ residual for each token.
+    GGML_API struct ggml_tensor * ggml_dsv4_hc_expand(


whay about ggml_mul, ggml_add and ggml_mul_mat?

ngxson · 2026-05-16T18:28:45Z

+    // DeepSeek V4 partial RoPE helper.
+    // Leaves the non-RoPE prefix unchanged and applies RoPE to the tail,
+    // matching ggml_concat(prefix, ggml_rope_ext(tail)).
+    GGML_API struct ggml_tensor * ggml_dsv4_rope_tail(


you should have skipped reading tips and tricks from https://github.com/ggml-org/llama.cpp/blob/master/docs/development/HOWTO-add-model.md

ngxson · 2026-05-17T11:01:37Z

Did you just copied straight from antirez's version? https://github.com/antirez/llama.cpp-deepseek-v4-flash/blob/main/ggml/include/ggml.h

Please note that dishonest about code's origin will resulting in being banned from the project.

cchuter · 2026-05-17T14:06:40Z

@ngxson you're right, thanks for pushing on it. This is derived from antirez's llama.cpp-deepseek-v4-flash (which builds on fairydreaming's and the ggml/llama.cpp lineage). The V4 ops and model are his; my part is CUDA/multi-GPU, a DSA-free rebase onto current master, and GGUF packaging. I should have credited that in the PR from the start and didn't. That's on me.

I have contacted @antirez and @fairydreaming to let them know I'm building on top of their work and crediting. They've both moved on to other things.

cchuter · 2026-05-17T14:10:24Z

Happy to restructure, or close, if you'd rather. I'm not aware of the history of this work

ngxson · 2026-05-17T14:40:22Z

Since your code contains many lines that are directly copied from antirez's work, we require explicit agreement from the original author to proceed.

antirez · 2026-05-17T14:52:55Z

Hi, I agree with taking the code I developed and making it part of llama.cpp or any other compatibly-licensed project. I don't ask for any credit. Have fun hacking LLMs! :)

segmond · 2026-05-17T18:34:04Z

I checked out your v4-clean-base branch and gave it a go, still needs more work. I couldn't build it because GGML_OP_COUNT wasn't updated in ggml-rpc.h. had to bump it to 101 from 96 to build. My guess is you didn't build with RPC.

On multi CUDA GPU setup it fails, abords fro ggml_backend_sched_split_graph(), I had to bump up GGML_SCHED_MAX_BACKEND and GGML_SCHED_MAX_SPLIT_INPUTS to stop the crash.

Once I go past 1 GPU I get the infamous <<<<<<<<<<< for output. With 1 GPU it works. 3090s btw. Nevermind the bad output, with the model completely loaded in memory vs 1 24gb GPU, performance isn't that much improved, all in memory TG 14.81tk/s, 1 GPU 11.70tk/s for about 4500 tokens generated. PP with multiGPU is 3x 177 vs 65. Performance leaves lots to be desired, but for now let's focus on correctness. To give an example, I get 800 tk/s on PP and 52 tk/s for TG for Qwen3.5-122 which is larger loaded all in memory. From reading the deepseek paper, it's suppose to require less compute....

I have tried every fork I could find, and these 2 in these order have worked best for me on multi cuda setup. mix of 3090s/3080s.

https://github.com/nonzod/llama.cpp-deepseek-v4-flash-spark
https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda

cchuter · 2026-05-18T19:51:02Z

@segmond thanks, that's very helpful. I'll hold off on the other GPU PRs until I've confirmed perf and correctness.

vanmilleru · 2026-05-20T16:20:10Z

any chance this would also support #22436

narikm · 2026-06-01T15:49:45Z

No news in 2 weeks?

pwilkin · 2026-06-01T16:23:33Z

Things are happening.

cchuter · 2026-06-01T17:00:36Z

Sorry, I've refocused and I've been working on adding multi-gpu support to @antirez 's project. Anyone is free to use my fork and previous work to further this along- no credit needed. If I get good results I'll see if I can get it working in llama.cpp as well.

vanmilleru · 2026-06-02T17:08:43Z

can ds4 run on multiple gpu on different machines?

fairydreaming · 2026-06-02T17:17:16Z

@vanmilleru I think antirez ds4 project just added such feature, but it seems to be only for the DS4 PRO Q4 GGUF to run across two 512 GB M3 Ultra Mac Studios.

vanmilleru · 2026-06-03T03:35:44Z

@fairydreaming any chance nvidia gpu can also get this

fairydreaming · 2026-06-06T07:35:37Z

@fairydreaming any chance nvidia gpu can also get this

@vanmilleru No idea, but there's #24162, so maybe you can check if it works on multiple machines by using llama.cpp RPC.

cchuter requested a review from ggerganov as a code owner May 15, 2026 21:55

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels May 15, 2026

This comment was marked as resolved.

Sign in to view

cchuter mentioned this pull request May 15, 2026

Model request: DeepSeek V4 Series #22319

Open

cchuter and others added 2 commits May 16, 2026 11:31

cchuter force-pushed the upstream-pr/dsv4-ops-cpu branch from 97704a6 to d99045f Compare May 16, 2026 16:55

cchuter requested review from a team and CISC as code owners May 16, 2026 16:55

github-actions Bot added model Model specific examples python python script changes labels May 16, 2026

ngxson reviewed May 16, 2026

View reviewed changes

ngxson mentioned this pull request May 17, 2026

contrib: require explicit agreement for including external code #23201

Open

This comment was marked as spam.

Sign in to view

JoursBleu mentioned this pull request May 29, 2026

cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122 #23863

Closed

Uh oh!

Conversation

cchuter commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Attribution

Requirements

Uh oh!

This comment was marked as resolved.

segmond commented May 16, 2026

Uh oh!

pwilkin commented May 16, 2026

Uh oh!

cchuter commented May 16, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson May 16, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson May 16, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson May 16, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 17, 2026

Uh oh!

cchuter commented May 17, 2026

Uh oh!

cchuter commented May 17, 2026

Uh oh!

ngxson commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antirez commented May 17, 2026

Uh oh!

segmond commented May 17, 2026

Uh oh!

This comment was marked as spam.

cchuter commented May 18, 2026

Uh oh!

vanmilleru commented May 20, 2026

Uh oh!

narikm commented Jun 1, 2026

Uh oh!

pwilkin commented Jun 1, 2026

Uh oh!

cchuter commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanmilleru commented Jun 2, 2026

Uh oh!

fairydreaming commented Jun 2, 2026

Uh oh!

vanmilleru commented Jun 3, 2026

Uh oh!

fairydreaming commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cchuter commented May 15, 2026 •

edited

Loading

ngxson commented May 17, 2026 •

edited

Loading

cchuter commented Jun 1, 2026 •

edited

Loading