Skip to content

cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863

Closed
JoursBleu wants to merge 8 commits into
ggml-org:masterfrom
JoursBleu:rocm/hip-fixes
Closed

cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863
JoursBleu wants to merge 8 commits into
ggml-org:masterfrom
JoursBleu:rocm/hip-fixes

Conversation

@JoursBleu

@JoursBleu JoursBleu commented May 29, 2026

Copy link
Copy Markdown
Contributor

A small HIP portability patch based on @cchuter's feat/v4-port-cuda enables DeepSeek-V4-Flash to build and run on ROCm (verified with gfx1151 Strix Halo, decoding speed of 14.8 t/s).

cchuter and others added 8 commits May 15, 2026 16:13
Adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference
implementations and test-backend-ops coverage:

  - GGML_OP_DSV4_HC_SPLIT_SINKHORN  hyperconnection mix split + Sinkhorn
  - GGML_OP_DSV4_HC_WEIGHTED_SUM    hyperconnection weighted residual sum
  - GGML_OP_DSV4_HC_EXPAND          hyperconnection stream expand
  - GGML_OP_DSV4_FP8_KV_QUANTIZE    e4m3 FP8 KV-cache quantize/dequantize
  - GGML_OP_DSV4_ROPE_TAIL          V4 partial-RoPE tail rotation

CPU-only, per the maintainer guidance to land new model ops with a CPU
implementation first and add accelerator backends in separate per-backend
PRs. No CUDA/Metal code. No model/arch wiring (separate PR). No DeepSeek
Sparse Attention / lightning-indexer / flash-attn-top_k changes — V4 does
not use those.

GGML_OP_COUNT goes 96 -> 101 (5 new ops).

The CPU implementations are the numerical reference; test-backend-ops
compares backend ops against the CPU backend, so on a CPU-only build the
new cases register but are inert (CPU is the reference). They are
exercised once a GPU backend implementing them is built (follow-up
per-backend PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Additively reconstructs the full DeepSeek-V4-Flash CPU model path on top of
the clean DSV4 ggml ops base, with DeepSeek Sparse Attention (DSA) entirely
absent: V4 model graph (src/models/deepseek4.cpp), arch/hparams/tensor
tables, the dsv4_* compressed-KV hybrid-iswa extension, the V4 converter
class grafted into the refactored conversion/ package, and the V4 chat
template. No DEEPSEEK32 / kv_cache_dsa / lightning_indexer /
build_attn_inp_k_dsa / ggml_flash_attn_ext_add_top_k. build_attn_mha keeps
its pristine upstream signature; deepseek4.cpp calls it without top_k.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review round-1 blocker fix: the V4 graph (src/models/deepseek4.cpp)
calls get_dsv4_attn_k / get_dsv4_index_k / get_dsv4_n_comp /
has_dsv4_compressed_kv, which were missing because the
src/llama-memory-hybrid-iswa.{cpp,h} dsv4_* compressed-KV extension was
not applied in the first L1 commit. Without it llama-cli fails to compile.
This adds the DSA-free dsv4_* extension additively (0 DSA identifiers).
Verified: full CPU build (cmake --build build-cpu-l1 --target llama-cli)
now succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Metal implementations + dispatch for the five DeepSeek-V4-Flash
ggml ops: DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND,
DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL.

Additive only, bounded to ggml/src/ggml-metal/* (7 files, +913/-0).
No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k
Metal code. Validated on Apple M3 Ultra (Metal, reference platform):
coherent V4 inference with DSA entirely absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA implementations + dispatch for the five DeepSeek-V4-Flash ggml ops
(DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND,
DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL), plus the V4-coupled multi-GPU
correctness fixes in ggml-cuda.cu (split-buffer DSV4 exception, view_src
cross-device dispatch refusal, env-gated peer-copy debug) and the
software-FP16 dequantize_V<half> fallback in fattn-common.cuh.

Additive, bounded to ggml/src/ggml-cuda/* (12 files, +1142/-2; the 2
deletions are the intentional split-buffer guard replacement). No
DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k code.
The dequantize_V<half> fallback is code byte-identical to the merged
PR-series fix d4e21f0; one rationale comment was generalized to not
name a kernel that does not exist on this DSA-free base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the DSA-free V4 source of truth up to date with ggml-org master
(18d1717 -> 0253fb2). 22/23 overlapping core files auto-merged.

Sole content conflict: src/llama-arch.cpp LLM_TENSOR_INFOS. Resolved by
keeping the V4 hyperconnection / compressed-KV tensor->op mappings
(INDEXER_COMPRESSOR_*, HC_*, FFN_GATE_TID2EID, *_APE) and adopting
upstream's NextN/MTP reclassification (LAYER_OUTPUT -> LAYER_REPEATING,
the loader-block-index fault fix). V4 functionally ignores NextN/MTP
(reserved for future MTP), so upstream's more-robust shared-infra
classification is the correct up-to-date choice.

DSA-exclusion intact: lightning-indexer / flash_attn-topk / kv_cache_dsa
absent; the 5 V4 layer commits unchanged. Revalidation on gpudual
(CUDA build + DSV4 19/19 + coherent decode) before push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Silent semantic conflict from the +20 upstream merge: origin/master
added a new `uint32_t n_rs_seq` parameter (position 13) to the
llama_memory_hybrid_iswa constructor. Git auto-merged the header and
the V4 call site separately with no text conflict, leaving the V4
create_memory() path passing 16 args to a 17-arg ctor (filter_attn
landed on `bool unified`, a std::function->bool error).

Pass `cparams.n_rs_seq` at position 13, matching the convention of the
two non-V4 hybrid_iswa/hybrid call sites in this file. C++ side
(llama, llama-completion, test-backend-ops) builds clean locally;
gpudual CUDA revalidation follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JoursBleu JoursBleu closed this May 29, 2026
@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants