cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122 by JoursBleu · Pull Request #23863 · ggml-org/llama.cpp

JoursBleu · 2026-05-29T10:05:21Z

A small HIP portability patch based on @cchuter's feat/v4-port-cuda enables DeepSeek-V4-Flash to build and run on ROCm (verified with gfx1151 Strix Halo, decoding speed of 14.8 t/s).

Adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference implementations and test-backend-ops coverage: - GGML_OP_DSV4_HC_SPLIT_SINKHORN hyperconnection mix split + Sinkhorn - GGML_OP_DSV4_HC_WEIGHTED_SUM hyperconnection weighted residual sum - GGML_OP_DSV4_HC_EXPAND hyperconnection stream expand - GGML_OP_DSV4_FP8_KV_QUANTIZE e4m3 FP8 KV-cache quantize/dequantize - GGML_OP_DSV4_ROPE_TAIL V4 partial-RoPE tail rotation CPU-only, per the maintainer guidance to land new model ops with a CPU implementation first and add accelerator backends in separate per-backend PRs. No CUDA/Metal code. No model/arch wiring (separate PR). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top_k changes — V4 does not use those. GGML_OP_COUNT goes 96 -> 101 (5 new ops). The CPU implementations are the numerical reference; test-backend-ops compares backend ops against the CPU backend, so on a CPU-only build the new cases register but are inert (CPU is the reference). They are exercised once a GPU backend implementing them is built (follow-up per-backend PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Additively reconstructs the full DeepSeek-V4-Flash CPU model path on top of the clean DSV4 ggml ops base, with DeepSeek Sparse Attention (DSA) entirely absent: V4 model graph (src/models/deepseek4.cpp), arch/hparams/tensor tables, the dsv4_* compressed-KV hybrid-iswa extension, the V4 converter class grafted into the refactored conversion/ package, and the V4 chat template. No DEEPSEEK32 / kv_cache_dsa / lightning_indexer / build_attn_inp_k_dsa / ggml_flash_attn_ext_add_top_k. build_attn_mha keeps its pristine upstream signature; deepseek4.cpp calls it without top_k. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code-review round-1 blocker fix: the V4 graph (src/models/deepseek4.cpp) calls get_dsv4_attn_k / get_dsv4_index_k / get_dsv4_n_comp / has_dsv4_compressed_kv, which were missing because the src/llama-memory-hybrid-iswa.{cpp,h} dsv4_* compressed-KV extension was not applied in the first L1 commit. Without it llama-cli fails to compile. This adds the DSA-free dsv4_* extension additively (0 DSA identifiers). Verified: full CPU build (cmake --build build-cpu-l1 --target llama-cli) now succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Metal implementations + dispatch for the five DeepSeek-V4-Flash ggml ops: DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL. Additive only, bounded to ggml/src/ggml-metal/* (7 files, +913/-0). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k Metal code. Validated on Apple M3 Ultra (Metal, reference platform): coherent V4 inference with DSA entirely absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CUDA implementations + dispatch for the five DeepSeek-V4-Flash ggml ops (DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL), plus the V4-coupled multi-GPU correctness fixes in ggml-cuda.cu (split-buffer DSV4 exception, view_src cross-device dispatch refusal, env-gated peer-copy debug) and the software-FP16 dequantize_V<half> fallback in fattn-common.cuh. Additive, bounded to ggml/src/ggml-cuda/* (12 files, +1142/-2; the 2 deletions are the intentional split-buffer guard replacement). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k code. The dequantize_V<half> fallback is code byte-identical to the merged PR-series fix d4e21f0; one rationale comment was generalized to not name a kernel that does not exist on this DSA-free base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings the DSA-free V4 source of truth up to date with ggml-org master (18d1717 -> 0253fb2). 22/23 overlapping core files auto-merged. Sole content conflict: src/llama-arch.cpp LLM_TENSOR_INFOS. Resolved by keeping the V4 hyperconnection / compressed-KV tensor->op mappings (INDEXER_COMPRESSOR_*, HC_*, FFN_GATE_TID2EID, *_APE) and adopting upstream's NextN/MTP reclassification (LAYER_OUTPUT -> LAYER_REPEATING, the loader-block-index fault fix). V4 functionally ignores NextN/MTP (reserved for future MTP), so upstream's more-robust shared-infra classification is the correct up-to-date choice. DSA-exclusion intact: lightning-indexer / flash_attn-topk / kv_cache_dsa absent; the 5 V4 layer commits unchanged. Revalidation on gpudual (CUDA build + DSV4 19/19 + coherent decode) before push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Silent semantic conflict from the +20 upstream merge: origin/master added a new `uint32_t n_rs_seq` parameter (position 13) to the llama_memory_hybrid_iswa constructor. Git auto-merged the header and the V4 call site separately with no text conflict, leaving the V4 create_memory() path passing 16 args to a 17-arg ctor (filter_attn landed on `bool unified`, a std::function->bool error). Pass `cparams.n_rs_seq` at position 13, matching the convention of the two non-V4 hybrid_iswa/hybrid call sites in this file. C++ side (llama, llama-completion, test-backend-ops) builds clean locally; gpudual CUDA revalidation follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cchuter and others added 8 commits May 15, 2026 16:13

cuda: enable dsv4 with rocm/hip

cb0c5f6

JoursBleu closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863

cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863
JoursBleu wants to merge 8 commits into
ggml-org:masterfrom
JoursBleu:rocm/hip-fixes

JoursBleu commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JoursBleu commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JoursBleu commented May 29, 2026 •

edited

Loading