cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863
Closed
JoursBleu wants to merge 8 commits into
Closed
cuda: enable DeepSeek V4 on ROCm/HIP (gfx1151 Strix Halo) — depends on #23122#23863JoursBleu wants to merge 8 commits into
JoursBleu wants to merge 8 commits into
Conversation
Adds the five DeepSeek-V4-Flash-specific ggml ops with CPU reference implementations and test-backend-ops coverage: - GGML_OP_DSV4_HC_SPLIT_SINKHORN hyperconnection mix split + Sinkhorn - GGML_OP_DSV4_HC_WEIGHTED_SUM hyperconnection weighted residual sum - GGML_OP_DSV4_HC_EXPAND hyperconnection stream expand - GGML_OP_DSV4_FP8_KV_QUANTIZE e4m3 FP8 KV-cache quantize/dequantize - GGML_OP_DSV4_ROPE_TAIL V4 partial-RoPE tail rotation CPU-only, per the maintainer guidance to land new model ops with a CPU implementation first and add accelerator backends in separate per-backend PRs. No CUDA/Metal code. No model/arch wiring (separate PR). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top_k changes — V4 does not use those. GGML_OP_COUNT goes 96 -> 101 (5 new ops). The CPU implementations are the numerical reference; test-backend-ops compares backend ops against the CPU backend, so on a CPU-only build the new cases register but are inert (CPU is the reference). They are exercised once a GPU backend implementing them is built (follow-up per-backend PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Additively reconstructs the full DeepSeek-V4-Flash CPU model path on top of the clean DSV4 ggml ops base, with DeepSeek Sparse Attention (DSA) entirely absent: V4 model graph (src/models/deepseek4.cpp), arch/hparams/tensor tables, the dsv4_* compressed-KV hybrid-iswa extension, the V4 converter class grafted into the refactored conversion/ package, and the V4 chat template. No DEEPSEEK32 / kv_cache_dsa / lightning_indexer / build_attn_inp_k_dsa / ggml_flash_attn_ext_add_top_k. build_attn_mha keeps its pristine upstream signature; deepseek4.cpp calls it without top_k. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review round-1 blocker fix: the V4 graph (src/models/deepseek4.cpp)
calls get_dsv4_attn_k / get_dsv4_index_k / get_dsv4_n_comp /
has_dsv4_compressed_kv, which were missing because the
src/llama-memory-hybrid-iswa.{cpp,h} dsv4_* compressed-KV extension was
not applied in the first L1 commit. Without it llama-cli fails to compile.
This adds the DSA-free dsv4_* extension additively (0 DSA identifiers).
Verified: full CPU build (cmake --build build-cpu-l1 --target llama-cli)
now succeeds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Metal implementations + dispatch for the five DeepSeek-V4-Flash ggml ops: DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL. Additive only, bounded to ggml/src/ggml-metal/* (7 files, +913/-0). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k Metal code. Validated on Apple M3 Ultra (Metal, reference platform): coherent V4 inference with DSA entirely absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA implementations + dispatch for the five DeepSeek-V4-Flash ggml ops (DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL), plus the V4-coupled multi-GPU correctness fixes in ggml-cuda.cu (split-buffer DSV4 exception, view_src cross-device dispatch refusal, env-gated peer-copy debug) and the software-FP16 dequantize_V<half> fallback in fattn-common.cuh. Additive, bounded to ggml/src/ggml-cuda/* (12 files, +1142/-2; the 2 deletions are the intentional split-buffer guard replacement). No DeepSeek Sparse Attention / lightning-indexer / flash-attn-top-k code. The dequantize_V<half> fallback is code byte-identical to the merged PR-series fix d4e21f0; one rationale comment was generalized to not name a kernel that does not exist on this DSA-free base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the DSA-free V4 source of truth up to date with ggml-org master (18d1717 -> 0253fb2). 22/23 overlapping core files auto-merged. Sole content conflict: src/llama-arch.cpp LLM_TENSOR_INFOS. Resolved by keeping the V4 hyperconnection / compressed-KV tensor->op mappings (INDEXER_COMPRESSOR_*, HC_*, FFN_GATE_TID2EID, *_APE) and adopting upstream's NextN/MTP reclassification (LAYER_OUTPUT -> LAYER_REPEATING, the loader-block-index fault fix). V4 functionally ignores NextN/MTP (reserved for future MTP), so upstream's more-robust shared-infra classification is the correct up-to-date choice. DSA-exclusion intact: lightning-indexer / flash_attn-topk / kv_cache_dsa absent; the 5 V4 layer commits unchanged. Revalidation on gpudual (CUDA build + DSV4 19/19 + coherent decode) before push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Silent semantic conflict from the +20 upstream merge: origin/master added a new `uint32_t n_rs_seq` parameter (position 13) to the llama_memory_hybrid_iswa constructor. Git auto-merged the header and the V4 call site separately with no text conflict, leaving the V4 create_memory() path passing 16 args to a 17-arg ctor (filter_attn landed on `bool unified`, a std::function->bool error). Pass `cparams.n_rs_seq` at position 13, matching the convention of the two non-V4 hybrid_iswa/hybrid call sites in this file. C++ side (llama, llama-completion, test-backend-ops) builds clean locally; gpudual CUDA revalidation follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A small HIP portability patch based on @cchuter's feat/v4-port-cuda enables DeepSeek-V4-Flash to build and run on ROCm (verified with gfx1151 Strix Halo, decoding speed of 14.8 t/s).