Add DeepSeek-V3.2 fused indexer path by XiaobingSuper · Pull Request #788 · ROCm/ATOM

XiaobingSuper · 2026-05-14T12:45:17Z

Summary

Fuse DeepSeek-V3.2 indexer wk and weights_proj when the checkpoint layout is compatible, including FP8 block-scale wk load support.
Add the fused indexer Q RoPE + Q quant + weight scaling + K norm/RoPE/cache path, guarded by ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION and enabled by default.
Preserve fallback behavior when the env is disabled, when the fused kernel shape constraints are not met, or when indexer.wk uses an unsupported weight dtype such as FP4/MXFP4.
Route sparse indexer plugin mode through the same fused/fallback selection.

Performance Validation

Environment:

Model: /shared/data/amd_int/models/DeepSeek-V3.2
Hardware: MI355, TP=4
Benchmark: ISL=1000, OSL=100, NUM=10*CON, warmups=4*CON
Baseline: same ATOM/AITER branch with ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION=0
Fused: default path with ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION=1
AITER JIT cache was cleared before validation
Logs: /workdir/my_test/results/indexer_refusion_repro_20260514_120947

CON	Baseline output tok/s	Fused output tok/s	Output delta	Baseline total tok/s	Fused total tok/s	Total speedup	Total delta
4	191.42	202.15	+5.6%	2105.61	2223.67	1.056x	+5.6%
8	324.49	342.91	+5.7%	3569.41	3772.00	1.057x	+5.7%
16	465.05	481.38	+3.5%	5115.58	5295.15	1.035x	+3.5%

Accuracy Validation

gsm8k, 3-shot, fused path enabled:

Filter	exact_match	Stderr
flexible-extract	0.9462	0.0062
strict-match	0.9469	0.0062

No material accuracy regression was observed in this run.

Test plan

git diff --check passed for ATOM changes.
python3 -m py_compile atom/models/deepseek_v2.py atom/plugin/attention_mla_sparse.py atom/utils/envs.py passed.
AITER indexer_qk_rope_quant_and_cache JIT compiled and ran during local validation.
Re-ran performance validation for CON=4/8/16 with the env-controlled fallback and fused paths.
Re-ran gsm8k 3-shot accuracy with the fused path enabled.

Notes

Companion AITER kernel PR: Add fused DeepSeek-V3.2 indexer cache kernel aiter#3185

Made with Cursor

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR introduces a fused indexer path for DeepSeek-V3.2 to improve decode performance. It fuses the indexer's wk and weights_proj into a single GEMM (with FP8 block-scale wk load support) and adds an AITER indexer_qk_rope_quant_and_cache kernel that combines Q RoPE, Q quantization, weight scaling, and K norm/RoPE/cache writes. The new path is gated by ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION (default on), with fallbacks when the env is disabled, when shape constraints aren't met, or when indexer.wk uses an unsupported dtype.

Changes:

New IndexerWkWeightsProjLinear (FP8 block-scale wk dequant → BF16) and packed-modules wiring to fuse indexer.wk + indexer.weights_proj.
New fused indexer_qk_rope_quant_and_cache call site in both native and sparse plugin paths, with env-guarded fallback to indexer_k_quant_and_cache.
Updated sparse_attn_indexer custom-op signatures (and fakes) to take K-norm/RoPE cache/scale args; fake/dummy returns now allocate fp32 weights to match the fused output dtype.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
atom/utils/envs.py	Adds `ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION` env (default `1`).
atom/plugin/attention_mla_sparse.py	Plugin-mode sparse indexer routes to fused kernel and updates fake/dummy returns.
atom/models/deepseek_v2.py	Adds fused wk+weights_proj linear, fused QK-rope/cache call, fusion eligibility check, packed-modules and quant-exclude updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        model_prefix = maybe_prefix(prefix, "model")
+        use_indexer_wk_weights_proj_fusion = _can_fuse_indexer_wk_weights_proj(
+            config,
+            quant_config,
+            f"{model_prefix}.layers.0.self_attn.indexer",
+        )


    quant_exclude_name_mapping: dict[str, str] = {
        # HF quant config uses "indexers_proj" but the ATOM module path is
-        # "indexer.weights_proj".  str.replace translates each exclude entry.
-        "indexers_proj": "indexer.weights_proj",
+        # "indexer.wk_weights_proj".  str.replace translates each exclude entry.
+        "indexers_proj": "indexer.wk_weights_proj",
    }


Keep indexer weight fusion decisions consistent with quant fallback paths and make dummy/profile behavior deterministic. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep the PR compatible with the repository's Black formatting check after the review fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Collapse the review helper stack while keeping a single model-level fusion decision for weight loading. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Ensure threaded checkpoint loading dequantizes pending FP8 indexer wk weights on the same device as their scales. Co-authored-by: Cursor <cursoragent@cursor.com>

XiaobingSuper and others added 6 commits May 14, 2026 05:51

[perf](deepseek): add fused indexer path

a7d5623

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): clear fused indexer rope args

f1fb144

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): simplify fused indexer path

a31b19d

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): drop unrelated indexer weights change

cc8d714

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): gate fused indexer path

d7e1649

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): guard indexer wk fusion by dtype

41fb818

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings May 14, 2026 12:45

XiaobingSuper mentioned this pull request May 14, 2026

Add fused DeepSeek-V3.2 indexer cache kernel ROCm/aiter#3185

Open

Copilot started reviewing on behalf of XiaobingSuper May 14, 2026 12:46 View session

[fix](deepseek): remove unused sparse import

9e2e5d6

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI reviewed May 14, 2026

View reviewed changes

XiaobingSuper requested a review from valarLip May 14, 2026 13:31

XiaobingSuper and others added 2 commits May 14, 2026 08:43

[fix](deepseek): address fused indexer review feedback

94652b8

Keep indexer weight fusion decisions consistent with quant fallback paths and make dummy/profile behavior deterministic. Co-authored-by: Cursor <cursoragent@cursor.com>

[fix](deepseek): align indexer review fixes with black

896944c

Keep the PR compatible with the repository's Black formatting check after the review fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings May 14, 2026 13:45

Copilot started reviewing on behalf of XiaobingSuper May 14, 2026 13:46 View session

[fix](deepseek): simplify indexer fusion guard

e5e6b1a

Collapse the review helper stack while keeping a single model-level fusion decision for weight loading. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI reviewed May 14, 2026

View reviewed changes

XiaobingSuper marked this pull request as draft May 14, 2026 13:56

[fix](deepseek): move pending fp8 wk before dequant

84cf820

Ensure threaded checkpoint loading dequantizes pending FP8 indexer wk weights on the same device as their scales. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek-V3.2 fused indexer path#788

Add DeepSeek-V3.2 fused indexer path#788
XiaobingSuper wants to merge 11 commits into
ROCm:mainfrom
XiaobingSuper:zxb/dsv32-indexer-fusion

XiaobingSuper commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XiaobingSuper commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Validation

Accuracy Validation

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XiaobingSuper commented May 14, 2026 •

edited

Loading