Skip to content

Add DeepSeek-V3.2 fused indexer path#788

Draft
XiaobingSuper wants to merge 11 commits into
ROCm:mainfrom
XiaobingSuper:zxb/dsv32-indexer-fusion
Draft

Add DeepSeek-V3.2 fused indexer path#788
XiaobingSuper wants to merge 11 commits into
ROCm:mainfrom
XiaobingSuper:zxb/dsv32-indexer-fusion

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

@XiaobingSuper XiaobingSuper commented May 14, 2026

Summary

  • Fuse DeepSeek-V3.2 indexer wk and weights_proj when the checkpoint layout is compatible, including FP8 block-scale wk load support.
  • Add the fused indexer Q RoPE + Q quant + weight scaling + K norm/RoPE/cache path, guarded by ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION and enabled by default.
  • Preserve fallback behavior when the env is disabled, when the fused kernel shape constraints are not met, or when indexer.wk uses an unsupported weight dtype such as FP4/MXFP4.
  • Route sparse indexer plugin mode through the same fused/fallback selection.

Performance Validation

Environment:

  • Model: /shared/data/amd_int/models/DeepSeek-V3.2
  • Hardware: MI355, TP=4
  • Benchmark: ISL=1000, OSL=100, NUM=10*CON, warmups=4*CON
  • Baseline: same ATOM/AITER branch with ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION=0
  • Fused: default path with ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION=1
  • AITER JIT cache was cleared before validation
  • Logs: /workdir/my_test/results/indexer_refusion_repro_20260514_120947
CON Baseline output tok/s Fused output tok/s Output delta Baseline total tok/s Fused total tok/s Total speedup Total delta
4 191.42 202.15 +5.6% 2105.61 2223.67 1.056x +5.6%
8 324.49 342.91 +5.7% 3569.41 3772.00 1.057x +5.7%
16 465.05 481.38 +3.5% 5115.58 5295.15 1.035x +3.5%

Accuracy Validation

gsm8k, 3-shot, fused path enabled:

Filter exact_match Stderr
flexible-extract 0.9462 0.0062
strict-match 0.9469 0.0062

No material accuracy regression was observed in this run.

Test plan

  • git diff --check passed for ATOM changes.
  • python3 -m py_compile atom/models/deepseek_v2.py atom/plugin/attention_mla_sparse.py atom/utils/envs.py passed.
  • AITER indexer_qk_rope_quant_and_cache JIT compiled and ran during local validation.
  • Re-ran performance validation for CON=4/8/16 with the env-controlled fallback and fused paths.
  • Re-ran gsm8k 3-shot accuracy with the fused path enabled.

Notes

Made with Cursor

XiaobingSuper and others added 6 commits May 14, 2026 05:51
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fused indexer path for DeepSeek-V3.2 to improve decode performance. It fuses the indexer's wk and weights_proj into a single GEMM (with FP8 block-scale wk load support) and adds an AITER indexer_qk_rope_quant_and_cache kernel that combines Q RoPE, Q quantization, weight scaling, and K norm/RoPE/cache writes. The new path is gated by ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION (default on), with fallbacks when the env is disabled, when shape constraints aren't met, or when indexer.wk uses an unsupported dtype.

Changes:

  • New IndexerWkWeightsProjLinear (FP8 block-scale wk dequant → BF16) and packed-modules wiring to fuse indexer.wk + indexer.weights_proj.
  • New fused indexer_qk_rope_quant_and_cache call site in both native and sparse plugin paths, with env-guarded fallback to indexer_k_quant_and_cache.
  • Updated sparse_attn_indexer custom-op signatures (and fakes) to take K-norm/RoPE cache/scale args; fake/dummy returns now allocate fp32 weights to match the fused output dtype.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
atom/utils/envs.py Adds ATOM_ENABLE_DS_INDEXER_QK_ROPE_CACHE_FUSION env (default 1).
atom/plugin/attention_mla_sparse.py Plugin-mode sparse indexer routes to fused kernel and updates fake/dummy returns.
atom/models/deepseek_v2.py Adds fused wk+weights_proj linear, fused QK-rope/cache call, fusion eligibility check, packed-modules and quant-exclude updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2142 to +2147
model_prefix = maybe_prefix(prefix, "model")
use_indexer_wk_weights_proj_fusion = _can_fuse_indexer_wk_weights_proj(
config,
quant_config,
f"{model_prefix}.layers.0.self_attn.indexer",
)
Comment on lines 2253 to 2257
quant_exclude_name_mapping: dict[str, str] = {
# HF quant config uses "indexers_proj" but the ATOM module path is
# "indexer.weights_proj". str.replace translates each exclude entry.
"indexers_proj": "indexer.weights_proj",
# "indexer.wk_weights_proj". str.replace translates each exclude entry.
"indexers_proj": "indexer.wk_weights_proj",
}
Comment thread atom/plugin/attention_mla_sparse.py
Comment thread atom/models/deepseek_v2.py Outdated
Comment thread atom/models/deepseek_v2.py Outdated
Comment thread atom/models/deepseek_v2.py
@XiaobingSuper XiaobingSuper requested a review from valarLip May 14, 2026 13:31
XiaobingSuper and others added 2 commits May 14, 2026 08:43
Keep indexer weight fusion decisions consistent with quant fallback paths and make dummy/profile behavior deterministic.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the PR compatible with the repository's Black formatting check after the review fixes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 14, 2026 13:45
Collapse the review helper stack while keeping a single model-level fusion decision for weight loading.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@XiaobingSuper XiaobingSuper marked this pull request as draft May 14, 2026 13:56
Ensure threaded checkpoint loading dequantizes pending FP8 indexer wk weights on the same device as their scales.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants