Skip to content

Split kv decode#146

Merged
sunjiweiswift merged 25 commits intosplit_kv_decodefrom
copilot/sub-pr-145
Mar 24, 2026
Merged

Split kv decode#146
sunjiweiswift merged 25 commits intosplit_kv_decodefrom
copilot/sub-pr-145

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 23, 2026

  • Add bool use_sink = false; and bool use_causal_mask = false; to Arguments struct in xe_fmha_fwd_decode_runner.hpp
  • Remove standalone bool use_sink parameter from FmhaDecodeRunner::operator() and FmhaSplitDecodeRunner::operator() signatures
  • Update .cpp.in bodies: dispatch on params.use_sink and params.use_causal_mask
  • Update flash_attention.cpp: set params.use_sink and params.use_causal_mask; update DISPATCH_DECODE_KERNEL macro
  • Fix non-ASCII em-dash character in flash_attention.cpp comment (replace with ASCII -)

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Copilot AI and others added 5 commits March 18, 2026 14:13
…el compilation units (#140)

Split the monolithic template instantiation of xe_fmha_fwd_decode_runner.hpp
into 72 separate .cpp files (one per QG_SZ × HEAD_DIM × PAGE_SIZE combination),
each compiled as its own library. This enables parallel compilation and
significantly speeds up build times.

Changes:
- Create xe_fmha_fwd_decode_kernel.cpp.in template for per-combination compilation
- Create xe_fmha_fwd_decode_dispatch.hpp with function declarations for all 72 kernels
- Move decode::mha_fwd() from header to flash_attention.cpp with dispatch table
- Update src/CMakeLists.txt to generate .cpp files via configure_file()
- Remove mha_fwd() definition from xe_fmha_fwd_decode_runner.hpp header

Co-authored-by: airMeng <39229107+airMeng@users.noreply.github.com>
Co-authored-by: jiwei1.sun <jiwei1.sun@intel.com>
* fix norm with noncontiguous input

* remove comment out test

* support in kernel
* Add MXFP4 Per Token Group Quant kernel and tests

Remove commented out fp8 blockwise group gemm registration

* Add benchmarking for per token group quant mxfp4

* Add test to run_suite.py

* Fix group size constraint for mxfp4; Add benchmark test to CI flow

* Remove reference provider from the benchmark script

- Add check for quantized and scale values separately
- Include eps value in ref quant function call

* Fix MXFP4 quantization to match OCP MX spec

- Replace ceil(log2(max/6.0)) scale computation with floor(log2(max)) -
  E2M1_EMAX per OCP MX spec
- Fix roundTiesToEven at midpoints in SYCL kernel (change <= to < at
  odd-mantissa boundaries)
- Replace naive argmin-based quantize_to_e2m1 reference with
  microxcaling _quantize_elemwise_core algorithm
- Normalize signed zeros (+0.0 vs -0.0) before packed byte comparison in
  tests and benchmark

* Fix lint issues

* Remove unsupported group sizes

* Apply formatting check

* Add TODO for quantize_to_e2m1

* Update tests/test_per_token_group_quant_mxfp4.py

Co-authored-by: Meng, Hengyu <airdldl@163.com>

* trigger CI

---------

Co-authored-by: Meng, Hengyu <airdldl@163.com>
Copilot AI changed the title [WIP] Split kv decode to enhance performance Split kv decode Mar 23, 2026
Copilot AI requested a review from sunjiweiswift March 23, 2026 07:33
@sunjiweiswift
Copy link
Copy Markdown
Collaborator

@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again

@sunjiweiswift
Copy link
Copy Markdown
Collaborator

@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again

…nditional dispatch between DecodeConfig and SplitDecodeConfig, comment out page_size 32

Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/82c107d1-3f61-4ce0-9444-d7f19f27a292
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again

Fixed in 8ce3170. Three changes:

  1. FMHAConfigDecodeConfig in xe_fmha_fwd_decode_kernel.cpp.in (fixes the undefined symbol)
  2. Added conditional dispatch in flash_attention.cpp: uses SplitDecodeConfig when params.use_split_kv_decode is true, otherwise falls back to DecodeConfig
  3. Commented out page_size 32 support (matching base branch update)

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

This branch has conflicts that must be resolved @copilot

Resolved in eeee619. Merged latest origin/split_kv_decode, resolved the conflict in xe_fmha_fwd_decode_runner.hpp (kept mha_fwd in flash_attention.cpp), and renamed kernel_dispatchrun to match the base branch update.

@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again

Addressed in 8ce3170 (FMHAConfig fix) and eeee619 (merge conflicts resolved with latest split_kv_decode).

@sunjiweiswift
Copy link
Copy Markdown
Collaborator

@copilot Use a function pointer table to perform the calls. You can refer to flash_attention.cpp in the main branch. Rather than instantiating a template object.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 24, 2026

@copilot Use a function pointer table to perform the calls. You can refer to flash_attention.cpp in the main branch. Rather than instantiating a template object.

Done in 1614a4c. Following the GroupGemmXe20 pattern:

  • xe_fmha_fwd_decode_runner.hpp: Added launch_fmha_decode<QG, HD, PS> and launch_fmha_split_decode<QG, HD, PS> template function definitions here.
  • .cpp.in files: Reduced to a single explicit instantiation line (template void launch_fmha_decode<@QG_SZ@, @HEAD_DIM@, @PAGE_SIZE@>(bool, const Arguments&)) — matching GroupGemmXe20LauncherInstance.cpp.in.
  • xe_fmha_fwd_decode_dispatch.hpp: Changed to template declarations + extern template instantiation declarations (equivalent to GroupGemmXe20.cpp's DECLARE_XE20_MOE_EXTERN pattern).
  • flash_attention.cpp: Function pointer table uses &launch_fmha_decode<QG, HD, PS> addresses — no template objects instantiated inline.

@sunjiweiswift
Copy link
Copy Markdown
Collaborator

@copilot
In flash_attention.cpp, include only the struct declarations and calls to the corresponding structs' operator(); the actual struct definitions should be implemented in cpp.in. You may refer to GroupGemmXe20.cpp as a reference.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 24, 2026

@copilot
In flash_attention.cpp, include only the struct declarations and calls to the corresponding structs' operator(); the actual struct definitions should be implemented in ...

Already done in 25d22d1:

  • flash_attention.cpp: Contains only extern template struct declarations (via dispatch.hpp) and calls via &FmhaDecodeRunner<QG,HD,PS>::call / &FmhaSplitDecodeRunner<QG,HD,PS>::call — no kernel definitions.
  • runner.hpp: Structs with operator() declared only (no body), plus a static call() helper.
  • .cpp.in files: Full operator() body + template struct explicit instantiation — mirroring the GroupGemmXe20LauncherInstance.cpp.in pattern.

… (GroupGemmXe20 pattern)

Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/85877884-e91a-400f-adcc-5b9a55692df9
…rom operator() signature

Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/f88d0fec-8f93-4f48-99b8-714d68fd14f4
@sunjiweiswift sunjiweiswift marked this pull request as ready for review March 24, 2026 05:12
@sunjiweiswift sunjiweiswift merged commit 48bcf2e into split_kv_decode Mar 24, 2026
@sunjiweiswift sunjiweiswift deleted the copilot/sub-pr-145 branch March 24, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants