Split kv decode#146
Conversation
…el compilation units (#140) Split the monolithic template instantiation of xe_fmha_fwd_decode_runner.hpp into 72 separate .cpp files (one per QG_SZ × HEAD_DIM × PAGE_SIZE combination), each compiled as its own library. This enables parallel compilation and significantly speeds up build times. Changes: - Create xe_fmha_fwd_decode_kernel.cpp.in template for per-combination compilation - Create xe_fmha_fwd_decode_dispatch.hpp with function declarations for all 72 kernels - Move decode::mha_fwd() from header to flash_attention.cpp with dispatch table - Update src/CMakeLists.txt to generate .cpp files via configure_file() - Remove mha_fwd() definition from xe_fmha_fwd_decode_runner.hpp header Co-authored-by: airMeng <39229107+airMeng@users.noreply.github.com> Co-authored-by: jiwei1.sun <jiwei1.sun@intel.com>
* fix norm with noncontiguous input * remove comment out test * support in kernel
* Add MXFP4 Per Token Group Quant kernel and tests Remove commented out fp8 blockwise group gemm registration * Add benchmarking for per token group quant mxfp4 * Add test to run_suite.py * Fix group size constraint for mxfp4; Add benchmark test to CI flow * Remove reference provider from the benchmark script - Add check for quantized and scale values separately - Include eps value in ref quant function call * Fix MXFP4 quantization to match OCP MX spec - Replace ceil(log2(max/6.0)) scale computation with floor(log2(max)) - E2M1_EMAX per OCP MX spec - Fix roundTiesToEven at midpoints in SYCL kernel (change <= to < at odd-mantissa boundaries) - Replace naive argmin-based quantize_to_e2m1 reference with microxcaling _quantize_elemwise_core algorithm - Normalize signed zeros (+0.0 vs -0.0) before packed byte comparison in tests and benchmark * Fix lint issues * Remove unsupported group sizes * Apply formatting check * Add TODO for quantize_to_e2m1 * Update tests/test_per_token_group_quant_mxfp4.py Co-authored-by: Meng, Hengyu <airdldl@163.com> * trigger CI --------- Co-authored-by: Meng, Hengyu <airdldl@163.com>
… and fix SplitDeodeConfig typo
9bdad0b to
25a95a3
Compare
|
@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again |
|
@copilot ”FMHAConfig“ undefine and split_kv_decode has update. please rebase main again |
…nditional dispatch between DecodeConfig and SplitDecodeConfig, comment out page_size 32 Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/82c107d1-3f61-4ce0-9444-d7f19f27a292
Fixed in
|
Resolved in
Addressed in |
|
@copilot Use a function pointer table to perform the calls. You can refer to |
…pattern Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/5c058f18-d408-4752-aba6-2625a6e5e638
Done in
|
… pattern Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/871c6171-1314-44d4-b4f4-007ddffc6ed4
|
@copilot |
Already done in
|
… (GroupGemmXe20 pattern) Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/85877884-e91a-400f-adcc-5b9a55692df9
…rom operator() signature Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/f88d0fec-8f93-4f48-99b8-714d68fd14f4
…yphen Co-authored-by: sunjiweiswift <16934286+sunjiweiswift@users.noreply.github.com> Agent-Logs-Url: https://github.com/sgl-project/sgl-kernel-xpu/sessions/8cc49274-a20f-42e3-aad5-39043ba2eefa
bool use_sink = false;andbool use_causal_mask = false;toArgumentsstruct inxe_fmha_fwd_decode_runner.hppbool use_sinkparameter fromFmhaDecodeRunner::operator()andFmhaSplitDecodeRunner::operator()signatures.cpp.inbodies: dispatch onparams.use_sinkandparams.use_causal_maskflash_attention.cpp: setparams.use_sinkandparams.use_causal_mask; updateDISPATCH_DECODE_KERNELmacroflash_attention.cppcomment (replace with ASCII-)📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.