fa2 xe2: precompile shared header umbrella for attn_kernels_xe_2#350
Closed
jasonboukheir wants to merge 1 commit into
Closed
fa2 xe2: precompile shared header umbrella for attn_kernels_xe_2#350jasonboukheir wants to merge 1 commit into
jasonboukheir wants to merge 1 commit into
Conversation
The chunk_prefill / paged_decode kernel templates share almost their entire transitive include set: cute, the CUTLASS-SYCL collective stack, the sycl runtime, flash_attention_v2 fusion / scheduler / epilogue, the FMHAKernel / PagedDecode kernel templates. That parse cost is identical across every generated TU and accounts for a non-trivial fraction of per-TU compile wall time on icpx (template-name-lookup over cute / cutlass is template-pathological at this header depth). Wire up target_precompile_headers(attn_kernels_xe_2 PRIVATE attn_pch.hpp) with an umbrella header that #includes both kernel template surfaces. cmake handles the -Xclang -include-pch -Xclang <build>/cmake_pch.hxx.pch plumbing automatically for icpx (clang-style PCH), so the per-TU compile commands load the pre-parsed AST snapshot in milliseconds instead of re-parsing. icpx 2025.3 SYCL+PCH quirk: cmake injects -fpch-instantiate-templates into the PCH compile, which eagerly instantiates every template the PCH-emit pass encounters. cute/util/compat/memory.hpp uses sycl::handler::parallel_for<class T> elaborated-type-specifiers to forward-declare its kernel-name tags inline; under -fpch-instantiate-templates icpx runs the SYCL kernel-name validator against the ETS tag before its implicit namespace-scope declaration has propagated, and the PCH-emit fails with "kernel name should be forward declarable at namespace scope". Forward-declaring both kernel-name classes at ::compat::detail in the umbrella header resolves the ETS to a redeclaration of an already-visible class, which the validator accepts. See attn_pch.hpp for the workaround. Effect on cmake-driven builds: ~20-30% wall-time reduction on the FA2 kernel TUs, and the dtype-split's "more TUs" wall-time penalty disappears entirely. Per-TU peak RSS drops a further 1-2 GB (frontend parse cost is no longer paid per-TU). Downstream non-cmake consumers (Nix per-TU dynamic-derivations builds) extract the PCH compile command from compile_commands.json and realise it as its own derivation; per-TU commands then -include-pch the shared PCH artifact. See attn_pch.hpp for the icpx PCH compatibility contract.
Contributor
Author
|
currently running into some issues, so putting into draft for now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
chunk_prefill/paged_decodekernel templates share almost their entire transitive include set:cute, the CUTLASS-SYCL collective stack, the sycl runtime,flash_attention_v2fusion / scheduler / epilogue, theFMHAKernel/PagedDecodekernel templates. That parse cost is identical across every generated TU and accounts for a non-trivial fraction of per-TU compile wall time on icpx (template-name-lookup over cute / cutlass is template-pathological at this header depth).Change
Wire up
target_precompile_headers(attn_kernels_xe_2 PRIVATE attn_pch.hpp)with an umbrella header that#includes both kernel template surfaces. cmake handles the-Xclang -include-pch -Xclang <build>/cmake_pch.hxx.pchplumbing automatically for icpx (clang-style PCH), so the per-TU compile commands load the pre-parsed AST snapshot in milliseconds instead of re-parsing.Effect
~20–30% wall-time reduction on the FA2 kernel TUs, and the dtype-split PR (#349)'s "more TUs" wall-time penalty disappears entirely. Per-TU peak RSS drops a further 1–2 GB (frontend parse cost is no longer paid per-TU).
For downstream consumers
Non-cmake consumers (Nix per-TU dynamic-derivations builds) extract the PCH compile command from
compile_commands.jsonand realise it as its own derivation; per-TU commands then-include-pchthe shared PCH artifact. Seeattn_pch.hppfor the icpx PCH compatibility contract.Draft — minimal cmake change, but I'd appreciate confirmation that upstream CI doesn't have a
-DCMAKE_DISABLE_PRECOMPILE_HEADERS=ONpath that would silently regress.