Skip to content

fa2 xe2: precompile shared header umbrella for attn_kernels_xe_2#350

Closed
jasonboukheir wants to merge 1 commit into
vllm-project:mainfrom
jasonboukheir:fa2-pch
Closed

fa2 xe2: precompile shared header umbrella for attn_kernels_xe_2#350
jasonboukheir wants to merge 1 commit into
vllm-project:mainfrom
jasonboukheir:fa2-pch

Conversation

@jasonboukheir
Copy link
Copy Markdown
Contributor

@jasonboukheir jasonboukheir commented May 13, 2026

The chunk_prefill / paged_decode kernel templates share almost their entire transitive include set: cute, the CUTLASS-SYCL collective stack, the sycl runtime, flash_attention_v2 fusion / scheduler / epilogue, the FMHAKernel / PagedDecode kernel templates. That parse cost is identical across every generated TU and accounts for a non-trivial fraction of per-TU compile wall time on icpx (template-name-lookup over cute / cutlass is template-pathological at this header depth).

Change

Wire up target_precompile_headers(attn_kernels_xe_2 PRIVATE attn_pch.hpp) with an umbrella header that #includes both kernel template surfaces. cmake handles the -Xclang -include-pch -Xclang <build>/cmake_pch.hxx.pch plumbing automatically for icpx (clang-style PCH), so the per-TU compile commands load the pre-parsed AST snapshot in milliseconds instead of re-parsing.

Effect

~20–30% wall-time reduction on the FA2 kernel TUs, and the dtype-split PR (#349)'s "more TUs" wall-time penalty disappears entirely. Per-TU peak RSS drops a further 1–2 GB (frontend parse cost is no longer paid per-TU).

For downstream consumers

Non-cmake consumers (Nix per-TU dynamic-derivations builds) extract the PCH compile command from compile_commands.json and realise it as its own derivation; per-TU commands then -include-pch the shared PCH artifact. See attn_pch.hpp for the icpx PCH compatibility contract.

Draft — minimal cmake change, but I'd appreciate confirmation that upstream CI doesn't have a -DCMAKE_DISABLE_PRECOMPILE_HEADERS=ON path that would silently regress.

The chunk_prefill / paged_decode kernel templates share almost their
entire transitive include set: cute, the CUTLASS-SYCL collective stack,
the sycl runtime, flash_attention_v2 fusion / scheduler / epilogue, the
FMHAKernel / PagedDecode kernel templates. That parse cost is identical
across every generated TU and accounts for a non-trivial fraction of
per-TU compile wall time on icpx (template-name-lookup over cute /
cutlass is template-pathological at this header depth).

Wire up target_precompile_headers(attn_kernels_xe_2 PRIVATE
attn_pch.hpp) with an umbrella header that #includes both kernel
template surfaces. cmake handles the
-Xclang -include-pch -Xclang <build>/cmake_pch.hxx.pch plumbing
automatically for icpx (clang-style PCH), so the per-TU compile commands
load the pre-parsed AST snapshot in milliseconds instead of re-parsing.

icpx 2025.3 SYCL+PCH quirk: cmake injects -fpch-instantiate-templates
into the PCH compile, which eagerly instantiates every template the
PCH-emit pass encounters. cute/util/compat/memory.hpp uses
sycl::handler::parallel_for<class T> elaborated-type-specifiers to
forward-declare its kernel-name tags inline; under
-fpch-instantiate-templates icpx runs the SYCL kernel-name validator
against the ETS tag before its implicit namespace-scope declaration has
propagated, and the PCH-emit fails with "kernel name should be forward
declarable at namespace scope". Forward-declaring both kernel-name
classes at ::compat::detail in the umbrella header resolves the ETS to
a redeclaration of an already-visible class, which the validator
accepts. See attn_pch.hpp for the workaround.

Effect on cmake-driven builds: ~20-30% wall-time reduction on the FA2
kernel TUs, and the dtype-split's "more TUs" wall-time penalty
disappears entirely. Per-TU peak RSS drops a further 1-2 GB (frontend
parse cost is no longer paid per-TU).

Downstream non-cmake consumers (Nix per-TU dynamic-derivations builds)
extract the PCH compile command from compile_commands.json and realise
it as its own derivation; per-TU commands then -include-pch the shared
PCH artifact. See attn_pch.hpp for the icpx PCH compatibility contract.
@jasonboukheir jasonboukheir marked this pull request as draft May 14, 2026 05:26
@jasonboukheir
Copy link
Copy Markdown
Contributor Author

currently running into some issues, so putting into draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant