[SYCL] Improve get_kernel_info performance#21866
[SYCL] Improve get_kernel_info performance#21866cperkinsintel wants to merge 12 commits intointel:syclfrom
Conversation
…r type kernels. Not yet working with free-function kernels. Signed-off-by: Chris Perkins <chris.perkins@intel.com>
| @@ -0,0 +1,66 @@ | |||
| //==-- kernel_cache.hpp - Fast kernel cache access for get_kernel_info -----==// | |||
There was a problem hiding this comment.
The header file seems misleading, this file just accesses the cache to retrieve the kernel.
| sycl::detail::getDeviceKernelInfo<KernelName>(); | ||
|
|
||
| // Dispatch to appropriate fast cache function based on query type | ||
| if constexpr (std::is_same_v< |
There was a problem hiding this comment.
This should reuse the code from source/detail/kernel_info.hpp instead. AFAIU, all we actually need is an exported function that retrieves the kernel from the program manager with getOrCreateKernel.
There was a problem hiding this comment.
done.
I also didn't want these if constexpr to be duplicated below, but the fallback behavior is different. I could make a helper and template with a callable but that seems really overengineered and hard to read for very little gain.
There was a problem hiding this comment.
Why do we need anything except queryCachedKernelInfo from kernel_info_queries? Can we just fetch the kernel (with a single new exported function) then call get_kernel_device_specific_info<...> directly instead for all of these cases?
Also, what are the queries that we use the fallback path for and why?
| return Info; | ||
| } | ||
|
|
||
| // Overload for free function kernels |
There was a problem hiding this comment.
Can we make use of this new overload for free function kernel submission as well (as a follow-up change)?
There was a problem hiding this comment.
I will look into it. The free functions submissions ARE O(1) (so long as they are explicitly submitted), but I'll see if they can use this path to make the code more DRY.
There was a problem hiding this comment.
Pull request overview
This PR optimizes ext::oneapi::get_kernel_info device-specific queries by reusing the runtime’s fast kernel cache (via ProgramManager::getOrCreateKernel) instead of constructing a kernel_bundle for common query types, and extends device-kernel-info lookup to support free-function kernels by name.
Changes:
- Add exported runtime helpers (
getKernel*) that query kernel device-specific info using cached kernels. - Add an exported
getDeviceKernelInfo(std::string_view)overload to support free-function kernels. - Update
ext::oneapi::get_kernel_infoto use the fast-path for common device-specific queries, with fallback to the previous bundle-based implementation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sycl/source/detail/kernel_info_queries.cpp | Implements cached-kernel device-specific query helpers. |
| sycl/include/sycl/detail/kernel_info_queries.hpp | Declares exported cached-kernel query helper APIs. |
| sycl/include/sycl/ext/oneapi/get_kernel_info.hpp | Switches device-specific queries to fast cached-kernel path for selected params; keeps fallback. |
| sycl/include/sycl/detail/get_device_kernel_info.hpp | Adds getDeviceKernelInfo(std::string_view) and free-function kernel template helper. |
| sycl/source/detail/get_device_kernel_info.cpp | Implements the new getDeviceKernelInfo(std::string_view) overload. |
| sycl/source/CMakeLists.txt | Adds the new kernel_info_queries.cpp compilation unit. |
| sycl/test/abi/sycl_symbols_linux.dump | Updates ABI symbol expectations for new exported helpers/overloads. |
| sycl/test/abi/sycl_symbols_windows.dump | Updates ABI symbol expectations for new exported helpers/overloads. |
| #include <sycl/detail/kernel_desc.hpp> | ||
|
|
||
| #include <detail/global_handler.hpp> | ||
| #include <detail/program_manager/program_manager.hpp> | ||
|
|
||
| #include <string_view> | ||
|
|
||
| namespace sycl { | ||
| inline namespace _V1 { | ||
| namespace detail { | ||
|
|
||
| DeviceKernelInfo &getDeviceKernelInfo(const CompileTimeKernelInfoTy &Info) { | ||
| return ProgramManager::getInstance().getDeviceKernelInfo(Info); | ||
| } | ||
|
|
||
| DeviceKernelInfo &getDeviceKernelInfo(std::string_view KernelName) { | ||
| return ProgramManager::getInstance().getDeviceKernelInfo(KernelName); | ||
| } |
There was a problem hiding this comment.
getDeviceKernelInfo(...) definitions in this TU are not preceded by the __SYCL_EXPORT declarations from sycl/detail/get_device_kernel_info.hpp. When building the SYCL DLL on Windows (__SYCL_BUILD_SYCL_DLL), this can result in the new getDeviceKernelInfo(std::string_view) (and potentially the existing overload) not being exported, causing link failures for users including the header. Include sycl/detail/get_device_kernel_info.hpp here (or annotate the definitions with __SYCL_EXPORT) so the dllexport attribute is applied in this translation unit.
There was a problem hiding this comment.
Copilot is (mostly) mistaken here. __SYCL_EXPORT IS used on those declarations in the .hpp. We don't use them in the .cpp files. But get_device_kernel_info.cpp is not including its own matching .hpp directly, it's getting it indirectly, which is maybe why Copilot is confused. I'm adding a direct include.
| // Dispatch to appropriate fast cache function based on query type | ||
| if constexpr (std::is_same_v< | ||
| Param, | ||
| sycl::info::kernel_device_specific::work_group_size>) { | ||
| return sycl::detail::getKernelWorkGroupSize(CtxImpl, DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| compile_work_group_size>) { | ||
| return sycl::detail::getKernelCompileWorkGroupSize(CtxImpl, DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| preferred_work_group_size_multiple>) { | ||
| return sycl::detail::getKernelPreferredWorkGroupSizeMultiple(CtxImpl, | ||
| DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| private_mem_size>) { | ||
| return sycl::detail::getKernelPrivateMemSize(CtxImpl, DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| max_sub_group_size>) { | ||
| return sycl::detail::getKernelMaxSubGroupSize(CtxImpl, DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| compile_num_sub_groups>) { | ||
| return sycl::detail::getKernelCompileNumSubGroups(CtxImpl, DevImpl, DKI); | ||
| } else if constexpr (std::is_same_v<Param, | ||
| sycl::info::kernel_device_specific:: | ||
| compile_sub_group_size>) { | ||
| return sycl::detail::getKernelCompileSubGroupSize(CtxImpl, DevImpl, DKI); | ||
| } else { | ||
| // Fallback for any other queries | ||
| auto Bundle = | ||
| sycl::get_kernel_bundle<KernelName, sycl::bundle_state::executable>( | ||
| Ctx, {Dev}); | ||
| return Bundle.template get_kernel<KernelName>().template get_info<Param>( | ||
| Dev); | ||
| } |
There was a problem hiding this comment.
The device-specific fast-path dispatch is duplicated for named kernels and free-function kernels. This increases the chance the two overloads diverge when new kernel_device_specific queries are added or adjusted. Consider factoring the if constexpr chain into a single internal helper (e.g., a templated dispatch_device_specific_query<Param>(CtxImpl, DevImpl, DKI)), and reuse it from both overloads.
O(1) optimization fix for
get_kernel_info, working for both lambda-functor and free function kernels.