Skip to content

Add sycl_khr_free_function_commands extension#922

Open
slawekptak wants to merge 77 commits intoKhronosGroup:mainfrom
slawekptak:khr_free_function_commands_new
Open

Add sycl_khr_free_function_commands extension#922
slawekptak wants to merge 77 commits intoKhronosGroup:mainfrom
slawekptak:khr_free_function_commands_new

Conversation

@slawekptak
Copy link
Copy Markdown

This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.

This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.

It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).

Pennycook and others added 30 commits October 17, 2024 15:00
This extension provides an alternative mechanism for submitting commands to a
device via free-functions that require developers to opt-in to the creation of
event objects.

It also proposes alternative names for several commands (e.g., launch) and
simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be
optionally provided to functions like parallel_for; now that there are
dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
There is no need to constrain T here because T must be device-copyable in order
to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for
submitting work, so it should be moved to its own extension.

This will also give us more time to consider the design and naming of any
proposed replacement(s), including how they should interact with new
functionality proposed in other KHRs.
There are currently no backends that define interop for reductions,
so we can remove these functions for now. If we decide later that
these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <al42and@gmail.com>
The function names for memory operations now follow
the "enqueue_*" pattern, to indicate that these operations
are added to the queue and not executed immediately.
- Changed the return type of the functions to void (signal_event
should be used to track completion).
- Added the signal_event, wait_event and wait_events structs
to be used with the requirements object.
- Added the following functions: make_event, enqueue_signal_event,
enqueue_wait_event, enqueue_wait_events, enqueue_barrier.
- Removed the following functions: command_barrier, event_barrier.
- Updated the code example.
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
slawekptak and others added 6 commits February 5, 2026 08:57
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Copy link
Copy Markdown
Member

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@abagusetty
Copy link
Copy Markdown

Thanks a bunch for the efforts to push this to finish line.

My recollection, prefetch_host is sparingly used ATM. And these few apps (qmcpack, exachem, etc) use it at production-scale.

My two-cents is towards free function standardization of this API just to reduce the cycles over non-free

Copy link
Copy Markdown

@tdavidcl tdavidcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This extension will solve many pain point that we have with our current SYCL usage, it looks great.

I just had kind of neatpick. In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Also i don't recall seeing any named kernel in the proposed extension. Can we name them for profiling (#1002 maybe ?) ?

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Yes. the working group has discussed this. We plan to add a KHR similar to this oneAPI extension:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

And I think we might consider a KHR similar to this as well:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

@tdavidcl
Copy link
Copy Markdown

tdavidcl commented Apr 10, 2026

Thanks for the clarification.

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

This one on the other hand would be wonderful to have. And if we want to use it as scratch we can always declare a char[] local as one would do in e.g. CUDA. Also i would favor it since it is the closest to the other GPU backends.

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

Why do you think it's more complex or harder to use than the CUDA API where you declare an unbounded array? We actually designed this API as a cleaner version of the CUDA one. One thing we don't like about the CUDA interface is that multiple unbounded array declarations alias each other, which seems very non-obvious. For example:

extern __shared__ float data1[];
extern __shared__ char data2[];

This looks like two different arrays of shared memory (work-group local memory). However, the two arrays data1 and data2 actually alias each other and point to exactly the same memory.

We think the proposed SYCL API makes this less surprising because there's just a single global function that returns a pointer to the start of the dynamically-sized local memory.

I don't really like the approach as one has to rebuild aliases every time to use it.

What do you mean by "rebuild aliases"? Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this:

template<typename T>
T* get_work_group_scratch_memory();

Then, you would use it like:

auto pscratch = sycl::khr::get_work_group_scratch_memory<int>();

@tdavidcl
Copy link
Copy Markdown

Oh i see now why.

It's hard to always remember what is what between CUDA and SYCL.

My confusion was related to "Static Shared Memory" in CUDA. which does resemble a lot more the second extension and is the most used one in CUDA. Whereas the "Dynamic Shared Memory" variant corresponding work_group_scratch_memory is quite rare in the type of apps i'm working with.

Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this

Yup, this syntax is much cleaner i think, with the static cast it would have made a very long line.

int* pscratch = sycl::khr::get_work_group_scratch_memory<int>();

which is already 65 chars so it allows for 3 level of indentation (4 space) before passing the 80 columns mark.

We are still forced to do some weird business though if we have multiple types similarely to cuda (there is probably some memory alignment to account for but not yet 😅 )

extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

int *integerData = sycl::khr::get_work_group_scratch_memory<int>(); // nI ints
float *floatData = reinterpret_cast<float>(integerData + nI);   // nF floats
char *charData   = reinterpret_cast<char>(floatData + nF);      // nC chars

That probably goes beyond the scope of the discussion of this PR though 😅

PS: why name it scratch instead of local in the name since it refers to local memory ?

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

I think you'd need to do something similar to what you show. I'm not sure about providing a "clean wrapper". I think a wrapper like this would be very complicated, and it would be easier to just do the pointer math like you show.

PS: why name it scratch instead of local in the name since it refers to local memory ?

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

@tdavidcl
Copy link
Copy Markdown

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

I get this one, but since all of the doc and SYCL tutorials outline it as local memory (= work group local memory) in the memory hierarchy of the model wouldn't it be too confusing to introduce a new name. Also

int* pscratch = sycl::khr::get_local_scratch_memory<int>();
// or
int* pscratch = sycl::khr::get_local_memory_pointer<int>();

is shorter 😛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants