Add sycl_khr_free_function_commands extension#922
Add sycl_khr_free_function_commands extension#922slawekptak wants to merge 77 commits intoKhronosGroup:mainfrom
Conversation
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects. It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be optionally provided to functions like parallel_for; now that there are dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
There is no need to constrain T here because T must be device-copyable in order to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for submitting work, so it should be moved to its own extension. This will also give us more time to consider the design and naming of any proposed replacement(s), including how they should interact with new functionality proposed in other KHRs.
There are currently no backends that define interop for reductions, so we can remove these functions for now. If we decide later that these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <al42and@gmail.com>
The function names for memory operations now follow the "enqueue_*" pattern, to indicate that these operations are added to the queue and not executed immediately.
- Changed the return type of the functions to void (signal_event should be used to track completion). - Added the signal_event, wait_event and wait_events structs to be used with the requirements object. - Added the following functions: make_event, enqueue_signal_event, enqueue_wait_event, enqueue_wait_events, enqueue_barrier. - Removed the following functions: command_barrier, event_barrier. - Updated the code example.
…gnal_event function.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
|
Thanks a bunch for the efforts to push this to finish line. My recollection, prefetch_host is sparingly used ATM. And these few apps (qmcpack, exachem, etc) use it at production-scale. My two-cents is towards free function standardization of this API just to reduce the cycles over non-free |
There was a problem hiding this comment.
Very nice! This extension will solve many pain point that we have with our current SYCL usage, it looks great.
I just had kind of neatpick. In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?
Also i don't recall seeing any named kernel in the proposed extension. Can we name them for profiling (#1002 maybe ?) ?
Yes. the working group has discussed this. We plan to add a KHR similar to this oneAPI extension: And I think we might consider a KHR similar to this as well: |
|
Thanks for the clarification. I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ? This one on the other hand would be wonderful to have. And if we want to use it as scratch we can always declare a char[] local as one would do in e.g. CUDA. Also i would favor it since it is the closest to the other GPU backends. |
Why do you think it's more complex or harder to use than the CUDA API where you declare an unbounded array? We actually designed this API as a cleaner version of the CUDA one. One thing we don't like about the CUDA interface is that multiple unbounded array declarations alias each other, which seems very non-obvious. For example: This looks like two different arrays of shared memory (work-group local memory). However, the two arrays We think the proposed SYCL API makes this less surprising because there's just a single global function that returns a pointer to the start of the dynamically-sized local memory.
What do you mean by "rebuild aliases"? Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this: Then, you would use it like: |
|
Oh i see now why. It's hard to always remember what is what between CUDA and SYCL. My confusion was related to "Static Shared Memory" in CUDA. which does resemble a lot more the second extension and is the most used one in CUDA. Whereas the "Dynamic Shared Memory" variant corresponding work_group_scratch_memory is quite rare in the type of apps i'm working with.
Yup, this syntax is much cleaner i think, with the static cast it would have made a very long line. int* pscratch = sycl::khr::get_work_group_scratch_memory<int>();which is already 65 chars so it allows for 3 level of indentation (4 space) before passing the 80 columns mark. We are still forced to do some weird business though if we have multiple types similarely to cuda (there is probably some memory alignment to account for but not yet 😅 ) extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF]; // nC charsI don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull. int *integerData = sycl::khr::get_work_group_scratch_memory<int>(); // nI ints
float *floatData = reinterpret_cast<float>(integerData + nI); // nF floats
char *charData = reinterpret_cast<char>(floatData + nF); // nC charsThat probably goes beyond the scope of the discussion of this PR though 😅 PS: why name it scratch instead of local in the name since it refers to local memory ? |
I think you'd need to do something similar to what you show. I'm not sure about providing a "clean wrapper". I think a wrapper like this would be very complicated, and it would be easier to just do the pointer math like you show.
I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name. |
I get this one, but since all of the doc and SYCL tutorials outline it as local memory (= work group local memory) in the memory hierarchy of the model wouldn't it be too confusing to introduce a new name. Also is shorter 😛 |
This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.
It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).