-
Notifications
You must be signed in to change notification settings - Fork 22
Add support for batched tasks. #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| Task Batching {#task_batching} | ||
| ============== | ||
|
|
||
| Task batching lets a device submit hook combine several compatible ready tasks | ||
| into one device operation. The runtime still owns dependency management, data | ||
| movement, completion, and release; the submit hook only decides which pending | ||
| tasks are compatible with the task it was asked to submit. | ||
|
|
||
| Batching is opt-in at the chore level. A task that does not advertise batching | ||
| is always delivered to the submit hook as a singleton task. | ||
|
|
||
| Enabling batching | ||
| ----------------- | ||
|
|
||
| For PTG-generated tasks, use the `batch = true` body property on a device body: | ||
|
|
||
| ```c | ||
| BODY [type=CUDA | ||
| batch = true | ||
| dyld=cublasDgemm dyldtype=cublas_dgemm_t] | ||
| { | ||
| /* GPU submit body. */ | ||
| } | ||
| ``` | ||
|
|
||
| For DTD tasks, add `PARSEC_DEV_CHORE_ALLOW_BATCH` to the device type when | ||
| registering or selecting the chore: | ||
|
|
||
| ```c | ||
| parsec_dtd_task_class_add_chore(tp, tc, | ||
| PARSEC_DEV_CUDA | PARSEC_DEV_CHORE_ALLOW_BATCH, | ||
| kernel_cuda); | ||
| ``` | ||
|
|
||
| The selected device type must also support batching at runtime. The device layer | ||
| uses `parsec_mca_device_type_supports_batch()` to check this and | ||
| `parsec_mca_device_type_sanitize_batch()` to drop the batching hint when the | ||
| selected device cannot batch. The MCA parameter `device_enable_batching` | ||
| defaults to the compile-time batching capability and can be used to disable | ||
| batching globally at runtime. | ||
| It is read-only when batching support is not compiled in. | ||
|
|
||
| Recommended collection helper | ||
| ----------------------------- | ||
|
|
||
| The preferred interface for GPU submit hooks is | ||
| `parsec_gpu_task_collect_batch()`. The runtime passes the submit hook a | ||
| singleton `parsec_gpu_task_t *gpu_task`. The hook calls the collector with a | ||
| callback that decides, for each task currently pending on the same stream, | ||
| whether that candidate can be added to the batch headed by `gpu_task`. | ||
|
|
||
| The callback has the type `parsec_gpu_task_batch_cb_t` and receives: | ||
|
|
||
| - `candidate`: a pending task from `gpu_stream->fifo_pending`; | ||
| - `batch_head`: the task originally passed to the submit hook; | ||
| - `callback_data`: user data passed through by the caller. | ||
|
|
||
| The callback return value controls the iterator: | ||
|
|
||
| - negative: stop immediately and return that error code; | ||
| - zero: remove `candidate` from the pending FIFO and append it to | ||
| `batch_head`'s task ring; | ||
| - positive: leave `candidate` pending and continue to the next pending task. | ||
|
|
||
| The callback must not modify `gpu_stream->fifo_pending` directly. | ||
|
|
||
| Example: | ||
|
|
||
| ```c | ||
| static int | ||
| gemm_batch_match(parsec_gpu_task_t *candidate, | ||
| parsec_gpu_task_t *batch_head, | ||
| void *callback_data) | ||
| { | ||
| (void)callback_data; | ||
|
|
||
| if( (batch_head->ec->task_class == candidate->ec->task_class) && | ||
| (batch_head->ec->selected_chore == candidate->ec->selected_chore) && | ||
| (batch_head->ec->selected_device == candidate->ec->selected_device) ) { | ||
| return 0; | ||
| } | ||
| return 1; | ||
| } | ||
|
|
||
| int | ||
| gemm_kernel_cuda(parsec_device_gpu_module_t *gpu_device, | ||
| parsec_gpu_task_t *gpu_task, | ||
| parsec_gpu_exec_stream_t *gpu_stream) | ||
| { | ||
| int batch_count; | ||
| parsec_gpu_task_t *current; | ||
|
|
||
| (void)gpu_device; | ||
|
|
||
| batch_count = parsec_gpu_task_collect_batch(gpu_stream, gpu_task, | ||
| gemm_batch_match, NULL); | ||
| if( batch_count < 0 ) { | ||
| return batch_count; | ||
| } | ||
|
|
||
| current = gpu_task; | ||
| do { | ||
| parsec_task_t *task = current->ec; | ||
|
|
||
| /* Submit one device operation for task, or use the whole ring to | ||
| * issue a real batched operation. | ||
| */ | ||
|
|
||
| current = (parsec_gpu_task_t *)current->list_item.list_next; | ||
| } while( current != gpu_task ); | ||
|
|
||
| return PARSEC_HOOK_RETURN_DONE; | ||
| } | ||
| ``` | ||
|
|
||
| `parsec_gpu_task_collect_batch()` returns the number of tasks in the ring on | ||
| success, including the original `gpu_task`, or the negative callback error. | ||
| Tasks accepted before an error remain attached to `gpu_task`; tasks not accepted | ||
| remain in `gpu_stream->fifo_pending`. | ||
|
|
||
| The submit hook does not need a completion callback merely to return the ring to | ||
| the runtime. If a batched submit hook returns a non-singleton task ring, the GPU | ||
| progress engine automatically chains that ring into the next stream's pending | ||
| FIFO after the recorded device event completes. The normal data retrieval, | ||
| epilog, ownership, pushout, and task completion paths then process the tasks one | ||
| at a time. | ||
|
|
||
| Iterating over the returned ring | ||
| -------------------------------- | ||
|
|
||
| A batched submit hook should treat `gpu_task` as the head of a circular task | ||
| ring. This works for both singleton and batched cases: | ||
|
|
||
| ```c | ||
| parsec_gpu_task_t *current = gpu_task; | ||
|
|
||
| do { | ||
| parsec_task_t *task = current->ec; | ||
|
|
||
| /* Use task. */ | ||
|
|
||
| current = (parsec_gpu_task_t *)current->list_item.list_next; | ||
| } while( current != gpu_task ); | ||
| ``` | ||
|
|
||
| Original direct collection style | ||
| -------------------------------- | ||
|
|
||
| The helper above is intentionally conservative: it keeps FIFO ownership inside | ||
| the device layer and exposes only a compatibility callback to the submit hook. | ||
| In very high load scenarios, the repeated callback call can become visible. A | ||
| specialized submit hook can still use the original direct style and manipulate | ||
| the pending FIFO and task ring itself. | ||
|
|
||
| This style is more fragile and should be reserved for code that is already | ||
| device-runtime aware. The hook must preserve FIFO correctness, keep rejected | ||
| tasks pending, and unlock the FIFO on every exit path. | ||
|
|
||
| ```c | ||
| parsec_list_t *pending = gpu_stream->fifo_pending; | ||
| parsec_list_item_t *item; | ||
| parsec_list_item_t *next; | ||
| int batch_count = 1; | ||
|
|
||
| PARSEC_LIST_ITEM_SINGLETON(&gpu_task->list_item); | ||
|
|
||
| parsec_list_lock(pending); | ||
| for(item = (parsec_list_item_t *)pending->ghost_element.list_next; | ||
| item != &pending->ghost_element; | ||
| item = next) { | ||
| parsec_gpu_task_t *candidate; | ||
|
|
||
| next = (parsec_list_item_t *)item->list_next; | ||
| candidate = (parsec_gpu_task_t *)item; | ||
|
|
||
| if( compatible_with_batch(candidate, gpu_task) ) { | ||
| (void)parsec_list_nolock_remove(pending, item); | ||
| (void)parsec_list_item_ring_push(&gpu_task->list_item, item); | ||
| batch_count++; | ||
| } | ||
| } | ||
| parsec_list_unlock(pending); | ||
| ``` | ||
|
|
||
| The direct style avoids the generic iterator and callback dispatch, and it can | ||
| fold the compatibility test into a tight kernel-specific loop. The cost is that | ||
| the submit hook now depends on internal list and stream details and must be | ||
| updated if the GPU stream internals change. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear how PTG tasks can actually batch kernel invocations. Simply stringing kernels together on the same stream won't save much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same way as DTD tasks, and the same way we did two years ago for the GB submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GB submissions are not part of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been lingering here for a very long time. Let's get it in.