Skip to content

Replace detail::merge_sort::dispatch by CUB's public API#8473

Merged
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:use_merge_sort_public
Apr 20, 2026
Merged

Replace detail::merge_sort::dispatch by CUB's public API#8473
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:use_merge_sort_public

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Apr 15, 2026

Fixes: #8376

@bernhardmgruber bernhardmgruber requested review from a team as code owners April 15, 2026 22:16
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Apr 15, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Apr 15, 2026
}
};

struct block_size_compare_t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operator assumes that only one CTA is going to be launched. Perhaps an assertion in the if branch should be added to make that assumption explicit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I added atomicMax to reduce the blockDim in case there would be multiple blocks.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the use_merge_sort_public branch 3 times, most recently from 6447c37 to 4eb7bcc Compare April 16, 2026 10:08
Comment on lines +44 to 54
unsigned int* ptr;

__device__ int operator()(int a, int b)
{
if (threadIdx.x == 0)
{
*ptr = blockDim.x;
// use an atomic operation to write the block dim in case multiple blocks are launched
atomicMax(ptr, blockDim.x);
}
return a + b;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: This is a drive-by fix for consistency.

Comment on lines +515 to +519
if constexpr (cuda::std::execution::__queryable_with<env_t, get_expected_allocation_size_t>)
{
const size_t expected_bytes_allocated = fixed_env.query(get_expected_allocation_size_t{});
REQUIRE(expected_bytes_allocated == bytes_allocated);
}
Copy link
Copy Markdown
Contributor Author

@bernhardmgruber bernhardmgruber Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making the expected_allocation_size check in the launch wrapper optional, we can also use the launch wrappers to test whether e.g. tunings were applied. At least for TEST_LAUNCH != 1.

@gonidelis I think you ran into the same issue, which is why many tests are guarded by #if TEST_LAUNCH == 0. Just FYI.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber enabled auto-merge (squash) April 20, 2026 11:38
@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 37m: Pass: 100%/323 | Total: 4d 22h | Max: 1h 09m | Hits: 92%/297570

See results here.

@bernhardmgruber bernhardmgruber merged commit 68c6260 into NVIDIA:main Apr 20, 2026
343 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Apr 20, 2026
@bernhardmgruber bernhardmgruber deleted the use_merge_sort_public branch April 20, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Use the new tuning API internally for detail::merge_sort::dispatch

3 participants