run to run scan warpspeed impl sm100+#9263
Conversation
|
Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
OverviewThis PR implements run-to-run support for the warpspeed scan optimization on SM100+ targets, enabling deterministic DeviceScan execution. The changes introduce a stable reduction order variant of the warpspeed lookahead logic and thread this stability setting through the scan dispatch pipeline. ChangesWarpspeed Lookahead Stable VariantAdded
Warpspeed Scan Pipeline IntegrationExtended the warpspeed scan implementation in
Kernel Dispatch UpdateUpdated Policy Selection for Stable ReductionModified Related IssueCloses WalkthroughAdds a deterministic ChangesStable Warpspeed Scan Implementation
Assessment against linked issues
Possibly related PRs
Suggested reviewers
important: Confirm that warpspeed stable lookahead updates important: Verify the suggestion: Consider adding a static_assert or comment that documents Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)
1038-1047:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winsuggestion: Update the inline rationale for the
require_stable_reduction_order→cc >= {10, 0}gate:warpIncrementalLookaheadStableis available for__cccl_ptx_isa >= 860(sm_90+), but the scan policy selector only produces ascan_warpspeed_policywhencc >= {10, 0}(otherwiseget_warpspeed_policyreturns{}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411
📒 Files selected for processing (4)
cub/cub/detail/warpspeed/look_ahead.cuhcub/cub/device/dispatch/kernels/kernel_scan.cuhcub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuhcub/cub/device/dispatch/tuning/tuning_scan.cuh
|
/ok to test cbd13bb |
🥳 CI Workflow Results🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913See results here. |
|
pre-commit.ci autofix |
| const ::cuda::std::uint32_t lanemaskEq = ::cuda::ptx::get_sreg_lanemask_eq(); | ||
|
|
||
| // Adjust the left pointer down to the nearest 32-multiple so we do batched sums | ||
| int idxTileCur = (idxTilePrev / 32) * 32; |
There was a problem hiding this comment.
Suggestion: Use cuda::round_down.
| AccumT aggrExclusiveCtaCur = aggrExclusiveCtaPrev; | ||
|
|
||
| using warp_reduce_t = WarpReduce<AccumT>; | ||
| static_assert(sizeof(typename warp_reduce_t::TempStorage) <= 4, |
There was a problem hiding this comment.
Why 4? I assume this is sizeof(uint32_t)? If so, best to say sizeof(uint32_t) instead (or better yet, refer to an actual type/value so that when that size is changed, the check automatically is as well).
There was a problem hiding this comment.
Because the TempStorage is a struct with further nested types that have no value, but because there are data members it has a size of 1. For some reason @elstehle chose 4 here, but the check is basically that no temporary storage is required. Btw, is_empty also does not work here.
There was a problem hiding this comment.
Could you please put this as a comment then in the src? 4 is quite a magic value to capture this, I would have expected 1 or something like that then
There was a problem hiding this comment.
Because the TempStorage is a struct with further nested types that have no value, but because there are data members it has a size of 1.
Not sure about that. I think it is just inheriting from cub::Uninitialized<cub::NullType>.
Therefore the check that I came up with is
static_assert(::cuda::std::is_base_of_v<cub::Uninitialized<cub::NullType>, TempStorage>, "Code assumes empty TempStorage");Pretty verbose/not super readable, but at least no magic number and a bit clearer in its motivation once one gets to the bottom of it? And no chance for this one to not trigger if we would start requiring temporary storage.
There was a problem hiding this comment.
I would strongly suggest an inline variable of the form:
template<class>
inline constexpr bool __requires_temp_storage = true;
template<>
inline constexpr bool __requires_temp_storage<cub::Uninitialized<cub::NullType>> = false;| [[maybe_unused]] typename warp_reduce_t::TempStorage temp_storage; | ||
|
|
||
| using warp_reduce_or_t = WarpReduce<::cuda::std::uint32_t>; | ||
| typename warp_reduce_or_t::TempStorage temp_storage_or; |
There was a problem hiding this comment.
Nit: typename is not needed here I think. WarpReduce<uint32_t> is not dependent on any of your template params.
| { | ||
| // Bitmask with a 1 bit in the position of the current lane if current lane has a tile aggregate | ||
| const ::cuda::std::uint32_t lane_has_aggregate = | ||
| lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate); |
There was a problem hiding this comment.
Have you benchmarked this multiplication to be an improvement over predication? Otherwise I would stay with
| lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate); | |
| (regTmpStates[idx].state == scan_state::tile_aggregate) ? lanemaskEq : 0u; |
My (possibly wrong) intuition is that the multiplication will result in either the same output or still generate a predicated move in addition to the multiplication since it needs to transform a predicate register into an integer.
| lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate); | ||
|
|
||
| // Bitmask with 1 bits indicating which lane has a tile aggregate | ||
| const ::cuda::std::uint32_t warp_has_aggregate_mask = warp_reduce_or.Reduce(lane_has_aggregate, or_op); |
There was a problem hiding this comment.
An even easier (and faster?) way of getting this mask would be a call to __ballot_sync(). That would also completely avoid the issue above.
| // Bitmask with 1 bits indicating which lane has a tile aggregate | ||
| const ::cuda::std::uint32_t warp_has_aggregate_mask = warp_reduce_or.Reduce(lane_has_aggregate, or_op); | ||
|
|
||
| // Bitmask with 1 bits for all rightmost lanes having a tile aggregate |
There was a problem hiding this comment.
| // Bitmask with 1 bits for all rightmost lanes having a tile aggregate | |
| // Bitmask with 1 bits for the contiguous run of lanes having a tile aggregate starting from LSB |
| } | ||
|
|
||
| const bool use_value = lanemaskEq & warp_right_aggregates_mask; | ||
| const AccumT value = use_value ? regTmpStates[idx].value : cuda::identity_element<ScanOpT, AccumT>(); |
There was a problem hiding this comment.
In case there is no identity element, you could use the valid_items overload of Reduce(). Or is the assumption that it always exists because this path is only ever dispatched with primitive FP types (and maybe complex ones)?
There was a problem hiding this comment.
Yes, I think the deterministic path is only taken for FP32 and FP64.
Description
closes #7556
Checklist