run to run scan warpspeed impl sm100+ by srinivasyadav18 · Pull Request #9263 · NVIDIA/cccl

srinivasyadav18 · 2026-06-04T19:09:35Z

Description

closes #7556

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-06-04T19:09:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-04T19:17:36Z

Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine.

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea6c063b-5047-4b8b-afa0-fc5e45b22a54

📥 Commits

Reviewing files that changed from the base of the PR and between cbd13bb and 5550396.

📒 Files selected for processing (1)

cub/cub/device/dispatch/kernels/kernel_scan.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cub/cub/device/dispatch/kernels/kernel_scan.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR implements run-to-run support for the warpspeed scan optimization on SM100+ targets, enabling deterministic DeviceScan execution. The changes introduce a stable reduction order variant of the warpspeed lookahead logic and thread this stability setting through the scan dispatch pipeline.

Changes

Warpspeed Lookahead Stable Variant

Added warpIncrementalLookaheadStable() function template to cub/cub/detail/warpspeed/look_ahead.cuh that provides deterministic lookahead reduction by:

Anchoring reduction progress to 32-tile boundaries
Fixing reduction order by only reducing when an expected contiguous count of tile aggregates is available
Updating previous-state variables (idxTilePrev, aggrExclusiveCtaPrev) via reference in-place
Returning the computed exclusive aggregate for the last processed range

Warpspeed Scan Pipeline Integration

Extended the warpspeed scan implementation in cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh with a new StableReductionOrder compile-time template parameter (default: false) that:

Selects between stable (warpIncrementalLookaheadStable) and non-stable (warpIncrementalLookahead) reduction paths
Modifies how previous-state updates and exclusive aggregates are computed based on the stability requirement
Propagates the setting from device kernel dispatch through the scan closure implementation

Kernel Dispatch Update

Updated DeviceScanKernel in cub/cub/device/dispatch/kernels/kernel_scan.cuh to pass the StableReductionOrder template parameter to device_scan_warpspeed_body, ensuring the stability requirement flows to the warpspeed execution path.

Policy Selection for Stable Reduction

Modified cub/cub/device/dispatch/tuning/tuning_scan.cuh to allow warpspeed scan selection when stable reduction order is required, but only for compute capability >= 10.0 (SM100+). Previously, warpspeed was skipped entirely for stable reduction requirements.

Related Issue

Closes #7556: Productize run-to-run DeviceScan

Walkthrough

Adds a deterministic warpIncrementalLookaheadStable lookahead to warpspeed scan, threads a compile-time StableReductionOrder flag into the warpspeed closure and dispatch, and updates policy gating to allow warpspeed on sm_100+ when stable reduction order is required.

Changes

Stable Warpspeed Scan Implementation

Layer / File(s)	Summary
Stable lookahead function `cub/cub/detail/warpspeed/look_ahead.cuh`	New `warpIncrementalLookaheadStable` deterministically anchors reduction progress to 32-tile boundaries, enforces fixed reduction order via expected tile count, updates `idxTilePrev` and `aggrExclusiveCtaPrev` by reference, and returns the exclusive aggregate.
Warpspeed kernel stable reduction routing `cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh`, `cub/cub/device/dispatch/kernels/kernel_scan.cuh`	`warpspeed_scan_closure` and `device_scan_warpspeed_body` gain `StableReductionOrder` template parameter; `lookahead` helper conditionally calls `warpIncrementalLookaheadStable` (stable path) or `warpIncrementalLookahead` (non-stable), and updates previous-state variables in the path-specific location.
Dispatch threading and policy gating `cub/cub/device/dispatch/tuning/tuning_scan.cuh`	`DeviceScanKernel` now forwards `StableReductionOrder` to warpspeed dispatch; policy selector allows warpspeed for stable reduction when compute capability >= sm_100 instead of blocking it outright.

Assessment against linked issues

Objective	Addressed	Explanation
Enable DeviceScan stable reduction path with warpspeed for run-to-run determinism [`#7556`]	❓	PR adds warpspeed-stable plumbing and policy gating, but does not show DeviceScan API overloads or env-based entry points required to expose run-to-run option at the public API layer.

Possibly related PRs

NVIDIA/cccl#9169: Refactors warpspeed lookahead infrastructure; this PR adds the stable variant atop that refactor.
NVIDIA/cccl#9098: Propagates StableReductionOrder through DeviceScan dispatch; related to deterministic reduction-order plumbing.

Suggested reviewers

fbusato
bernhardmgruber
miscco

important: Confirm that warpspeed stable lookahead updates idxTilePrev and aggrExclusiveCtaPrev correctly across all lane widths and boundary conditions; the anchor-to-32-multiple alignment must not introduce off-by-one errors when tile indices are not 32-aligned.

important: Verify the if constexpr (StableReductionOrder) routing preserves memory-ordering and concurrent-access guarantees: stable path updates previous-state before shared-memory writeback while non-stable updates after, which changes leader/writeback timing.

suggestion: Consider adding a static_assert or comment that documents StableReductionOrder == true validity only for sm_100+ to catch accidental template misuse at compile-time.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)

1038-1047: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: Update the inline rationale for the require_stable_reduction_order → cc >= {10, 0} gate: warpIncrementalLookaheadStable is available for __cccl_ptx_isa >= 860 (sm_90+), but the scan policy selector only produces a scan_warpspeed_policy when cc >= {10, 0} (otherwise get_warpspeed_policy returns {}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411

📥 Commits

Reviewing files that changed from the base of the PR and between 316f9cc and cbd13bb.

📒 Files selected for processing (4)

cub/cub/detail/warpspeed/look_ahead.cuh
cub/cub/device/dispatch/kernels/kernel_scan.cuh
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
cub/cub/device/dispatch/tuning/tuning_scan.cuh

srinivasyadav18 · 2026-06-05T14:41:35Z

/ok to test cbd13bb

github-actions · 2026-06-05T17:10:30Z

🥳 CI Workflow Results

🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913

See results here.

srinivasyadav18 · 2026-06-05T19:56:29Z

pre-commit.ci autofix

bernhardmgruber · 2026-06-05T20:12:36Z

+  const ::cuda::std::uint32_t lanemaskEq = ::cuda::ptx::get_sreg_lanemask_eq();
+
+  // Adjust the left pointer down to the nearest 32-multiple so we do batched sums
+  int idxTileCur             = (idxTilePrev / 32) * 32;


Suggestion: Use cuda::round_down.

Jacobfaib · 2026-06-05T20:20:45Z

+  AccumT aggrExclusiveCtaCur = aggrExclusiveCtaPrev;
+
+  using warp_reduce_t = WarpReduce<AccumT>;
+  static_assert(sizeof(typename warp_reduce_t::TempStorage) <= 4,


Why 4? I assume this is sizeof(uint32_t)? If so, best to say sizeof(uint32_t) instead (or better yet, refer to an actual type/value so that when that size is changed, the check automatically is as well).

Because the TempStorage is a struct with further nested types that have no value, but because there are data members it has a size of 1. For some reason @elstehle chose 4 here, but the check is basically that no temporary storage is required. Btw, is_empty also does not work here.

Could you please put this as a comment then in the src? 4 is quite a magic value to capture this, I would have expected 1 or something like that then

Because the TempStorage is a struct with further nested types that have no value, but because there are data members it has a size of 1.

Not sure about that. I think it is just inheriting from cub::Uninitialized<cub::NullType>.

Therefore the check that I came up with is

static_assert(::cuda::std::is_base_of_v<cub::Uninitialized<cub::NullType>, TempStorage>, "Code assumes empty TempStorage");

Pretty verbose/not super readable, but at least no magic number and a bit clearer in its motivation once one gets to the bottom of it? And no chance for this one to not trigger if we would start requiring temporary storage.

I would strongly suggest an inline variable of the form:

template<class> inline constexpr bool __requires_temp_storage = true; template<> inline constexpr bool __requires_temp_storage<cub::Uninitialized<cub::NullType>> = false;

ok, here is my attempt: #9294

Jacobfaib · 2026-06-05T20:22:00Z

+  [[maybe_unused]] typename warp_reduce_t::TempStorage temp_storage;
+
+  using warp_reduce_or_t = WarpReduce<::cuda::std::uint32_t>;
+  typename warp_reduce_or_t::TempStorage temp_storage_or;


Nit: typename is not needed here I think. WarpReduce<uint32_t> is not dependent on any of your template params.

pauleonix · 2026-06-07T01:58:25Z

+    {
+      // Bitmask with a 1 bit in the position of the current lane if current lane has a tile aggregate
+      const ::cuda::std::uint32_t lane_has_aggregate =
+        lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate);


Have you benchmarked this multiplication to be an improvement over predication? Otherwise I would stay with

Suggested change

lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate);

(regTmpStates[idx].state == scan_state::tile_aggregate) ? lanemaskEq : 0u;

My (possibly wrong) intuition is that the multiplication will result in either the same output or still generate a predicated move in addition to the multiplication since it needs to transform a predicate register into an integer.

pauleonix · 2026-06-07T02:03:08Z

+        lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate);
+
+      // Bitmask with 1 bits indicating which lane has a tile aggregate
+      const ::cuda::std::uint32_t warp_has_aggregate_mask = warp_reduce_or.Reduce(lane_has_aggregate, or_op);


An even easier (and faster?) way of getting this mask would be a call to __ballot_sync(). That would also completely avoid the issue above.

pauleonix · 2026-06-07T02:14:14Z

+      // Bitmask with 1 bits indicating which lane has a tile aggregate
+      const ::cuda::std::uint32_t warp_has_aggregate_mask = warp_reduce_or.Reduce(lane_has_aggregate, or_op);
+
+      // Bitmask with 1 bits for all rightmost lanes having a tile aggregate


Suggested change

// Bitmask with 1 bits for all rightmost lanes having a tile aggregate

// Bitmask with 1 bits for the contiguous run of lanes having a tile aggregate starting from LSB

pauleonix · 2026-06-07T02:17:14Z

+      }
+
+      const bool use_value    = lanemaskEq & warp_right_aggregates_mask;
+      const AccumT value      = use_value ? regTmpStates[idx].value : cuda::identity_element<ScanOpT, AccumT>();


In case there is no identity element, you could use the valid_items overload of Reduce(). Or is the assumption that it always exists because this path is only ever dispatched with primitive FP types (and maybe complex ones)?

Yes, I think the deterministic path is only taken for FP32 and FP64.

run to run warpspeed impl sm100+

cbd13bb

srinivasyadav18 requested a review from a team as a code owner June 4, 2026 19:09

srinivasyadav18 requested a review from pauleonix June 4, 2026 19:09

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

[pre-commit.ci] auto code formatting

5550396

srinivasyadav18 changed the title ~~run to run warpspeed impl sm100+~~ run to run scan warpspeed impl sm100+ Jun 5, 2026

bernhardmgruber reviewed Jun 5, 2026

View reviewed changes

Jacobfaib reviewed Jun 5, 2026

View reviewed changes

pauleonix reviewed Jun 7, 2026

View reviewed changes

	lanemaskEq * (regTmpStates[idx].state == scan_state::tile_aggregate);
	(regTmpStates[idx].state == scan_state::tile_aggregate) ? lanemaskEq : 0u;

	// Bitmask with 1 bits for all rightmost lanes having a tile aggregate
	// Bitmask with 1 bits for the contiguous run of lanes having a tile aggregate starting from LSB

Conversation

srinivasyadav18 commented Jun 4, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Warpspeed Lookahead Stable Variant

Warpspeed Scan Pipeline Integration

Kernel Dispatch Update

Policy Selection for Stable Reduction

Related Issue

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pauleonix Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miscco Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pauleonix Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pauleonix Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

pauleonix Jun 7, 2026 •

edited

Loading

miscco Jun 8, 2026 •

edited

Loading

pauleonix Jun 7, 2026 •

edited

Loading

pauleonix Jun 7, 2026 •

edited

Loading