speed update tie correct by Intron7 · Pull Request #701 · scverse/rapids-singlecell

Intron7 · 2026-06-28T13:39:41Z

No description provided.

Signed-off-by: Intron7 <severin.dicks@icloud.com>

Intron7 · 2026-06-28T13:39:55Z

coderabbitai · 2026-06-28T13:40:01Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-06-28T13:47:41Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added support for a new sparse ranking mode that handles zero values more efficiently in large datasets.
- Expanded GPU processing so more group configurations can be handled directly during ranking.
Bug Fixes
- Improved ranking behavior for sparse inputs by better accounting for zero counts and positive values.
- Updated tier selection logic to make processing more consistent across dense and sparse paths.
Changes
- Added a new option exposed through ranking workflows to enable the updated sparse-zero handling automatically.

Walkthrough

Adds an analytic_zeros execution mode to all three OVO Wilcoxon rank tiers (MEDIUM, LARGE, HUGE). New CUDA kernels operate only on stored positive values while incorporating zero counts analytically. The flag propagates through tier planning, dispatch, dense/device-sparse/host-sparse streaming pipelines, C++ Python bindings, and the Python caller, which activates it when sparse negative fallback is disabled.

OVO Wilcoxon analytic-zeros mode

Layer / File(s)	Summary
New analytic-zero CUDA kernels (MEDIUM/LARGE/HUGE) `src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh`	Adds `ovo_rank_medium_analytic_kernel`, `ovo_rank_smem_analytic_kernel`, `compact_huge_nonzeros_kernel`, and `ovo_rank_huge_analytic_kernel`; removes old `ref_tie_sum_kernel`/`ovo_rank_medium_kernel`; extends `ovo_rank_sorted_kernel` with `skip_n_grp_gt` upper-bound pruning.
Tier planning and `ovo_dispatch_tiers` control flow `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`	`OvoTierPlan` gains `huge_skip_le`; `make_ovo_tier_plan` sets tier flags from actual group sizes; `launch_ovo_medium` and `ovo_dispatch_tiers` gain `analytic_zeros` and select analytic vs non-analytic kernels per tier; HUGE non-analytic path uses `plan.huge_skip_le` with `INT_MAX`.
Dense streaming pipeline wiring `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu`	`launch_ovo_rank_dense_tiered_unsorted_ref` and `launch_ovo_rank_dense_host_streaming` switch huge-mode selection to plan-derived `run_huge`/`huge_skip_le` and pass `analytic_zeros=false` to `ovo_dispatch_tiers`.
Device-sparse (CSR/CSC) pipeline wiring `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh`	Both CSR and CSC device-sparse impls switch `run_huge` and sort-skip to plan-derived values and pass `analytic_zeros=false` to `ovo_dispatch_tiers`.
Host-sparse analytic_zeros propagation and scratch budgeting `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuh`	`ovo_streaming_csc_host_impl` and `ovo_streaming_csr_host_impl` gain `analytic_zeros`; `StreamBuf` gains `grp_nz`/`d_grp_nz` scratch pointers allocated conditionally; per-stream memory budgets account for extra group slabs; tier dispatch calls include `analytic_zeros`.
Python bindings and caller activation `src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cu`, `src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py`	`ovo_streaming_csc_host` and `ovo_streaming_csr_host` bindings expose `analytic_zeros` (default `false`); `_run_ovo_host_sparse` passes `analytic_zeros=not rg._sparse_negative_fallback` to both CSC and CSR calls.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

scverse/rapids-singlecell#636: Directly precedes this PR's work — introduces OVO tiered Wilcoxon ranking with medium/huge tier logic in the same kernels_wilcoxon_ovo.cuh, wilcoxon.cu, and sparse dispatch files that this PR extends with analytic-zero paths.

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is related to the change, but it is too vague to describe the main update clearly.	Use a specific title like 'Add analytic-zero OVO tie-correction speedups for sparse Wilcoxon ranks'.
Description check	❓ Inconclusive	No description was provided, so there is no meaningful summary to evaluate.	Add a short description of the main behavior changes, especially the new analytic-zero and tiered ranking updates.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch speed-up-tie-correct

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh (1)

248-252: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the named warp-reduction buffer size.

The new analytic kernels hard-code warp_buf[32]; use WARP_REDUCE_BUF to keep shared-memory sizing aligned with the reduction helper.

As per coding guidelines, src/rapids_singlecell/_cuda/**/*.{cu,cuh}: “All numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST be defined as named constants”.
Proposed fix
-    __shared__ double warp_buf[32];
+    __shared__ double warp_buf[WARP_REDUCE_BUF];
-    __shared__ double warp_buf[32];
+    __shared__ double warp_buf[WARP_REDUCE_BUF];
Also applies to: 375-376
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh` around lines
248 - 252, The shared-memory warp reduction buffer is hard-coded in the analytic
Wilcoxon kernels, which violates the named-constant guideline and can drift from
the helper’s expected size. Update the declarations in the Wilcoxon OVO kernel
code to use WARP_REDUCE_BUF instead of the literal 32, and make the same change
in the other affected spot, keeping the shared-memory layout aligned with the
reduction helper and consistent with the kernel symbols in
kernels_wilcoxon_ovo.cuh.
Source: Coding guidelines

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh (1)

78-97: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Size LARGE shared memory from the actual largest LARGE group.

large_max = min(max_grp_size, OVO_LARGE_MAX) overestimates when a HUGE group exists alongside smaller LARGE groups, which can reduce occupancy or force an unnecessary HUGE fallback.

Proposed refactor

     OvoTierPlan c;
+    int large_max = 0;
     for (int g = 0; g < n_groups; g++) {
         int sz = h_grp_offsets[g + 1] - h_grp_offsets[g];
         if (sz > c.max_grp_size) c.max_grp_size = sz;
         if (sz <= OVO_MEDIUM_MAX)
             c.run_medium = true;
-        else if (sz <= OVO_LARGE_MAX)
+        else if (sz <= OVO_LARGE_MAX) {
             c.run_large = true;
-        else
+            large_max = std::max(large_max, sz);
+        } else {
             c.run_huge = true;
+        }
     }
@@
     // Size smem for the largest LARGE-band group.
     if (c.run_large) {
-        int large_max = std::min(c.max_grp_size, OVO_LARGE_MAX);
         c.large_padded = 1;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh` around lines
78 - 97, The LARGE shared-memory sizing in the kernel config logic is using the
global max group size instead of the largest group that actually falls in the
LARGE range, which can over-allocate when a HUGE group is present. Update the
sizing in the configuration block around c.run_large/c.large_smem to track the
largest sz that is <= OVO_LARGE_MAX while iterating groups, and use that value
when computing large_max in this setup path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh`:
- Around line 24-29: Gate the huge-mode scratch allocation in the sparse
Wilcoxon OVO path so it only happens when tie correction is actually enabled;
currently `run_huge` and the huge-tier sort buffers are created in
`wilcoxon_ovo_device_sparse` even when `compute_tie_corr` will take the no-tie
fast path. Update the allocation/dispatch logic around `make_sort_group_ids` and
the later huge-buffer setup so `compute_tie_corr=false` skips all huge-mode
sort-ID/scratch setup, while preserving the existing huge path when tie
correction is on.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`:
- Around line 222-235: Validate the analytic-HUGE launch inputs before calling
compact_huge_nonzeros_kernel and cub_segmented_sortkeys in
wilcoxon_ovo_kernels.cuh: guard grp_nz against null before passing it to the
writing kernel, and check that sb_cols * n_all_grp fits the segment offset type
used by grp_seg_offsets and grp_seg_ends. Add these pre-launch validations
alongside the existing analytic_zeros branch so the kernel only runs with valid
scratch buffers and safe index ranges.
- Around line 3-4: The Wilcoxon kernel header uses std::min but only includes
<climits>, so add the missing direct <algorithm> include in
wilcoxon_ovo_kernels.cuh near the existing includes. Keep the change localized
to the header and ensure the std::min usage in the Wilcoxon kernel code remains
valid without relying on transitive includes.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu`:
- Around line 369-375: The huge-tier scratch buffers are being enabled in the
Wilcoxon tier setup even when compute_tie_corr is false, but ovo_dispatch_tiers
exits early on that path, so avoid allocating the huge-mode data unless the
tie-correction path will actually run. Update the run_huge handling in
wilcoxon.cu around the tier setup and the later buffer preparation to gate
make_sort_group_ids and the related huge scratch allocation behind
compute_tie_corr, using the existing symbols run_huge, compute_tie_corr,
t1.run_huge, and ovo_dispatch_tiers to keep the fast path from triggering
unnecessary OOMs.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh`:
- Around line 248-252: The shared-memory warp reduction buffer is hard-coded in
the analytic Wilcoxon kernels, which violates the named-constant guideline and
can drift from the helper’s expected size. Update the declarations in the
Wilcoxon OVO kernel code to use WARP_REDUCE_BUF instead of the literal 32, and
make the same change in the other affected spot, keeping the shared-memory
layout aligned with the reduction helper and consistent with the kernel symbols
in kernels_wilcoxon_ovo.cuh.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`:
- Around line 78-97: The LARGE shared-memory sizing in the kernel config logic
is using the global max group size instead of the largest group that actually
falls in the LARGE range, which can over-allocate when a HUGE group is present.
Update the sizing in the configuration block around c.run_large/c.large_smem to
track the largest sz that is <= OVO_LARGE_MAX while iterating groups, and use
that value when computing large_max in this setup path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b2c19453-e26f-4e4e-8a32-32957bea7b1c

📥 Commits

Reviewing files that changed from the base of the PR and between 8840275 and 3893261.

📒 Files selected for processing (7)

src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cu
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

Signed-off-by: Intron7 <severin.dicks@icloud.com>

codecov-commenter · 2026-06-28T14:13:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.31%. Comparing base (8840275) to head (8b87c82).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #701   +/-   ##
=======================================
  Coverage   90.31%   90.31%           
=======================================
  Files         104      104           
  Lines        9570     9570           
=======================================
  Hits         8643     8643           
  Misses        927      927

Files with missing lines	Coverage Δ
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py	`92.91% <ø> (ø)`

speed update tie correct

3893261

Signed-off-by: Intron7 <severin.dicks@icloud.com>

Intron7 added the run-gpu-ci label Jun 28, 2026

github-actions Bot removed the run-gpu-ci label Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

address coderabbit

8b87c82

Signed-off-by: Intron7 <severin.dicks@icloud.com>

Intron7 added the run-gpu-ci label Jun 28, 2026

github-actions Bot removed the run-gpu-ci label Jun 28, 2026

Intron7 merged commit 87470ae into main Jun 29, 2026
24 of 27 checks passed

Intron7 deleted the speed-up-tie-correct branch June 29, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speed update tie correct#701

speed update tie correct#701
Intron7 merged 2 commits into
mainfrom
speed-up-tie-correct

Intron7 commented Jun 28, 2026

Uh oh!

Intron7 commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026

Summary by CodeRabbit

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Intron7 commented Jun 28, 2026

Uh oh!

Intron7 commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 28, 2026

Summary by CodeRabbit

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

codecov-commenter commented Jun 28, 2026 •

edited

Loading