Skip to content

speed update tie correct#701

Merged
Intron7 merged 2 commits into
mainfrom
speed-up-tie-correct
Jun 29, 2026
Merged

speed update tie correct#701
Intron7 merged 2 commits into
mainfrom
speed-up-tie-correct

Conversation

@Intron7

@Intron7 Intron7 commented Jun 28, 2026

Copy link
Copy Markdown
Member

No description provided.

Signed-off-by: Intron7 <severin.dicks@icloud.com>
@Intron7

Intron7 commented Jun 28, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added support for a new sparse ranking mode that handles zero values more efficiently in large datasets.
    • Expanded GPU processing so more group configurations can be handled directly during ranking.
  • Bug Fixes

    • Improved ranking behavior for sparse inputs by better accounting for zero counts and positive values.
    • Updated tier selection logic to make processing more consistent across dense and sparse paths.
  • Changes

    • Added a new option exposed through ranking workflows to enable the updated sparse-zero handling automatically.

Walkthrough

Adds an analytic_zeros execution mode to all three OVO Wilcoxon rank tiers (MEDIUM, LARGE, HUGE). New CUDA kernels operate only on stored positive values while incorporating zero counts analytically. The flag propagates through tier planning, dispatch, dense/device-sparse/host-sparse streaming pipelines, C++ Python bindings, and the Python caller, which activates it when sparse negative fallback is disabled.

OVO Wilcoxon analytic-zeros mode

Layer / File(s) Summary
New analytic-zero CUDA kernels (MEDIUM/LARGE/HUGE)
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh
Adds ovo_rank_medium_analytic_kernel, ovo_rank_smem_analytic_kernel, compact_huge_nonzeros_kernel, and ovo_rank_huge_analytic_kernel; removes old ref_tie_sum_kernel/ovo_rank_medium_kernel; extends ovo_rank_sorted_kernel with skip_n_grp_gt upper-bound pruning.
Tier planning and ovo_dispatch_tiers control flow
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh
OvoTierPlan gains huge_skip_le; make_ovo_tier_plan sets tier flags from actual group sizes; launch_ovo_medium and ovo_dispatch_tiers gain analytic_zeros and select analytic vs non-analytic kernels per tier; HUGE non-analytic path uses plan.huge_skip_le with INT_MAX.
Dense streaming pipeline wiring
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
launch_ovo_rank_dense_tiered_unsorted_ref and launch_ovo_rank_dense_host_streaming switch huge-mode selection to plan-derived run_huge/huge_skip_le and pass analytic_zeros=false to ovo_dispatch_tiers.
Device-sparse (CSR/CSC) pipeline wiring
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh
Both CSR and CSC device-sparse impls switch run_huge and sort-skip to plan-derived values and pass analytic_zeros=false to ovo_dispatch_tiers.
Host-sparse analytic_zeros propagation and scratch budgeting
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuh
ovo_streaming_csc_host_impl and ovo_streaming_csr_host_impl gain analytic_zeros; StreamBuf gains grp_nz/d_grp_nz scratch pointers allocated conditionally; per-stream memory budgets account for extra group slabs; tier dispatch calls include analytic_zeros.
Python bindings and caller activation
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cu, src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
ovo_streaming_csc_host and ovo_streaming_csr_host bindings expose analytic_zeros (default false); _run_ovo_host_sparse passes analytic_zeros=not rg._sparse_negative_fallback to both CSC and CSR calls.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • scverse/rapids-singlecell#636: Directly precedes this PR's work — introduces OVO tiered Wilcoxon ranking with medium/huge tier logic in the same kernels_wilcoxon_ovo.cuh, wilcoxon.cu, and sparse dispatch files that this PR extends with analytic-zero paths.
🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is related to the change, but it is too vague to describe the main update clearly. Use a specific title like 'Add analytic-zero OVO tie-correction speedups for sparse Wilcoxon ranks'.
Description check ❓ Inconclusive No description was provided, so there is no meaningful summary to evaluate. Add a short description of the main behavior changes, especially the new analytic-zero and tiered ranking updates.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch speed-up-tie-correct

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh (1)

248-252: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the named warp-reduction buffer size.

The new analytic kernels hard-code warp_buf[32]; use WARP_REDUCE_BUF to keep shared-memory sizing aligned with the reduction helper.

As per coding guidelines, src/rapids_singlecell/_cuda/**/*.{cu,cuh}: “All numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST be defined as named constants”.

Proposed fix
-    __shared__ double warp_buf[32];
+    __shared__ double warp_buf[WARP_REDUCE_BUF];
-    __shared__ double warp_buf[32];
+    __shared__ double warp_buf[WARP_REDUCE_BUF];

Also applies to: 375-376

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh` around lines
248 - 252, The shared-memory warp reduction buffer is hard-coded in the analytic
Wilcoxon kernels, which violates the named-constant guideline and can drift from
the helper’s expected size. Update the declarations in the Wilcoxon OVO kernel
code to use WARP_REDUCE_BUF instead of the literal 32, and make the same change
in the other affected spot, keeping the shared-memory layout aligned with the
reduction helper and consistent with the kernel symbols in
kernels_wilcoxon_ovo.cuh.

Source: Coding guidelines

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh (1)

78-97: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Size LARGE shared memory from the actual largest LARGE group.

large_max = min(max_grp_size, OVO_LARGE_MAX) overestimates when a HUGE group exists alongside smaller LARGE groups, which can reduce occupancy or force an unnecessary HUGE fallback.

Proposed refactor
     OvoTierPlan c;
+    int large_max = 0;
     for (int g = 0; g < n_groups; g++) {
         int sz = h_grp_offsets[g + 1] - h_grp_offsets[g];
         if (sz > c.max_grp_size) c.max_grp_size = sz;
         if (sz <= OVO_MEDIUM_MAX)
             c.run_medium = true;
-        else if (sz <= OVO_LARGE_MAX)
+        else if (sz <= OVO_LARGE_MAX) {
             c.run_large = true;
-        else
+            large_max = std::max(large_max, sz);
+        } else {
             c.run_huge = true;
+        }
     }
@@
     // Size smem for the largest LARGE-band group.
     if (c.run_large) {
-        int large_max = std::min(c.max_grp_size, OVO_LARGE_MAX);
         c.large_padded = 1;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh` around lines
78 - 97, The LARGE shared-memory sizing in the kernel config logic is using the
global max group size instead of the largest group that actually falls in the
LARGE range, which can over-allocate when a HUGE group is present. Update the
sizing in the configuration block around c.run_large/c.large_smem to track the
largest sz that is <= OVO_LARGE_MAX while iterating groups, and use that value
when computing large_max in this setup path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh`:
- Around line 24-29: Gate the huge-mode scratch allocation in the sparse
Wilcoxon OVO path so it only happens when tie correction is actually enabled;
currently `run_huge` and the huge-tier sort buffers are created in
`wilcoxon_ovo_device_sparse` even when `compute_tie_corr` will take the no-tie
fast path. Update the allocation/dispatch logic around `make_sort_group_ids` and
the later huge-buffer setup so `compute_tie_corr=false` skips all huge-mode
sort-ID/scratch setup, while preserving the existing huge path when tie
correction is on.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`:
- Around line 222-235: Validate the analytic-HUGE launch inputs before calling
compact_huge_nonzeros_kernel and cub_segmented_sortkeys in
wilcoxon_ovo_kernels.cuh: guard grp_nz against null before passing it to the
writing kernel, and check that sb_cols * n_all_grp fits the segment offset type
used by grp_seg_offsets and grp_seg_ends. Add these pre-launch validations
alongside the existing analytic_zeros branch so the kernel only runs with valid
scratch buffers and safe index ranges.
- Around line 3-4: The Wilcoxon kernel header uses std::min but only includes
<climits>, so add the missing direct <algorithm> include in
wilcoxon_ovo_kernels.cuh near the existing includes. Keep the change localized
to the header and ensure the std::min usage in the Wilcoxon kernel code remains
valid without relying on transitive includes.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu`:
- Around line 369-375: The huge-tier scratch buffers are being enabled in the
Wilcoxon tier setup even when compute_tie_corr is false, but ovo_dispatch_tiers
exits early on that path, so avoid allocating the huge-mode data unless the
tie-correction path will actually run. Update the run_huge handling in
wilcoxon.cu around the tier setup and the later buffer preparation to gate
make_sort_group_ids and the related huge scratch allocation behind
compute_tie_corr, using the existing symbols run_huge, compute_tie_corr,
t1.run_huge, and ovo_dispatch_tiers to keep the fast path from triggering
unnecessary OOMs.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh`:
- Around line 248-252: The shared-memory warp reduction buffer is hard-coded in
the analytic Wilcoxon kernels, which violates the named-constant guideline and
can drift from the helper’s expected size. Update the declarations in the
Wilcoxon OVO kernel code to use WARP_REDUCE_BUF instead of the literal 32, and
make the same change in the other affected spot, keeping the shared-memory
layout aligned with the reduction helper and consistent with the kernel symbols
in kernels_wilcoxon_ovo.cuh.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh`:
- Around line 78-97: The LARGE shared-memory sizing in the kernel config logic
is using the global max group size instead of the largest group that actually
falls in the LARGE range, which can over-allocate when a HUGE group is present.
Update the sizing in the configuration block around c.run_large/c.large_smem to
track the largest sz that is <= OVO_LARGE_MAX while iterating groups, and use
that value when computing large_max in this setup path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b2c19453-e26f-4e4e-8a32-32957bea7b1c

📥 Commits

Reviewing files that changed from the base of the PR and between 8840275 and 3893261.

📒 Files selected for processing (7)
  • src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon_ovo.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_host_sparse.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh
  • src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_sparse.cu
  • src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_device_sparse.cuh Outdated
Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh
Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo_kernels.cuh
Comment thread src/rapids_singlecell/_cuda/wilcoxon/wilcoxon.cu Outdated
Signed-off-by: Intron7 <severin.dicks@icloud.com>
@codecov-commenter

codecov-commenter commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.31%. Comparing base (8840275) to head (8b87c82).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #701   +/-   ##
=======================================
  Coverage   90.31%   90.31%           
=======================================
  Files         104      104           
  Lines        9570     9570           
=======================================
  Hits         8643     8643           
  Misses        927      927           
Files with missing lines Coverage Δ
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py 92.91% <ø> (ø)

@Intron7 Intron7 merged commit 87470ae into main Jun 29, 2026
24 of 27 checks passed
@Intron7 Intron7 deleted the speed-up-tie-correct branch June 29, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants