[REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell by vinaydes · Pull Request #1768 · rapidsai/cuvs

vinaydes · 2026-02-04T10:11:30Z

cuVS currently implements 1-nearest neighbor using a fused-kernel approach, where the pairwise-distance GEMM and the subsequent reduction are combined into a single kernel. While this can be efficient, the fused implementation has limitations that prevent it from consistently achieving the best performance. Additionally, the separate-kernels implementation can be used for half and int8 datatypes unlike the fused implementation which is restricted to float only.

This PR adds a separate-kernel path in which GEMM and reduction run as two distinct kernels. On Blackwell, the separate-kernel approach performs better for certain M, N, and K configurations (see results below).

In addition, this PR includes:

A simple heuristic to choose between the fused and separate paths
Unit tests covering both fused and separate execution paths
A benchmark that compares fused vs. separate performance and also reports GEMM-only time for reference

End-to-end benchmarks

I ran the CUVS_IVF_PQ_ANN_BENCH on a dataset with 10 million vectors and with following build parameters

"build_param": {
        "pq_dim": 128,
        "pq_bits": 8,
        "nlist": 10000,
        "niter": 10,
        "ratio": 100
      },

Observed following performance improvements:

Fused:

Seperate:

1-NN computing benchmark:

Following table shows the performance of fused and separate computation of 1-NN for various sizes of M, N and K. The GEMM column shows the performance of pure GEMM for comparison. Higher is better here.

M	N	K	Fused TFLOPS	Separate TFLOPS	GEMM TFLOPS
16384	4096	128	28.28	37.14	44.53
16384	4096	64	22.04	25.86	33.66
8192	2048	128	26.43	35.33	40.96
8192	2048	64	18.57	24.77	30.54

copy-pr-bot · 2026-02-04T10:11:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2026-02-10T21:29:34Z

Thanks for this change @vinaydes. There's 2 things we'll need to cover in a bit more depth:

Are M and N tile sizes and K number of dims? 16k and 4096 are not particularly realistic nor representative for the types of data we encounter in practice- we're now seeing rows well into the millions and columns into the thousands (1536 and 2048 are becoming more and more common). We'll definitely need to demonstrate what this looks like in an algorithmn like kmeans.
We're experiencing a lot of challenges maintaining a reasonable binary size, especially as we keep adding new kernels, which get compiled various supported architectures. Can you pleaee verify the impact to the binary size in your changes here?

Overall, these are welcome changes, and I think these help address some of the perf gaps we're seeing on Blackwell!

Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>

dantegd

PR looks good to me, also I think it's a good idea to keep the benchmark on a new bench/prims directory separate of the ann ones

vinaydes · 2026-03-03T14:27:51Z

@cjnolet @dantegd What is the next step for this PR? I have addressed all the comments. Does it need some kind of blessing for CI to run?

aamijar · 2026-03-04T21:47:10Z

/ok to test ec3487d

aamijar · 2026-03-04T21:48:05Z

/ok to test c5b51e0

vinaydes · 2026-03-05T10:22:42Z

@aamijar Thanks for running the CI. I am not sure, why I dont see the CI failures locally. May be because I am using 13.2 version of toolkit. I'll try to reprodue the failure locally with 13.1.

tfeher

Thanks Vinay for the PR, it is great to improve kmeans performance.

I would recommend to move the benchmark code to a separate PR, because there are a few issues there that need to be resolved, and it should not delay merging the improved 1-NN distance functions.

Regarding the new 1-NN distance function, please check if we could use raft's GEMM wrappers. Otherwise the code looks good.

vinaydes · 2026-03-20T15:28:33Z

@tfeher I have deleted the benchmarking code for now. We can reintroduce it after refactoring and addressing changes you suggested. Thanks.

tarang-jain · 2026-04-21T22:52:29Z

-        0.0f,
-        stream);
+      if (should_use_fused) {
+        cuvs::distance::fusedDistanceNNMinReduce<MathT, raft::KeyValuePair<IdxT, MathT>, IdxT>(


Rather than (or in addition to) doing this check here, can we do this in minClusterDistanceCompute.cu?. In that case, even regular kmeans will benefit from this change. I have an open PR #2001 to call minClusterDistanceCompute from kmeans_balanced.cuh instead of directly instantiating fusedDistanceNNMinReduce.

vinaydes added 18 commits January 27, 2026 14:54

Forwarding all the previous work

8b98037

Removing and updating comments

316df9f

RNG does not support int8_t, therefore commenting the test case for now

4276494

Adding cutlass dependency to the benchmarking code

c36f25e

Removing unused API declaration

f5ed1c7

Fixing Blackwell's SM version

268a505

Removing debugging and profiling code

381653f

Splitting the GEMM and reduction calls

7a49bec

Updating the name of distance GEMM cal

f5a8a79

Changing the file name to reflect the benchmark better

5a4a795

Updating the name of benchmarking file

1b372fe

Updating test and bench file names

9e9e912

Changed to manual time to capture correct GPU time

8aff3dd

Adding default benchmarking cases and refactoring

0384953

Refactoring

6aa6b4a

Improving comments and refactoring

e65cd9a

Removing unused headers

d653bce

Merge branch 'main' into distance-nn

f0108df

vinaydes requested review from a team as code owners February 4, 2026 10:11

github-project-automation Bot added this to Unstructured Data Processing Feb 4, 2026

vinaydes changed the title ~~[WIP] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell~~ [REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell Feb 4, 2026

cjnolet assigned vinaydes Feb 4, 2026

cjnolet moved this to In Progress in Unstructured Data Processing Feb 4, 2026

tarang-jain reviewed Feb 6, 2026

View reviewed changes

Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh

Comment thread cpp/src/distance/unfused_distance_nn.cuh Outdated

Comment thread cpp/src/distance/unfused_distance_nn.cuh

Merge branch 'main' into distance-nn

124cce6

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 10, 2026

vinaydes and others added 3 commits February 19, 2026 15:17

Update cpp/bench/prims/src/distance/distance_nn.cu

5c31c28

Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>

Always assume GPU build

2c2196a

Add test cases where m != n

7e8afc0

tarang-jain reviewed Feb 19, 2026

View reviewed changes

Comment thread cpp/src/distance/unfused_distance_nn.cuh Outdated

Comment thread cpp/src/distance/unfused_distance_nn.cuh

Comment thread cpp/tests/neighbors/distance_nn.cu Outdated

vinaydes added 4 commits February 23, 2026 14:56

Replacing with matrix::fill

1bd407f

Using RAFT host device function macro

bef16b5

Varying some more sizes

e6f861b

Fixing the dimension of vectors

4cdf49e

dantegd approved these changes Feb 23, 2026

View reviewed changes

vinaydes added 2 commits February 24, 2026 09:33

Merge branch 'main' into distance-nn

d957b33

Merge branch 'main' into distance-nn

ec3487d

Merge branch 'main' into distance-nn

c5b51e0

Merge branch 'main' into distance-nn

f522203

tfeher requested changes Mar 15, 2026

View reviewed changes

vinaydes added 7 commits March 20, 2026 09:57

Merge branch 'main' into distance-nn

afa39a2

Deleting warm-up as warm-up runs can be done using google benchmark cli

31d1def

CUB block reduce header needs to be exclusively included now

0790c56

Adding docstring to explain the reduction function

1324a1d

Explicity setting stream for cublas gemm

5e355fa

Removing the prims benchmarks for now

3313834

Fixing a spelling error

3a4dfaf

Merge branch 'main' into distance-nn

7a2e1a2

vinaydes requested a review from tfeher April 7, 2026 11:04

Merge branch 'main' into distance-nn

6553bb5

tarang-jain reviewed Apr 21, 2026

View reviewed changes

Conversation

vinaydes commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjnolet commented Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

vinaydes commented Mar 3, 2026

Uh oh!

aamijar commented Mar 4, 2026

Uh oh!

aamijar commented Mar 4, 2026

Uh oh!

vinaydes commented Mar 5, 2026

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinaydes commented Mar 20, 2026

Uh oh!

tarang-jain Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vinaydes commented Feb 4, 2026 •

edited

Loading