Skip to content

[REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell#1768

Open
vinaydes wants to merge 46 commits intorapidsai:mainfrom
vinaydes:distance-nn
Open

[REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell#1768
vinaydes wants to merge 46 commits intorapidsai:mainfrom
vinaydes:distance-nn

Conversation

@vinaydes
Copy link
Copy Markdown
Contributor

@vinaydes vinaydes commented Feb 4, 2026

cuVS currently implements 1-nearest neighbor using a fused-kernel approach, where the pairwise-distance GEMM and the subsequent reduction are combined into a single kernel. While this can be efficient, the fused implementation has limitations that prevent it from consistently achieving the best performance. Additionally, the separate-kernels implementation can be used for half and int8 datatypes unlike the fused implementation which is restricted to float only.

This PR adds a separate-kernel path in which GEMM and reduction run as two distinct kernels. On Blackwell, the separate-kernel approach performs better for certain M, N, and K configurations (see results below).

In addition, this PR includes:

  • A simple heuristic to choose between the fused and separate paths
  • Unit tests covering both fused and separate execution paths
  • A benchmark that compares fused vs. separate performance and also reports GEMM-only time for reference

End-to-end benchmarks

I ran the CUVS_IVF_PQ_ANN_BENCH on a dataset with 10 million vectors and with following build parameters

"build_param": {
        "pq_dim": 128,
        "pq_bits": 8,
        "nlist": 10000,
        "niter": 10,
        "ratio": 100
      },

Observed following performance improvements:

Fused:
image

Seperate:
image

1-NN computing benchmark:

Following table shows the performance of fused and separate computation of 1-NN for various sizes of M, N and K. The GEMM column shows the performance of pure GEMM for comparison. Higher is better here.

M N K Fused TFLOPS Separate TFLOPS GEMM TFLOPS
16384 4096 128 28.28 37.14 44.53
16384 4096 64 22.04 25.86 33.66
8192 2048 128 26.43 35.33 40.96
8192 2048 64 18.57 24.77 30.54

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vinaydes vinaydes changed the title [WIP] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell [REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell Feb 4, 2026
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing Feb 4, 2026
Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh
Comment thread cpp/src/distance/unfused_distance_nn.cuh Outdated
Comment thread cpp/src/distance/unfused_distance_nn.cuh
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 10, 2026
@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Feb 10, 2026

Thanks for this change @vinaydes. There's 2 things we'll need to cover in a bit more depth:

  1. Are M and N tile sizes and K number of dims? 16k and 4096 are not particularly realistic nor representative for the types of data we encounter in practice- we're now seeing rows well into the millions and columns into the thousands (1536 and 2048 are becoming more and more common). We'll definitely need to demonstrate what this looks like in an algorithmn like kmeans.

  2. We're experiencing a lot of challenges maintaining a reasonable binary size, especially as we keep adding new kernels, which get compiled various supported architectures. Can you pleaee verify the impact to the binary size in your changes here?

Overall, these are welcome changes, and I think these help address some of the perf gaps we're seeing on Blackwell!

Comment thread cpp/src/distance/unfused_distance_nn.cuh Outdated
Comment thread cpp/src/distance/unfused_distance_nn.cuh
Comment thread cpp/tests/neighbors/distance_nn.cu Outdated
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me, also I think it's a good idea to keep the benchmark on a new bench/prims directory separate of the ann ones

@vinaydes
Copy link
Copy Markdown
Contributor Author

vinaydes commented Mar 3, 2026

@cjnolet @dantegd What is the next step for this PR? I have addressed all the comments. Does it need some kind of blessing for CI to run?

@aamijar
Copy link
Copy Markdown
Member

aamijar commented Mar 4, 2026

/ok to test ec3487d

@aamijar
Copy link
Copy Markdown
Member

aamijar commented Mar 4, 2026

/ok to test c5b51e0

@vinaydes
Copy link
Copy Markdown
Contributor Author

vinaydes commented Mar 5, 2026

@aamijar Thanks for running the CI. I am not sure, why I dont see the CI failures locally. May be because I am using 13.2 version of toolkit. I'll try to reprodue the failure locally with 13.1.

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vinay for the PR, it is great to improve kmeans performance.

I would recommend to move the benchmark code to a separate PR, because there are a few issues there that need to be resolved, and it should not delay merging the improved 1-NN distance functions.

Regarding the new 1-NN distance function, please check if we could use raft's GEMM wrappers. Otherwise the code looks good.

Comment thread cpp/src/distance/unfused_distance_nn.cuh
Comment thread cpp/src/distance/unfused_distance_nn.cuh
Comment thread cpp/bench/prims/src/distance/distance_nn.cu Outdated
Comment thread cpp/bench/prims/src/distance/distance_nn.cu Outdated
Comment thread cpp/bench/prims/src/distance/distance_nn.cu Outdated
@vinaydes
Copy link
Copy Markdown
Contributor Author

@tfeher I have deleted the benchmarking code for now. We can reintroduce it after refactoring and addressing changes you suggested. Thanks.

@vinaydes vinaydes requested a review from tfeher April 7, 2026 11:04
0.0f,
stream);
if (should_use_fused) {
cuvs::distance::fusedDistanceNNMinReduce<MathT, raft::KeyValuePair<IdxT, MathT>, IdxT>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than (or in addition to) doing this check here, can we do this in minClusterDistanceCompute.cu?. In that case, even regular kmeans will benefit from this change. I have an open PR #2001 to call minClusterDistanceCompute from kmeans_balanced.cuh instead of directly instantiating fusedDistanceNNMinReduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

6 participants