[REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell#1768
[REVIEW] Improve 1-NN performance with split GEMM/reduction kernels on Blackwell#1768vinaydes wants to merge 46 commits intorapidsai:mainfrom
Conversation
|
Thanks for this change @vinaydes. There's 2 things we'll need to cover in a bit more depth:
Overall, these are welcome changes, and I think these help address some of the perf gaps we're seeing on Blackwell! |
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
dantegd
left a comment
There was a problem hiding this comment.
PR looks good to me, also I think it's a good idea to keep the benchmark on a new bench/prims directory separate of the ann ones
|
/ok to test ec3487d |
|
/ok to test c5b51e0 |
|
@aamijar Thanks for running the CI. I am not sure, why I dont see the CI failures locally. May be because I am using 13.2 version of toolkit. I'll try to reprodue the failure locally with 13.1. |
tfeher
left a comment
There was a problem hiding this comment.
Thanks Vinay for the PR, it is great to improve kmeans performance.
I would recommend to move the benchmark code to a separate PR, because there are a few issues there that need to be resolved, and it should not delay merging the improved 1-NN distance functions.
Regarding the new 1-NN distance function, please check if we could use raft's GEMM wrappers. Otherwise the code looks good.
|
@tfeher I have deleted the benchmarking code for now. We can reintroduce it after refactoring and addressing changes you suggested. Thanks. |
| 0.0f, | ||
| stream); | ||
| if (should_use_fused) { | ||
| cuvs::distance::fusedDistanceNNMinReduce<MathT, raft::KeyValuePair<IdxT, MathT>, IdxT>( |
There was a problem hiding this comment.
Rather than (or in addition to) doing this check here, can we do this in minClusterDistanceCompute.cu?. In that case, even regular kmeans will benefit from this change. I have an open PR #2001 to call minClusterDistanceCompute from kmeans_balanced.cuh instead of directly instantiating fusedDistanceNNMinReduce.
cuVS currently implements 1-nearest neighbor using a fused-kernel approach, where the pairwise-distance GEMM and the subsequent reduction are combined into a single kernel. While this can be efficient, the fused implementation has limitations that prevent it from consistently achieving the best performance. Additionally, the separate-kernels implementation can be used for
halfandint8datatypes unlike the fused implementation which is restricted tofloatonly.This PR adds a separate-kernel path in which GEMM and reduction run as two distinct kernels. On Blackwell, the separate-kernel approach performs better for certain M, N, and K configurations (see results below).
In addition, this PR includes:
End-to-end benchmarks
I ran the CUVS_IVF_PQ_ANN_BENCH on a dataset with 10 million vectors and with following build parameters
Observed following performance improvements:
Fused:

Seperate:

1-NN computing benchmark:
Following table shows the performance of fused and separate computation of 1-NN for various sizes of M, N and K. The GEMM column shows the performance of pure GEMM for comparison. Higher is better here.