Support IVF-RaBitQ in cuVS Library#1866
Support IVF-RaBitQ in cuVS Library#1866Stardust-SJF wants to merge 146 commits intorapidsai:mainfrom
Conversation
- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh
…q' into jamxia_cuvs_ivf_rabitq
* Download Eigen automatically by rapids-cmake * Disable FAISS and DISKANN benchmarks * add config files and update readme * Update Readme and openai_1M config * Update python bench command line * update README * update README --------- Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking - Stream-ordered CUDA calls
- Remove commented-out code - Rename a variable - Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
Implement streaming index construction for IVF-RaBitQ to handle datasets that exceed available GPU memory. This enables building indices for large datasets by processing data in batches streamed from host memory. Key features: - Automatic detection based on dataset size vs available workspace - Complete-cluster batching strategy (no partial clusters across batches) - OpenMP parallel host data gathering with persistent thread pool - Contiguous data handling in quantizer for improved performance - Configurable batch size via streaming_batch_size parameter The implementation uses omp_get_max_threads() to scale with available hardware while maintaining efficient memory bandwidth utilization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add an optional parameter to force streaming construction regardless of dataset size. This provides users with explicit control over the construction method for testing or specific use cases. When force_streaming is enabled: - Streaming construction is used even if dataset fits in GPU memory - Distinct log message indicates explicit vs automatic decision to use streaming construction Default behavior (force_streaming=false) remains unchanged, with automatic detection based on dataset size vs available workspace. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Enable control of the force_streaming parameter through JSON benchmark
configuration files. Users can now specify force_streaming in the
build_param section of their benchmark configs.
Example usage:
"build_param": {
"nlist": 10000,
"force_streaming": true,
...
}
This allows benchmark configurations to explicitly control streaming
construction for performance testing and comparison.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.
Adds build_forced_streaming test case that explicitly enables streaming construction even for small datasets that fit in GPU memory. This validates the streaming code path with dynamic batch sizing and ensures compatibility with serialization/deserialization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.
Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)
Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
* data_transformation_batch_opt()
* data_transformation_batch_opt_contiguous()
* rabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables
Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation
Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/ok to test 52793fc |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis pull request introduces a complete IVF-RaBitQ nearest neighbor search implementation for cuVS, spanning GPU index construction (with streaming support), multi-mode search (LUT32/LUT16/QUANT8/QUANT4), quantization, serialization, and comprehensive benchmark/test integration for both cuVS and FAISS CPU variants. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ 90+ minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (16)
cpp/CMakeLists.txt-1070-1070 (1)
1070-1070:⚠️ Potential issue | 🟠 MajorAdd
ivf_rabitqto the staticcuvs_statictarget.The shared
cuvstarget linksivf_rabitq, butcuvs_staticdoes not. Sincesrc/neighbors/ivf_rabitq.cuis included incuvs_objsfor static builds, calls into the helper target become unresolved forcuvs_staticconsumers.Proposed fix
target_link_libraries( cuvs_static INTERFACE $<COMPILE_ONLY:rmm::rmm> PUBLIC raft::raft cuvs::cuvs_cpp_headers ${CUVS_CTK_MATH_DEPENDENCIES} $<TARGET_NAME_IF_EXISTS:NCCL::NCCL> $<BUILD_LOCAL_INTERFACE:$<TARGET_NAME_IF_EXISTS:hnswlib::hnswlib>> PRIVATE rmm::rmm $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> CUDA::nvJitLink CUDA::nvrtc + ivf_rabitq $<$<BOOL:${CUVS_NVTX}>:CUDA::nvtx3> $<COMPILE_ONLY:nvidia::cutlass::cutlass> $<COMPILE_ONLY:cuco::cuco> )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` at line 1070, The static target cuvs_static is missing linkage to ivf_rabitq causing unresolved symbols from src/neighbors/ivf_rabitq.cu; update the CMake target definition that builds cuvs_static (which aggregates cuvs_objs) to link the ivf_rabitq library the same way the shared cuvs target does (i.e., add ivf_rabitq to the target_link_libraries or equivalent linkage list for cuvs_static so consumers of cuvs_static get the helper resolved).cpp/CMakeLists.txt-228-253 (1)
228-253:⚠️ Potential issue | 🟠 MajorCondition OpenMP linking and compilation flags on the
DISABLE_OPENMPoption.The
ivf_rabitqtarget unconditionally linksOpenMP::OpenMP_CXXand injects-fopenmpcompiler flags, even whenDISABLE_OPENMP=ON. This breaks the build because OpenMP is not discovered when that option is enabled. Other targets in this file (lines 980–1130) correctly use the$<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>pattern to make OpenMP optional.Use conditional generator expressions for the OpenMP target dependency and remove the hardcoded
-fopenmpflags from the compile options:Proposed fix
- target_link_libraries(ivf_rabitq PRIVATE OpenMP::OpenMP_CXX CUDA::cudart raft::raft rmm) + target_link_libraries( + ivf_rabitq + PRIVATE $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> + CUDA::cudart + raft::raft + rmm::rmm + ) target_compile_options( ivf_rabitq - PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: $<$<CONFIG:Debug>:-G;-g> --extended-lambda - --expt-relaxed-constexpr -Xcompiler=-fopenmp > $<$<COMPILE_LANGUAGE:CXX>:-fopenmp> + PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:$<$<CONFIG:Debug>:-G;-g> --extended-lambda + --expt-relaxed-constexpr> )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` around lines 228 - 253, The ivf_rabitq target currently always links OpenMP and always injects -fopenmp into compile options; change target_link_libraries(ivf_rabitq ...) to use the optional generator expression $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> instead of OpenMP::OpenMP_CXX, and remove the hardcoded -fopenmp from target_compile_options so the CXX and CUDA compile flags do not force OpenMP when DISABLE_OPENMP=ON (make any CUDA -Xcompiler=-fopenmp and CXX -fopenmp additions conditional on the OpenMP target existing or the option being enabled).cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h-14-14 (1)
14-14:⚠️ Potential issue | 🟠 MajorGuard the RaBitQ FAISS include and wrapper behind the build option.
The header is shared by all FAISS CPU benchmarks. Unconditional inclusion of
<faiss/IndexIVFRaBitQ.h>and thefaiss_cpu_ivfrabitqclass definition will fail the entire build if FAISS lacks RaBitQ support, even whenCUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQis disabled.Proposed fix
+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ `#include` <faiss/IndexIVFRaBitQ.h> +#endif+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ template <typename T> class faiss_cpu_ivfrabitq : public faiss_cpu<T> { public: struct build_param : public faiss_cpu<T>::build_param {}; @@ } }; +#endif🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h` at line 14, Guard the RaBitQ-specific include and wrapper behind the build option by wrapping the `#include` <faiss/IndexIVFRaBitQ.h> and the faiss_cpu_ivfrabitq class definition with an `#if` defined(CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ) / `#endif` block; locate the unconditional include and the faiss_cpu_ivfrabitq symbol in faiss_cpu_wrapper.h and ensure both the header inclusion and the class (or any RaBitQ-specific functions/typedefs) are only compiled when CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ is defined to avoid build failures when FAISS lacks RaBitQ support.cpp/src/neighbors/ivf_rabitq/utils/tools.hpp-10-15 (1)
10-15:⚠️ Potential issue | 🟠 MajorHeader must include
<cstddef>to be self-contained.
size_tis used without including a declaring header. Although this currently works inmemory.hppdue to transitive includes (<cstdlib>), the header is not self-contained and including it directly in other translation units could fail depending on include order. Add#include <cstddef>.The codebase convention uses bare
size_t, notstd::size_t, so no style change is needed.Minimal fix
`#pragma` once + +#include <cstddef> namespace cuvs::neighbors::ivf_rabitq::detail {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/utils/tools.hpp` around lines 10 - 15, The header uses size_t in the inline functions div_rd_up and rd_up_to_multiple_of but doesn't include the declaring header; add `#include` <cstddef> at the top of the file so the header is self-contained and other translation units can include tools.hpp directly without relying on transitive includes (keep using bare size_t as per project convention).cpp/bench/ann/CMakeLists.txt-51-51 (1)
51-51:⚠️ Potential issue | 🟠 MajorAdd IVF-RaBitQ to the cuVS benchmark option bookkeeping.
CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQdefaults ON, but it is not disabled inBUILD_CPU_ONLYand is omitted from theCUVS_ANN_BENCH_USE_CUVSaggregate. CPU-only builds can still try to configure the.cuRabitQ benchmark/linkivf_rabitq, and cuVS benchmark detection can be wrong when only RabitQ is enabled.Proposed fix
if(BUILD_CPU_ONLY) set(CUVS_FAISS_ENABLE_GPU OFF) set(CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT OFF) set(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ OFF) set(CUVS_ANN_BENCH_USE_CUVS_CAGRA OFF) set(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE OFF) set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB OFF) set(CUVS_ANN_BENCH_USE_GGNN OFF) set(CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE OFF) set(CUVS_ANN_BENCH_USE_CUVS_MG OFF) set(CUVS_ANN_BENCH_USE_CUVS_VAMANA OFF) set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_DISKANN OFF) + set(CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ OFF) else() set(CUVS_FAISS_ENABLE_GPU ON) endif() set(CUVS_ANN_BENCH_USE_CUVS OFF) if(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ OR CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE OR CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT + OR CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ OR CUVS_ANN_BENCH_USE_CUVS_CAGRA OR CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB OR CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCEAlso applies to: 84-110, 251-256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/bench/ann/CMakeLists.txt` at line 51, CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ is defaulted ON but not included in the CUVS_ANN_BENCH_USE_CUVS aggregate and not disabled under BUILD_CPU_ONLY, so CPU-only builds can attempt to configure/link the .cu ivf_rabitq target; update the CMake logic to (1) add CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ into the aggregated CUVS_ANN_BENCH_USE_CUVS option and (2) wrap/force-disable CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ when BUILD_CPU_ONLY is set (same treatment as other GPU-only CUVS flags) so ivf_rabitq is not considered/configured for CPU-only builds.cpp/src/neighbors/ivf_rabitq/defines.hpp-23-25 (1)
23-25:⚠️ Potential issue | 🟠 MajorMake
operator>false for equal distances.
!(*this < other)returns true when distances are equal, which breaks comparator expectations.Proposed fix
- bool operator>(const Candidate& other) const { return !(*this < other); } + bool operator>(const Candidate& other) const { return distance > other.distance; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 23 - 25, The current Candidate::operator>(const Candidate& other) uses '!(*this < other)' which yields true when distances are equal; update operator> to compare distances strictly (e.g., return distance > other.distance or return other.distance < distance) so it returns false for equal distances and preserves strict weak ordering; modify the Candidate struct's operator< and operator> pair accordingly.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh-1-10 (1)
1-10:⚠️ Potential issue | 🟠 MajorAdd an include guard to this CUDA header.
This
.cuhdefines constants, a struct, and a device function; including it twice in one translation unit will cause redefinition errors.Proposed fix
/* * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. * SPDX-License-Identifier: Apache-2.0 */ +#pragma once + // // Created by Stardust on 4/14/25. //🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh` around lines 1 - 10, This CUDA header lacks an include guard and can be included multiple times causing redefinition of its constants, struct, and device function; fix it by adding a header guard (or a top-line `#pragma once`) around the entire contents of searcher_gpu.cuh using a unique macro name (e.g. SEARCHER_GPU_COMMON_CUH_) so the declarations inside (the constants, the struct, and the device function defined in this file) are only processed once per translation unit.cpp/src/neighbors/ivf_rabitq/defines.hpp-8-12 (1)
8-12:⚠️ Potential issue | 🟠 MajorInclude
<cstddef>and usestd::size_t.The header only includes
<cstdint>, which does not declarestd::size_tper C++17 standard. Line 12 uses baresize_twithout a qualifying namespace or using declaration, which will not compile on conforming implementations.Proposed fix
+#include <cstddef> `#include` <cstdint> namespace cuvs::neighbors::ivf_rabitq::detail { -constexpr size_t FAST_SIZE = 32; +constexpr std::size_t FAST_SIZE = 32;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 8 - 12, The header defines FAST_SIZE using bare size_t but only includes <cstdint>, so add `#include` <cstddef> and change the type to std::size_t for FAST_SIZE in the cuvs::neighbors::ivf_rabitq::detail namespace (symbol: FAST_SIZE in defines.hpp) to ensure portable, standards-conforming compilation.cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu-36-64 (1)
36-64:⚠️ Potential issue | 🟠 MajorFix type safety and synchronization for the probed vector reduction.
The reduction over
d_probed_vectors_count(stored asunsigned long long) mixes incompatible types: init0isint, comparator isthrust::maximum<size_t>(), and the memset usessizeof(size_t)instead ofsizeof(unsigned long long). Additionally,raft::copyon line 57 is asynchronous, yet the function returns without synchronizing the stream. Callers immediately accessmax_probed_vectors_counton the next line, creating a race condition.Introduce a type alias for the vector count, use it consistently in memset and reduce operations, and synchronize the stream before returning.
Proposed fix
+ using CountT = unsigned long long; + auto d_max_probed_cluster_size = raft::make_device_scalar<uint32_t>(handle, 0); - auto d_probed_vectors_count = raft::make_device_vector<unsigned long long, int64_t>( - handle, get_max_probed_vectors_count ? num_queries : 0); + auto d_probed_vectors_count = + raft::make_device_vector<CountT, int64_t>(handle, + get_max_probed_vectors_count ? num_queries : 0); // raw pointers for passing by value to device lambda auto d_max_probed_cluster_size_ptr = d_max_probed_cluster_size.data_handle(); auto d_probed_vectors_count_ptr = d_probed_vectors_count.data_handle(); if (get_max_probed_vectors_count) { RAFT_CUDA_TRY(cudaMemsetAsync( - d_probed_vectors_count_ptr, 0, num_queries * sizeof(size_t), stream)); // Initialize to 0 + d_probed_vectors_count_ptr, 0, num_queries * sizeof(CountT), stream)); // Initialize to 0 } auto count = thrust::make_counting_iterator<int64_t>(0); thrust::for_each( raft::resource::get_thrust_policy(handle), count, count + num_pairs, [=] __device__(int64_t i) { auto [cluster_idx, query_idx] = d_cluster_query_pairs[i]; auto cluster_size = d_cluster_meta[cluster_idx].num; atomicMax(d_max_probed_cluster_size_ptr, cluster_size); if (get_max_probed_vectors_count) atomicAdd(&d_probed_vectors_count_ptr[query_idx], static_cast<unsigned long long>(cluster_size)); }); raft::copy(&max_probed_cluster_size, d_max_probed_cluster_size_ptr, 1, stream); if (get_max_probed_vectors_count) { - max_probed_vectors_count = thrust::reduce(raft::resource::get_thrust_policy(handle), - d_probed_vectors_count_ptr, - d_probed_vectors_count_ptr + num_queries, - 0, - thrust::maximum<size_t>()); + auto max_count = thrust::reduce(raft::resource::get_thrust_policy(handle), + d_probed_vectors_count_ptr, + d_probed_vectors_count_ptr + num_queries, + CountT{0}, + thrust::maximum<CountT>()); + max_probed_vectors_count = static_cast<size_t>(max_count); } + RAFT_CUDA_TRY(cudaStreamSynchronize(stream));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu` around lines 36 - 64, The reduction mixes types and is unsynchronized: introduce a vector-count alias (e.g., using vec_count_t = unsigned long long) and use it for d_probed_vectors_count / d_probed_vectors_count_ptr, change the cudaMemsetAsync size to num_queries * sizeof(vec_count_t) and initialize the reduce identity to static_cast<vec_count_t>(0) while using thrust::maximum<vec_count_t>() for thrust::reduce; after copying device scalar to max_probed_cluster_size and after the thrust::reduce that writes max_probed_vectors_count, synchronize the stream (e.g., cudaStreamSynchronize(stream)) before returning or before any host reads of max_probed_vectors_count to avoid the race; update references in the code paths guarded by get_max_probed_vectors_count as needed.cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-120-125 (1)
120-125:⚠️ Potential issue | 🟠 MajorWire
refine_ratiointo search or remove the advertised refinement path.
needs_dataset()requests the dataset forrefine_ratio > 1.0f, butsearch()never usesrefine_ratio_ordataset_. Benchmark runs configured with refinement will silently report unrefined IVF-RaBitQ results.Also applies to: 155-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 120 - 125, The set_search_param stores refine_ratio_ but the search() path never uses refine_ratio_ or dataset_, so either wire the advertised refinement into the search pipeline or remove the refinement plumbing; specifically, in cuvs_ivf_rabitq<T, IdxT>::search() detect refine_ratio_ > 1.0f and perform the additional refinement step using dataset_ and search_params_ (e.g., extra re-ranking / distance computations on candidates) so needs_dataset() is valid, or if refinement is unimplemented remove refine_ratio_ handling from cuvs_ivf_rabitq<T, IdxT>::set_search_param and stop requesting the dataset in needs_dataset(); update references to search_params_, refine_ratio_, dataset_, needs_dataset(), and search() accordingly.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu-1451-1457 (1)
1451-1457:⚠️ Potential issue | 🟠 MajorGuard zero-norm query quantization before computing
delta.For an all-zero query,
normandnorm_quancan be zero, makingdeltaNaN. That value is later used asquery_width, so distances can become NaN and poison top-k selection.Proposed fix
- float norm_quan = sqrtf(fmaxf(xu_sq, 0.f)); - float cos_similarity = ip_resi_xucb / (norm * norm_quan); - float delta = norm / norm_quan * cos_similarity; + float norm_quan = sqrtf(fmaxf(xu_sq, 0.f)); + float delta = 0.0f; + if (norm > 0.0f && norm_quan > 0.0f) { + float cos_similarity = ip_resi_xucb / (norm * norm_quan); + delta = norm / norm_quan * cos_similarity; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around lines 1451 - 1457, The computation of delta can produce NaN when norm or norm_quan are zero (all-zero query); update the block around tid==0 to guard divisions: check if norm_quan <= 0 or norm <= 0 and in that case set delta to 0 (or another defined safe fallback) instead of performing the division, then store that safe value into d_delta[row]; ensure you reference and modify the delta computation using the existing symbols norm_quan, norm, cos_similarity, delta and d_delta (and row) so downstream use as query_width cannot become NaN.cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu-41-58 (1)
41-58:⚠️ Potential issue | 🟠 MajorHonor the documented row-major serialization layout.
load()andsave()copy the matrix linearly, but the header says files are row-major and device storage is transposed for column-major GEMM use. This breaks compatibility with any writer/reader following the documented format.Proposed fix direction
- for (size_t i = 0; i < D * D; ++i) { - input.read(reinterpret_cast<char*>(&host_buf(i)), sizeof(float)); - } + for (size_t r = 0; r < D; ++r) { + for (size_t c = 0; c < D; ++c) { + float v; + input.read(reinterpret_cast<char*>(&v), sizeof(float)); + host_buf(c * D + r) = v; // row-major file -> column-major/device layout + } + }Apply the inverse transpose in
save(), or update the header and all serializers to declare the raw layout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu` around lines 41 - 58, The code currently copies the matrix bytes linearly between host_buf and rotation_matrix_ but device storage is transposed for column-major GEMM while files are documented as row-major; update RotatorGPU::load and RotatorGPU::save to perform the inverse transpose when moving between disk and device. Concretely: in load(), read the file into host_buf (row-major) then transpose that host buffer into the device layout expected by rotation_matrix_ (or copy host_buf to a temporary host_transposed and raft::copy that into rotation_matrix_); in save(), copy rotation_matrix_ back to a host buffer in device layout, transpose that buffer to row-major order, then write the row-major bytes to output. Use the existing symbols (RotatorGPU::load, RotatorGPU::save, rotation_matrix_, host_buf, D, stream_, handle_) and a host or GPU transpose utility (e.g., raft transpose or a small custom transpose) to perform the conversions so on-disk layout remains row-major while device storage stays transposed for GEMM.cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu-961-968 (1)
961-968:⚠️ Potential issue | 🟠 MajorAdd stream synchronization after the D2H copy before reading the host buffer.
raft::copy()is asynchronous; the CPU loop can readh_rand_row_normalized_absbefore the copy completes, causing stale or undefined data. The fully-GPU path at line 1450 correctly syncs before reading—apply the same pattern here.Proposed fix
raft::copy(h_rand_row_normalized_abs.data_handle(), rand.data_handle(), kConstNum * dim, raft::resource::get_cuda_stream(handle)); + raft::resource::sync_stream(handle); double sum = 0;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu` around lines 961 - 968, The host loop reads h_rand_row_normalized_abs immediately after an asynchronous raft::copy from rand; add a CUDA stream synchronization using the same stream from raft::resource::get_cuda_stream(handle) after the raft::copy and before the loop that calls best_rescale_factor so the D2H transfer is complete; locate the raft::copy(...) call and the subsequent use of h_rand_row_normalized_abs and insert a stream synchronization (using the stream handle retrieved from raft::resource::get_cuda_stream(handle) or an equivalent RAFT sync helper) between them.cpp/src/neighbors/ivf_rabitq/utils/IO.hpp-19-35 (1)
19-35:⚠️ Potential issue | 🟠 MajorReturn
-1assize_tcreates a potential SIZE_MAX allocation.The function
get_filesizereturns-1on error, which as asize_tbecomesSIZE_MAX. While thefile_exitscheck mitigates this in the happy path, a TOCTOU race condition exists: the file could be deleted or become inaccessible between the existence check and thestat64call, causingget_filesizeto returnSIZE_MAXand subsequent division on line 88 to produce an enormous row count.Additionally, these helper functions should be marked
inlinesince they're defined in a header file used by multiple instantiations of template functions.Proposed fix
-size_t get_filesize(const char* filename) +inline std::optional<size_t> get_filesize(const char* filename) { struct stat64 stat_buf; int rc = stat64(filename, &stat_buf); - return rc == 0 ? stat_buf.st_size : -1; + if (rc != 0) { return std::nullopt; } + return static_cast<size_t>(stat_buf.st_size); } -bool file_exits(const char* filename) +inline bool file_exits(const char* filename)Then handle the empty optional before computing
rows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/utils/IO.hpp` around lines 19 - 35, get_filesize currently returns size_t and uses -1 for errors which converts to SIZE_MAX (risking huge allocations if the file disappears between file_exits and stat64); mark both helpers inline, change get_filesize to return an optional-like error indicator (e.g., std::optional<size_t> or ssize_t) instead of size_t so errors are represented safely, update callers (the code that computes rows/divisions) to check the optional/negative value and bail out early rather than performing divisions, and keep file_exits inline as well; specifically update get_filesize and file_exits signatures and ensure the place that computes rows (where get_filesize() is used) validates the returned value before using it.cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu-452-527 (1)
452-527:⚠️ Potential issue | 🟠 Major
intgrid/block arithmetic can overflow for large inputs.Several launch-config computations mix
size_toperands intointresults, which silently truncates or overflows whennum_vectors,num_centroids, orbatch_size * num_centroidsexceeds ~2³¹:
- Line 453 / 526 / 539 / 695:
int num_blocks = (num_vectors + block_size - 1) / block_size;- Line 468 / 618:
int num_levels = num_centroids + 1;(passed asintto CUB)- Lines 1238/1346/1455:
int grid = num_centroids + batch_size;- Lines 1254/1362/1471:
(batch_size * num_centroids + add_threads - 1) / add_threadsassigned tointGiven this index supports datasets too large to fit in GPU memory (streaming path),
num_vectorsin the tens-of-billions range is plausible. Please usesize_t/int64_tfor these intermediates (CUB histogram/scan useint num_levels, so a runtime check thatnum_centroidsfits inintis also warranted).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 452 - 527, The grid/block arithmetic and CUB parameters use int and can overflow; change intermediate counters like num_blocks, num_levels, and grid calculations to size_t or int64_t (compute e.g. num_blocks = (size_t(num_vectors) + block_size - 1) / block_size and similar for (batch_size * num_centroids + add_threads - 1) / add_threads), keep block_size as an int for launch, and cast to the appropriate launch type only when invoking kernels such as build_cluster_meta_kernel; additionally add runtime checks that values passed into CUB APIs (num_centroids -> num_levels) fit into int and fail/throw if not, and ensure d_offsets/ histogram allocations use matching wider index types (size_t) so no truncation occurs before any kernel or cub call.cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh-163-166 (1)
163-166:⚠️ Potential issue | 🟠 MajorDefault constructor leaves scalar members uninitialized.
The default constructor only initializes
handle_,initializer, andRota, butnum_vectors,num_dimensions,num_padded_dim,num_centroids,max_cluster_length, andex_bits(declared at lines 381-386) are left indeterminate. If anything reads these (e.g. a getter call, or an early exit insave/load_transposed) beforeload_transposedpopulates them, you get UB. Also, passing a hard-coded128toRotatorGPUis wasted work sinceload_transposedimmediately replacesRota.🛡️ Proposed fix
IVFGPU(raft::resources const& handle) - : handle_(handle), initializer(nullptr), Rota(std::make_unique<RotatorGPU>(handle_, 128)) + : handle_(handle), + num_vectors(0), + num_dimensions(0), + num_padded_dim(0), + num_centroids(0), + max_cluster_length(0), + ex_bits(0), + initializer(nullptr), + DQ(nullptr), + Rota(nullptr) { }Alternatively, give the data members default member initializers where declared.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh` around lines 163 - 166, The IVFGPU(raft::resources const& handle) constructor leaves scalar members num_vectors, num_dimensions, num_padded_dim, num_centroids, max_cluster_length, and ex_bits uninitialized and eagerly constructs Rota with a hard-coded 128; fix by initializing those scalars to safe defaults (e.g. 0) either via default member initializers where they are declared or by initializing them in the IVFGPU constructor initializer list, and avoid wasteful RotatorGPU construction by initializing Rota to nullptr or deferring its construction until load_transposed replaces it (or construct with the correct parameter), ensuring consistency with load_transposed and initializer usage.
🟡 Minor comments (12)
cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp-18-40 (1)
18-40:⚠️ Potential issue | 🟡 MinorUse floating-point durations to preserve elapsed time precision.
All four methods cast to integral duration types before calling
.count(), which truncates fractional time. For example,duration_cast<seconds>drops sub-second precision, causing short phases to report0.0seconds. Use floating-point duration types directly.Proposed fix
- float getElapsedTimeSec() + float getElapsedTimeSec() const { - std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now(); - return (std::chrono::duration_cast<std::chrono::seconds>(time_end - time_begin).count()); + return std::chrono::duration<float>(std::chrono::steady_clock::now() - time_begin).count(); } - float getElapsedTimeMili() + float getElapsedTimeMili() const { - std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now(); - return (std::chrono::duration_cast<std::chrono::milliseconds>(time_end - time_begin).count()); + return std::chrono::duration<float, std::milli>(std::chrono::steady_clock::now() - time_begin) + .count(); } - float getElapsedTimeMicro() + float getElapsedTimeMicro() const { - std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now(); - return (std::chrono::duration_cast<std::chrono::microseconds>(time_end - time_begin).count()); + return std::chrono::duration<float, std::micro>(std::chrono::steady_clock::now() - time_begin) + .count(); } - float getElapsedTimeNano() + float getElapsedTimeNano() const { - std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now(); - return (std::chrono::duration_cast<std::chrono::nanoseconds>(time_end - time_begin).count()); + return std::chrono::duration<float, std::nano>(std::chrono::steady_clock::now() - time_begin) + .count(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp` around lines 18 - 40, The current getElapsedTimeSec/getElapsedTimeMili/getElapsedTimeMicro/getElapsedTimeNano functions use duration_cast to integral duration types which truncates fractions; change each to compute elapsed = std::chrono::steady_clock::now() - time_begin and return elapsed.count() from a floating-point chrono::duration (e.g., std::chrono::duration<float> for seconds, std::chrono::duration<float, std::milli> for milliseconds, std::chrono::duration<float, std::micro> for microseconds, and std::chrono::duration<float, std::nano> for nanoseconds) so the functions return fractional values instead of being truncated to integers.cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-122-126 (1)
122-126:⚠️ Potential issue | 🟡 MinorReplace the release-stripped
assertwith runtime validation.
n_probescomes from benchmark config, andassertis compiled out underNDEBUG; invalid configs can then reach the search path unchecked.Proposed fix
- assert(search_params_.n_probes <= index_params_.n_lists); + if (search_params_.n_probes > index_params_.n_lists) { + throw std::runtime_error("ivf_rabitq n_probes must be <= n_lists"); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 122 - 126, Replace the release-only assert with a runtime validation: after assigning search_params_ and refine_ratio_ check if search_params_.n_probes > index_params_.n_lists and if so throw a std::invalid_argument (or another appropriate exception) with a clear message including the offending values (e.g., mention search_params_.n_probes and index_params_.n_lists); update the block that currently uses assert(...) to this if-check so invalid benchmark configs are rejected at runtime (refer to the variables search_params_, index_params_.n_lists, and refine_ratio_ to locate the code).cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu-459-460 (1)
459-460:⚠️ Potential issue | 🟡 MinorRemove redundant
uint32_t*cast inqueue.store()calls.The explicit
(uint32_t*)cast is unnecessary sinceparams.d_topk_pidsis alreadyPID*, which is defined asuint32_t*. Removing the cast improves code clarity without affecting functionality.Proposed fix
- queue.store(params.d_topk_dists + output_offset, - (uint32_t*)(params.d_topk_pids + output_offset)); + queue.store(params.d_topk_dists + output_offset, + params.d_topk_pids + output_offset);Also applies to: 787-788
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around lines 459 - 460, The calls to queue.store currently include an unnecessary cast (uint32_t*) on params.d_topk_pids; remove the redundant cast so the call becomes queue.store(params.d_topk_dists + output_offset, params.d_topk_pids + output_offset). Update both occurrences (the one around queue.store(...) at the shown location and the second occurrence around lines ~787-788) to drop the explicit (uint32_t*) cast; retain all other arguments and behavior unchanged and rely on PID being typedef'd to uint32_t*.cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu-37-75 (1)
37-75:⚠️ Potential issue | 🟡 MinorSilent UB when
Dis not a multiple of 4.
ComputeDistancesKernelWarpreinterprets centroid/query pointers asfloat4*and iterates overD_vec = D/4, but the host call site uses integer divisionD / 4and passes without checking. IfDis ever not a multiple of 4, the last few scalar dimensions are silently dropped (and thefloat4cast reads past the valid range). In the current pipelineD = num_padded_dimis a multiple of 64, but nothing in this TU enforces that — add a defensiveRAFT_EXPECTS(D % 4 == 0, ...)inComputeCentroidsDistances(and/or at construction) to fail loudly if a future caller violates the assumption.Also applies to: 115-134
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 37 - 75, The kernel ComputeDistancesKernelWarp assumes vectors are packed as float4 (D_vec = D/4) which causes silent UB if D is not a multiple of 4; add a defensive runtime check using RAFT_EXPECTS(D % 4 == 0, "D must be multiple of 4") in the host-side function ComputeCentroidsDistances (and/or the class/constructor that sets num_padded_dim) before computing/passing D_vec, and add the same guard where similar kernels are invoked (the other occurrence around lines 115-134) so any future caller fails loudly instead of reading/padding past valid memory.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-560-594 (1)
560-594:⚠️ Potential issue | 🟡 MinorThreshold tightening loop skips legitimate zero distances.
In both block-sort variants, the "find max topk distance" scan filters with
dist > 0 && ... && dist < INFINITY. Valid query-vector pairs can yielddist == 0.0f(e.g., identical vectors, or after numerical cancellation). When every top-k distance is 0,max_topk_distremains-INFINITYand the threshold is never updated — which is safe but wastes the tightening opportunity; when only some slots are 0, they are excluded from the max, which is still correct but inconsistent. Considerdist >= 0 && dist < INFINITY(or justdist < INFINITY, since the init value isINFINITY) to let zero distances participate.Also applies to: 891-924
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 560 - 594, The max-top-k scan currently excludes zero distances via `if (dist > 0 && ...)` so zero-valued valid distances are ignored; update the condition in the scan that computes max_topk_dist (the loop that reads params.d_topk_dists at output_offset using query_idx, probe_slot and params.topk) to include zeros (e.g., `if (dist >= 0 && dist < INFINITY)`) or simply `if (dist < INFINITY)`) and apply the same change in the other block-sort variant (the similar loop around lines 891-924) so atomic threshold tightening uses zero distances too before performing the `atomicMin` on threshold_ptr.cpp/include/cuvs/neighbors/ivf_rabitq.hpp-91-97 (1)
91-97:⚠️ Potential issue | 🟡 MinorDocument
search_modeenumerators.Each enumerator (
LUT16,LUT32,QUANT4,QUANT8) affects recall/throughput/memory trade-offs materially, but the only inline doc is the group comment "A type for specifying the mode...". Please add per-value Doxygen explaining what each mode does and when to pick it; users will otherwise have to read the.cuimplementation to choose a mode.As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` around lines 91 - 97, Add Doxygen comments for each enumerator of enum class search_mode (LUT16, LUT32, QUANT4, QUANT8) describing what the mode does, its impact on recall/throughput/memory, and guidance on when to choose it (e.g., LUT16: smaller LUT, fastest lookup, lower recall; LUT32: larger LUT, higher recall but more memory; QUANT4/QUANT8: quantized ADC trade-offs with QUANT4 being lowest memory/best speed and QUANT8 higher accuracy). Place the /// or /** */ comments directly above each enumerator so the public API header documents per-value behavior for consumers.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh-65-91 (1)
65-91:⚠️ Potential issue | 🟡 MinorDoxygen for
SearchClusterQueryPairsis out of sync with the signature.The
@paramlist referencesall_topk_results,h_query,d_centroid, andstream, none of which are parameters of the declared function. Conversely,topk,d_final_dists, andd_final_pidsare undocumented. Please bring the doc comment in line with the actual signature (and apply the same check toSearchClusterQueryPairsSharedMemOpt/SearchClusterQueryPairsQuantizeQuery, which have no Doxygen at all).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh` around lines 65 - 91, The Doxygen block above SearchClusterQueryPairs is stale: remove references to non-existent params (all_topk_results, h_query, d_centroid, stream) and add documentation for the actual parameters topk, d_final_dists, and d_final_pids, describing their purpose and types to match the signature of SearchClusterQueryPairs(const IVFGPU& cur_ivf, IVFGPU::GPUClusterMeta* d_cluster_meta, ClusterQueryPair* d_sorted_pairs, size_t num_queries, const float* d_query, const float* d_G_k1xSumq, const float* d_G_kbxSumq, size_t nprobe, size_t topk, float* d_final_dists, PID* d_final_pids); also add or update matching Doxygen for the related functions SearchClusterQueryPairsSharedMemOpt and SearchClusterQueryPairsQuantizeQuery so their comments reflect their exact parameter lists.cpp/tests/neighbors/ann_ivf_rabitq.cuh-370-391 (1)
370-391:⚠️ Potential issue | 🟡 MinorTests never exercise
build_only()— only the post-serialize path is validated.The comment on lines 372-373 states that deserialization is required to reorganize data for correct search. As a result, every
TEST_Pmacro (build_serialize_search, build_host_input_serialize_search, build_forced_streaming) runs through serialize→deserialize before callingsearch. There is no test that confirmsbuild_only()(the path users call when they don't persist the index) produces correct results — which means a regression that broke direct-build search would not be caught here.If the library is expected to require a roundtrip for correctness, that should be enforced at the API level (e.g.,
searchrefuses to run on a non-finalized index) rather than documented only in a test comment. Otherwise, please add aTEST_BUILD_SEARCHvariant that callsbuild_only()and asserts recall, so the no-persistence workflow has regression coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/neighbors/ann_ivf_rabitq.cuh` around lines 370 - 391, Add a test variant that exercises the direct-build path instead of always going through serialize/deserialize: create a TEST_BUILD_SEARCH(type) macro (mirroring TEST_BUILD_SERIALIZE_SEARCH) whose TEST_P registers build_search and inside calls this->run([this]() { return this->build_only(); }); then ensure the resulting test calls search/asserts recall the same way the other tests do; reference the existing TEST_P macros and the build_only(), build_serialize(), build_host_input_serialize(), build_with_forced_streaming(), and search() flow so the new macro is wired into the same test fixture and assertions.cpp/include/cuvs/neighbors/ivf_rabitq.hpp-293-293 (1)
293-293:⚠️ Potential issue | 🟡 MinorDoxygen typo: "IVF-PQ" should be "IVF-RaBitQ".
Copy-paste from the IVF-PQ docs. Also consider clarifying on the
indexparameter that*indexmust be a non-null, default-constructedivf_rabitq::index<int64_t>(the implementationRAFT_FAILs onnullptrbut does not document this precondition).📝 Proposed fix
- * `@param`[out] index IVF-PQ index + * `@param`[out] index IVF-RaBitQ index (must be non-null)As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` at line 293, Update the Doxygen param description to correct the algorithm name from "IVF-PQ" to "IVF-RaBitQ" and clarify the precondition for the parameter `index`: document that the caller must pass a non-null pointer and that `*index` must be a default-constructed ivf_rabitq::index<int64_t> (the implementation currently RAFT_FAILs on nullptr). Locate the doc block that currently reads "@param[out] index IVF-PQ index" and replace it with a concise note stating "IVF-RaBitQ index; must be non-null and point to a default-constructed ivf_rabitq::index<int64_t>" so callers know the required precondition.cpp/src/neighbors/ivf_rabitq.cu-222-257 (1)
222-257:⚠️ Potential issue | 🟡 MinorAdd a fall-through guard for unknown
search_modein the dispatch chain.The if/else-if ladder has no terminal
else— ifsearch_modeever gains a new enumerator,searchwill silently producefinal_idsof uninitialized values (sincemake_device_vectorat line 220 does not zero-initialize), and the subsequentraft::linalg::mapat 260-264 will emit garbage neighbor IDs without any diagnostic. Thesearch_mode_to_stringlambda already covers the new-enumerator case withRAFT_FAIL, but it runs earlier (for SearcherGPU construction), so for readers of this function the invariant is non-local. A finalelse { RAFT_FAIL(...); }makes the dispatch self-checking.🛡️ Proposed guard
} else if (params.mode == search_mode::QUANT4) { idx.rabitq_index().BatchClusterSearchQuantizeQuery(..., 4); + } else { + RAFT_FAIL("Invalid search mode"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq.cu` around lines 222 - 257, The dispatch over params.mode lacks a terminal guard: after the existing branches handling search_mode::LUT32, LUT16, QUANT8, and QUANT4, add a final else that calls RAFT_FAIL (or similar) to fail fast for unknown search_mode values (use search_mode_to_string(params.mode) if available for a readable message); this ensures BatchClusterSearch* branches cannot silently leave final_ids uninitialized.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-947-952 (1)
947-952:⚠️ Potential issue | 🟡 MinorUse
0.0f(or remove) the LUT pre-fill instead of-infinity.Filling the LUT buffer with
-std::numeric_limits<float>::infinity()is risky: ifprecomputeAllLUTsever leaves any entry unwritten (e.g., due to a mismatch between host-sidecur_ivf.get_num_padded_dim()and the searcher'sD, kernel launch failure, or future refactor), those entries silently leak-infinto the IP accumulation and produce corrupted distances without any indication. Since the kernel currently writes every entry covered by the searcher'sD, either drop the fill entirely or, if kept defensively, use0.0fwhich degrades gracefully.🔒️ Proposed change
rmm::device_uvector<float> d_lut_for_queries(lut_size / sizeof(float), stream_); - thrust::fill(thrust::cuda::par.on(stream_), - d_lut_for_queries.data(), - d_lut_for_queries.data() + d_lut_for_queries.size(), - -std::numeric_limits<float>::infinity()); // precompute LUTS launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 947 - 952, The LUT buffer is being pre-filled with -std::numeric_limits<float>::infinity() which can silently corrupt inner-product accumulations if any entry remains unwritten; change the pre-fill for d_lut_for_queries (used before launchPrecomputeLUTs) to use 0.0f instead (or remove the thrust::fill call entirely) so unwritten entries degrade gracefully; update the thrust::fill target d_lut_for_queries.data() to use 0.0f and keep launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_) as-is.cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-528-540 (1)
528-540:⚠️ Potential issue | 🟡 MinorRemove the commented debug line.
Line 530 contains a stale debug artifact (
// ex_dist = ex_dist+1;) that should be deleted. The all-threads invariant forqueue.add()is correctly implemented: the else branch feeds dummy values (INFINITY/0) to ensure every thread callsqueue.add()exactly once per round, which matches the block-sort requirement. The comment at line 539 already affirms this invariant.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 528 - 540, Remove the stale debug comment line that reads "// ex_dist = ex_dist+1;" located near the calculation of ex_dist and pid; this is a leftover artifact and should be deleted so the code only contains the real assignment to ex_dist, the pid assignment from params.d_pids[global_vec_idx], and the subsequent queue.add(ex_dist, pid) call (ensure references to ex_dist, pid and queue.add remain unchanged).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 3c486d41-58fb-412e-9cf6-6c82835e1bbf
📒 Files selected for processing (36)
cpp/CMakeLists.txtcpp/bench/ann/CMakeLists.txtcpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.hcpp/bench/ann/src/cuvs/cuvs_benchmark.cucpp/bench/ann/src/cuvs/cuvs_ivf_rabitq.cucpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.hcpp/bench/ann/src/faiss/faiss_cpu_benchmark.cppcpp/bench/ann/src/faiss/faiss_cpu_wrapper.hcpp/include/cuvs/neighbors/ivf_rabitq.hppcpp/src/neighbors/ivf_rabitq.cucpp/src/neighbors/ivf_rabitq/defines.hppcpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cucpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cucpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cucpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cucpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cucpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuhcpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cucpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cucpp/src/neighbors/ivf_rabitq/utils/IO.hppcpp/src/neighbors/ivf_rabitq/utils/StopW.hppcpp/src/neighbors/ivf_rabitq/utils/memory.hppcpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cucpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.hppcpp/src/neighbors/ivf_rabitq/utils/space.hppcpp/src/neighbors/ivf_rabitq/utils/tools.hppcpp/tests/CMakeLists.txtcpp/tests/neighbors/ann_ivf_rabitq.cuhcpp/tests/neighbors/ann_ivf_rabitq/test_float_int64_t.cupython/cuvs/cuvs/tests/test_compute_matrix_product.pypython/cuvs_bench/cuvs_bench/config/algorithms.yaml
| // Replace the second block with: | ||
| { | ||
| // Sort distances and candidate IDs by distance using CUB | ||
|
|
||
| // Determine temporary device storage requirements | ||
| size_t temp_storage_bytes = 0; | ||
| cub::DeviceRadixSort::SortPairs(nullptr, | ||
| temp_storage_bytes, | ||
| d_distances.data(), | ||
| d_distances.data(), // keys (in-place sort) | ||
| d_candidate_ids.data(), | ||
| d_candidate_ids.data(), // values (in-place sort) | ||
| K, | ||
| 0, | ||
| 32, | ||
| stream_); | ||
|
|
||
| rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_); | ||
|
|
||
| // Run sorting operation | ||
| cub::DeviceRadixSort::SortPairs(d_temp_storage.data(), | ||
| temp_storage_bytes, | ||
| d_distances.data(), | ||
| d_distances.data(), // keys (in-place sort) | ||
| d_candidate_ids.data(), | ||
| d_candidate_ids.data(), // values (in-place sort) | ||
| K, | ||
| 0, | ||
| 32, | ||
| stream_); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does cub::DeviceRadixSort::SortPairs support in-place sorting where d_keys_in and d_keys_out point to the same buffer?
💡 Result:
No, cub::DeviceRadixSort::SortPairs does not support in-place sorting where d_keys_in and d_keys_out point to the same buffer. The official documentation explicitly states: "In-place operations are not supported. There must be no overlap between any of the provided ranges: [d_keys_in, d_keys_in + num_items) and [d_keys_out, d_keys_out + num_items)." For in-place-like sorting with reduced temporary storage (~N auxiliary storage), use the overload with DoubleBuffer<KeyT> &d_keys and DoubleBuffer<ValueT> &d_values, which requires two separate buffers for keys (and values) that can be ping-ponged during the multi-pass radix sort.
Citations:
Undefined behavior: cub::DeviceRadixSort::SortPairs does not support in-place (aliased) buffers.
Both calls pass d_distances.data() as both d_keys_in and d_keys_out, and d_candidate_ids.data() as both d_values_in and d_values_out. Per the CCCL documentation, "In-place operations are not supported. There must be no overlap between any of the provided ranges: [d_keys_in, d_keys_in + num_items) and [d_keys_out, d_keys_out + num_items)." Aliasing input and output produces undefined behavior — results may be silently wrong on some architectures, CUB versions, or under memory-pressure conditions, even if it appears to work today.
Either:
- Allocate distinct output buffers and swap, or
- Switch to the
cub::DoubleBuffer-based overload, which is explicitly designed for ping-pong sorting, or - Use
thrust::sort_by_key(which handles the buffering internally).
🐛 Proposed fix using separate output buffers
+ rmm::device_uvector<float> d_distances_sorted(K, stream_);
+ rmm::device_uvector<PID> d_candidate_ids_sorted(K, stream_);
+
// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
temp_storage_bytes,
d_distances.data(),
- d_distances.data(), // keys (in-place sort)
+ d_distances_sorted.data(),
d_candidate_ids.data(),
- d_candidate_ids.data(), // values (in-place sort)
+ d_candidate_ids_sorted.data(),
K,
0,
32,
stream_);
rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
temp_storage_bytes,
d_distances.data(),
- d_distances.data(), // keys (in-place sort)
+ d_distances_sorted.data(),
d_candidate_ids.data(),
- d_candidate_ids.data(), // values (in-place sort)
+ d_candidate_ids_sorted.data(),
K,
0,
32,
stream_);
+ // swap/use the sorted buffers downstream
+ d_distances.swap(d_distances_sorted);
+ d_candidate_ids.swap(d_candidate_ids_sorted);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Replace the second block with: | |
| { | |
| // Sort distances and candidate IDs by distance using CUB | |
| // Determine temporary device storage requirements | |
| size_t temp_storage_bytes = 0; | |
| cub::DeviceRadixSort::SortPairs(nullptr, | |
| temp_storage_bytes, | |
| d_distances.data(), | |
| d_distances.data(), // keys (in-place sort) | |
| d_candidate_ids.data(), | |
| d_candidate_ids.data(), // values (in-place sort) | |
| K, | |
| 0, | |
| 32, | |
| stream_); | |
| rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_); | |
| // Run sorting operation | |
| cub::DeviceRadixSort::SortPairs(d_temp_storage.data(), | |
| temp_storage_bytes, | |
| d_distances.data(), | |
| d_distances.data(), // keys (in-place sort) | |
| d_candidate_ids.data(), | |
| d_candidate_ids.data(), // values (in-place sort) | |
| K, | |
| 0, | |
| 32, | |
| stream_); | |
| } | |
| // Replace the second block with: | |
| { | |
| // Sort distances and candidate IDs by distance using CUB | |
| rmm::device_uvector<float> d_distances_sorted(K, stream_); | |
| rmm::device_uvector<PID> d_candidate_ids_sorted(K, stream_); | |
| // Determine temporary device storage requirements | |
| size_t temp_storage_bytes = 0; | |
| cub::DeviceRadixSort::SortPairs(nullptr, | |
| temp_storage_bytes, | |
| d_distances.data(), | |
| d_distances_sorted.data(), | |
| d_candidate_ids.data(), | |
| d_candidate_ids_sorted.data(), | |
| K, | |
| 0, | |
| 32, | |
| stream_); | |
| rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_); | |
| // Run sorting operation | |
| cub::DeviceRadixSort::SortPairs(d_temp_storage.data(), | |
| temp_storage_bytes, | |
| d_distances.data(), | |
| d_distances_sorted.data(), | |
| d_candidate_ids.data(), | |
| d_candidate_ids_sorted.data(), | |
| K, | |
| 0, | |
| 32, | |
| stream_); | |
| // swap/use the sorted buffers downstream | |
| d_distances.swap(d_distances_sorted); | |
| d_candidate_ids.swap(d_candidate_ids_sorted); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 150 -
180, The calls to cub::DeviceRadixSort::SortPairs use the same device pointers
for input and output (d_distances and d_candidate_ids), which is undefined
because SortPairs does not support in-place/aliased buffers; fix by providing
non-overlapping output buffers or using CUB's DoubleBuffer overload: allocate
separate device buffers (e.g., d_distances_out and d_candidate_ids_out) of size
K, pass those as the output pointers to the SortPairs temporary-storage and run
calls, then swap or copy results back into d_distances/d_candidate_ids as
needed; alternatively replace the two-call pattern with the cub::DeviceRadixSort
overload that accepts cub::DoubleBuffer for keys and values and use that to
perform ping-pong sorting without aliasing.
| auto read_into_device_host = [&](void* d_ptr, void* h_ptr, size_t n_bytes) { | ||
| std::vector<std::uint8_t> h_buf(n_bytes); // host staging buffer | ||
| auto before = input.tellg(); | ||
| input.read(reinterpret_cast<char*>(h_buf.data()), n_bytes); | ||
| auto got = static_cast<size_t>(input.gcount()); | ||
| if (got != n_bytes) { | ||
| std::ostringstream oss; | ||
| oss << "unexpected EOF: wanted " << n_bytes << " bytes at offset " << before << ", got " | ||
| << got << (input.eof() ? " (hit EOF)" : "") << (input.bad() ? " (I/O error)" : ""); | ||
| } | ||
|
|
||
| raft::copy(static_cast<uint8_t*>(d_ptr), h_buf.data(), n_bytes, stream_); | ||
| raft::resource::sync_stream(handle_); | ||
| memcpy(h_ptr, h_buf.data(), n_bytes); | ||
| }; |
There was a problem hiding this comment.
Silent short-read in read_into_device_host.
On an EOF/partial read, the error message is composed into a local std::ostringstream and then dropped on the floor — no throw, no log, no return. The lambda proceeds to raft::copy and memcpy of garbage host-staging data and the caller sees a corrupted index with no indication of failure. Compare with read_into_device and read_into_device_host_transposed_short at lines 129 and 157, which both throw.
🐛 Proposed fix
if (got != n_bytes) {
std::ostringstream oss;
oss << "unexpected EOF: wanted " << n_bytes << " bytes at offset " << before << ", got "
<< got << (input.eof() ? " (hit EOF)" : "") << (input.bad() ? " (I/O error)" : "");
+ throw std::runtime_error(oss.str());
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 136 - 150,
The lambda read_into_device_host silently ignores partial/EOF reads: build the
error string exactly as done now but then throw an exception (e.g., throw
std::runtime_error(oss.str())) instead of dropping the ostringstream, and return
early so the subsequent raft::copy and memcpy are not executed on truncated
data; mirror the behavior used by read_into_device and
read_into_device_host_transposed_short to ensure callers receive an error on
short reads.
| auto batch_pids = raft::make_host_vector<PID, int64_t>(batch_size_vectors); | ||
| auto batch_data = raft::make_host_vector<float, int64_t>(batch_size_vectors * num_dimensions); | ||
| auto d_batch_data = | ||
| raft::make_device_vector<float, int64_t>(handle_, batch_size_vectors * num_dimensions); | ||
|
|
||
| size_t batch_count = 0; | ||
|
|
||
| // Calculate number of threads for host data gathering | ||
| const size_t num_threads = omp_get_max_threads(); | ||
|
|
||
| size_t cluster_idx = 0; | ||
|
|
||
| while (cluster_idx < num_centroids) { | ||
| batch_count++; | ||
| // Determine batch: complete clusters only | ||
| size_t batch_start_cluster = cluster_idx; | ||
| size_t batch_vectors = 0; | ||
| size_t batch_start_offset = h_cluster_meta[cluster_idx].start_index; | ||
|
|
||
| while (cluster_idx < num_centroids && | ||
| batch_vectors + h_cluster_meta[cluster_idx].num <= batch_size_vectors) { | ||
| batch_vectors += h_cluster_meta[cluster_idx].num; | ||
| cluster_idx++; | ||
| } | ||
|
|
||
| // Handle case where single cluster exceeds batch size | ||
| if (batch_vectors == 0 && cluster_idx < num_centroids) { | ||
| batch_vectors = h_cluster_meta[cluster_idx].num; | ||
| cluster_idx++; | ||
| } | ||
|
|
||
| if (batch_vectors == 0) break; // No more clusters to process | ||
|
|
||
| // ------------------------- | ||
| // 11. Gather and transfer batch data to GPU using reordered PIDs | ||
| // ------------------------- | ||
|
|
||
| // First, copy the reordered PIDs for this batch to host (reuse preallocated buffer) | ||
| raft::copy(batch_pids.data_handle(), d_flat_pids + batch_start_offset, batch_vectors, stream_); | ||
| raft::resource::sync_stream(handle_); | ||
|
|
||
| // Gather data on host using the reordered PIDs (reuse preallocated buffer) | ||
| // Use OpenMP parallel region to leverage persistent thread pool across batches | ||
| #pragma omp parallel num_threads(num_threads) | ||
| { | ||
| int tid = omp_get_thread_num(); | ||
| size_t chunk_size = batch_vectors / num_threads; | ||
| size_t start = tid * chunk_size; | ||
| size_t end = (tid == num_threads - 1) ? batch_vectors : start + chunk_size; | ||
|
|
||
| for (size_t i = start; i < end; ++i) { | ||
| std::memcpy(&batch_data.data_handle()[i * num_dimensions], | ||
| &host_data[batch_pids.data_handle()[i] * num_dimensions], | ||
| num_dimensions * sizeof(float)); | ||
| } | ||
| } | ||
| // OpenMP implicit barrier ensures all threads complete before continuing | ||
|
|
||
| // Transfer batch to GPU (reuse preallocated buffer) | ||
| raft::copy(d_batch_data.data_handle(), | ||
| batch_data.data_handle(), | ||
| batch_vectors * num_dimensions, | ||
| stream_); |
There was a problem hiding this comment.
Out-of-bounds write when a single cluster exceeds batch_size_vectors.
batch_pids, batch_data, and d_batch_data are preallocated with capacity batch_size_vectors / batch_size_vectors * num_dimensions. The cluster-packing loop at lines 747-757 has an explicit "single cluster exceeds batch size" fallback that sets batch_vectors = h_cluster_meta[cluster_idx].num — which may be larger than batch_size_vectors. The subsequent raft::copy (line 766), the OpenMP memcpy loop (lines 778-782), and the device copy (lines 787-790) will then write past the end of the preallocated buffers, resulting in a host heap overflow and a device OOB write.
Either grow the buffers on demand when an oversize cluster is encountered, or split that cluster across multiple batch iterations. If the contract is "caller must set batch_size_vectors ≥ max cluster size", that invariant needs to be validated explicitly (and max_cluster_length is already known on line 714).
🛡️ Proposed minimum safeguard
size_t cluster_idx = 0;
+ RAFT_EXPECTS(max_cluster_length <= batch_size_vectors,
+ "construct_on_gpu_streaming: batch_size_vectors (%zu) must be >= max cluster size (%zu)",
+ batch_size_vectors, max_cluster_length);
+
while (cluster_idx < num_centroids) {A more robust fix grows batch_pids / batch_data / d_batch_data to max(batch_size_vectors, max_cluster_length) up front, or reallocates lazily inside the fallback branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 728 - 790,
The code allows a single cluster to set batch_vectors > batch_size_vectors
causing out-of-bounds writes into batch_pids, batch_data and d_batch_data; fix
by ensuring the host/device buffers are sized to at least
max(batch_size_vectors, max_cluster_length) (or grow/reallocate them lazily when
the fallback branch that sets batch_vectors = h_cluster_meta[cluster_idx].num
runs) before calling raft::copy or performing the OpenMP memcpy and the device
copy; update allocation of raft::make_host_vector for batch_pids and batch_data
and raft::make_device_vector for d_batch_data (or add a resize branch
referencing batch_size_vectors, max_cluster_length, batch_pids, batch_data,
d_batch_data) so no raft::copy or memcpy writes past the buffers.
| if (is_candidate) { | ||
| int candidate_slot = atomicAdd(&num_candidates, 1); | ||
| if (candidate_slot < params.max_candidates_per_pair) { | ||
| shared_candidate_ips[candidate_slot] = local_ip_quantized; | ||
| shared_candidate_indices[candidate_slot] = vec_idx; | ||
| } |
There was a problem hiding this comment.
Clamp num_candidates to the number actually stored.
All block-sort variants increment num_candidates even when candidate_slot exceeds params.max_candidates_per_pair. Later loops index shared_candidate_indices[cand_idx] up to num_candidates, so overflowed candidates read unwritten shared memory.
Proposed fix pattern
__syncthreads();
+ const int stored_candidates = min(num_candidates, params.max_candidates_per_pair);
- if (num_candidates > 0) {
+ if (stored_candidates > 0) {
...
- const int candidates_per_thread = (num_candidates + num_threads - 1) / num_threads;
+ const int candidates_per_thread = (stored_candidates + num_threads - 1) / num_threads;
...
- if (cand_idx < num_candidates && cand_idx < params.max_candidates_per_pair) {
+ if (cand_idx < stored_candidates) {Use stored_candidates consistently for IP2 loops, queue rounds, and threshold updates.
Also applies to: 438-439, 471-479, 664-669, 691-699, 895-900, 982-983, 1015-1023, 1204-1209, 1231-1239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 353 - 358, The code increments num_candidates via atomicAdd even when
candidate_slot >= params.max_candidates_per_pair, causing later loops to read
unwritten shared_candidate_indices; clamp num_candidates to the actual number
written (e.g., compute and use stored_candidates = min(num_candidates,
params.max_candidates_per_pair)) and replace uses of num_candidates in
subsequent IP2 loops, queue round iterations, and threshold-update logic with
stored_candidates so indexing into shared_candidate_indices/shared_candidate_ips
only goes up to the stored count; update all occurrences mentioned (the blocks
around the atomicAdd, the IP2 loops, queue rounds, and threshold update
sections) to use the stored_candidates variable consistently.
| queue.store(params.d_topk_dists + output_offset, | ||
| (uint32_t*)(params.d_topk_pids + output_offset)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C3 'queue\.store\(.*\n\s*\(uint32_t\*\)\(params\.d_topk_pids'
rg -n -C3 'cuvs_ivf_rabitq<float,\s*int64_t>|index<int64_t>|using\s+PID|typedef.*PID'Repository: rapidsai/cuvs
Length of output: 50578
🏁 Script executed:
# Check file size and read relevant sections
wc -l cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cuRepository: rapidsai/cuvs
Length of output: 131
🏁 Script executed:
# Read around line 520-521
sed -n '515,525p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
# Read around line 754-755
sed -n '750,760p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
# Read around line 1064-1065
sed -n '1060,1070p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
# Read around line 1294-1295
sed -n '1290,1300p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cuRepository: rapidsai/cuvs
Length of output: 2026
🏁 Script executed:
# Look at file header to understand template parameters
head -100 cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cuRepository: rapidsai/cuvs
Length of output: 3375
🏁 Script executed:
# Check for the struct/class definition and IdxT usage
rg -n 'struct Searcher|class Searcher|template.*IdxT' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu | head -20Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Look for template declaration and params structure
rg -n 'template.*Searcher|struct.*Params' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu | head -30Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Look for d_topk_pids declaration in params or search params structure
rg -n 'd_topk_pids' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cuRepository: rapidsai/cuvs
Length of output: 689
🏁 Script executed:
# Check the header files for params structure definition
fd -t f 'searcher_gpu.*\.c' cpp/src/neighbors/ivf_rabitq/gpu_index/ -x cat {}Repository: rapidsai/cuvs
Length of output: 50370
🏁 Script executed:
# Look at the searcher_gpu.cuh header
head -200 cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh | tail -100Repository: rapidsai/cuvs
Length of output: 4623
🏁 Script executed:
# Find params structure
rg -n 'struct.*Params.*topk_pids' cpp/src/neighbors/ivf_rabitq/ -A 5Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Get broader search for Params structure with d_topk_pids
rg -n 'd_topk_pids' cpp/src/neighbors/ivf_rabitq/ -B 5 -A 1 | head -100Repository: rapidsai/cuvs
Length of output: 10177
Remove unsafe type casts in queue.store() calls that force PID pointers through uint32_t*.
The casts (uint32_t*) in queue.store() calls hide type mismatches when IdxT is templated as int64_t. With d_topk_pids allocated as raft::make_device_vector<PID, int64_t> where PID is typedef'd as uint32_t, these explicit casts bypass type safety. Removing the casts exposes type consistency issues at compile time rather than risking silent data corruption at runtime.
Affected locations
Lines 521, 755, 1065, 1295 in cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
Similar issues exist in:
- cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu (lines 460, 788)
- cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu (lines 558, 888)
Proposed fix
- queue.store(params.d_topk_dists + output_offset,
- (uint32_t*)(params.d_topk_pids + output_offset));
+ queue.store(params.d_topk_dists + output_offset,
+ params.d_topk_pids + output_offset);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 520 - 521, Remove the unsafe (uint32_t*) casts in the queue.store(...)
calls and make the pointer types consistent: pass params.d_topk_pids +
output_offset (no cast) and update queue.store's signature/overloads or the
PID/IdxT typedefs so the stored pointer type matches the templated index type
(IdxT) used by the queue; ensure PID (the element type of params.d_topk_pids)
and the queue.store parameter use the same type (e.g., PID* or IdxT*) to expose
mismatches at compile time and prevent silent corruption—apply the same change
to all queue.store occurrences referenced (e.g., in
searcher_gpu_quantize_query.cu, searcher_gpu_shared_mem_opt.cu,
searcher_gpu.cu).
| if (is_candidate) { | ||
| int candidate_slot = atomicAdd(&num_candidates, 1); | ||
| if (candidate_slot < params.max_candidates_per_pair) { | ||
| shared_candidate_ips[candidate_slot] = local_ip; | ||
| shared_candidate_indices[candidate_slot] = vec_idx; | ||
| } | ||
| } |
There was a problem hiding this comment.
Clamp candidate count before reading shared candidate arrays.
num_candidates keeps increasing after candidate_slot >= params.max_candidates_per_pair, but later loops iterate up to num_candidates and read entries that were never written. A dense cluster or loose threshold can turn this into shared-memory OOB reads and corrupt top-k results.
Proposed fix pattern
__syncthreads();
+ const int stored_candidates = min(num_candidates, params.max_candidates_per_pair);
- if (num_candidates > 0) {
+ if (stored_candidates > 0) {
...
- for (int cand_idx = warp_id; cand_idx < num_candidates; cand_idx += num_warps) {
+ for (int cand_idx = warp_id; cand_idx < stored_candidates; cand_idx += num_warps) {
...
}
...
- const int adds_per_thread = (num_candidates + num_threads - 1) / num_threads;
+ const int adds_per_thread = (stored_candidates + num_threads - 1) / num_threads;Apply the same cap to the NoEX block-sort path and threshold-update checks.
Also applies to: 378-379, 411-419, 735-742, 758-766
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 332 - 338, The bug is that num_candidates can grow past
params.max_candidates_per_pair so later loops read unwritten
shared_candidate_ips/shared_candidate_indices; after the candidate-collection
phase, clamp num_candidates = min(num_candidates,
params.max_candidates_per_pair) (or compute a local bounded_count =
min(num_candidates, params.max_candidates_per_pair)) and use that bounded count
for any subsequent loops or threshold-update checks that iterate over
shared_candidate_ips/shared_candidate_indices; apply the same clamp in the NoEX
block-sort path and any threshold-update code paths that currently iterate up to
num_candidates to avoid shared-memory OOB reads.
| template <size_t alignment, class T, bool HUGE_PAGE = false> | ||
| inline T* align_mm(size_t nbytes) | ||
| { | ||
| size_t size = rd_up_to_multiple_of(nbytes, alignment); | ||
| void* p = std::aligned_alloc(alignment, size); | ||
| if (HUGE_PAGE) { madvise(p, nbytes, MADV_HUGEPAGE); } | ||
| std::memset(p, 0, size); | ||
| return static_cast<T*>(p); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
C++17 std::aligned_alloc requirements: alignment constraints, size multiple of alignment, and null return on allocation failure
💡 Result:
In C++17, std::aligned_alloc(alignment, size) has the following requirements: - alignment must be a valid alignment supported by the implementation (e.g., power of two, multiple of sizeof(void*) in POSIX; fundamental alignments always supported). - size must be an integral multiple of alignment. - If either requirement is violated, or allocation fails, it returns nullptr (null pointer). - On success, returns a pointer to size bytes of uninitialized storage aligned to at least alignment, which must be freed with std::free or std::realloc.
Citations:
- 1: https://en.cppreference.com/w/cpp/memory/c/aligned_alloc
- 2: https://cppreference.dev/w/cpp/memory/c/aligned_alloc
- 3: https://tcs.rwth-aachen.de/docs/cpp/reference/en.cppreference.com/w/cpp/memory/c/aligned_alloc.html
- 4: https://en.cppreference.com/cpp/memory/c/aligned_alloc
- 5: https://en.cppreference.com/c/memory/aligned_alloc
🏁 Script executed:
find . -name "memory.hpp" | grep -i ivf_rabitqRepository: rapidsai/cuvs
Length of output: 104
🏁 Script executed:
cat -n cpp/src/neighbors/ivf_rabitq/utils/memory.hppRepository: rapidsai/cuvs
Length of output: 1098
🏁 Script executed:
# Search for any usages of align_mm to understand impact
rg "align_mm" --type cpp --type hpp -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 84
🏁 Script executed:
# Check if there's any null handling or error handling patterns in the codebase
rg "std::aligned_alloc" --type cpp --type hpp -B 3 -A 3Repository: rapidsai/cuvs
Length of output: 84
🏁 Script executed:
# Search for align_mm usages
rg "align_mm" cpp/src/neighbors/ivf_rabitq/ -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 1113
🏁 Script executed:
# Check the rd_up_to_multiple_of function in tools.hpp
rg "rd_up_to_multiple_of" cpp/src/neighbors/ivf_rabitq/ -B 2 -A 5Repository: rapidsai/cuvs
Length of output: 5418
🏁 Script executed:
# Check if there's any null handling pattern in the calling code
rg "std::aligned_alloc|align_mm" cpp/src/neighbors/ivf_rabitq/ --type-list | head -20Repository: rapidsai/cuvs
Length of output: 446
🏁 Script executed:
# Simpler search for all files that use align_mm
find cpp/src/neighbors/ivf_rabitq -type f -name "*.cpp" -o -name "*.hpp" | xargs grep -l "align_mm" 2>/dev/nullRepository: rapidsai/cuvs
Length of output: 102
Add null pointer check after allocation and use correct size for madvise.
std::aligned_alloc returns nullptr on failure; the current code calls madvise and memset on this null pointer, causing a crash. Additionally, madvise should use the rounded allocation size, not the input nbytes, to cover the entire allocated region. The function should also enforce alignment preconditions at compile time and handle the edge case where rounding yields zero bytes.
Proposed fix
+#include <new>
+
namespace memory {
template <size_t alignment, class T, bool HUGE_PAGE = false>
inline T* align_mm(size_t nbytes)
{
+ static_assert(alignment != 0 && (alignment & (alignment - 1)) == 0);
+ static_assert(alignment % alignof(void*) == 0);
+
size_t size = rd_up_to_multiple_of(nbytes, alignment);
+ if (size == 0) { size = alignment; }
void* p = std::aligned_alloc(alignment, size);
- if (HUGE_PAGE) { madvise(p, nbytes, MADV_HUGEPAGE); }
+ if (p == nullptr) { throw std::bad_alloc{}; }
+ if constexpr (HUGE_PAGE) { madvise(p, size, MADV_HUGEPAGE); }
std::memset(p, 0, size);
return static_cast<T*>(p);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/neighbors/ivf_rabitq/utils/memory.hpp` around lines 22 - 29, In
align_mm<size_t alignment, class T, bool HUGE_PAGE> (function align_mm) add a
compile-time check for valid alignment (e.g., static_assert that alignment is a
non-zero power of two and a multiple of alignof(T)), compute size =
rd_up_to_multiple_of(nbytes, alignment) and if size == 0 return nullptr
immediately, call std::aligned_alloc only after those checks, then check the
returned pointer for nullptr before calling madvise or std::memset, and when
calling madvise use the rounded allocation size (size) instead of the original
nbytes; ensure you still cast and return static_cast<T*>(p) only after the
nullptr check.
| for (int j = tid; j < D; j += BlockSize) { | ||
| s_xp_norm[j] = d_XP_norm[row * D + j]; | ||
| } | ||
| __syncthreads(); |
There was a problem hiding this comment.
Unnecessary sync, each thread access the data it wrote to smem just before. A thread won't read values write by other thread
| } | ||
|
|
||
| /** | ||
| * @defgroup ivf_rabitq_cpp_index_params IVF-RaBitQ index build parameters |
There was a problem hiding this comment.
Add all declared group to doxygen
| raft::device_matrix<float, int64_t, raft::row_major> centroids_ = | ||
| raft::make_device_matrix<float, int64_t, raft::row_major>( | ||
| handle_, K, D); // Stored in GPU device memory. Points to the parent centroids' array |
There was a problem hiding this comment.
It should be fine to remove that default and only keep the constructor initialization of centroids_
| centroids_(raft::make_device_matrix<float, int64_t, raft::row_major>(handle_, K, D)) | ||
| { | ||
| dist_func = L2SqrGPU; | ||
| raft::resource::sync_stream(handle_); |
| // Initialize candidate IDs as a sequence 0,1,...,K-1 using a custom kernel | ||
| int block_size = 256; | ||
| int grid_size = (K + block_size - 1) / block_size; | ||
| init_sequence_kernel<<<grid_size, block_size, 0, stream_>>>(d_candidate_ids, K); | ||
| RAFT_CUDA_TRY(cudaPeekAtLastError()); | ||
| raft::resource::sync_stream(handle_); // Wait for kernel completion |
There was a problem hiding this comment.
Prefer raft::linalg::range
| const_cast<float*>(rotation_matrix_.data_handle()), D, D), | ||
| raft::make_device_matrix_view<float, int64_t, raft::col_major>(const_cast<float*>(d_A), D, N), | ||
| raft::make_device_matrix_view<float, int64_t, raft::col_major>(d_RAND_A, D, N)); | ||
| raft::resource::sync_stream(handle_); |
| auto rd_up_to_multiple_of = [](uint32_t dim, uint32_t mult) -> size_t { | ||
| return ((dim + mult - 1) / mult) * mult; | ||
| }; | ||
| D = rd_up_to_multiple_of(dim, 64); |
|
|
||
| namespace cuvs::neighbors::ivf_rabitq::detail { | ||
|
|
||
| #define MAX_D 2048 |
There was a problem hiding this comment.
static constexpr int ... or something else than a macro
|
|
||
| static constexpr int BITS_PER_CHUNK = 4; | ||
| static constexpr int LUT_SIZE = (1 << BITS_PER_CHUNK); // 16 | ||
| static constexpr int WARP_SIZE = 32; |
| RAFT_CUDA_TRY(cudaFuncSetAttribute( | ||
| fully_fused_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_mem_size)); |
There was a problem hiding this comment.
Why is it needed?
The shared mem size specified in <<<>>> should be enough?
This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.