Skip to content

Support IVF-RaBitQ in cuVS Library#1866

Open
Stardust-SJF wants to merge 146 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq
Open

Support IVF-RaBitQ in cuVS Library#1866
Stardust-SJF wants to merge 146 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq

Conversation

@Stardust-SJF
Copy link
Copy Markdown

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

jamxia155 and others added 30 commits November 3, 2025 08:03
- Currently built as a separate library.
- To be merged with existing `cuvs_objs` library.
- Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work
is completed.
- CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1
- Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ

* Align line formatting && Delete unused variables in robust_prune.cuh
* Download Eigen automatically by rapids-cmake

* Disable FAISS and DISKANN benchmarks

* add config files and update readme

* Update Readme and openai_1M config

* Update python bench command line

* update README

* update README

---------

Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking
- Stream-ordered CUDA calls
jamxia155 and others added 20 commits March 18, 2026 06:13
- Remove commented-out code
- Rename a variable
- Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
  Implement streaming index construction for IVF-RaBitQ to handle datasets
  that exceed available GPU memory. This enables building indices for
  large datasets by processing data in batches streamed from host memory.

  Key features:
  - Automatic detection based on dataset size vs available workspace
  - Complete-cluster batching strategy (no partial clusters across batches)
  - OpenMP parallel host data gathering with persistent thread pool
  - Contiguous data handling in quantizer for improved performance
  - Configurable batch size via streaming_batch_size parameter

The implementation uses omp_get_max_threads() to scale with available
  hardware while maintaining efficient memory bandwidth utilization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Add an optional parameter to force streaming construction regardless of
  dataset size. This provides users with explicit control over the
  construction method for testing or specific use cases.

  When force_streaming is enabled:
  - Streaming construction is used even if dataset fits in GPU memory
  - Distinct log message indicates explicit vs automatic decision to
  use streaming construction

  Default behavior (force_streaming=false) remains unchanged, with
  automatic detection based on dataset size vs available workspace.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Enable control of the force_streaming parameter through JSON benchmark
  configuration files. Users can now specify force_streaming in the
  build_param section of their benchmark configs.

  Example usage:
    "build_param": {
      "nlist": 10000,
      "force_streaming": true,
      ...
    }

  This allows benchmark configurations to explicitly control streaming
  construction for performance testing and comparison.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Update documentation to note that force_streaming has no effect when
  the dataset is already in device memory, as streaming construction is
  only applicable for host-to-device data transfer.
  Adds build_forced_streaming test case that explicitly enables streaming
  construction even for small datasets that fit in GPU memory. This validates
  the streaming code path with dynamic batch sizing and ensures compatibility
  with serialization/deserialization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.

Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
  ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
  data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)

Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
  * data_transformation_batch_opt()
  * data_transformation_batch_opt_contiguous()
  * rabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables

Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation

Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented Mar 25, 2026

/ok to test 52793fc

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features
    • Added IVF-RaBitQ algorithm for approximate nearest neighbor search with GPU-accelerated support
    • Added FAISS CPU variant of IVF-RaBitQ for CPU-based searching
    • Added comprehensive benchmark support for both IVF-RaBitQ implementations
    • Added serialization and deserialization capabilities for trained indices

Walkthrough

This pull request introduces a complete IVF-RaBitQ nearest neighbor search implementation for cuVS, spanning GPU index construction (with streaming support), multi-mode search (LUT32/LUT16/QUANT8/QUANT4), quantization, serialization, and comprehensive benchmark/test integration for both cuVS and FAISS CPU variants.

Changes

Cohort / File(s) Summary
Build System Configuration
cpp/CMakeLists.txt, cpp/bench/ann/CMakeLists.txt, cpp/tests/CMakeLists.txt
Added CMake targets/options for ivf_rabitq static library, benchmark toggles (CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ, CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ), and test executable registration.
Public API Header
cpp/include/cuvs/neighbors/ivf_rabitq.hpp
New public C++ interface defining index_params, search_params, search_mode enum, templated index<IdxT> class, and top-level build/search/serialize/deserialize functions for IVF-RaBitQ index operations.
Core GPU Implementation
cpp/src/neighbors/ivf_rabitq.cu, cpp/src/neighbors/ivf_rabitq/gpu_index/*
Implements index building (device/host/streaming paths), k-means clustering, RaBitQ quantization on GPU, GPU rotator matrix handling, three GPU search variants with different memory/compute tradeoffs, and serialization/deserialization logic. Includes ~6,500 LOC of GPU kernels and supporting infrastructure.
Utility & Helper Components
cpp/src/neighbors/ivf_rabitq/defines.hpp, cpp/src/neighbors/ivf_rabitq/utils/*
Internal type definitions, aligned-memory allocation, IO helpers for loading vectors, stopwatch timing, vector normalization, space computations, and constraint tools.
Benchmark Framework
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq*, cpp/bench/ann/src/cuvs/cuvs_benchmark.cu, cpp/bench/ann/src/faiss/faiss_cpu_benchmark.cpp, cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h
Benchmark wrappers for cuVS and FAISS CPU ivf_rabitq variants, parameter parsing from JSON configurations, algorithm factory integration, persistence helpers (save/load), and search preference definitions.
Test Infrastructure
cpp/tests/neighbors/ann_ivf_rabitq.cuh, cpp/tests/neighbors/ann_ivf_rabitq/test_float_int64_t.cu
Comprehensive parameterized GoogleTest harness covering build variants (direct device, host input, serialization, streaming), reference KNN validation, and test generation across dimensions, probe counts, k values, bits-per-dimension, and search modes.
Configuration & Python
python/cuvs_bench/cuvs_bench/config/algorithms.yaml, python/cuvs/cuvs/tests/test_compute_matrix_product.py
Added benchmark algorithm entries for faiss_cpu_ivf_rabitq and cuvs_ivf_rabitq; new pytest module for matrix product iteration utilities.

Estimated code review effort

🎯 5 (Critical) | ⏱️ 90+ minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately describes the main change: adding support for the IVF-RaBitQ algorithm to the cuVS library.
Description check ✅ Passed The description explains the feature being added (IVF-RaBitQ algorithm), its benefits (recall-throughput trade-off, fast build, small storage), and is directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (16)
cpp/CMakeLists.txt-1070-1070 (1)

1070-1070: ⚠️ Potential issue | 🟠 Major

Add ivf_rabitq to the static cuvs_static target.

The shared cuvs target links ivf_rabitq, but cuvs_static does not. Since src/neighbors/ivf_rabitq.cu is included in cuvs_objs for static builds, calls into the helper target become unresolved for cuvs_static consumers.

Proposed fix
     target_link_libraries(
       cuvs_static
       INTERFACE $<COMPILE_ONLY:rmm::rmm>
       PUBLIC raft::raft
              cuvs::cuvs_cpp_headers
              ${CUVS_CTK_MATH_DEPENDENCIES}
              $<TARGET_NAME_IF_EXISTS:NCCL::NCCL>
              $<BUILD_LOCAL_INTERFACE:$<TARGET_NAME_IF_EXISTS:hnswlib::hnswlib>>
       PRIVATE rmm::rmm
               $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
               CUDA::nvJitLink
               CUDA::nvrtc
+              ivf_rabitq
               $<$<BOOL:${CUVS_NVTX}>:CUDA::nvtx3>
               $<COMPILE_ONLY:nvidia::cutlass::cutlass>
               $<COMPILE_ONLY:cuco::cuco>
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` at line 1070, The static target cuvs_static is missing
linkage to ivf_rabitq causing unresolved symbols from
src/neighbors/ivf_rabitq.cu; update the CMake target definition that builds
cuvs_static (which aggregates cuvs_objs) to link the ivf_rabitq library the same
way the shared cuvs target does (i.e., add ivf_rabitq to the
target_link_libraries or equivalent linkage list for cuvs_static so consumers of
cuvs_static get the helper resolved).
cpp/CMakeLists.txt-228-253 (1)

228-253: ⚠️ Potential issue | 🟠 Major

Condition OpenMP linking and compilation flags on the DISABLE_OPENMP option.

The ivf_rabitq target unconditionally links OpenMP::OpenMP_CXX and injects -fopenmp compiler flags, even when DISABLE_OPENMP=ON. This breaks the build because OpenMP is not discovered when that option is enabled. Other targets in this file (lines 980–1130) correctly use the $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> pattern to make OpenMP optional.

Use conditional generator expressions for the OpenMP target dependency and remove the hardcoded -fopenmp flags from the compile options:

Proposed fix
-  target_link_libraries(ivf_rabitq PRIVATE OpenMP::OpenMP_CXX CUDA::cudart raft::raft rmm)
+  target_link_libraries(
+    ivf_rabitq
+    PRIVATE $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+            CUDA::cudart
+            raft::raft
+            rmm::rmm
+  )

   target_compile_options(
     ivf_rabitq
-    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>: $<$<CONFIG:Debug>:-G;-g> --extended-lambda
-            --expt-relaxed-constexpr -Xcompiler=-fopenmp > $<$<COMPILE_LANGUAGE:CXX>:-fopenmp>
+    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:$<$<CONFIG:Debug>:-G;-g> --extended-lambda
+            --expt-relaxed-constexpr>
   )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 228 - 253, The ivf_rabitq target currently
always links OpenMP and always injects -fopenmp into compile options; change
target_link_libraries(ivf_rabitq ...) to use the optional generator expression
$<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> instead of OpenMP::OpenMP_CXX, and
remove the hardcoded -fopenmp from target_compile_options so the CXX and CUDA
compile flags do not force OpenMP when DISABLE_OPENMP=ON (make any CUDA
-Xcompiler=-fopenmp and CXX -fopenmp additions conditional on the OpenMP target
existing or the option being enabled).
cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h-14-14 (1)

14-14: ⚠️ Potential issue | 🟠 Major

Guard the RaBitQ FAISS include and wrapper behind the build option.

The header is shared by all FAISS CPU benchmarks. Unconditional inclusion of <faiss/IndexIVFRaBitQ.h> and the faiss_cpu_ivfrabitq class definition will fail the entire build if FAISS lacks RaBitQ support, even when CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ is disabled.

Proposed fix
+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ
 `#include` <faiss/IndexIVFRaBitQ.h>
+#endif
+#ifdef CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ
 template <typename T>
 class faiss_cpu_ivfrabitq : public faiss_cpu<T> {
  public:
   struct build_param : public faiss_cpu<T>::build_param {};
@@
   }
 };
+#endif
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h` at line 14, Guard the
RaBitQ-specific include and wrapper behind the build option by wrapping the
`#include` <faiss/IndexIVFRaBitQ.h> and the faiss_cpu_ivfrabitq class definition
with an `#if` defined(CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ) / `#endif` block;
locate the unconditional include and the faiss_cpu_ivfrabitq symbol in
faiss_cpu_wrapper.h and ensure both the header inclusion and the class (or any
RaBitQ-specific functions/typedefs) are only compiled when
CUVS_ANN_BENCH_USE_FAISS_CPU_IVF_RABITQ is defined to avoid build failures when
FAISS lacks RaBitQ support.
cpp/src/neighbors/ivf_rabitq/utils/tools.hpp-10-15 (1)

10-15: ⚠️ Potential issue | 🟠 Major

Header must include <cstddef> to be self-contained.

size_t is used without including a declaring header. Although this currently works in memory.hpp due to transitive includes (<cstdlib>), the header is not self-contained and including it directly in other translation units could fail depending on include order. Add #include <cstddef>.

The codebase convention uses bare size_t, not std::size_t, so no style change is needed.

Minimal fix
 `#pragma` once
+
+#include <cstddef>
 
 namespace cuvs::neighbors::ivf_rabitq::detail {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/tools.hpp` around lines 10 - 15, The
header uses size_t in the inline functions div_rd_up and rd_up_to_multiple_of
but doesn't include the declaring header; add `#include` <cstddef> at the top of
the file so the header is self-contained and other translation units can include
tools.hpp directly without relying on transitive includes (keep using bare
size_t as per project convention).
cpp/bench/ann/CMakeLists.txt-51-51 (1)

51-51: ⚠️ Potential issue | 🟠 Major

Add IVF-RaBitQ to the cuVS benchmark option bookkeeping.

CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ defaults ON, but it is not disabled in BUILD_CPU_ONLY and is omitted from the CUVS_ANN_BENCH_USE_CUVS aggregate. CPU-only builds can still try to configure the .cu RabitQ benchmark/link ivf_rabitq, and cuVS benchmark detection can be wrong when only RabitQ is enabled.

Proposed fix
 if(BUILD_CPU_ONLY)
   set(CUVS_FAISS_ENABLE_GPU OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB OFF)
   set(CUVS_ANN_BENCH_USE_GGNN OFF)
   set(CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_MG OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_VAMANA OFF)
   set(CUVS_ANN_BENCH_USE_CUVS_CAGRA_DISKANN OFF)
+  set(CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ OFF)
 else()
   set(CUVS_FAISS_ENABLE_GPU ON)
 endif()
 
 set(CUVS_ANN_BENCH_USE_CUVS OFF)
 if(CUVS_ANN_BENCH_USE_CUVS_IVF_PQ
    OR CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE
    OR CUVS_ANN_BENCH_USE_CUVS_IVF_FLAT
+   OR CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA
    OR CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB
    OR CUVS_KNN_BENCH_USE_CUVS_BRUTE_FORCE

Also applies to: 84-110, 251-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/CMakeLists.txt` at line 51, CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
is defaulted ON but not included in the CUVS_ANN_BENCH_USE_CUVS aggregate and
not disabled under BUILD_CPU_ONLY, so CPU-only builds can attempt to
configure/link the .cu ivf_rabitq target; update the CMake logic to (1) add
CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ into the aggregated CUVS_ANN_BENCH_USE_CUVS
option and (2) wrap/force-disable CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ when
BUILD_CPU_ONLY is set (same treatment as other GPU-only CUVS flags) so
ivf_rabitq is not considered/configured for CPU-only builds.
cpp/src/neighbors/ivf_rabitq/defines.hpp-23-25 (1)

23-25: ⚠️ Potential issue | 🟠 Major

Make operator> false for equal distances.

!(*this < other) returns true when distances are equal, which breaks comparator expectations.

Proposed fix
-  bool operator>(const Candidate& other) const { return !(*this < other); }
+  bool operator>(const Candidate& other) const { return distance > other.distance; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 23 - 25, The current
Candidate::operator>(const Candidate& other) uses '!(*this < other)' which
yields true when distances are equal; update operator> to compare distances
strictly (e.g., return distance > other.distance or return other.distance <
distance) so it returns false for equal distances and preserves strict weak
ordering; modify the Candidate struct's operator< and operator> pair
accordingly.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh-1-10 (1)

1-10: ⚠️ Potential issue | 🟠 Major

Add an include guard to this CUDA header.

This .cuh defines constants, a struct, and a device function; including it twice in one translation unit will cause redefinition errors.

Proposed fix
 /*
  * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION.
  * SPDX-License-Identifier: Apache-2.0
  */
 
+#pragma once
+
 //
 // Created by Stardust on 4/14/25.
 //
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh` around lines
1 - 10, This CUDA header lacks an include guard and can be included multiple
times causing redefinition of its constants, struct, and device function; fix it
by adding a header guard (or a top-line `#pragma once`) around the entire
contents of searcher_gpu.cuh using a unique macro name (e.g.
SEARCHER_GPU_COMMON_CUH_) so the declarations inside (the constants, the struct,
and the device function defined in this file) are only processed once per
translation unit.
cpp/src/neighbors/ivf_rabitq/defines.hpp-8-12 (1)

8-12: ⚠️ Potential issue | 🟠 Major

Include <cstddef> and use std::size_t.

The header only includes <cstdint>, which does not declare std::size_t per C++17 standard. Line 12 uses bare size_t without a qualifying namespace or using declaration, which will not compile on conforming implementations.

Proposed fix
+#include <cstddef>
 `#include` <cstdint>
 
 namespace cuvs::neighbors::ivf_rabitq::detail {
 
-constexpr size_t FAST_SIZE = 32;
+constexpr std::size_t FAST_SIZE = 32;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/defines.hpp` around lines 8 - 12, The header
defines FAST_SIZE using bare size_t but only includes <cstdint>, so add `#include`
<cstddef> and change the type to std::size_t for FAST_SIZE in the
cuvs::neighbors::ivf_rabitq::detail namespace (symbol: FAST_SIZE in defines.hpp)
to ensure portable, standards-conforming compilation.
cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu-36-64 (1)

36-64: ⚠️ Potential issue | 🟠 Major

Fix type safety and synchronization for the probed vector reduction.

The reduction over d_probed_vectors_count (stored as unsigned long long) mixes incompatible types: init 0 is int, comparator is thrust::maximum<size_t>(), and the memset uses sizeof(size_t) instead of sizeof(unsigned long long). Additionally, raft::copy on line 57 is asynchronous, yet the function returns without synchronizing the stream. Callers immediately access max_probed_vectors_count on the next line, creating a race condition.

Introduce a type alias for the vector count, use it consistently in memset and reduce operations, and synchronize the stream before returning.

Proposed fix
+  using CountT = unsigned long long;
+
   auto d_max_probed_cluster_size = raft::make_device_scalar<uint32_t>(handle, 0);
-  auto d_probed_vectors_count    = raft::make_device_vector<unsigned long long, int64_t>(
-    handle, get_max_probed_vectors_count ? num_queries : 0);
+  auto d_probed_vectors_count =
+    raft::make_device_vector<CountT, int64_t>(handle,
+                                              get_max_probed_vectors_count ? num_queries : 0);
   // raw pointers for passing by value to device lambda
   auto d_max_probed_cluster_size_ptr = d_max_probed_cluster_size.data_handle();
   auto d_probed_vectors_count_ptr    = d_probed_vectors_count.data_handle();
   if (get_max_probed_vectors_count) {
     RAFT_CUDA_TRY(cudaMemsetAsync(
-      d_probed_vectors_count_ptr, 0, num_queries * sizeof(size_t), stream));  // Initialize to 0
+      d_probed_vectors_count_ptr, 0, num_queries * sizeof(CountT), stream));  // Initialize to 0
   }
   
   auto count = thrust::make_counting_iterator<int64_t>(0);
   thrust::for_each(
     raft::resource::get_thrust_policy(handle), count, count + num_pairs, [=] __device__(int64_t i) {
       auto [cluster_idx, query_idx] = d_cluster_query_pairs[i];
       auto cluster_size             = d_cluster_meta[cluster_idx].num;
       atomicMax(d_max_probed_cluster_size_ptr, cluster_size);
       if (get_max_probed_vectors_count)
         atomicAdd(&d_probed_vectors_count_ptr[query_idx],
                   static_cast<unsigned long long>(cluster_size));
     });
   raft::copy(&max_probed_cluster_size, d_max_probed_cluster_size_ptr, 1, stream);
   if (get_max_probed_vectors_count) {
-    max_probed_vectors_count = thrust::reduce(raft::resource::get_thrust_policy(handle),
-                                              d_probed_vectors_count_ptr,
-                                              d_probed_vectors_count_ptr + num_queries,
-                                              0,
-                                              thrust::maximum<size_t>());
+    auto max_count = thrust::reduce(raft::resource::get_thrust_policy(handle),
+                                    d_probed_vectors_count_ptr,
+                                    d_probed_vectors_count_ptr + num_queries,
+                                    CountT{0},
+                                    thrust::maximum<CountT>());
+    max_probed_vectors_count = static_cast<size_t>(max_count);
   }
+  RAFT_CUDA_TRY(cudaStreamSynchronize(stream));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu` around lines 36 -
64, The reduction mixes types and is unsynchronized: introduce a vector-count
alias (e.g., using vec_count_t = unsigned long long) and use it for
d_probed_vectors_count / d_probed_vectors_count_ptr, change the cudaMemsetAsync
size to num_queries * sizeof(vec_count_t) and initialize the reduce identity to
static_cast<vec_count_t>(0) while using thrust::maximum<vec_count_t>() for
thrust::reduce; after copying device scalar to max_probed_cluster_size and after
the thrust::reduce that writes max_probed_vectors_count, synchronize the stream
(e.g., cudaStreamSynchronize(stream)) before returning or before any host reads
of max_probed_vectors_count to avoid the race; update references in the code
paths guarded by get_max_probed_vectors_count as needed.
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-120-125 (1)

120-125: ⚠️ Potential issue | 🟠 Major

Wire refine_ratio into search or remove the advertised refinement path.

needs_dataset() requests the dataset for refine_ratio > 1.0f, but search() never uses refine_ratio_ or dataset_. Benchmark runs configured with refinement will silently report unrefined IVF-RaBitQ results.

Also applies to: 155-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 120 - 125, The
set_search_param stores refine_ratio_ but the search() path never uses
refine_ratio_ or dataset_, so either wire the advertised refinement into the
search pipeline or remove the refinement plumbing; specifically, in
cuvs_ivf_rabitq<T, IdxT>::search() detect refine_ratio_ > 1.0f and perform the
additional refinement step using dataset_ and search_params_ (e.g., extra
re-ranking / distance computations on candidates) so needs_dataset() is valid,
or if refinement is unimplemented remove refine_ratio_ handling from
cuvs_ivf_rabitq<T, IdxT>::set_search_param and stop requesting the dataset in
needs_dataset(); update references to search_params_, refine_ratio_, dataset_,
needs_dataset(), and search() accordingly.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu-1451-1457 (1)

1451-1457: ⚠️ Potential issue | 🟠 Major

Guard zero-norm query quantization before computing delta.

For an all-zero query, norm and norm_quan can be zero, making delta NaN. That value is later used as query_width, so distances can become NaN and poison top-k selection.

Proposed fix
-    float norm_quan      = sqrtf(fmaxf(xu_sq, 0.f));
-    float cos_similarity = ip_resi_xucb / (norm * norm_quan);
-    float delta          = norm / norm_quan * cos_similarity;
+    float norm_quan = sqrtf(fmaxf(xu_sq, 0.f));
+    float delta     = 0.0f;
+    if (norm > 0.0f && norm_quan > 0.0f) {
+      float cos_similarity = ip_resi_xucb / (norm * norm_quan);
+      delta = norm / norm_quan * cos_similarity;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 1451 - 1457, The computation of delta can produce NaN when norm or
norm_quan are zero (all-zero query); update the block around tid==0 to guard
divisions: check if norm_quan <= 0 or norm <= 0 and in that case set delta to 0
(or another defined safe fallback) instead of performing the division, then
store that safe value into d_delta[row]; ensure you reference and modify the
delta computation using the existing symbols norm_quan, norm, cos_similarity,
delta and d_delta (and row) so downstream use as query_width cannot become NaN.
cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu-41-58 (1)

41-58: ⚠️ Potential issue | 🟠 Major

Honor the documented row-major serialization layout.

load() and save() copy the matrix linearly, but the header says files are row-major and device storage is transposed for column-major GEMM use. This breaks compatibility with any writer/reader following the documented format.

Proposed fix direction
-  for (size_t i = 0; i < D * D; ++i) {
-    input.read(reinterpret_cast<char*>(&host_buf(i)), sizeof(float));
-  }
+  for (size_t r = 0; r < D; ++r) {
+    for (size_t c = 0; c < D; ++c) {
+      float v;
+      input.read(reinterpret_cast<char*>(&v), sizeof(float));
+      host_buf(c * D + r) = v;  // row-major file -> column-major/device layout
+    }
+  }

Apply the inverse transpose in save(), or update the header and all serializers to declare the raw layout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu` around lines 41 - 58,
The code currently copies the matrix bytes linearly between host_buf and
rotation_matrix_ but device storage is transposed for column-major GEMM while
files are documented as row-major; update RotatorGPU::load and RotatorGPU::save
to perform the inverse transpose when moving between disk and device.
Concretely: in load(), read the file into host_buf (row-major) then transpose
that host buffer into the device layout expected by rotation_matrix_ (or copy
host_buf to a temporary host_transposed and raft::copy that into
rotation_matrix_); in save(), copy rotation_matrix_ back to a host buffer in
device layout, transpose that buffer to row-major order, then write the
row-major bytes to output. Use the existing symbols (RotatorGPU::load,
RotatorGPU::save, rotation_matrix_, host_buf, D, stream_, handle_) and a host or
GPU transpose utility (e.g., raft transpose or a small custom transpose) to
perform the conversions so on-disk layout remains row-major while device storage
stays transposed for GEMM.
cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu-961-968 (1)

961-968: ⚠️ Potential issue | 🟠 Major

Add stream synchronization after the D2H copy before reading the host buffer.

raft::copy() is asynchronous; the CPU loop can read h_rand_row_normalized_abs before the copy completes, causing stale or undefined data. The fully-GPU path at line 1450 correctly syncs before reading—apply the same pattern here.

Proposed fix
   raft::copy(h_rand_row_normalized_abs.data_handle(),
              rand.data_handle(),
              kConstNum * dim,
              raft::resource::get_cuda_stream(handle));
+  raft::resource::sync_stream(handle);
 
   double sum = 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu` around lines 961 -
968, The host loop reads h_rand_row_normalized_abs immediately after an
asynchronous raft::copy from rand; add a CUDA stream synchronization using the
same stream from raft::resource::get_cuda_stream(handle) after the raft::copy
and before the loop that calls best_rescale_factor so the D2H transfer is
complete; locate the raft::copy(...) call and the subsequent use of
h_rand_row_normalized_abs and insert a stream synchronization (using the stream
handle retrieved from raft::resource::get_cuda_stream(handle) or an equivalent
RAFT sync helper) between them.
cpp/src/neighbors/ivf_rabitq/utils/IO.hpp-19-35 (1)

19-35: ⚠️ Potential issue | 🟠 Major

Return -1 as size_t creates a potential SIZE_MAX allocation.

The function get_filesize returns -1 on error, which as a size_t becomes SIZE_MAX. While the file_exits check mitigates this in the happy path, a TOCTOU race condition exists: the file could be deleted or become inaccessible between the existence check and the stat64 call, causing get_filesize to return SIZE_MAX and subsequent division on line 88 to produce an enormous row count.

Additionally, these helper functions should be marked inline since they're defined in a header file used by multiple instantiations of template functions.

Proposed fix
-size_t get_filesize(const char* filename)
+inline std::optional<size_t> get_filesize(const char* filename)
 {
   struct stat64 stat_buf;
   int rc = stat64(filename, &stat_buf);
-  return rc == 0 ? stat_buf.st_size : -1;
+  if (rc != 0) { return std::nullopt; }
+  return static_cast<size_t>(stat_buf.st_size);
 }
 
-bool file_exits(const char* filename)
+inline bool file_exits(const char* filename)

Then handle the empty optional before computing rows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/IO.hpp` around lines 19 - 35, get_filesize
currently returns size_t and uses -1 for errors which converts to SIZE_MAX
(risking huge allocations if the file disappears between file_exits and stat64);
mark both helpers inline, change get_filesize to return an optional-like error
indicator (e.g., std::optional<size_t> or ssize_t) instead of size_t so errors
are represented safely, update callers (the code that computes rows/divisions)
to check the optional/negative value and bail out early rather than performing
divisions, and keep file_exits inline as well; specifically update get_filesize
and file_exits signatures and ensure the place that computes rows (where
get_filesize() is used) validates the returned value before using it.
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu-452-527 (1)

452-527: ⚠️ Potential issue | 🟠 Major

int grid/block arithmetic can overflow for large inputs.

Several launch-config computations mix size_t operands into int results, which silently truncates or overflows when num_vectors, num_centroids, or batch_size * num_centroids exceeds ~2³¹:

  • Line 453 / 526 / 539 / 695: int num_blocks = (num_vectors + block_size - 1) / block_size;
  • Line 468 / 618: int num_levels = num_centroids + 1; (passed as int to CUB)
  • Lines 1238/1346/1455: int grid = num_centroids + batch_size;
  • Lines 1254/1362/1471: (batch_size * num_centroids + add_threads - 1) / add_threads assigned to int

Given this index supports datasets too large to fit in GPU memory (streaming path), num_vectors in the tens-of-billions range is plausible. Please use size_t/int64_t for these intermediates (CUB histogram/scan use int num_levels, so a runtime check that num_centroids fits in int is also warranted).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 452 - 527,
The grid/block arithmetic and CUB parameters use int and can overflow; change
intermediate counters like num_blocks, num_levels, and grid calculations to
size_t or int64_t (compute e.g. num_blocks = (size_t(num_vectors) + block_size -
1) / block_size and similar for (batch_size * num_centroids + add_threads - 1) /
add_threads), keep block_size as an int for launch, and cast to the appropriate
launch type only when invoking kernels such as build_cluster_meta_kernel;
additionally add runtime checks that values passed into CUB APIs (num_centroids
-> num_levels) fit into int and fail/throw if not, and ensure d_offsets/
histogram allocations use matching wider index types (size_t) so no truncation
occurs before any kernel or cub call.
cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh-163-166 (1)

163-166: ⚠️ Potential issue | 🟠 Major

Default constructor leaves scalar members uninitialized.

The default constructor only initializes handle_, initializer, and Rota, but num_vectors, num_dimensions, num_padded_dim, num_centroids, max_cluster_length, and ex_bits (declared at lines 381-386) are left indeterminate. If anything reads these (e.g. a getter call, or an early exit in save/load_transposed) before load_transposed populates them, you get UB. Also, passing a hard-coded 128 to RotatorGPU is wasted work since load_transposed immediately replaces Rota.

🛡️ Proposed fix
   IVFGPU(raft::resources const& handle)
-    : handle_(handle), initializer(nullptr), Rota(std::make_unique<RotatorGPU>(handle_, 128))
+    : handle_(handle),
+      num_vectors(0),
+      num_dimensions(0),
+      num_padded_dim(0),
+      num_centroids(0),
+      max_cluster_length(0),
+      ex_bits(0),
+      initializer(nullptr),
+      DQ(nullptr),
+      Rota(nullptr)
   {
   }

Alternatively, give the data members default member initializers where declared.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh` around lines 163 - 166,
The IVFGPU(raft::resources const& handle) constructor leaves scalar members
num_vectors, num_dimensions, num_padded_dim, num_centroids, max_cluster_length,
and ex_bits uninitialized and eagerly constructs Rota with a hard-coded 128; fix
by initializing those scalars to safe defaults (e.g. 0) either via default
member initializers where they are declared or by initializing them in the
IVFGPU constructor initializer list, and avoid wasteful RotatorGPU construction
by initializing Rota to nullptr or deferring its construction until
load_transposed replaces it (or construct with the correct parameter), ensuring
consistency with load_transposed and initializer usage.
🟡 Minor comments (12)
cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp-18-40 (1)

18-40: ⚠️ Potential issue | 🟡 Minor

Use floating-point durations to preserve elapsed time precision.

All four methods cast to integral duration types before calling .count(), which truncates fractional time. For example, duration_cast<seconds> drops sub-second precision, causing short phases to report 0.0 seconds. Use floating-point duration types directly.

Proposed fix
-  float getElapsedTimeSec()
+  float getElapsedTimeSec() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::seconds>(time_end - time_begin).count());
+    return std::chrono::duration<float>(std::chrono::steady_clock::now() - time_begin).count();
   }
 
-  float getElapsedTimeMili()
+  float getElapsedTimeMili() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::milliseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::milli>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
 
-  float getElapsedTimeMicro()
+  float getElapsedTimeMicro() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::microseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::micro>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
 
-  float getElapsedTimeNano()
+  float getElapsedTimeNano() const
   {
-    std::chrono::steady_clock::time_point time_end = std::chrono::steady_clock::now();
-    return (std::chrono::duration_cast<std::chrono::nanoseconds>(time_end - time_begin).count());
+    return std::chrono::duration<float, std::nano>(std::chrono::steady_clock::now() - time_begin)
+      .count();
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp` around lines 18 - 40, The
current
getElapsedTimeSec/getElapsedTimeMili/getElapsedTimeMicro/getElapsedTimeNano
functions use duration_cast to integral duration types which truncates
fractions; change each to compute elapsed = std::chrono::steady_clock::now() -
time_begin and return elapsed.count() from a floating-point chrono::duration
(e.g., std::chrono::duration<float> for seconds, std::chrono::duration<float,
std::milli> for milliseconds, std::chrono::duration<float, std::micro> for
microseconds, and std::chrono::duration<float, std::nano> for nanoseconds) so
the functions return fractional values instead of being truncated to integers.
cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h-122-126 (1)

122-126: ⚠️ Potential issue | 🟡 Minor

Replace the release-stripped assert with runtime validation.

n_probes comes from benchmark config, and assert is compiled out under NDEBUG; invalid configs can then reach the search path unchecked.

Proposed fix
-  assert(search_params_.n_probes <= index_params_.n_lists);
+  if (search_params_.n_probes > index_params_.n_lists) {
+    throw std::runtime_error("ivf_rabitq n_probes must be <= n_lists");
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h` around lines 122 - 126,
Replace the release-only assert with a runtime validation: after assigning
search_params_ and refine_ratio_ check if search_params_.n_probes >
index_params_.n_lists and if so throw a std::invalid_argument (or another
appropriate exception) with a clear message including the offending values
(e.g., mention search_params_.n_probes and index_params_.n_lists); update the
block that currently uses assert(...) to this if-check so invalid benchmark
configs are rejected at runtime (refer to the variables search_params_,
index_params_.n_lists, and refine_ratio_ to locate the code).
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu-459-460 (1)

459-460: ⚠️ Potential issue | 🟡 Minor

Remove redundant uint32_t* cast in queue.store() calls.

The explicit (uint32_t*) cast is unnecessary since params.d_topk_pids is already PID*, which is defined as uint32_t*. Removing the cast improves code clarity without affecting functionality.

Proposed fix
-      queue.store(params.d_topk_dists + output_offset,
-                  (uint32_t*)(params.d_topk_pids + output_offset));
+      queue.store(params.d_topk_dists + output_offset,
+                  params.d_topk_pids + output_offset);

Also applies to: 787-788

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 459 - 460, The calls to queue.store currently include an unnecessary cast
(uint32_t*) on params.d_topk_pids; remove the redundant cast so the call becomes
queue.store(params.d_topk_dists + output_offset, params.d_topk_pids +
output_offset). Update both occurrences (the one around queue.store(...) at the
shown location and the second occurrence around lines ~787-788) to drop the
explicit (uint32_t*) cast; retain all other arguments and behavior unchanged and
rely on PID being typedef'd to uint32_t*.
cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu-37-75 (1)

37-75: ⚠️ Potential issue | 🟡 Minor

Silent UB when D is not a multiple of 4.

ComputeDistancesKernelWarp reinterprets centroid/query pointers as float4* and iterates over D_vec = D/4, but the host call site uses integer division D / 4 and passes without checking. If D is ever not a multiple of 4, the last few scalar dimensions are silently dropped (and the float4 cast reads past the valid range). In the current pipeline D = num_padded_dim is a multiple of 64, but nothing in this TU enforces that — add a defensive RAFT_EXPECTS(D % 4 == 0, ...) in ComputeCentroidsDistances (and/or at construction) to fail loudly if a future caller violates the assumption.

Also applies to: 115-134

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 37 -
75, The kernel ComputeDistancesKernelWarp assumes vectors are packed as float4
(D_vec = D/4) which causes silent UB if D is not a multiple of 4; add a
defensive runtime check using RAFT_EXPECTS(D % 4 == 0, "D must be multiple of
4") in the host-side function ComputeCentroidsDistances (and/or the
class/constructor that sets num_padded_dim) before computing/passing D_vec, and
add the same guard where similar kernels are invoked (the other occurrence
around lines 115-134) so any future caller fails loudly instead of
reading/padding past valid memory.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-560-594 (1)

560-594: ⚠️ Potential issue | 🟡 Minor

Threshold tightening loop skips legitimate zero distances.

In both block-sort variants, the "find max topk distance" scan filters with dist > 0 && ... && dist < INFINITY. Valid query-vector pairs can yield dist == 0.0f (e.g., identical vectors, or after numerical cancellation). When every top-k distance is 0, max_topk_dist remains -INFINITY and the threshold is never updated — which is safe but wastes the tightening opportunity; when only some slots are 0, they are excluded from the max, which is still correct but inconsistent. Consider dist >= 0 && dist < INFINITY (or just dist < INFINITY, since the init value is INFINITY) to let zero distances participate.

Also applies to: 891-924

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 560 -
594, The max-top-k scan currently excludes zero distances via `if (dist > 0 &&
...)` so zero-valued valid distances are ignored; update the condition in the
scan that computes max_topk_dist (the loop that reads params.d_topk_dists at
output_offset using query_idx, probe_slot and params.topk) to include zeros
(e.g., `if (dist >= 0 && dist < INFINITY)`) or simply `if (dist < INFINITY)`)
and apply the same change in the other block-sort variant (the similar loop
around lines 891-924) so atomic threshold tightening uses zero distances too
before performing the `atomicMin` on threshold_ptr.
cpp/include/cuvs/neighbors/ivf_rabitq.hpp-91-97 (1)

91-97: ⚠️ Potential issue | 🟡 Minor

Document search_mode enumerators.

Each enumerator (LUT16, LUT32, QUANT4, QUANT8) affects recall/throughput/memory trade-offs materially, but the only inline doc is the group comment "A type for specifying the mode...". Please add per-value Doxygen explaining what each mode does and when to pick it; users will otherwise have to read the .cu implementation to choose a mode.

As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` around lines 91 - 97, Add Doxygen
comments for each enumerator of enum class search_mode (LUT16, LUT32, QUANT4,
QUANT8) describing what the mode does, its impact on recall/throughput/memory,
and guidance on when to choose it (e.g., LUT16: smaller LUT, fastest lookup,
lower recall; LUT32: larger LUT, higher recall but more memory; QUANT4/QUANT8:
quantized ADC trade-offs with QUANT4 being lowest memory/best speed and QUANT8
higher accuracy). Place the /// or /** */ comments directly above each
enumerator so the public API header documents per-value behavior for consumers.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh-65-91 (1)

65-91: ⚠️ Potential issue | 🟡 Minor

Doxygen for SearchClusterQueryPairs is out of sync with the signature.

The @param list references all_topk_results, h_query, d_centroid, and stream, none of which are parameters of the declared function. Conversely, topk, d_final_dists, and d_final_pids are undocumented. Please bring the doc comment in line with the actual signature (and apply the same check to SearchClusterQueryPairsSharedMemOpt / SearchClusterQueryPairsQuantizeQuery, which have no Doxygen at all).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh` around lines 65 -
91, The Doxygen block above SearchClusterQueryPairs is stale: remove references
to non-existent params (all_topk_results, h_query, d_centroid, stream) and add
documentation for the actual parameters topk, d_final_dists, and d_final_pids,
describing their purpose and types to match the signature of
SearchClusterQueryPairs(const IVFGPU& cur_ivf, IVFGPU::GPUClusterMeta*
d_cluster_meta, ClusterQueryPair* d_sorted_pairs, size_t num_queries, const
float* d_query, const float* d_G_k1xSumq, const float* d_G_kbxSumq, size_t
nprobe, size_t topk, float* d_final_dists, PID* d_final_pids); also add or
update matching Doxygen for the related functions
SearchClusterQueryPairsSharedMemOpt and SearchClusterQueryPairsQuantizeQuery so
their comments reflect their exact parameter lists.
cpp/tests/neighbors/ann_ivf_rabitq.cuh-370-391 (1)

370-391: ⚠️ Potential issue | 🟡 Minor

Tests never exercise build_only() — only the post-serialize path is validated.

The comment on lines 372-373 states that deserialization is required to reorganize data for correct search. As a result, every TEST_P macro (build_serialize_search, build_host_input_serialize_search, build_forced_streaming) runs through serialize→deserialize before calling search. There is no test that confirms build_only() (the path users call when they don't persist the index) produces correct results — which means a regression that broke direct-build search would not be caught here.

If the library is expected to require a roundtrip for correctness, that should be enforced at the API level (e.g., search refuses to run on a non-finalized index) rather than documented only in a test comment. Otherwise, please add a TEST_BUILD_SEARCH variant that calls build_only() and asserts recall, so the no-persistence workflow has regression coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_ivf_rabitq.cuh` around lines 370 - 391, Add a test
variant that exercises the direct-build path instead of always going through
serialize/deserialize: create a TEST_BUILD_SEARCH(type) macro (mirroring
TEST_BUILD_SERIALIZE_SEARCH) whose TEST_P registers build_search and inside
calls this->run([this]() { return this->build_only(); }); then ensure the
resulting test calls search/asserts recall the same way the other tests do;
reference the existing TEST_P macros and the build_only(), build_serialize(),
build_host_input_serialize(), build_with_forced_streaming(), and search() flow
so the new macro is wired into the same test fixture and assertions.
cpp/include/cuvs/neighbors/ivf_rabitq.hpp-293-293 (1)

293-293: ⚠️ Potential issue | 🟡 Minor

Doxygen typo: "IVF-PQ" should be "IVF-RaBitQ".

Copy-paste from the IVF-PQ docs. Also consider clarifying on the index parameter that *index must be a non-null, default-constructed ivf_rabitq::index<int64_t> (the implementation RAFT_FAILs on nullptr but does not document this precondition).

📝 Proposed fix
- * `@param`[out] index IVF-PQ index
+ * `@param`[out] index IVF-RaBitQ index (must be non-null)

As per coding guidelines: "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/ivf_rabitq.hpp` at line 293, Update the Doxygen
param description to correct the algorithm name from "IVF-PQ" to "IVF-RaBitQ"
and clarify the precondition for the parameter `index`: document that the caller
must pass a non-null pointer and that `*index` must be a default-constructed
ivf_rabitq::index<int64_t> (the implementation currently RAFT_FAILs on nullptr).
Locate the doc block that currently reads "@param[out] index IVF-PQ index" and
replace it with a concise note stating "IVF-RaBitQ index; must be non-null and
point to a default-constructed ivf_rabitq::index<int64_t>" so callers know the
required precondition.
cpp/src/neighbors/ivf_rabitq.cu-222-257 (1)

222-257: ⚠️ Potential issue | 🟡 Minor

Add a fall-through guard for unknown search_mode in the dispatch chain.

The if/else-if ladder has no terminal else — if search_mode ever gains a new enumerator, search will silently produce final_ids of uninitialized values (since make_device_vector at line 220 does not zero-initialize), and the subsequent raft::linalg::map at 260-264 will emit garbage neighbor IDs without any diagnostic. The search_mode_to_string lambda already covers the new-enumerator case with RAFT_FAIL, but it runs earlier (for SearcherGPU construction), so for readers of this function the invariant is non-local. A final else { RAFT_FAIL(...); } makes the dispatch self-checking.

🛡️ Proposed guard
   } else if (params.mode == search_mode::QUANT4) {
     idx.rabitq_index().BatchClusterSearchQuantizeQuery(...,
                                                        4);
+  } else {
+    RAFT_FAIL("Invalid search mode");
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq.cu` around lines 222 - 257, The dispatch over
params.mode lacks a terminal guard: after the existing branches handling
search_mode::LUT32, LUT16, QUANT8, and QUANT4, add a final else that calls
RAFT_FAIL (or similar) to fail fast for unknown search_mode values (use
search_mode_to_string(params.mode) if available for a readable message); this
ensures BatchClusterSearch* branches cannot silently leave final_ids
uninitialized.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-947-952 (1)

947-952: ⚠️ Potential issue | 🟡 Minor

Use 0.0f (or remove) the LUT pre-fill instead of -infinity.

Filling the LUT buffer with -std::numeric_limits<float>::infinity() is risky: if precomputeAllLUTs ever leaves any entry unwritten (e.g., due to a mismatch between host-side cur_ivf.get_num_padded_dim() and the searcher's D, kernel launch failure, or future refactor), those entries silently leak -inf into the IP accumulation and produce corrupted distances without any indication. Since the kernel currently writes every entry covered by the searcher's D, either drop the fill entirely or, if kept defensively, use 0.0f which degrades gracefully.

🔒️ Proposed change
   rmm::device_uvector<float> d_lut_for_queries(lut_size / sizeof(float), stream_);
-  thrust::fill(thrust::cuda::par.on(stream_),
-               d_lut_for_queries.data(),
-               d_lut_for_queries.data() + d_lut_for_queries.size(),
-               -std::numeric_limits<float>::infinity());
   // precompute LUTS
   launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 947 -
952, The LUT buffer is being pre-filled with
-std::numeric_limits<float>::infinity() which can silently corrupt inner-product
accumulations if any entry remains unwritten; change the pre-fill for
d_lut_for_queries (used before launchPrecomputeLUTs) to use 0.0f instead (or
remove the thrust::fill call entirely) so unwritten entries degrade gracefully;
update the thrust::fill target d_lut_for_queries.data() to use 0.0f and keep
launchPrecomputeLUTs(d_query, d_lut_for_queries.data(), num_queries, D, stream_)
as-is.
cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu-528-540 (1)

528-540: ⚠️ Potential issue | 🟡 Minor

Remove the commented debug line.

Line 530 contains a stale debug artifact (// ex_dist = ex_dist+1;) that should be deleted. The all-threads invariant for queue.add() is correctly implemented: the else branch feeds dummy values (INFINITY/0) to ensure every thread calls queue.add() exactly once per round, which matches the block-sort requirement. The comment at line 539 already affirms this invariant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu` around lines 528 -
540, Remove the stale debug comment line that reads "//                ex_dist =
ex_dist+1;" located near the calculation of ex_dist and pid; this is a leftover
artifact and should be deleted so the code only contains the real assignment to
ex_dist, the pid assignment from params.d_pids[global_vec_idx], and the
subsequent queue.add(ex_dist, pid) call (ensure references to ex_dist, pid and
queue.add remain unchanged).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3c486d41-58fb-412e-9cf6-6c82835e1bbf

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and e28491b.

📒 Files selected for processing (36)
  • cpp/CMakeLists.txt
  • cpp/bench/ann/CMakeLists.txt
  • cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
  • cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h
  • cpp/bench/ann/src/faiss/faiss_cpu_benchmark.cpp
  • cpp/bench/ann/src/faiss/faiss_cpu_wrapper.h
  • cpp/include/cuvs/neighbors/ivf_rabitq.hpp
  • cpp/src/neighbors/ivf_rabitq.cu
  • cpp/src/neighbors/ivf_rabitq/defines.hpp
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/quantizer_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/rotator_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_common.cuh
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu
  • cpp/src/neighbors/ivf_rabitq/utils/IO.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/StopW.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/memory.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.cu
  • cpp/src/neighbors/ivf_rabitq/utils/searcher_gpu_utils.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/space.hpp
  • cpp/src/neighbors/ivf_rabitq/utils/tools.hpp
  • cpp/tests/CMakeLists.txt
  • cpp/tests/neighbors/ann_ivf_rabitq.cuh
  • cpp/tests/neighbors/ann_ivf_rabitq/test_float_int64_t.cu
  • python/cuvs/cuvs/tests/test_compute_matrix_product.py
  • python/cuvs_bench/cuvs_bench/config/algorithms.yaml

Comment on lines +150 to +180
// Replace the second block with:
{
// Sort distances and candidate IDs by distance using CUB

// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
temp_storage_bytes,
d_distances.data(),
d_distances.data(), // keys (in-place sort)
d_candidate_ids.data(),
d_candidate_ids.data(), // values (in-place sort)
K,
0,
32,
stream_);

rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_);

// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
temp_storage_bytes,
d_distances.data(),
d_distances.data(), // keys (in-place sort)
d_candidate_ids.data(),
d_candidate_ids.data(), // values (in-place sort)
K,
0,
32,
stream_);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does cub::DeviceRadixSort::SortPairs support in-place sorting where d_keys_in and d_keys_out point to the same buffer?

💡 Result:

No, cub::DeviceRadixSort::SortPairs does not support in-place sorting where d_keys_in and d_keys_out point to the same buffer. The official documentation explicitly states: "In-place operations are not supported. There must be no overlap between any of the provided ranges: [d_keys_in, d_keys_in + num_items) and [d_keys_out, d_keys_out + num_items)." For in-place-like sorting with reduced temporary storage (~N auxiliary storage), use the overload with DoubleBuffer<KeyT> &d_keys and DoubleBuffer<ValueT> &d_values, which requires two separate buffers for keys (and values) that can be ping-ponged during the multi-pass radix sort.

Citations:


Undefined behavior: cub::DeviceRadixSort::SortPairs does not support in-place (aliased) buffers.

Both calls pass d_distances.data() as both d_keys_in and d_keys_out, and d_candidate_ids.data() as both d_values_in and d_values_out. Per the CCCL documentation, "In-place operations are not supported. There must be no overlap between any of the provided ranges: [d_keys_in, d_keys_in + num_items) and [d_keys_out, d_keys_out + num_items)." Aliasing input and output produces undefined behavior — results may be silently wrong on some architectures, CUB versions, or under memory-pressure conditions, even if it appears to work today.

Either:

  1. Allocate distinct output buffers and swap, or
  2. Switch to the cub::DoubleBuffer-based overload, which is explicitly designed for ping-pong sorting, or
  3. Use thrust::sort_by_key (which handles the buffering internally).
🐛 Proposed fix using separate output buffers
+    rmm::device_uvector<float> d_distances_sorted(K, stream_);
+    rmm::device_uvector<PID>   d_candidate_ids_sorted(K, stream_);
+
     // Determine temporary device storage requirements
     size_t temp_storage_bytes = 0;
     cub::DeviceRadixSort::SortPairs(nullptr,
                                     temp_storage_bytes,
                                     d_distances.data(),
-                                    d_distances.data(),  // keys (in-place sort)
+                                    d_distances_sorted.data(),
                                     d_candidate_ids.data(),
-                                    d_candidate_ids.data(),  // values (in-place sort)
+                                    d_candidate_ids_sorted.data(),
                                     K,
                                     0,
                                     32,
                                     stream_);

     rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_);

     // Run sorting operation
     cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
                                     temp_storage_bytes,
                                     d_distances.data(),
-                                    d_distances.data(),  // keys (in-place sort)
+                                    d_distances_sorted.data(),
                                     d_candidate_ids.data(),
-                                    d_candidate_ids.data(),  // values (in-place sort)
+                                    d_candidate_ids_sorted.data(),
                                     K,
                                     0,
                                     32,
                                     stream_);
+    // swap/use the sorted buffers downstream
+    d_distances.swap(d_distances_sorted);
+    d_candidate_ids.swap(d_candidate_ids_sorted);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Replace the second block with:
{
// Sort distances and candidate IDs by distance using CUB
// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
temp_storage_bytes,
d_distances.data(),
d_distances.data(), // keys (in-place sort)
d_candidate_ids.data(),
d_candidate_ids.data(), // values (in-place sort)
K,
0,
32,
stream_);
rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
temp_storage_bytes,
d_distances.data(),
d_distances.data(), // keys (in-place sort)
d_candidate_ids.data(),
d_candidate_ids.data(), // values (in-place sort)
K,
0,
32,
stream_);
}
// Replace the second block with:
{
// Sort distances and candidate IDs by distance using CUB
rmm::device_uvector<float> d_distances_sorted(K, stream_);
rmm::device_uvector<PID> d_candidate_ids_sorted(K, stream_);
// Determine temporary device storage requirements
size_t temp_storage_bytes = 0;
cub::DeviceRadixSort::SortPairs(nullptr,
temp_storage_bytes,
d_distances.data(),
d_distances_sorted.data(),
d_candidate_ids.data(),
d_candidate_ids_sorted.data(),
K,
0,
32,
stream_);
rmm::device_buffer d_temp_storage(temp_storage_bytes, stream_);
// Run sorting operation
cub::DeviceRadixSort::SortPairs(d_temp_storage.data(),
temp_storage_bytes,
d_distances.data(),
d_distances_sorted.data(),
d_candidate_ids.data(),
d_candidate_ids_sorted.data(),
K,
0,
32,
stream_);
// swap/use the sorted buffers downstream
d_distances.swap(d_distances_sorted);
d_candidate_ids.swap(d_candidate_ids_sorted);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/initializer_gpu.cu` around lines 150 -
180, The calls to cub::DeviceRadixSort::SortPairs use the same device pointers
for input and output (d_distances and d_candidate_ids), which is undefined
because SortPairs does not support in-place/aliased buffers; fix by providing
non-overlapping output buffers or using CUB's DoubleBuffer overload: allocate
separate device buffers (e.g., d_distances_out and d_candidate_ids_out) of size
K, pass those as the output pointers to the SortPairs temporary-storage and run
calls, then swap or copy results back into d_distances/d_candidate_ids as
needed; alternatively replace the two-call pattern with the cub::DeviceRadixSort
overload that accepts cub::DoubleBuffer for keys and values and use that to
perform ping-pong sorting without aliasing.

Comment on lines +136 to +150
auto read_into_device_host = [&](void* d_ptr, void* h_ptr, size_t n_bytes) {
std::vector<std::uint8_t> h_buf(n_bytes); // host staging buffer
auto before = input.tellg();
input.read(reinterpret_cast<char*>(h_buf.data()), n_bytes);
auto got = static_cast<size_t>(input.gcount());
if (got != n_bytes) {
std::ostringstream oss;
oss << "unexpected EOF: wanted " << n_bytes << " bytes at offset " << before << ", got "
<< got << (input.eof() ? " (hit EOF)" : "") << (input.bad() ? " (I/O error)" : "");
}

raft::copy(static_cast<uint8_t*>(d_ptr), h_buf.data(), n_bytes, stream_);
raft::resource::sync_stream(handle_);
memcpy(h_ptr, h_buf.data(), n_bytes);
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Silent short-read in read_into_device_host.

On an EOF/partial read, the error message is composed into a local std::ostringstream and then dropped on the floor — no throw, no log, no return. The lambda proceeds to raft::copy and memcpy of garbage host-staging data and the caller sees a corrupted index with no indication of failure. Compare with read_into_device and read_into_device_host_transposed_short at lines 129 and 157, which both throw.

🐛 Proposed fix
     if (got != n_bytes) {
       std::ostringstream oss;
       oss << "unexpected EOF: wanted " << n_bytes << " bytes at offset " << before << ", got "
           << got << (input.eof() ? " (hit EOF)" : "") << (input.bad() ? " (I/O error)" : "");
+      throw std::runtime_error(oss.str());
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 136 - 150,
The lambda read_into_device_host silently ignores partial/EOF reads: build the
error string exactly as done now but then throw an exception (e.g., throw
std::runtime_error(oss.str())) instead of dropping the ostringstream, and return
early so the subsequent raft::copy and memcpy are not executed on truncated
data; mirror the behavior used by read_into_device and
read_into_device_host_transposed_short to ensure callers receive an error on
short reads.

Comment on lines +728 to +790
auto batch_pids = raft::make_host_vector<PID, int64_t>(batch_size_vectors);
auto batch_data = raft::make_host_vector<float, int64_t>(batch_size_vectors * num_dimensions);
auto d_batch_data =
raft::make_device_vector<float, int64_t>(handle_, batch_size_vectors * num_dimensions);

size_t batch_count = 0;

// Calculate number of threads for host data gathering
const size_t num_threads = omp_get_max_threads();

size_t cluster_idx = 0;

while (cluster_idx < num_centroids) {
batch_count++;
// Determine batch: complete clusters only
size_t batch_start_cluster = cluster_idx;
size_t batch_vectors = 0;
size_t batch_start_offset = h_cluster_meta[cluster_idx].start_index;

while (cluster_idx < num_centroids &&
batch_vectors + h_cluster_meta[cluster_idx].num <= batch_size_vectors) {
batch_vectors += h_cluster_meta[cluster_idx].num;
cluster_idx++;
}

// Handle case where single cluster exceeds batch size
if (batch_vectors == 0 && cluster_idx < num_centroids) {
batch_vectors = h_cluster_meta[cluster_idx].num;
cluster_idx++;
}

if (batch_vectors == 0) break; // No more clusters to process

// -------------------------
// 11. Gather and transfer batch data to GPU using reordered PIDs
// -------------------------

// First, copy the reordered PIDs for this batch to host (reuse preallocated buffer)
raft::copy(batch_pids.data_handle(), d_flat_pids + batch_start_offset, batch_vectors, stream_);
raft::resource::sync_stream(handle_);

// Gather data on host using the reordered PIDs (reuse preallocated buffer)
// Use OpenMP parallel region to leverage persistent thread pool across batches
#pragma omp parallel num_threads(num_threads)
{
int tid = omp_get_thread_num();
size_t chunk_size = batch_vectors / num_threads;
size_t start = tid * chunk_size;
size_t end = (tid == num_threads - 1) ? batch_vectors : start + chunk_size;

for (size_t i = start; i < end; ++i) {
std::memcpy(&batch_data.data_handle()[i * num_dimensions],
&host_data[batch_pids.data_handle()[i] * num_dimensions],
num_dimensions * sizeof(float));
}
}
// OpenMP implicit barrier ensures all threads complete before continuing

// Transfer batch to GPU (reuse preallocated buffer)
raft::copy(d_batch_data.data_handle(),
batch_data.data_handle(),
batch_vectors * num_dimensions,
stream_);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Out-of-bounds write when a single cluster exceeds batch_size_vectors.

batch_pids, batch_data, and d_batch_data are preallocated with capacity batch_size_vectors / batch_size_vectors * num_dimensions. The cluster-packing loop at lines 747-757 has an explicit "single cluster exceeds batch size" fallback that sets batch_vectors = h_cluster_meta[cluster_idx].num — which may be larger than batch_size_vectors. The subsequent raft::copy (line 766), the OpenMP memcpy loop (lines 778-782), and the device copy (lines 787-790) will then write past the end of the preallocated buffers, resulting in a host heap overflow and a device OOB write.

Either grow the buffers on demand when an oversize cluster is encountered, or split that cluster across multiple batch iterations. If the contract is "caller must set batch_size_vectors ≥ max cluster size", that invariant needs to be validated explicitly (and max_cluster_length is already known on line 714).

🛡️ Proposed minimum safeguard
   size_t cluster_idx = 0;
 
+  RAFT_EXPECTS(max_cluster_length <= batch_size_vectors,
+               "construct_on_gpu_streaming: batch_size_vectors (%zu) must be >= max cluster size (%zu)",
+               batch_size_vectors, max_cluster_length);
+
   while (cluster_idx < num_centroids) {

A more robust fix grows batch_pids / batch_data / d_batch_data to max(batch_size_vectors, max_cluster_length) up front, or reallocates lazily inside the fallback branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/ivf_gpu.cu` around lines 728 - 790,
The code allows a single cluster to set batch_vectors > batch_size_vectors
causing out-of-bounds writes into batch_pids, batch_data and d_batch_data; fix
by ensuring the host/device buffers are sized to at least
max(batch_size_vectors, max_cluster_length) (or grow/reallocate them lazily when
the fallback branch that sets batch_vectors = h_cluster_meta[cluster_idx].num
runs) before calling raft::copy or performing the OpenMP memcpy and the device
copy; update allocation of raft::make_host_vector for batch_pids and batch_data
and raft::make_device_vector for d_batch_data (or add a resize branch
referencing batch_size_vectors, max_cluster_length, batch_pids, batch_data,
d_batch_data) so no raft::copy or memcpy writes past the buffers.

Comment on lines +353 to +358
if (is_candidate) {
int candidate_slot = atomicAdd(&num_candidates, 1);
if (candidate_slot < params.max_candidates_per_pair) {
shared_candidate_ips[candidate_slot] = local_ip_quantized;
shared_candidate_indices[candidate_slot] = vec_idx;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Clamp num_candidates to the number actually stored.

All block-sort variants increment num_candidates even when candidate_slot exceeds params.max_candidates_per_pair. Later loops index shared_candidate_indices[cand_idx] up to num_candidates, so overflowed candidates read unwritten shared memory.

Proposed fix pattern
   __syncthreads();
+  const int stored_candidates = min(num_candidates, params.max_candidates_per_pair);
 
-  if (num_candidates > 0) {
+  if (stored_candidates > 0) {
     ...
-    const int candidates_per_thread = (num_candidates + num_threads - 1) / num_threads;
+    const int candidates_per_thread = (stored_candidates + num_threads - 1) / num_threads;
     ...
-    if (cand_idx < num_candidates && cand_idx < params.max_candidates_per_pair) {
+    if (cand_idx < stored_candidates) {

Use stored_candidates consistently for IP2 loops, queue rounds, and threshold updates.

Also applies to: 438-439, 471-479, 664-669, 691-699, 895-900, 982-983, 1015-1023, 1204-1209, 1231-1239

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 353 - 358, The code increments num_candidates via atomicAdd even when
candidate_slot >= params.max_candidates_per_pair, causing later loops to read
unwritten shared_candidate_indices; clamp num_candidates to the actual number
written (e.g., compute and use stored_candidates = min(num_candidates,
params.max_candidates_per_pair)) and replace uses of num_candidates in
subsequent IP2 loops, queue round iterations, and threshold-update logic with
stored_candidates so indexing into shared_candidate_indices/shared_candidate_ips
only goes up to the stored count; update all occurrences mentioned (the blocks
around the atomicAdd, the IP2 loops, queue rounds, and threshold update
sections) to use the stored_candidates variable consistently.

Comment on lines +520 to +521
queue.store(params.d_topk_dists + output_offset,
(uint32_t*)(params.d_topk_pids + output_offset));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C3 'queue\.store\(.*\n\s*\(uint32_t\*\)\(params\.d_topk_pids'
rg -n -C3 'cuvs_ivf_rabitq<float,\s*int64_t>|index<int64_t>|using\s+PID|typedef.*PID'

Repository: rapidsai/cuvs

Length of output: 50578


🏁 Script executed:

# Check file size and read relevant sections
wc -l cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

Repository: rapidsai/cuvs

Length of output: 131


🏁 Script executed:

# Read around line 520-521
sed -n '515,525p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

# Read around line 754-755
sed -n '750,760p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

# Read around line 1064-1065
sed -n '1060,1070p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

# Read around line 1294-1295
sed -n '1290,1300p' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

Repository: rapidsai/cuvs

Length of output: 2026


🏁 Script executed:

# Look at file header to understand template parameters
head -100 cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

Repository: rapidsai/cuvs

Length of output: 3375


🏁 Script executed:

# Check for the struct/class definition and IdxT usage
rg -n 'struct Searcher|class Searcher|template.*IdxT' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu | head -20

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Look for template declaration and params structure
rg -n 'template.*Searcher|struct.*Params' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu | head -30

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Look for d_topk_pids declaration in params or search params structure
rg -n 'd_topk_pids' cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

Repository: rapidsai/cuvs

Length of output: 689


🏁 Script executed:

# Check the header files for params structure definition
fd -t f 'searcher_gpu.*\.c' cpp/src/neighbors/ivf_rabitq/gpu_index/ -x cat {}

Repository: rapidsai/cuvs

Length of output: 50370


🏁 Script executed:

# Look at the searcher_gpu.cuh header
head -200 cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cuh | tail -100

Repository: rapidsai/cuvs

Length of output: 4623


🏁 Script executed:

# Find params structure
rg -n 'struct.*Params.*topk_pids' cpp/src/neighbors/ivf_rabitq/ -A 5

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Get broader search for Params structure with d_topk_pids
rg -n 'd_topk_pids' cpp/src/neighbors/ivf_rabitq/ -B 5 -A 1 | head -100

Repository: rapidsai/cuvs

Length of output: 10177


Remove unsafe type casts in queue.store() calls that force PID pointers through uint32_t*.

The casts (uint32_t*) in queue.store() calls hide type mismatches when IdxT is templated as int64_t. With d_topk_pids allocated as raft::make_device_vector<PID, int64_t> where PID is typedef'd as uint32_t, these explicit casts bypass type safety. Removing the casts exposes type consistency issues at compile time rather than risking silent data corruption at runtime.

Affected locations

Lines 521, 755, 1065, 1295 in cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu

Similar issues exist in:

  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu (lines 460, 788)
  • cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu.cu (lines 558, 888)
Proposed fix
-      queue.store(params.d_topk_dists + output_offset,
-                  (uint32_t*)(params.d_topk_pids + output_offset));
+      queue.store(params.d_topk_dists + output_offset,
+                  params.d_topk_pids + output_offset);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_quantize_query.cu` around
lines 520 - 521, Remove the unsafe (uint32_t*) casts in the queue.store(...)
calls and make the pointer types consistent: pass params.d_topk_pids +
output_offset (no cast) and update queue.store's signature/overloads or the
PID/IdxT typedefs so the stored pointer type matches the templated index type
(IdxT) used by the queue; ensure PID (the element type of params.d_topk_pids)
and the queue.store parameter use the same type (e.g., PID* or IdxT*) to expose
mismatches at compile time and prevent silent corruption—apply the same change
to all queue.store occurrences referenced (e.g., in
searcher_gpu_quantize_query.cu, searcher_gpu_shared_mem_opt.cu,
searcher_gpu.cu).

Comment on lines +332 to +338
if (is_candidate) {
int candidate_slot = atomicAdd(&num_candidates, 1);
if (candidate_slot < params.max_candidates_per_pair) {
shared_candidate_ips[candidate_slot] = local_ip;
shared_candidate_indices[candidate_slot] = vec_idx;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Clamp candidate count before reading shared candidate arrays.

num_candidates keeps increasing after candidate_slot >= params.max_candidates_per_pair, but later loops iterate up to num_candidates and read entries that were never written. A dense cluster or loose threshold can turn this into shared-memory OOB reads and corrupt top-k results.

Proposed fix pattern
   __syncthreads();
+  const int stored_candidates = min(num_candidates, params.max_candidates_per_pair);
 
-  if (num_candidates > 0) {
+  if (stored_candidates > 0) {
     ...
-    for (int cand_idx = warp_id; cand_idx < num_candidates; cand_idx += num_warps) {
+    for (int cand_idx = warp_id; cand_idx < stored_candidates; cand_idx += num_warps) {
       ...
     }
     ...
-    const int adds_per_thread = (num_candidates + num_threads - 1) / num_threads;
+    const int adds_per_thread = (stored_candidates + num_threads - 1) / num_threads;

Apply the same cap to the NoEX block-sort path and threshold-update checks.

Also applies to: 378-379, 411-419, 735-742, 758-766

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/gpu_index/searcher_gpu_shared_mem_opt.cu` around
lines 332 - 338, The bug is that num_candidates can grow past
params.max_candidates_per_pair so later loops read unwritten
shared_candidate_ips/shared_candidate_indices; after the candidate-collection
phase, clamp num_candidates = min(num_candidates,
params.max_candidates_per_pair) (or compute a local bounded_count =
min(num_candidates, params.max_candidates_per_pair)) and use that bounded count
for any subsequent loops or threshold-update checks that iterate over
shared_candidate_ips/shared_candidate_indices; apply the same clamp in the NoEX
block-sort path and any threshold-update code paths that currently iterate up to
num_candidates to avoid shared-memory OOB reads.

Comment on lines +22 to +29
template <size_t alignment, class T, bool HUGE_PAGE = false>
inline T* align_mm(size_t nbytes)
{
size_t size = rd_up_to_multiple_of(nbytes, alignment);
void* p = std::aligned_alloc(alignment, size);
if (HUGE_PAGE) { madvise(p, nbytes, MADV_HUGEPAGE); }
std::memset(p, 0, size);
return static_cast<T*>(p);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

C++17 std::aligned_alloc requirements: alignment constraints, size multiple of alignment, and null return on allocation failure

💡 Result:

In C++17, std::aligned_alloc(alignment, size) has the following requirements: - alignment must be a valid alignment supported by the implementation (e.g., power of two, multiple of sizeof(void*) in POSIX; fundamental alignments always supported). - size must be an integral multiple of alignment. - If either requirement is violated, or allocation fails, it returns nullptr (null pointer). - On success, returns a pointer to size bytes of uninitialized storage aligned to at least alignment, which must be freed with std::free or std::realloc.

Citations:


🏁 Script executed:

find . -name "memory.hpp" | grep -i ivf_rabitq

Repository: rapidsai/cuvs

Length of output: 104


🏁 Script executed:

cat -n cpp/src/neighbors/ivf_rabitq/utils/memory.hpp

Repository: rapidsai/cuvs

Length of output: 1098


🏁 Script executed:

# Search for any usages of align_mm to understand impact
rg "align_mm" --type cpp --type hpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 84


🏁 Script executed:

# Check if there's any null handling or error handling patterns in the codebase
rg "std::aligned_alloc" --type cpp --type hpp -B 3 -A 3

Repository: rapidsai/cuvs

Length of output: 84


🏁 Script executed:

# Search for align_mm usages
rg "align_mm" cpp/src/neighbors/ivf_rabitq/ -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 1113


🏁 Script executed:

# Check the rd_up_to_multiple_of function in tools.hpp
rg "rd_up_to_multiple_of" cpp/src/neighbors/ivf_rabitq/ -B 2 -A 5

Repository: rapidsai/cuvs

Length of output: 5418


🏁 Script executed:

# Check if there's any null handling pattern in the calling code
rg "std::aligned_alloc|align_mm" cpp/src/neighbors/ivf_rabitq/ --type-list | head -20

Repository: rapidsai/cuvs

Length of output: 446


🏁 Script executed:

# Simpler search for all files that use align_mm
find cpp/src/neighbors/ivf_rabitq -type f -name "*.cpp" -o -name "*.hpp" | xargs grep -l "align_mm" 2>/dev/null

Repository: rapidsai/cuvs

Length of output: 102


Add null pointer check after allocation and use correct size for madvise.

std::aligned_alloc returns nullptr on failure; the current code calls madvise and memset on this null pointer, causing a crash. Additionally, madvise should use the rounded allocation size, not the input nbytes, to cover the entire allocated region. The function should also enforce alignment preconditions at compile time and handle the edge case where rounding yields zero bytes.

Proposed fix
+#include <new>
+
 namespace memory {
 template <size_t alignment, class T, bool HUGE_PAGE = false>
 inline T* align_mm(size_t nbytes)
 {
+  static_assert(alignment != 0 && (alignment & (alignment - 1)) == 0);
+  static_assert(alignment % alignof(void*) == 0);
+
   size_t size = rd_up_to_multiple_of(nbytes, alignment);
+  if (size == 0) { size = alignment; }
   void* p     = std::aligned_alloc(alignment, size);
-  if (HUGE_PAGE) { madvise(p, nbytes, MADV_HUGEPAGE); }
+  if (p == nullptr) { throw std::bad_alloc{}; }
+  if constexpr (HUGE_PAGE) { madvise(p, size, MADV_HUGEPAGE); }
   std::memset(p, 0, size);
   return static_cast<T*>(p);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_rabitq/utils/memory.hpp` around lines 22 - 29, In
align_mm<size_t alignment, class T, bool HUGE_PAGE> (function align_mm) add a
compile-time check for valid alignment (e.g., static_assert that alignment is a
non-zero power of two and a multiple of alignof(T)), compute size =
rd_up_to_multiple_of(nbytes, alignment) and if size == 0 return nullptr
immediately, call std::aligned_alloc only after those checks, then check the
returned pointer for nullptr before calling madvise or std::memset, and when
calling madvise use the rounded allocation size (size) instead of the original
nbytes; ensure you still cast and return static_cast<T*>(p) only after the
nullptr check.

for (int j = tid; j < D; j += BlockSize) {
s_xp_norm[j] = d_XP_norm[row * D + j];
}
__syncthreads();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary sync, each thread access the data it wrote to smem just before. A thread won't read values write by other thread

}

/**
* @defgroup ivf_rabitq_cpp_index_params IVF-RaBitQ index build parameters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add all declared group to doxygen

Comment on lines +113 to +115
raft::device_matrix<float, int64_t, raft::row_major> centroids_ =
raft::make_device_matrix<float, int64_t, raft::row_major>(
handle_, K, D); // Stored in GPU device memory. Points to the parent centroids' array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to remove that default and only keep the constructor initialization of centroids_

centroids_(raft::make_device_matrix<float, int64_t, raft::row_major>(handle_, K, D))
{
dist_func = L2SqrGPU;
raft::resource::sync_stream(handle_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary sync

Comment on lines +146 to +151
// Initialize candidate IDs as a sequence 0,1,...,K-1 using a custom kernel
int block_size = 256;
int grid_size = (K + block_size - 1) / block_size;
init_sequence_kernel<<<grid_size, block_size, 0, stream_>>>(d_candidate_ids, K);
RAFT_CUDA_TRY(cudaPeekAtLastError());
raft::resource::sync_stream(handle_); // Wait for kernel completion
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer raft::linalg::range

const_cast<float*>(rotation_matrix_.data_handle()), D, D),
raft::make_device_matrix_view<float, int64_t, raft::col_major>(const_cast<float*>(d_A), D, N),
raft::make_device_matrix_view<float, int64_t, raft::col_major>(d_RAND_A, D, N));
raft::resource::sync_stream(handle_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sync

Comment on lines +27 to +30
auto rd_up_to_multiple_of = [](uint32_t dim, uint32_t mult) -> size_t {
return ((dim + mult - 1) / mult) * mult;
};
D = rd_up_to_multiple_of(dim, 64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft::round_up_safe


namespace cuvs::neighbors::ivf_rabitq::detail {

#define MAX_D 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static constexpr int ... or something else than a macro


static constexpr int BITS_PER_CHUNK = 4;
static constexpr int LUT_SIZE = (1 << BITS_PER_CHUNK); // 16
static constexpr int WARP_SIZE = 32;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use raft::WarpSize

Comment on lines +1423 to +1424
RAFT_CUDA_TRY(cudaFuncSetAttribute(
fully_fused_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_mem_size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed?
The shared mem size specified in <<<>>> should be enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

6 participants