Multi segment cagra search#2035
Conversation
Introduces a globally-optimized path for searching across multiple CAGRA
index segments without per-segment device-to-host copies.
- c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp:
new C API function cuvsSelectK that wraps cuvs::selection::select_k
for selecting the k smallest float values from a flat device array
- BufferedCagraSearch (internal interface): searchIntoBuffer() writes
per-segment CAGRA results into a slice of a caller-owned device buffer
without syncing the stream or copying to host
- CagraIndexImpl: implements BufferedCagraSearch; searchIntoBuffer()
computes byte offsets into the global buffer using segmentIdx * topK
- SelectKHelper: Panama FFI binding for cuvsSelectK
- MultiSegmentCagraSearch: orchestrates the full pipeline — queue all
per-segment searches, sync once, run cuvsSelectK on GPU, sync again,
single D2H copy, decode results
- MultiSegmentSearchResults: simple result carrier with count,
segmentIndices, ordinals, and distances arrays
Implements concurrent per-segment GPU search that eliminates per-segment
device-to-host copies and CPU blocking on workspace deallocation.
Key changes:
- CudaStreamPool: fixed-size pool of non-blocking CUDA streams (one
cuvsResources_t per slot). Segments are assigned to slots via
round-robin so searches on different slots run concurrently. Pool size
defaults to 8 and is overridden via the system property
com.nvidia.cuvs.streamPoolSize.
- MultiSegmentCagraSearch: single-query search across N index segments
using the stream pool. All per-segment CAGRA kernels write into a
shared device buffer (no per-segment D2H copy or stream sync). A
single cuvsSelectK call finds the global top-k entirely on GPU, then
one D2H copy transfers the results.
- BufferedCagraSearch / CagraIndexImpl: new searchIntoBuffer() method
that queues a CAGRA search kernel on a caller-supplied stream and
writes results at a given row offset into a pre-allocated device
buffer.
- cuvsRMMAsyncMemoryResourceEnable() (C API + Java bindings): switches
the current device memory resource to cuda_async_memory_resource so
that workspace deallocations issued by CAGRA's search plan destructor
are stream-ordered and non-blocking. Without this, cudaFree serializes
kernel launches across streams regardless of stream assignment,
nullifying the stream pool benefit.
CudaStreamPool was a static singleton shared across all threads. With
multiple concurrent query threads, calls to MultiSegmentCagraSearch.search()
would alias onto the same pool slots after Math.floorMod, causing concurrent
cudaEventRecord and cudaStreamWaitEvent calls on the same event handle
(undefined behavior) and cross-thread stream interference. Additionally,
CudaStreamPool.closeInstance() was called from CuVSResourcesImpl.close(),
so whichever thread closed its resources first would destroy the shared pool
while other threads were still using it.
Fix: make CudaStreamPool a per-CuVSResources instance rather than a static
singleton. One pool is created and owned by each CuVSResourcesImpl; it is
closed when that instance is closed. Since CuVSResources is thread-local in
the Lucene integration, each query thread gets its own independent set of
streams and events with no sharing or locking required.
- CudaStreamPool: remove static singleton (getOrCreate, closeInstance, static
volatile instance, static AtomicInteger slotCounter); add package-private
constructor; replace static slotCounter with an instance int and a new
nextSlot(int count) method.
- CuVSResourcesImpl: add final CudaStreamPool streamPool field (sized from
com.nvidia.cuvs.streamPoolSize system property); close it directly in
close(); add static getStreamPool(CuVSResources) helper for
MultiSegmentCagraSearch to retrieve the per-resources pool.
- MultiSegmentCagraSearch: get pool via CuVSResourcesImpl.getStreamPool
and advance via pool.nextSlot. Remove redundant cuvsStreamSync after
cuvsSelectK — the D2H copies are enqueued on the same stream so CUDA
ordering already serializes them. Replace three separate hostArena.allocate
calls with one contiguous allocation (Long.BYTES-aligned) sliced into three
typed views, reducing OS-level allocation overhead per query.
The persistent kernel runner was previously keyed on (dataset_desc, graph, fixed search params), which forced a destroy/recreate cycle for every segment when searching a multi-segment Lucene index: each segment has a different graph pointer and potentially a different auto-computed max_iterations, producing a different hash on every call. C++ changes (search_single_cta_kernel-inl.cuh): - Move dataset_desc_ptr, graph_ptr, and graph_degree from fixed runner state into per-job fields in job_desc_t. The persistent kernel reads them from the job descriptor, so one runner instance can serve any number of segments without being rebuilt. - Remove dataset_desc and graph arguments from persistent_runner_t constructor and calculate_parameter_hash; the runner is now keyed only on fixed kernel parameters (block_size, smem_size, itopk, etc.). - Update select_and_run to initialize the device descriptor on the caller's stream and synchronize before submission, then pass dd_dev_ptr, graph.data_handle(), and graph_degree to runner::launch. - Remove dd_host from persistent_runner_t; dataset upload is now the caller's responsibility on each launch. Java changes: - Add persistent, persistentLifetime, and persistentDeviceUsage fields, getters, and Builder methods to CagraSearchParams. - Wire the three persistent params through CuVSParamsHelper into the Panama-generated cuvsCagraSearchParams struct.
…raSearch In persistent mode, `searchIntoBuffer` blocks on the CPU until the GPU signals completion via a system-scope atomic. Previously, segments were searched sequentially, so the GPU processed one segment at a time per query, leaving its job queue mostly idle between segment dispatches. Submit one async task per pool slot so all slots' segment searches are in-flight simultaneously. The persistent runner's job queue can hold all N segment jobs at once, allowing GPU workers to execute segments in parallel (bounded by worker_queue_size). Segments are grouped by pool slot rather than submitted one-per-segment to prevent concurrent access to the same cuvsResources_t handle: the descriptor_cache stored inside the RAFT resources object is not thread-safe, and multiple threads calling cuvsCagraSearch with the same handle causes a SIGSEGV. Grouping ensures each cuvsResources_t is accessed by at most one thread at a time. Effective parallelism is min(numSegments, pool.size()); increasing cuvsStreamPoolSize raises the ceiling. In non-persistent mode the existing sequential loop is unchanged: kernel launches are asynchronous and return immediately, so Java-level parallelism adds overhead without benefit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-segment search (C++/C/Java): - Add search_multi_segment() C++ function: builds a single search plan sized for the largest segment, packs per-segment descriptors, and launches search_kernel_ms — a new SINGLE_CTA kernel with grid dimensions (1, num_queries, num_segments) so each CTA independently searches one (query, segment) pair in a single kernel call. - Add cuvsCagraSearchMultiSegment() C API wrapping the above. - Simplify MultiSegmentCagraSearch.java to call cuvsCagraSearchMultiSegment() unconditionally, removing the previous per-segment stream pool, thread pool, CUDA event synchronization, and persistent/non-persistent branching. The search now completes in four phases: multi-segment kernel, GPU-side select-k, single D2H copy, result decoding. - Add BufferedCagraSearch.getIndexHandle() to expose the raw cuvsCagraIndex_t handle needed by the multi-segment kernel dispatch. Workspace pool (C/Java): - Add cuvsResourcesSetWorkspacePool(): configures the per-resources temporary workspace as an uncapped RMM pool that grows without shrinking. After warmup, cuvsRMMAlloc/cuvsRMMFree hit the pool cache instead of calling cudaMallocAsync/cudaFreeAsync, eliminating CUDA context lock contention under concurrent query threads. - Route cuvsRMMAlloc/cuvsRMMFree through the workspace resource so Java- side output buffer allocations also benefit from the pool. - Expose setWorkspacePool() in the CuVSResources Java interface with implementations in CuVSResourcesImpl and SynchronizedCuVSResources. Refactoring (search_single_cta_kernel-inl.cuh): - Extract TopkVariant enum and select_topk_variant() helper, shared by search_kernel_config and search_kernel_config_ms, replacing duplicated if/else trees in both choose_itopk_and_mx_candidates() bodies. - Extract kernel_dispatch_params::compute() to centralize max_candidates and max_itopk computation shared by select_and_run and select_and_run_multi_segment. - Extract hashmap_element_count() static helper on the search struct, used by both set_params (single-segment) and run_multi_segment.
Add HALF(2) to the DataType enum and wire it through the Java layer so
callers can build and exchange float16 matrices without any Java-side
type conversion:
- CuVSMatrix.Builder: new addVector(short[]) overload; each element is
a raw IEEE 754 binary16 bit pattern held in a short
- LinkerHelper: add C_SHORT via canonicalLayouts("short")
- CuVSMatrixBaseImpl: map HALF → C_SHORT in valueLayoutFromType();
decode DLPack (kDLFloat, bits=16) → DataType.HALF in
dataTypeFromTensor()
- CuVSMatrixInternal: HALF maps to the same kDLFloat DLPack type code
as FLOAT; the bits field (16 vs 32) distinguishes them on the C side
- JDKProvider.MatrixBuilder: implement addVector(short[]) following the
same MemorySegment.ofArray pattern as the other overloads
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis pull request introduces multi-segment CAGRA search functionality, GPU-based selection, and RMM memory pool management. It adds C and C++ APIs for multi-segment approximate nearest neighbor search, async memory allocation, and select-K operations, alongside corresponding Java bindings and internal implementations supporting concurrent per-segment GPU processing and stream-based result aggregation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)
293-322:⚠️ Potential issue | 🟡 MinorThe
toString()method does not include the new persistent fields.The three new fields (
persistent,persistentLifetime,persistentDeviceUsage) are not included in thetoString()output. While not critical, this creates an inconsistency where debugging output won't show the full configuration.🔧 Proposed fix to include persistent fields in toString()
+ ", randXORMask=" + randXORMask + + ", persistent=" + + persistent + + ", persistentLifetime=" + + persistentLifetime + + ", persistentDeviceUsage=" + + persistentDeviceUsage + "]";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around lines 293 - 322, The toString() in class CagraSearchParams currently omits the new fields persistent, persistentLifetime, and persistentDeviceUsage; update the CagraSearchParams.toString() method to append these three fields (with their names and values) into the returned string (same formatting style as the other fields) so debugging output shows the full configuration.
🧹 Nitpick comments (3)
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java (1)
59-62: Add destroyed-state guard in the new delegate method.
setWorkspacePool(...)should callcheckNotDestroyed()before delegating, consistent with this wrapper’s defensive contract.Proposed change
`@Override` public void setWorkspacePool(long sizeBytes) { + checkNotDestroyed(); inner.setWorkspacePool(sizeBytes); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java` around lines 59 - 62, The new delegate method setWorkspacePool(long sizeBytes) is missing the wrapper's destroyed-state guard; modify CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start (before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the class's defensive contract and uses the same destroyed-state check as other methods.java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)
527-537: Consider adding validation forpersistentDeviceUsagebounds.The Javadoc states the value "must be greater than 0.0 and not greater than 1.0", but the setter doesn't validate this constraint. Invalid values would only fail at the native layer.
🛡️ Optional: Add validation in the builder
public Builder withPersistentDeviceUsage(float persistentDeviceUsage) { + if (persistentDeviceUsage <= 0.0f || persistentDeviceUsage > 1.0f) { + throw new IllegalArgumentException( + "persistentDeviceUsage must be > 0.0 and <= 1.0, got: " + persistentDeviceUsage); + } this.persistentDeviceUsage = persistentDeviceUsage; return this; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around lines 527 - 537, The builder setter withPersistentDeviceUsage currently assigns persistentDeviceUsage without enforcing the Javadoc constraint; update Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f and <= 1.0f and throw an IllegalArgumentException (including the invalid value in the message) when the check fails so invalid values are rejected early before reaching native code.cpp/src/neighbors/detail/cagra/cagra_search.cuh (1)
388-417: Consider hoisting query_norms allocation outside the loop for CosineExpanded metric.When multiple segments use
CosineExpandedmetric,query_normsis allocated and computed redundantly for each segment. Since queries are the same across segments (repeated query vector), the norms could be computed once.This is a minor optimization opportunity. The current implementation is correct; the redundant computation is bounded by the number of segments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh` around lines 388 - 417, Hoist the allocation and computation of query_norms out of the segment loop and reuse it for every segment whose indices[i]->metric() == cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the shared queries data to compute the norms, then inside the loop call raft::linalg::matrix_vector_op (as currently done) using the precomputed query_norms for each CosineExpanded segment; after the loop free or let query_norms go out of scope. Ensure you still call cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded branches and keep all existing ops (raft::compose_op, raft::sq_op, raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/include/cuvs/core/c_api.h`:
- Around line 229-240: The current implementation of
cuvsRMMAsyncMemoryResourceEnable stores cuda_async_memory_resource in a
thread_local async_mr and passes it to rmm::mr::set_current_device_resource(),
creating a lifetime mismatch vs the API doc that says the change is global;
change the implementation so the async memory resource is process-scoped (not
thread_local) by replacing thread_local async_mr with a static/process-global
instance (e.g., a static unique_ptr or static object) so it outlives the thread
and remains valid for rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.
In `@c/src/neighbors/cagra.cpp`:
- Around line 710-733: The loop currently casts every indices[i]->addr and
builds device views without validating each segment's types/devices; update the
loop that fills idx_vec, q_vec, n_vec, d_vec to perform the same per-segment
checks used in cuvsCagraSearch: assert indices[i] != nullptr and addr != 0
(already present), then verify indices[i]->dtype.code == kDLFloat &&
indices[i]->dtype.bits == 32 for every i, and validate that queries[i],
neighbors[i], and distances[i] are device-backed DLPack tensors with the
expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
- Around line 726-736: The loop that builds idx_vec/q_vec/n_vec/d_vec must also
validate that all segment indices use the same distance metric as the first
segment to prevent mixing incompatible metrics; inside the for-loop in cagra.cpp
(after the existing RAFT_EXPECTS that checks indices[i] non-null) compare
indices[i]->metric (or the actual metric field name on your IndexT struct/class)
to indices[0]->metric and fail fast with RAFT_EXPECTS (or equivalent) and a
clear message like "Mixed distance metrics across segments: expected %s but got
%s at segment %u"; keep this check before pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.
In `@c/src/selection/select_k.cpp`:
- Around line 14-40: cuvsSelectK dereferences shape[1] and casts buffers without
validating the DLPack tensors; add explicit validation at the top of cuvsSelectK
for in_val, out_val, out_idx (non-null), then check each
DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows match expected
1 or compatible), dtype (in_val/out_val float32, out_idx int64), device type
(CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.
In `@cpp/include/cuvs/neighbors/cagra.hpp`:
- Around line 1752-1806: Add Doxygen documentation for each overloaded
search_multi_segment declaration so they appear in generated API docs; either
add brief doxygen blocks above each overload or use `@copydoc` to reference the
primary search_multi_segment doc block (e.g., use `@copydoc`
search_multi_segment(raft::resources const&,
cuvs::neighbors::cagra::search_params const&, const std::vector<const
cuvs::neighbors::cagra::index<float, uint32_t>*>&, const
std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>&))
for the overloads with half/int8_t/uint8_t and uint32_t/int64_t neighbor types
so each signature (the overloads of search_multi_segment) is documented.
- Around line 1744-1806: The header declares the template parameter order as <T,
OutputIdxT, IdxT> but the instantiation macro (used with <T, IdxT, OutputIdxT>
e.g. (data_t, uint32_t, int64_t)) expects <T, IdxT, OutputIdxT>, causing the
IdxT/OutputIdxT swap that trips the static_assert in cagra_search.cuh:276 and
breaks link-time overloads for int64_t; fix by changing the template parameter
order in the search_multi_segment declarations to <T, IdxT, OutputIdxT> (or
alternatively update the instantiation macro to match the declared order) so
types map correctly, and add missing Doxygen for overloads 2–8 by inserting a
`@copydoc` search_multi_segment (or equivalent documentation block) above each of
those overloaded search_multi_segment declarations to satisfy the public API
docs requirement.
In `@cpp/src/neighbors/cagra.cuh`:
- Around line 409-420: The wrapper search_multi_segment currently forwards
indices, queries, neighbors, and distances without validation; add the same
upfront shape checks used by search() before calling
cagra::detail::search_multi_segment: verify indices.size() == queries.size() ==
neighbors.size() == distances.size(), then for each segment i ensure
queries[i].n_rows == neighbors[i].n_rows, neighbors[i].n_cols ==
distances[i].n_cols (k matches), queries[i].n_cols equals the index dimension
for indices[i] (or indices[i]->dim() / appropriate accessor), and that k > 0 and
dims are consistent; if any check fails, return/throw a clear error (or use
RAFT/CUASSERT used elsewhere) rather than forwarding to the detail
implementation.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java`:
- Around line 60-76: The Java API should validate workspace pool sizes before
calling native code: update the Javadoc for setWorkspacePool to state the valid
range is > 0, and in the implementation of setWorkspacePool (the method that
currently invokes the native cuvsResourcesSetWorkspacePool) add a check that
rejects non-positive values (<= 0) by throwing an appropriate Java exception
(e.g., IllegalArgumentException) with a clear message; only call
cuvsResourcesSetWorkspacePool when the value is positive to avoid
signed->unsigned wraparound in native size_t.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java`:
- Around line 24-27: MultiSegmentSearchResults currently stores native uint32
ordinals in an int[] (field ordinals) which corrupts values > Integer.MAX_VALUE;
change ordinals from int[] to long[] (and any constructor/getter signatures) and
ensure code that decodes native uint32_t values writes unsigned values into the
long (e.g., value & 0xFFFFFFFFL) so ordinals remain non-negative; update any
consumers (notably MultiSegmentCagraSearch) to compare against a long sentinel
(e.g., -1L) or otherwise handle long ordinals instead of treating negative ints
as sentinels.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java`:
- Around line 43-46: The setWorkspacePool method in SynchronizedCuVSResources is
not using the shared lock and must be serialized like access(); modify
setWorkspacePool to acquire the same lock used by access() (e.g., wrap the call
to inner.setWorkspacePool(sizeBytes) in the synchronized block or lock guard
used by access()) so that mutations to workspace pool are protected by the same
synchronization as access().
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java`:
- Around line 391-396: The offset calculation uses segmentIdx * topK
(neighborByteOffset/distanceByteOffset) but the code builds tensors with shape
{numQueries, topK} (numQueries from queryVectors.size()), so when numQueries > 1
the buffer offsets are wrong; either enforce single-query by adding a guard (if
(numQueries != 1) throw new IllegalArgumentException(...)) near where numQueries
is computed (queryVectors) or update the offsets to multiply by numQueries
(neighborByteOffset = segmentIdx * numQueries * topK * C_INT_BYTE_SIZE and
distanceByteOffset = segmentIdx * numQueries * topK * Float.BYTES) before
creating neighborSlice/distanceSlice (globalNeighborsDP/globalDistancesDP).
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`:
- Around line 115-119: Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
- Around line 125-131: In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
- Around line 59-85: The constructor CudaStreamPool currently leaks native
handles if a create call fails mid-loop; modify the CudaStreamPool(int size)
constructor to perform rollback cleanup on failure by tracking the current index
and, if any checkCudaError/checkCuVSError throws, iterating over
already-initialized entries in resources[], streams[], and events[] to call the
corresponding destroy functions (cudaStreamDestroy for streams[],
cudaEventDestroy for events[], and the cuvs resources destroy routine for
resources[]) before rethrowing the exception; implement this by wrapping the
allocation loop in try/catch (or try/finally with a success flag) and invoking
the same cleanup logic as close() for indices < currentIndex so no native
handles are leaked if construction aborts.
In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`:
- Around line 31-33: The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`:
- Around line 119-145: The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
- Around line 96-111: The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.
---
Outside diff comments:
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 293-322: The toString() in class CagraSearchParams currently omits
the new fields persistent, persistentLifetime, and persistentDeviceUsage; update
the CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.
---
Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh`:
- Around line 388-417: Hoist the allocation and computation of query_norms out
of the segment loop and reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 527-537: The builder setter withPersistentDeviceUsage currently
assigns persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.
In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`:
- Around line 59-62: The new delegate method setWorkspacePool(long sizeBytes) is
missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 626e7fa1-95fc-4f40-b30d-d7fbcb6521a6
📒 Files selected for processing (40)
c/CMakeLists.txtc/include/cuvs/core/c_api.hc/include/cuvs/neighbors/cagra.hc/include/cuvs/selection/select_k.hc/src/core/c_api.cppc/src/neighbors/cagra.cppc/src/selection/select_k.cppcpp/include/cuvs/neighbors/cagra.hppcpp/src/neighbors/cagra.cuhcpp/src/neighbors/cagra_search_inst.cu.incpp/src/neighbors/detail/cagra/cagra_search.cuhcpp/src/neighbors/detail/cagra/search_single_cta.cuhcpp/src/neighbors/detail/cagra/search_single_cta_inst.cuhcpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuhcpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuhjava/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSAceParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSMatrix.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswAceParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndex.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndexParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/spi/UnsupportedProvider.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BufferedCagraSearch.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixBaseImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixInternal.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSParamsHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/SelectKHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/LinkerHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.javajava/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java
| /** | ||
| * @brief Switches the working memory resource to use stream-ordered asynchronous allocation | ||
| * (cudaMallocAsync / cudaFreeAsync). Unlike the pool resource, this resource returns memory to | ||
| * the stream immediately without blocking the CPU, eliminating device-wide synchronization on | ||
| * deallocation. This is especially beneficial when multiple CAGRA searches run concurrently on | ||
| * separate CUDA streams, because the internal workspace allocations no longer serialize kernel | ||
| * launches. Be aware that this function will change the memory resource for the whole process | ||
| * and the new memory resource will be used until explicitly changed. | ||
| * | ||
| * @return cuvsError_t | ||
| */ | ||
| cuvsError_t cuvsRMMAsyncMemoryResourceEnable(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "c_api.cpp" -path "*/c/src/core/*" | head -5Repository: rapidsai/cuvs
Length of output: 79
🏁 Script executed:
cat -n c/src/core/c_api.cpp | sed -n '160,195p'Repository: rapidsai/cuvs
Length of output: 1665
🏁 Script executed:
cd c/src/core && grep -n "thread_local" c_api.cppRepository: rapidsai/cuvs
Length of output: 269
🏁 Script executed:
cd c/include/cuvs/core && grep -n "cuvsRMMAsyncMemoryResourceEnable" c_api.h -A 15Repository: rapidsai/cuvs
Length of output: 580
🏁 Script executed:
# Search for RMM set_current_device_resource usage patterns in the codebase
rg "set_current_device_resource" --type cpp -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 2101
🏁 Script executed:
# Check if there are any docs changes or migration guides in the PR
find . -name "*.md" -o -name "*.rst" | xargs grep -l "RMM\|AsyncMemory\|cuvsRMM" 2>/dev/null | head -10Repository: rapidsai/cuvs
Length of output: 200
🏁 Script executed:
# Check for any deprecation warnings or migration documentation
find . -type f \( -name "*.h" -o -name "*.md" \) -path "*/docs/*" | xargs grep -l "cuvsRMM\|memory.*resource" 2>/dev/null | head -5Repository: rapidsai/cuvs
Length of output: 89
🏁 Script executed:
# Check if there are any API change notes or migration guides in the docs
cat docs/source/api_basics.rst | head -100Repository: rapidsai/cuvs
Length of output: 2518
🏁 Script executed:
# Check CHANGELOG for any RMM API changes mentioned
rg "Async|AsyncMemory|cuvsRMMAsync" CHANGELOG.md -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Check if reset_current_device_resource clears the async_mr
cat -n c/src/core/c_api.cpp | sed -n '192,210p'Repository: rapidsai/cuvs
Length of output: 784
🏁 Script executed:
# Check how pool memory resource is handled - it passes rvalues
# Let's see if there's any documentation on RMM's set_current_device_resource behavior
rg "set_current_device_resource" --type h | head -20Repository: rapidsai/cuvs
Length of output: 331
🏁 Script executed:
# Look for RMM in vendored code or requirements to understand its semantics
find . -name "*.txt" -o -name "*.cmake" | xargs grep -i "rmm" | head -10Repository: rapidsai/cuvs
Length of output: 779
🏁 Script executed:
# Check if there's any documentation update about async memory resource being process-wide
rg "cuvsRMMAsyncMemoryResourceEnable" docs/ 2>/dev/null || echo "No docs match found"Repository: rapidsai/cuvs
Length of output: 76
🏁 Script executed:
# Let's check how the pool resource implementation differs from async
# The pool passes a rvalue (temporary), while async passes a pointer to thread_local
# Search for any notes on RMM memory resource ownership in comments
rg "ownership|lifetime|thread" c/src/core/c_api.cpp | head -20Repository: rapidsai/cuvs
Length of output: 275
🏁 Script executed:
# Check if cuvsRMMPoolMemoryResourceEnable has similar issues by checking if it's thread_local or static
rg "pool_mr" c/src/core/c_api.cpp -B 5 -A 5Repository: rapidsai/cuvs
Length of output: 665
🏁 Script executed:
# Let's check the RMM version used and look for any behavior documentation
rg "rmm" vcpkg.json conanfile.txt requirements.txt 2>/dev/null | head -10Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Check if there are any thread safety comments or warnings in the C API header
cat -n c/include/cuvs/core/c_api.h | sed -n '220,250p'Repository: rapidsai/cuvs
Length of output: 1728
The async-memory resource owner cannot be thread_local when this API changes the current resource globally.
The implementation at c/src/core/c_api.cpp:188 stores cuda_async_memory_resource in thread_local async_mr and passes it to rmm::mr::set_current_device_resource(), but the documentation explicitly states this function "will change the memory resource for the whole process" (line 235). This creates a critical lifetime mismatch:
- If
set_current_device_resource()is device-scoped (affecting all threads), then when the enabling thread exits, itsthread_local async_mris destroyed while still registered as the current resource, leaving RMM with a dangling pointer. - If thread-local semantics were intended, the documentation must be updated to reflect that this only affects the calling thread.
The pool resource avoids this issue by passing temporary rvalues to set_current_device_resource(), allowing RMM to manage the lifetime. Either make async_mr process/device-scoped (not thread_local), or narrow the documentation and implementation to clarify thread-local semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/include/cuvs/core/c_api.h` around lines 229 - 240, The current
implementation of cuvsRMMAsyncMemoryResourceEnable stores
cuda_async_memory_resource in a thread_local async_mr and passes it to
rmm::mr::set_current_device_resource(), creating a lifetime mismatch vs the API
doc that says the change is global; change the implementation so the async
memory resource is process-scoped (not thread_local) by replacing thread_local
async_mr with a static/process-global instance (e.g., a static unique_ptr or
static object) so it outlives the thread and remains valid for
rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.
| // Only float32 is supported for multi-segment search. | ||
| RAFT_EXPECTS( | ||
| indices[0]->dtype.code == kDLFloat && indices[0]->dtype.bits == 32, | ||
| "Multi-segment search only supports float32 indices"); | ||
|
|
||
| using T = float; | ||
| using IdxT = uint32_t; | ||
| using OutIdxT = uint32_t; | ||
| using DistanceT = float; | ||
| using IndexT = cuvs::neighbors::cagra::index<T, IdxT>; | ||
|
|
||
| std::vector<const IndexT*> idx_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<const T, int64_t, raft::row_major>> q_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<OutIdxT, int64_t, raft::row_major>> n_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<DistanceT, int64_t, raft::row_major>> d_vec(num_segments); | ||
|
|
||
| for (uint32_t i = 0; i < num_segments; i++) { | ||
| RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0, | ||
| "Index at position %u is null or not built", i); | ||
| idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr); | ||
| q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]); | ||
| n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]); | ||
| d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]); | ||
| } |
There was a problem hiding this comment.
Validate every segment before the reinterpret_cast.
Only indices[0] is checked for float32. The loop then blindly casts every indices[i]->addr to index<float, uint32_t>* and builds float32 device views from queries[i], neighbors[i], and distances[i]. If any later segment is half/int8/uint8 or any tensor is not device-backed, this path will feed the kernel mismatched types and can produce garbage results or GPU faults. Please mirror the per-input dtype/device/null checks from cuvsCagraSearch for each segment before from_dlpack.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/cagra.cpp` around lines 710 - 733, The loop currently casts
every indices[i]->addr and builds device views without validating each segment's
types/devices; update the loop that fills idx_vec, q_vec, n_vec, d_vec to
perform the same per-segment checks used in cuvsCagraSearch: assert indices[i]
!= nullptr and addr != 0 (already present), then verify indices[i]->dtype.code
== kDLFloat && indices[i]->dtype.bits == 32 for every i, and validate that
queries[i], neighbors[i], and distances[i] are device-backed DLPack tensors with
the expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
| for (uint32_t i = 0; i < num_segments; i++) { | ||
| RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0, | ||
| "Index at position %u is null or not built", i); | ||
| idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr); | ||
| q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]); | ||
| n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]); | ||
| d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]); | ||
| } | ||
|
|
||
| cuvs::neighbors::cagra::search_multi_segment( | ||
| *res_ptr, search_params, idx_vec, q_vec, n_vec, d_vec); |
There was a problem hiding this comment.
Reject mixed-distance metrics across segments.
This API combines raw distances from all segments into one global ranking, so all indices must use the same metric. Right now nothing checks that indices[i] matches indices[0] on metric, which means mixing L2/IP/Cosine segments will return nonsensical top-k ordering even though the header says the distances are comparable across segments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/cagra.cpp` around lines 726 - 736, The loop that builds
idx_vec/q_vec/n_vec/d_vec must also validate that all segment indices use the
same distance metric as the first segment to prevent mixing incompatible
metrics; inside the for-loop in cagra.cpp (after the existing RAFT_EXPECTS that
checks indices[i] non-null) compare indices[i]->metric (or the actual metric
field name on your IndexT struct/class) to indices[0]->metric and fail fast with
RAFT_EXPECTS (or equivalent) and a clear message like "Mixed distance metrics
across segments: expected %s but got %s at segment %u"; keep this check before
pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.
| extern "C" cuvsError_t cuvsSelectK(cuvsResources_t res, | ||
| DLManagedTensor* in_val, | ||
| DLManagedTensor* out_val, | ||
| DLManagedTensor* out_idx) | ||
| { | ||
| return cuvs::core::translate_exceptions([=] { | ||
| auto* res_ptr = reinterpret_cast<raft::resources*>(res); | ||
|
|
||
| int64_t n = in_val->dl_tensor.shape[1]; | ||
| int64_t k = out_val->dl_tensor.shape[1]; | ||
|
|
||
| auto in_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>( | ||
| static_cast<const float*>(in_val->dl_tensor.data), 1, n); | ||
|
|
||
| auto out_val_view = raft::make_device_matrix_view<float, int64_t, raft::row_major>( | ||
| static_cast<float*>(out_val->dl_tensor.data), 1, k); | ||
|
|
||
| auto out_idx_view = raft::make_device_matrix_view<int64_t, int64_t, raft::row_major>( | ||
| static_cast<int64_t*>(out_idx->dl_tensor.data), 1, k); | ||
|
|
||
| cuvs::selection::select_k( | ||
| *res_ptr, | ||
| in_view, | ||
| std::nullopt, // implicit positions [0, n) as in_idx | ||
| out_val_view, | ||
| out_idx_view, | ||
| true); // select_min = true (smallest distance = nearest neighbor) |
There was a problem hiding this comment.
Validate the DLPack contract before dereferencing shape[1] and casting buffers.
cuvsSelectK currently assumes non-null 2D CUDA tensors with float32 / float32 / int64 dtypes, zero offset, and contiguous row-major layout. Without checking ndim, shape, device, dtype, byte_offset, and stride compatibility, malformed callers can crash here or feed corrupted views into select_k.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/selection/select_k.cpp` around lines 14 - 40, cuvsSelectK dereferences
shape[1] and casts buffers without validating the DLPack tensors; add explicit
validation at the top of cuvsSelectK for in_val, out_val, out_idx (non-null),
then check each DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows
match expected 1 or compatible), dtype (in_val/out_val float32, out_idx int64),
device type (CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Declaration in cpp/src/neighbors/detail/cagra/cagra_search.cuh"
rg -n -A8 -B2 'template <typename T, typename OutputIdxT = uint32_t, typename IdxT = uint32_t' \
cpp/src/neighbors/detail/cagra/cagra_search.cuh
echo
echo "Explicit instantiation site in cpp/src/neighbors/cagra_search_inst.cu.in"
rg -n -A8 -B2 'CUVS_INST_CAGRA_SEARCH_MULTI_SEGMENT|search_multi_segment<' \
cpp/src/neighbors/cagra_search_inst.cu.inRepository: rapidsai/cuvs
Length of output: 2317
🏁 Script executed:
# Check for static_assert on IdxT in the implementation
rg -n "static_assert.*IdxT" cpp/src/neighbors/detail/cagra/cagra_search.cuh
echo "---"
# Read the header to see all overloads and their documentation
cat -n cpp/include/cuvs/neighbors/cagra.hpp | sed -n '1710,1810p'Repository: rapidsai/cuvs
Length of output: 6679
Fix template parameter order mismatch and add missing documentation.
The instantiation macro calls the template with parameter order <T, IdxT, OutputIdxT>, but the template declares them as <T, OutputIdxT, IdxT>. When the macro instantiates with (data_t, uint32_t, int64_t), it produces <data_t, uint32_t, int64_t>, incorrectly mapping IdxT to int64_t. This violates the static_assert at cagra_search.cuh:276 (IdxT == uint32_t) and will fail at link time for the int64_t output overloads.
Additionally, overloads 2–8 (lines 1752–1806) lack Doxygen documentation. Per the coding guidelines, all public functions must be documented. Add a @copydoc search_multi_segment directive before each additional overload, or provide explicit documentation blocks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/include/cuvs/neighbors/cagra.hpp` around lines 1744 - 1806, The header
declares the template parameter order as <T, OutputIdxT, IdxT> but the
instantiation macro (used with <T, IdxT, OutputIdxT> e.g. (data_t, uint32_t,
int64_t)) expects <T, IdxT, OutputIdxT>, causing the IdxT/OutputIdxT swap that
trips the static_assert in cagra_search.cuh:276 and breaks link-time overloads
for int64_t; fix by changing the template parameter order in the
search_multi_segment declarations to <T, IdxT, OutputIdxT> (or alternatively
update the instantiation macro to match the declared order) so types map
correctly, and add missing Doxygen for overloads 2–8 by inserting a `@copydoc`
search_multi_segment (or equivalent documentation block) above each of those
overloaded search_multi_segment declarations to satisfy the public API docs
requirement.
| public int nextSlot(int count) { | ||
| int start = slotCounter; | ||
| slotCounter += count; | ||
| return start; | ||
| } |
There was a problem hiding this comment.
Make nextSlot atomic and modulo-bounded.
At Line 116–118, slotCounter updates are racy and the returned value is not wrapped despite the method contract; this can cause slot collisions and eventual overflow issues.
🔧 Proposed fix
import java.lang.foreign.Arena;
import java.lang.foreign.MemorySegment;
+import java.util.concurrent.atomic.AtomicInteger;
@@
- private int slotCounter;
+ private final AtomicInteger slotCounter = new AtomicInteger(0);
@@
public int nextSlot(int count) {
- int start = slotCounter;
- slotCounter += count;
- return start;
+ if (count <= 0 || count > size) {
+ throw new IllegalArgumentException("count must be in [1, size]");
+ }
+ return Math.floorMod(slotCounter.getAndAdd(count), size);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public int nextSlot(int count) { | |
| int start = slotCounter; | |
| slotCounter += count; | |
| return start; | |
| } | |
| public int nextSlot(int count) { | |
| if (count <= 0 || count > size) { | |
| throw new IllegalArgumentException("count must be in [1, size]"); | |
| } | |
| return Math.floorMod(slotCounter.getAndAdd(count), size); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 115 - 119, Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
| @Override | ||
| public void close() { | ||
| for (int i = 0; i < size; i++) { | ||
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | ||
| cuvsResourcesDestroy(resources[i]); | ||
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all cuvsResourcesDestroy call sites are validated.
rg -n -C2 '\bcuvsResourcesDestroy\s*\(' --type=javaRepository: rapidsai/cuvs
Length of output: 1298
Wrap cuvsResourcesDestroy with error validation during cleanup.
Line 129 calls cuvsResourcesDestroy without checkCuVSError, allowing cleanup failures to be silently ignored. Other call sites in the codebase (e.g., CuVSResourcesImpl.java) properly validate the return code. Apply the same pattern here for consistency:
Proposed fix
- cuvsResourcesDestroy(resources[i]);
+ checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Override | |
| public void close() { | |
| for (int i = 0; i < size; i++) { | |
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | |
| cuvsResourcesDestroy(resources[i]); | |
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | |
| } | |
| `@Override` | |
| public void close() { | |
| for (int i = 0; i < size; i++) { | |
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | |
| checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy"); | |
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 125 - 131, In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
| private final CudaStreamPool streamPool = | ||
| new CudaStreamPool( | ||
| Integer.getInteger(CudaStreamPool.SIZE_PROPERTY, CudaStreamPool.DEFAULT_SIZE)); |
There was a problem hiding this comment.
Construct the stream pool inside the constructor.
Allocating CudaStreamPool in a field initializer leaks native resources if CuVSResourcesImpl(Path) throws later during cuvsResourcesCreate/cuvsDeviceIdGet, because the object never finishes construction and close() is never reachable. Build the pool inside the constructor and tear it down in the constructor's failure path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`
around lines 31 - 33, The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.
| CagraSearchParams searchParameters = queries.get(0).getCagraSearchParameters(); | ||
|
|
||
| try (var resourcesAccessor = resources.access()) { | ||
| long cuvsRes = resourcesAccessor.handle(); | ||
| var cuvsStream = getStream(cuvsRes); | ||
|
|
||
| try (var globalNeighborsDP = allocateRMMSegment(cuvsRes, neighborsBytes); | ||
| var globalDistancesDP = allocateRMMSegment(cuvsRes, distancesBytes); | ||
| var outIdxDP = allocateRMMSegment(cuvsRes, outIdxBytes); | ||
| var outValDP = allocateRMMSegment(cuvsRes, outValBytes)) { | ||
|
|
||
| // --- Phase 1: call cuvsCagraSearchMultiSegment --- | ||
| // Single kernel launch covers all segments; results land in globalNeighborsDP / | ||
| // globalDistancesDP on the same CUDA stream, so SelectK below sees them via ordering. | ||
| try (var arena = Arena.ofConfined()) { | ||
| MemorySegment sp = CuVSParamsHelper.buildCagraSearchParams(arena, searchParameters); |
There was a problem hiding this comment.
Per-segment query settings are being dropped.
This path always builds search params from queries.get(0), and the native multi-segment API you're calling has no filter argument, so any search params or prefilter carried by queries[1..] are silently ignored. Either validate that every CagraQuery is equivalent and unfiltered, or plumb per-segment filtering/params through the native API before exposing List<CagraQuery> here.
Also applies to: 147-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 96 - 111, The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.
| long[] segShape = {1, k}; | ||
| for (int i = 0; i < numSegments; i++) { | ||
| // Index handle | ||
| indexArray.setAtIndex(ValueLayout.ADDRESS, i, buffered[i].getIndexHandle()); | ||
|
|
||
| // Query DLTensor | ||
| var queryVectors = (CuVSMatrixInternal) queries.get(i).getQueryVectors(); | ||
| queriesArray.setAtIndex(ValueLayout.ADDRESS, i, queryVectors.toTensor(arena)); | ||
|
|
||
| // Neighbors DLTensor — slice of global buffer | ||
| long nByteOffset = (long) i * k * Integer.BYTES; | ||
| MemorySegment nSlice = | ||
| MemorySegment.ofAddress(globalNeighborsDP.handle().address() + nByteOffset); | ||
| neighborsArray.setAtIndex( | ||
| ValueLayout.ADDRESS, | ||
| i, | ||
| prepareTensor(arena, nSlice, segShape, kDLUInt(), 32, kDLCUDA())); | ||
|
|
||
| // Distances DLTensor — slice of global buffer | ||
| long dByteOffset = (long) i * k * Float.BYTES; | ||
| MemorySegment dSlice = | ||
| MemorySegment.ofAddress(globalDistancesDP.handle().address() + dByteOffset); | ||
| distancesArray.setAtIndex( | ||
| ValueLayout.ADDRESS, | ||
| i, | ||
| prepareTensor(arena, dSlice, segShape, kDLFloat(), 32, kDLCUDA())); | ||
| } |
There was a problem hiding this comment.
Enforce the single-query contract before sizing output buffers.
Each segment gets a [1, k] neighbors/distances tensor, but the input query tensor comes straight from queries.get(i).getQueryVectors(). If any caller passes nq > 1, the native search will try to write nq * k results into storage sized for only k, which is an out-of-bounds device write. Please reject multi-row query tensors here or size the output slices from the actual query row count.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 119 - 145, The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
This PR adds a CAGRA search API and associated implementation for batching workloads involving a single query and multiple indexes (e.g. index segments built from one dataset by cuvs-lucene). This promotes GPU utilization for non-batched queries by parallelizing work over multiple CTAs, while minimizing the number of concurrent kernels for potential host-side parallelism.
Addresses cuvs-lucene issue 124.
Associated cuvs-lucene PR: rapidsai/cuvs-lucene#133.