Skip to content

Multi segment cagra search#2035

Draft
jamxia155 wants to merge 10 commits intorapidsai:mainfrom
jamxia155:multi-segment-cagra-search
Draft

Multi segment cagra search#2035
jamxia155 wants to merge 10 commits intorapidsai:mainfrom
jamxia155:multi-segment-cagra-search

Conversation

@jamxia155
Copy link
Copy Markdown
Contributor

@jamxia155 jamxia155 commented Apr 22, 2026

This PR adds a CAGRA search API and associated implementation for batching workloads involving a single query and multiple indexes (e.g. index segments built from one dataset by cuvs-lucene). This promotes GPU utilization for non-batched queries by parallelizing work over multiple CTAs, while minimizing the number of concurrent kernels for potential host-side parallelism.

Addresses cuvs-lucene issue 124.

Associated cuvs-lucene PR: rapidsai/cuvs-lucene#133.

jamxia155 and others added 8 commits April 22, 2026 12:58
  Introduces a globally-optimized path for searching across multiple CAGRA
  index segments without per-segment device-to-host copies.

  - c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp:
    new C API function cuvsSelectK that wraps cuvs::selection::select_k
    for selecting the k smallest float values from a flat device array

  - BufferedCagraSearch (internal interface): searchIntoBuffer() writes
    per-segment CAGRA results into a slice of a caller-owned device buffer
    without syncing the stream or copying to host

  - CagraIndexImpl: implements BufferedCagraSearch; searchIntoBuffer()
    computes byte offsets into the global buffer using segmentIdx * topK

  - SelectKHelper: Panama FFI binding for cuvsSelectK

  - MultiSegmentCagraSearch: orchestrates the full pipeline — queue all
    per-segment searches, sync once, run cuvsSelectK on GPU, sync again,
    single D2H copy, decode results

  - MultiSegmentSearchResults: simple result carrier with count,
    segmentIndices, ordinals, and distances arrays
  Implements concurrent per-segment GPU search that eliminates per-segment
  device-to-host copies and CPU blocking on workspace deallocation.
  Key changes:
  - CudaStreamPool: fixed-size pool of non-blocking CUDA streams (one
    cuvsResources_t per slot). Segments are assigned to slots via
    round-robin so searches on different slots run concurrently. Pool size
    defaults to 8 and is overridden via the system property
    com.nvidia.cuvs.streamPoolSize.
  - MultiSegmentCagraSearch: single-query search across N index segments
    using the stream pool. All per-segment CAGRA kernels write into a
    shared device buffer (no per-segment D2H copy or stream sync). A
    single cuvsSelectK call finds the global top-k entirely on GPU, then
    one D2H copy transfers the results.
  - BufferedCagraSearch / CagraIndexImpl: new searchIntoBuffer() method
    that queues a CAGRA search kernel on a caller-supplied stream and
    writes results at a given row offset into a pre-allocated device
    buffer.
  - cuvsRMMAsyncMemoryResourceEnable() (C API + Java bindings): switches
    the current device memory resource to cuda_async_memory_resource so
    that workspace deallocations issued by CAGRA's search plan destructor
    are stream-ordered and non-blocking. Without this, cudaFree serializes
    kernel launches across streams regardless of stream assignment,
    nullifying the stream pool benefit.
  CudaStreamPool was a static singleton shared across all threads. With
  multiple concurrent query threads, calls to MultiSegmentCagraSearch.search()
  would alias onto the same pool slots after Math.floorMod, causing concurrent
  cudaEventRecord and cudaStreamWaitEvent calls on the same event handle
  (undefined behavior) and cross-thread stream interference. Additionally,
  CudaStreamPool.closeInstance() was called from CuVSResourcesImpl.close(),
  so whichever thread closed its resources first would destroy the shared pool
  while other threads were still using it.

  Fix: make CudaStreamPool a per-CuVSResources instance rather than a static
  singleton. One pool is created and owned by each CuVSResourcesImpl; it is
  closed when that instance is closed. Since CuVSResources is thread-local in
  the Lucene integration, each query thread gets its own independent set of
  streams and events with no sharing or locking required.

  - CudaStreamPool: remove static singleton (getOrCreate, closeInstance, static
    volatile instance, static AtomicInteger slotCounter); add package-private
    constructor; replace static slotCounter with an instance int and a new
    nextSlot(int count) method.

  - CuVSResourcesImpl: add final CudaStreamPool streamPool field (sized from
    com.nvidia.cuvs.streamPoolSize system property); close it directly in
    close(); add static getStreamPool(CuVSResources) helper for
    MultiSegmentCagraSearch to retrieve the per-resources pool.

  - MultiSegmentCagraSearch: get pool via CuVSResourcesImpl.getStreamPool
    and advance via pool.nextSlot. Remove redundant cuvsStreamSync after
    cuvsSelectK — the D2H copies are enqueued on the same stream so CUDA
    ordering already serializes them. Replace three separate hostArena.allocate
    calls with one contiguous allocation (Long.BYTES-aligned) sliced into three
    typed views, reducing OS-level allocation overhead per query.
The persistent kernel runner was previously keyed on (dataset_desc,
graph, fixed search params), which forced a destroy/recreate cycle for
every segment when searching a multi-segment Lucene index: each segment
has a different graph pointer and potentially a different auto-computed
max_iterations, producing a different hash on every call.

C++ changes (search_single_cta_kernel-inl.cuh):
- Move dataset_desc_ptr, graph_ptr, and graph_degree from fixed runner
  state into per-job fields in job_desc_t. The persistent kernel reads
  them from the job descriptor, so one runner instance can serve any
  number of segments without being rebuilt.
- Remove dataset_desc and graph arguments from persistent_runner_t
  constructor and calculate_parameter_hash; the runner is now keyed
  only on fixed kernel parameters (block_size, smem_size, itopk, etc.).
- Update select_and_run to initialize the device descriptor on the
  caller's stream and synchronize before submission, then pass
  dd_dev_ptr, graph.data_handle(), and graph_degree to runner::launch.
- Remove dd_host from persistent_runner_t; dataset upload is now the
  caller's responsibility on each launch.

Java changes:
- Add persistent, persistentLifetime, and persistentDeviceUsage fields,
  getters, and Builder methods to CagraSearchParams.
- Wire the three persistent params through CuVSParamsHelper into the
  Panama-generated cuvsCagraSearchParams struct.
…raSearch

In persistent mode, `searchIntoBuffer` blocks on the CPU until the GPU
signals completion via a system-scope atomic. Previously, segments were
searched sequentially, so the GPU processed one segment at a time per
query, leaving its job queue mostly idle between segment dispatches.

Submit one async task per pool slot so all slots' segment searches are
in-flight simultaneously. The persistent runner's job queue can hold
all N segment jobs at once, allowing GPU workers to execute segments in
parallel (bounded by worker_queue_size).

Segments are grouped by pool slot rather than submitted one-per-segment
to prevent concurrent access to the same cuvsResources_t handle: the
descriptor_cache stored inside the RAFT resources object is not
thread-safe, and multiple threads calling cuvsCagraSearch with the same
handle causes a SIGSEGV. Grouping ensures each cuvsResources_t is
accessed by at most one thread at a time. Effective parallelism is
min(numSegments, pool.size()); increasing cuvsStreamPoolSize raises
the ceiling.

In non-persistent mode the existing sequential loop is unchanged:
kernel launches are asynchronous and return immediately, so Java-level
parallelism adds overhead without benefit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-segment search (C++/C/Java):
- Add search_multi_segment() C++ function: builds a single search plan
  sized for the largest segment, packs per-segment descriptors, and
  launches search_kernel_ms — a new SINGLE_CTA kernel with grid
  dimensions (1, num_queries, num_segments) so each CTA independently
  searches one (query, segment) pair in a single kernel call.
- Add cuvsCagraSearchMultiSegment() C API wrapping the above.
- Simplify MultiSegmentCagraSearch.java to call
  cuvsCagraSearchMultiSegment() unconditionally, removing the previous
  per-segment stream pool, thread pool, CUDA event synchronization, and
  persistent/non-persistent branching. The search now completes in four
  phases: multi-segment kernel, GPU-side select-k, single D2H copy,
  result decoding.
- Add BufferedCagraSearch.getIndexHandle() to expose the raw
  cuvsCagraIndex_t handle needed by the multi-segment kernel dispatch.

Workspace pool (C/Java):
- Add cuvsResourcesSetWorkspacePool(): configures the per-resources
  temporary workspace as an uncapped RMM pool that grows without
  shrinking. After warmup, cuvsRMMAlloc/cuvsRMMFree hit the pool cache
  instead of calling cudaMallocAsync/cudaFreeAsync, eliminating CUDA
  context lock contention under concurrent query threads.
- Route cuvsRMMAlloc/cuvsRMMFree through the workspace resource so Java-
  side output buffer allocations also benefit from the pool.
- Expose setWorkspacePool() in the CuVSResources Java interface with
  implementations in CuVSResourcesImpl and SynchronizedCuVSResources.

Refactoring (search_single_cta_kernel-inl.cuh):
- Extract TopkVariant enum and select_topk_variant() helper, shared by
  search_kernel_config and search_kernel_config_ms, replacing duplicated
  if/else trees in both choose_itopk_and_mx_candidates() bodies.
- Extract kernel_dispatch_params::compute() to centralize max_candidates
  and max_itopk computation shared by select_and_run and
  select_and_run_multi_segment.
- Extract hashmap_element_count() static helper on the search struct,
  used by both set_params (single-segment) and run_multi_segment.
Add HALF(2) to the DataType enum and wire it through the Java layer so
callers can build and exchange float16 matrices without any Java-side
type conversion:

- CuVSMatrix.Builder: new addVector(short[]) overload; each element is
  a raw IEEE 754 binary16 bit pattern held in a short
- LinkerHelper: add C_SHORT via canonicalLayouts("short")
- CuVSMatrixBaseImpl: map HALF → C_SHORT in valueLayoutFromType();
  decode DLPack (kDLFloat, bits=16) → DataType.HALF in
  dataTypeFromTensor()
- CuVSMatrixInternal: HALF maps to the same kDLFloat DLPack type code
  as FLOAT; the bits field (16 vs 32) distinguishes them on the C side
- JDKProvider.MatrixBuilder: implement addVector(short[]) following the
  same MemorySegment.ofArray pattern as the other overloads
@jamxia155 jamxia155 requested review from a team as code owners April 22, 2026 23:02
@jamxia155 jamxia155 marked this pull request as draft April 22, 2026 23:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

This pull request introduces multi-segment CAGRA search functionality, GPU-based selection, and RMM memory pool management. It adds C and C++ APIs for multi-segment approximate nearest neighbor search, async memory allocation, and select-K operations, alongside corresponding Java bindings and internal implementations supporting concurrent per-segment GPU processing and stream-based result aggregation.

Changes

Cohort / File(s) Summary
Build Configuration
c/CMakeLists.txt
Added src/selection/select_k.cpp to the C API library source list.
C API: RMM Memory Management
c/include/cuvs/core/c_api.h, c/src/core/c_api.cpp
Introduced cuvsResourcesSetWorkspacePool to configure per-resources uncapped device memory pools and cuvsRMMAsyncMemoryResourceEnable to switch RMM to stream-ordered async allocation. Extended cuvsRMMMemoryResourceReset to clear thread-local async resources.
C API: Multi-Segment CAGRA
c/include/cuvs/neighbors/cagra.h, c/src/neighbors/cagra.cpp
Added cuvsCagraSearchMultiSegment to perform concurrent ANN searches across multiple CAGRA index segments, accepting per-segment index pointers and query/result tensors.
C API: Select-K
c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp
Introduced cuvsSelectK to select k smallest values from GPU tensors, returning values and column indices.
C++ API: Multi-Segment CAGRA
cpp/include/cuvs/neighbors/cagra.hpp, cpp/src/neighbors/cagra.cuh, cpp/src/neighbors/cagra_search_inst.cu.in
Added public search_multi_segment function overloads for float, half, int8_t, uint8_t index types with configurable output neighbor types (uint32_t or int64_t). Included explicit template instantiations for multi-segment variants.
C++ Implementation: Multi-Segment Search Logic
cpp/src/neighbors/detail/cagra/cagra_search.cuh
Implemented multi-segment CAGRA search orchestration, handling dataset validation, kernel configuration, per-segment device descriptor management, and distance postprocessing (cosine-expanded normalization).
C++ Implementation: Multi-Segment Kernel Infrastructure
cpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuh, cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh, cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh
Added multi-segment kernel execution path with per-segment descriptor structure, concurrent CTA scheduling via blockIdx.z, hashmap sizing across segments, persistent runner refactoring to use job-specific descriptors, and select_and_run_multi_segment launcher.
C++ Implementation: Multi-Segment Search Helpers
cpp/src/neighbors/detail/cagra/search_single_cta.cuh
Introduced run_multi_segment and hashmap_element_count helpers to orchestrate concurrent segment processing and allocate global hashmap buffers.
Java: Minor Updates
java/cuvs-java/src/main/java/com/nvidia/cuvs/{CagraIndex,HnswIndex}.java, java/cuvs-java/src/main/java/com/nvidia/cuvs/{CuVSAceParams,HnswAceParams,HnswIndexParams}.java
Updated SPDX copyright headers and reformatted method signatures for consistency.
Java: Core API Extensions
java/cuvs-java/src/main/java/com/nvidia/cuvs/{CagraSearchParams,CuVSMatrix,CuVSResources,SynchronizedCuVSResources}.java
Added persistent kernel configuration options to CagraSearchParams, HALF data type support to CuVSMatrix, and setWorkspacePool method to CuVSResources interface and implementations.
Java: SPI
java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/{CuVSProvider,UnsupportedProvider}.java
Added enableRMMAsyncMemory() method to provider interface and stub implementation.
Java: Multi-Segment Search API
java/cuvs-java/src/main/java22/com/nvidia/cuvs/{MultiSegmentSearchResults,MultiSegmentCagraSearch}.java
Introduced result class to represent decoded multi-segment search outputs and public API for multi-segment search with global top-k selection via GPU-resident select-K.
Java: Buffered Search Interface
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{BufferedCagraSearch,CagraIndexImpl}.java
Defined interface for GPU-side buffered search into caller-provided device buffers and implemented support in CagraIndexImpl for per-segment search without host synchronization.
Java: Internal Helpers & Bindings
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{SelectKHelper,CuVSParamsHelper,CudaStreamPool,LinkerHelper}.java
Added native FFM bindings for select-K, CAGRA search parameters builder, fixed-size CUDA stream/resource pool with round-robin slot allocation, and C_SHORT layout constant.
Java: Internal Matrix & Resources
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{CuVSMatrixBaseImpl,CuVSMatrixInternal,CuVSResourcesImpl,HnswIndexImpl}.java
Added half-precision float tensor handling, integrated stream pool ownership in resources, workspace pool implementation via native API, and formatting adjustments.
Java: Provider Implementation
java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
Implemented enableRMMAsyncMemory() and added short[] vector support for matrix building.
Java: Test Utilities
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java
Implemented setWorkspacePool delegation in test wrapper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Multi segment cagra search' directly corresponds to the main feature being added: a CAGRA search API for multiple index segments. It clearly summarizes the primary change from the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is directly related to the changeset, explaining the core feature (CAGRA multi-segment search API) and its purpose (batching workloads for single query, multiple indexes).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)

293-322: ⚠️ Potential issue | 🟡 Minor

The toString() method does not include the new persistent fields.

The three new fields (persistent, persistentLifetime, persistentDeviceUsage) are not included in the toString() output. While not critical, this creates an inconsistency where debugging output won't show the full configuration.

🔧 Proposed fix to include persistent fields in toString()
         + ", randXORMask="
         + randXORMask
+        + ", persistent="
+        + persistent
+        + ", persistentLifetime="
+        + persistentLifetime
+        + ", persistentDeviceUsage="
+        + persistentDeviceUsage
         + "]";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around
lines 293 - 322, The toString() in class CagraSearchParams currently omits the
new fields persistent, persistentLifetime, and persistentDeviceUsage; update the
CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.
🧹 Nitpick comments (3)
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java (1)

59-62: Add destroyed-state guard in the new delegate method.

setWorkspacePool(...) should call checkNotDestroyed() before delegating, consistent with this wrapper’s defensive contract.

Proposed change
   `@Override`
   public void setWorkspacePool(long sizeBytes) {
+    checkNotDestroyed();
     inner.setWorkspacePool(sizeBytes);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`
around lines 59 - 62, The new delegate method setWorkspacePool(long sizeBytes)
is missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)

527-537: Consider adding validation for persistentDeviceUsage bounds.

The Javadoc states the value "must be greater than 0.0 and not greater than 1.0", but the setter doesn't validate this constraint. Invalid values would only fail at the native layer.

🛡️ Optional: Add validation in the builder
     public Builder withPersistentDeviceUsage(float persistentDeviceUsage) {
+      if (persistentDeviceUsage <= 0.0f || persistentDeviceUsage > 1.0f) {
+        throw new IllegalArgumentException(
+            "persistentDeviceUsage must be > 0.0 and <= 1.0, got: " + persistentDeviceUsage);
+      }
       this.persistentDeviceUsage = persistentDeviceUsage;
       return this;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around
lines 527 - 537, The builder setter withPersistentDeviceUsage currently assigns
persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.
cpp/src/neighbors/detail/cagra/cagra_search.cuh (1)

388-417: Consider hoisting query_norms allocation outside the loop for CosineExpanded metric.

When multiple segments use CosineExpanded metric, query_norms is allocated and computed redundantly for each segment. Since queries are the same across segments (repeated query vector), the norms could be computed once.

This is a minor optimization opportunity. The current implementation is correct; the redundant computation is bounded by the number of segments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh` around lines 388 - 417,
Hoist the allocation and computation of query_norms out of the segment loop and
reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@c/include/cuvs/core/c_api.h`:
- Around line 229-240: The current implementation of
cuvsRMMAsyncMemoryResourceEnable stores cuda_async_memory_resource in a
thread_local async_mr and passes it to rmm::mr::set_current_device_resource(),
creating a lifetime mismatch vs the API doc that says the change is global;
change the implementation so the async memory resource is process-scoped (not
thread_local) by replacing thread_local async_mr with a static/process-global
instance (e.g., a static unique_ptr or static object) so it outlives the thread
and remains valid for rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.

In `@c/src/neighbors/cagra.cpp`:
- Around line 710-733: The loop currently casts every indices[i]->addr and
builds device views without validating each segment's types/devices; update the
loop that fills idx_vec, q_vec, n_vec, d_vec to perform the same per-segment
checks used in cuvsCagraSearch: assert indices[i] != nullptr and addr != 0
(already present), then verify indices[i]->dtype.code == kDLFloat &&
indices[i]->dtype.bits == 32 for every i, and validate that queries[i],
neighbors[i], and distances[i] are device-backed DLPack tensors with the
expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
- Around line 726-736: The loop that builds idx_vec/q_vec/n_vec/d_vec must also
validate that all segment indices use the same distance metric as the first
segment to prevent mixing incompatible metrics; inside the for-loop in cagra.cpp
(after the existing RAFT_EXPECTS that checks indices[i] non-null) compare
indices[i]->metric (or the actual metric field name on your IndexT struct/class)
to indices[0]->metric and fail fast with RAFT_EXPECTS (or equivalent) and a
clear message like "Mixed distance metrics across segments: expected %s but got
%s at segment %u"; keep this check before pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.

In `@c/src/selection/select_k.cpp`:
- Around line 14-40: cuvsSelectK dereferences shape[1] and casts buffers without
validating the DLPack tensors; add explicit validation at the top of cuvsSelectK
for in_val, out_val, out_idx (non-null), then check each
DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows match expected
1 or compatible), dtype (in_val/out_val float32, out_idx int64), device type
(CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.

In `@cpp/include/cuvs/neighbors/cagra.hpp`:
- Around line 1752-1806: Add Doxygen documentation for each overloaded
search_multi_segment declaration so they appear in generated API docs; either
add brief doxygen blocks above each overload or use `@copydoc` to reference the
primary search_multi_segment doc block (e.g., use `@copydoc`
search_multi_segment(raft::resources const&,
cuvs::neighbors::cagra::search_params const&, const std::vector<const
cuvs::neighbors::cagra::index<float, uint32_t>*>&, const
std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>&))
for the overloads with half/int8_t/uint8_t and uint32_t/int64_t neighbor types
so each signature (the overloads of search_multi_segment) is documented.
- Around line 1744-1806: The header declares the template parameter order as <T,
OutputIdxT, IdxT> but the instantiation macro (used with <T, IdxT, OutputIdxT>
e.g. (data_t, uint32_t, int64_t)) expects <T, IdxT, OutputIdxT>, causing the
IdxT/OutputIdxT swap that trips the static_assert in cagra_search.cuh:276 and
breaks link-time overloads for int64_t; fix by changing the template parameter
order in the search_multi_segment declarations to <T, IdxT, OutputIdxT> (or
alternatively update the instantiation macro to match the declared order) so
types map correctly, and add missing Doxygen for overloads 2–8 by inserting a
`@copydoc` search_multi_segment (or equivalent documentation block) above each of
those overloaded search_multi_segment declarations to satisfy the public API
docs requirement.

In `@cpp/src/neighbors/cagra.cuh`:
- Around line 409-420: The wrapper search_multi_segment currently forwards
indices, queries, neighbors, and distances without validation; add the same
upfront shape checks used by search() before calling
cagra::detail::search_multi_segment: verify indices.size() == queries.size() ==
neighbors.size() == distances.size(), then for each segment i ensure
queries[i].n_rows == neighbors[i].n_rows, neighbors[i].n_cols ==
distances[i].n_cols (k matches), queries[i].n_cols equals the index dimension
for indices[i] (or indices[i]->dim() / appropriate accessor), and that k > 0 and
dims are consistent; if any check fails, return/throw a clear error (or use
RAFT/CUASSERT used elsewhere) rather than forwarding to the detail
implementation.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java`:
- Around line 60-76: The Java API should validate workspace pool sizes before
calling native code: update the Javadoc for setWorkspacePool to state the valid
range is > 0, and in the implementation of setWorkspacePool (the method that
currently invokes the native cuvsResourcesSetWorkspacePool) add a check that
rejects non-positive values (<= 0) by throwing an appropriate Java exception
(e.g., IllegalArgumentException) with a clear message; only call
cuvsResourcesSetWorkspacePool when the value is positive to avoid
signed->unsigned wraparound in native size_t.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java`:
- Around line 24-27: MultiSegmentSearchResults currently stores native uint32
ordinals in an int[] (field ordinals) which corrupts values > Integer.MAX_VALUE;
change ordinals from int[] to long[] (and any constructor/getter signatures) and
ensure code that decodes native uint32_t values writes unsigned values into the
long (e.g., value & 0xFFFFFFFFL) so ordinals remain non-negative; update any
consumers (notably MultiSegmentCagraSearch) to compare against a long sentinel
(e.g., -1L) or otherwise handle long ordinals instead of treating negative ints
as sentinels.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java`:
- Around line 43-46: The setWorkspacePool method in SynchronizedCuVSResources is
not using the shared lock and must be serialized like access(); modify
setWorkspacePool to acquire the same lock used by access() (e.g., wrap the call
to inner.setWorkspacePool(sizeBytes) in the synchronized block or lock guard
used by access()) so that mutations to workspace pool are protected by the same
synchronization as access().

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java`:
- Around line 391-396: The offset calculation uses segmentIdx * topK
(neighborByteOffset/distanceByteOffset) but the code builds tensors with shape
{numQueries, topK} (numQueries from queryVectors.size()), so when numQueries > 1
the buffer offsets are wrong; either enforce single-query by adding a guard (if
(numQueries != 1) throw new IllegalArgumentException(...)) near where numQueries
is computed (queryVectors) or update the offsets to multiply by numQueries
(neighborByteOffset = segmentIdx * numQueries * topK * C_INT_BYTE_SIZE and
distanceByteOffset = segmentIdx * numQueries * topK * Float.BYTES) before
creating neighborSlice/distanceSlice (globalNeighborsDP/globalDistancesDP).

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`:
- Around line 115-119: Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
- Around line 125-131: In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
- Around line 59-85: The constructor CudaStreamPool currently leaks native
handles if a create call fails mid-loop; modify the CudaStreamPool(int size)
constructor to perform rollback cleanup on failure by tracking the current index
and, if any checkCudaError/checkCuVSError throws, iterating over
already-initialized entries in resources[], streams[], and events[] to call the
corresponding destroy functions (cudaStreamDestroy for streams[],
cudaEventDestroy for events[], and the cuvs resources destroy routine for
resources[]) before rethrowing the exception; implement this by wrapping the
allocation loop in try/catch (or try/finally with a success flag) and invoking
the same cleanup logic as close() for indices < currentIndex so no native
handles are leaked if construction aborts.

In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`:
- Around line 31-33: The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`:
- Around line 119-145: The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
- Around line 96-111: The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.

---

Outside diff comments:
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 293-322: The toString() in class CagraSearchParams currently omits
the new fields persistent, persistentLifetime, and persistentDeviceUsage; update
the CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.

---

Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh`:
- Around line 388-417: Hoist the allocation and computation of query_norms out
of the segment loop and reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 527-537: The builder setter withPersistentDeviceUsage currently
assigns persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.

In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`:
- Around line 59-62: The new delegate method setWorkspacePool(long sizeBytes) is
missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 626e7fa1-95fc-4f40-b30d-d7fbcb6521a6

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 49e5a14.

📒 Files selected for processing (40)
  • c/CMakeLists.txt
  • c/include/cuvs/core/c_api.h
  • c/include/cuvs/neighbors/cagra.h
  • c/include/cuvs/selection/select_k.h
  • c/src/core/c_api.cpp
  • c/src/neighbors/cagra.cpp
  • c/src/selection/select_k.cpp
  • cpp/include/cuvs/neighbors/cagra.hpp
  • cpp/src/neighbors/cagra.cuh
  • cpp/src/neighbors/cagra_search_inst.cu.in
  • cpp/src/neighbors/detail/cagra/cagra_search.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuh
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSAceParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSMatrix.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswAceParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndex.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndexParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/UnsupportedProvider.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BufferedCagraSearch.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixBaseImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixInternal.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSParamsHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/SelectKHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/LinkerHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
  • java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java

Comment on lines +229 to +240
/**
* @brief Switches the working memory resource to use stream-ordered asynchronous allocation
* (cudaMallocAsync / cudaFreeAsync). Unlike the pool resource, this resource returns memory to
* the stream immediately without blocking the CPU, eliminating device-wide synchronization on
* deallocation. This is especially beneficial when multiple CAGRA searches run concurrently on
* separate CUDA streams, because the internal workspace allocations no longer serialize kernel
* launches. Be aware that this function will change the memory resource for the whole process
* and the new memory resource will be used until explicitly changed.
*
* @return cuvsError_t
*/
cuvsError_t cuvsRMMAsyncMemoryResourceEnable();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "c_api.cpp" -path "*/c/src/core/*" | head -5

Repository: rapidsai/cuvs

Length of output: 79


🏁 Script executed:

cat -n c/src/core/c_api.cpp | sed -n '160,195p'

Repository: rapidsai/cuvs

Length of output: 1665


🏁 Script executed:

cd c/src/core && grep -n "thread_local" c_api.cpp

Repository: rapidsai/cuvs

Length of output: 269


🏁 Script executed:

cd c/include/cuvs/core && grep -n "cuvsRMMAsyncMemoryResourceEnable" c_api.h -A 15

Repository: rapidsai/cuvs

Length of output: 580


🏁 Script executed:

# Search for RMM set_current_device_resource usage patterns in the codebase
rg "set_current_device_resource" --type cpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 2101


🏁 Script executed:

# Check if there are any docs changes or migration guides in the PR
find . -name "*.md" -o -name "*.rst" | xargs grep -l "RMM\|AsyncMemory\|cuvsRMM" 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 200


🏁 Script executed:

# Check for any deprecation warnings or migration documentation
find . -type f \( -name "*.h" -o -name "*.md" \) -path "*/docs/*" | xargs grep -l "cuvsRMM\|memory.*resource" 2>/dev/null | head -5

Repository: rapidsai/cuvs

Length of output: 89


🏁 Script executed:

# Check if there are any API change notes or migration guides in the docs
cat docs/source/api_basics.rst | head -100

Repository: rapidsai/cuvs

Length of output: 2518


🏁 Script executed:

# Check CHANGELOG for any RMM API changes mentioned
rg "Async|AsyncMemory|cuvsRMMAsync" CHANGELOG.md -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if reset_current_device_resource clears the async_mr
cat -n c/src/core/c_api.cpp | sed -n '192,210p'

Repository: rapidsai/cuvs

Length of output: 784


🏁 Script executed:

# Check how pool memory resource is handled - it passes rvalues
# Let's see if there's any documentation on RMM's set_current_device_resource behavior
rg "set_current_device_resource" --type h | head -20

Repository: rapidsai/cuvs

Length of output: 331


🏁 Script executed:

# Look for RMM in vendored code or requirements to understand its semantics
find . -name "*.txt" -o -name "*.cmake" | xargs grep -i "rmm" | head -10

Repository: rapidsai/cuvs

Length of output: 779


🏁 Script executed:

# Check if there's any documentation update about async memory resource being process-wide
rg "cuvsRMMAsyncMemoryResourceEnable" docs/ 2>/dev/null || echo "No docs match found"

Repository: rapidsai/cuvs

Length of output: 76


🏁 Script executed:

# Let's check how the pool resource implementation differs from async
# The pool passes a rvalue (temporary), while async passes a pointer to thread_local
# Search for any notes on RMM memory resource ownership in comments
rg "ownership|lifetime|thread" c/src/core/c_api.cpp | head -20

Repository: rapidsai/cuvs

Length of output: 275


🏁 Script executed:

# Check if cuvsRMMPoolMemoryResourceEnable has similar issues by checking if it's thread_local or static
rg "pool_mr" c/src/core/c_api.cpp -B 5 -A 5

Repository: rapidsai/cuvs

Length of output: 665


🏁 Script executed:

# Let's check the RMM version used and look for any behavior documentation
rg "rmm" vcpkg.json conanfile.txt requirements.txt 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if there are any thread safety comments or warnings in the C API header
cat -n c/include/cuvs/core/c_api.h | sed -n '220,250p'

Repository: rapidsai/cuvs

Length of output: 1728


The async-memory resource owner cannot be thread_local when this API changes the current resource globally.

The implementation at c/src/core/c_api.cpp:188 stores cuda_async_memory_resource in thread_local async_mr and passes it to rmm::mr::set_current_device_resource(), but the documentation explicitly states this function "will change the memory resource for the whole process" (line 235). This creates a critical lifetime mismatch:

  • If set_current_device_resource() is device-scoped (affecting all threads), then when the enabling thread exits, its thread_local async_mr is destroyed while still registered as the current resource, leaving RMM with a dangling pointer.
  • If thread-local semantics were intended, the documentation must be updated to reflect that this only affects the calling thread.

The pool resource avoids this issue by passing temporary rvalues to set_current_device_resource(), allowing RMM to manage the lifetime. Either make async_mr process/device-scoped (not thread_local), or narrow the documentation and implementation to clarify thread-local semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/include/cuvs/core/c_api.h` around lines 229 - 240, The current
implementation of cuvsRMMAsyncMemoryResourceEnable stores
cuda_async_memory_resource in a thread_local async_mr and passes it to
rmm::mr::set_current_device_resource(), creating a lifetime mismatch vs the API
doc that says the change is global; change the implementation so the async
memory resource is process-scoped (not thread_local) by replacing thread_local
async_mr with a static/process-global instance (e.g., a static unique_ptr or
static object) so it outlives the thread and remains valid for
rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.

Comment thread c/src/neighbors/cagra.cpp
Comment on lines +710 to +733
// Only float32 is supported for multi-segment search.
RAFT_EXPECTS(
indices[0]->dtype.code == kDLFloat && indices[0]->dtype.bits == 32,
"Multi-segment search only supports float32 indices");

using T = float;
using IdxT = uint32_t;
using OutIdxT = uint32_t;
using DistanceT = float;
using IndexT = cuvs::neighbors::cagra::index<T, IdxT>;

std::vector<const IndexT*> idx_vec(num_segments);
std::vector<raft::device_matrix_view<const T, int64_t, raft::row_major>> q_vec(num_segments);
std::vector<raft::device_matrix_view<OutIdxT, int64_t, raft::row_major>> n_vec(num_segments);
std::vector<raft::device_matrix_view<DistanceT, int64_t, raft::row_major>> d_vec(num_segments);

for (uint32_t i = 0; i < num_segments; i++) {
RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0,
"Index at position %u is null or not built", i);
idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr);
q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]);
n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]);
d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate every segment before the reinterpret_cast.

Only indices[0] is checked for float32. The loop then blindly casts every indices[i]->addr to index<float, uint32_t>* and builds float32 device views from queries[i], neighbors[i], and distances[i]. If any later segment is half/int8/uint8 or any tensor is not device-backed, this path will feed the kernel mismatched types and can produce garbage results or GPU faults. Please mirror the per-input dtype/device/null checks from cuvsCagraSearch for each segment before from_dlpack.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 710 - 733, The loop currently casts
every indices[i]->addr and builds device views without validating each segment's
types/devices; update the loop that fills idx_vec, q_vec, n_vec, d_vec to
perform the same per-segment checks used in cuvsCagraSearch: assert indices[i]
!= nullptr and addr != 0 (already present), then verify indices[i]->dtype.code
== kDLFloat && indices[i]->dtype.bits == 32 for every i, and validate that
queries[i], neighbors[i], and distances[i] are device-backed DLPack tensors with
the expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.

Comment thread c/src/neighbors/cagra.cpp
Comment on lines +726 to +736
for (uint32_t i = 0; i < num_segments; i++) {
RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0,
"Index at position %u is null or not built", i);
idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr);
q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]);
n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]);
d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]);
}

cuvs::neighbors::cagra::search_multi_segment(
*res_ptr, search_params, idx_vec, q_vec, n_vec, d_vec);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject mixed-distance metrics across segments.

This API combines raw distances from all segments into one global ranking, so all indices must use the same metric. Right now nothing checks that indices[i] matches indices[0] on metric, which means mixing L2/IP/Cosine segments will return nonsensical top-k ordering even though the header says the distances are comparable across segments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 726 - 736, The loop that builds
idx_vec/q_vec/n_vec/d_vec must also validate that all segment indices use the
same distance metric as the first segment to prevent mixing incompatible
metrics; inside the for-loop in cagra.cpp (after the existing RAFT_EXPECTS that
checks indices[i] non-null) compare indices[i]->metric (or the actual metric
field name on your IndexT struct/class) to indices[0]->metric and fail fast with
RAFT_EXPECTS (or equivalent) and a clear message like "Mixed distance metrics
across segments: expected %s but got %s at segment %u"; keep this check before
pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.

Comment on lines +14 to +40
extern "C" cuvsError_t cuvsSelectK(cuvsResources_t res,
DLManagedTensor* in_val,
DLManagedTensor* out_val,
DLManagedTensor* out_idx)
{
return cuvs::core::translate_exceptions([=] {
auto* res_ptr = reinterpret_cast<raft::resources*>(res);

int64_t n = in_val->dl_tensor.shape[1];
int64_t k = out_val->dl_tensor.shape[1];

auto in_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
static_cast<const float*>(in_val->dl_tensor.data), 1, n);

auto out_val_view = raft::make_device_matrix_view<float, int64_t, raft::row_major>(
static_cast<float*>(out_val->dl_tensor.data), 1, k);

auto out_idx_view = raft::make_device_matrix_view<int64_t, int64_t, raft::row_major>(
static_cast<int64_t*>(out_idx->dl_tensor.data), 1, k);

cuvs::selection::select_k(
*res_ptr,
in_view,
std::nullopt, // implicit positions [0, n) as in_idx
out_val_view,
out_idx_view,
true); // select_min = true (smallest distance = nearest neighbor)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate the DLPack contract before dereferencing shape[1] and casting buffers.

cuvsSelectK currently assumes non-null 2D CUDA tensors with float32 / float32 / int64 dtypes, zero offset, and contiguous row-major layout. Without checking ndim, shape, device, dtype, byte_offset, and stride compatibility, malformed callers can crash here or feed corrupted views into select_k.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/selection/select_k.cpp` around lines 14 - 40, cuvsSelectK dereferences
shape[1] and casts buffers without validating the DLPack tensors; add explicit
validation at the top of cuvsSelectK for in_val, out_val, out_idx (non-null),
then check each DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows
match expected 1 or compatible), dtype (in_val/out_val float32, out_idx int64),
device type (CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.

Comment on lines +1744 to +1806
void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Declaration in cpp/src/neighbors/detail/cagra/cagra_search.cuh"
rg -n -A8 -B2 'template <typename T, typename OutputIdxT = uint32_t, typename IdxT = uint32_t' \
  cpp/src/neighbors/detail/cagra/cagra_search.cuh

echo
echo "Explicit instantiation site in cpp/src/neighbors/cagra_search_inst.cu.in"
rg -n -A8 -B2 'CUVS_INST_CAGRA_SEARCH_MULTI_SEGMENT|search_multi_segment<' \
  cpp/src/neighbors/cagra_search_inst.cu.in

Repository: rapidsai/cuvs

Length of output: 2317


🏁 Script executed:

# Check for static_assert on IdxT in the implementation
rg -n "static_assert.*IdxT" cpp/src/neighbors/detail/cagra/cagra_search.cuh

echo "---"

# Read the header to see all overloads and their documentation
cat -n cpp/include/cuvs/neighbors/cagra.hpp | sed -n '1710,1810p'

Repository: rapidsai/cuvs

Length of output: 6679


Fix template parameter order mismatch and add missing documentation.

The instantiation macro calls the template with parameter order <T, IdxT, OutputIdxT>, but the template declares them as <T, OutputIdxT, IdxT>. When the macro instantiates with (data_t, uint32_t, int64_t), it produces <data_t, uint32_t, int64_t>, incorrectly mapping IdxT to int64_t. This violates the static_assert at cagra_search.cuh:276 (IdxT == uint32_t) and will fail at link time for the int64_t output overloads.

Additionally, overloads 2–8 (lines 1752–1806) lack Doxygen documentation. Per the coding guidelines, all public functions must be documented. Add a @copydoc search_multi_segment directive before each additional overload, or provide explicit documentation blocks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/cagra.hpp` around lines 1744 - 1806, The header
declares the template parameter order as <T, OutputIdxT, IdxT> but the
instantiation macro (used with <T, IdxT, OutputIdxT> e.g. (data_t, uint32_t,
int64_t)) expects <T, IdxT, OutputIdxT>, causing the IdxT/OutputIdxT swap that
trips the static_assert in cagra_search.cuh:276 and breaks link-time overloads
for int64_t; fix by changing the template parameter order in the
search_multi_segment declarations to <T, IdxT, OutputIdxT> (or alternatively
update the instantiation macro to match the declared order) so types map
correctly, and add missing Doxygen for overloads 2–8 by inserting a `@copydoc`
search_multi_segment (or equivalent documentation block) above each of those
overloaded search_multi_segment declarations to satisfy the public API docs
requirement.

Comment on lines +115 to +119
public int nextSlot(int count) {
int start = slotCounter;
slotCounter += count;
return start;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make nextSlot atomic and modulo-bounded.

At Line 116–118, slotCounter updates are racy and the returned value is not wrapped despite the method contract; this can cause slot collisions and eventual overflow issues.

🔧 Proposed fix
 import java.lang.foreign.Arena;
 import java.lang.foreign.MemorySegment;
+import java.util.concurrent.atomic.AtomicInteger;
@@
-  private int slotCounter;
+  private final AtomicInteger slotCounter = new AtomicInteger(0);
@@
   public int nextSlot(int count) {
-    int start = slotCounter;
-    slotCounter += count;
-    return start;
+    if (count <= 0 || count > size) {
+      throw new IllegalArgumentException("count must be in [1, size]");
+    }
+    return Math.floorMod(slotCounter.getAndAdd(count), size);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public int nextSlot(int count) {
int start = slotCounter;
slotCounter += count;
return start;
}
public int nextSlot(int count) {
if (count <= 0 || count > size) {
throw new IllegalArgumentException("count must be in [1, size]");
}
return Math.floorMod(slotCounter.getAndAdd(count), size);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 115 - 119, Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.

Comment on lines +125 to +131
@Override
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
cuvsResourcesDestroy(resources[i]);
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify all cuvsResourcesDestroy call sites are validated.
rg -n -C2 '\bcuvsResourcesDestroy\s*\(' --type=java

Repository: rapidsai/cuvs

Length of output: 1298


Wrap cuvsResourcesDestroy with error validation during cleanup.

Line 129 calls cuvsResourcesDestroy without checkCuVSError, allowing cleanup failures to be silently ignored. Other call sites in the codebase (e.g., CuVSResourcesImpl.java) properly validate the return code. Apply the same pattern here for consistency:

Proposed fix
-      cuvsResourcesDestroy(resources[i]);
+      checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Override
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
cuvsResourcesDestroy(resources[i]);
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}
`@Override`
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 125 - 131, In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.

Comment on lines +31 to +33
private final CudaStreamPool streamPool =
new CudaStreamPool(
Integer.getInteger(CudaStreamPool.SIZE_PROPERTY, CudaStreamPool.DEFAULT_SIZE));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Construct the stream pool inside the constructor.

Allocating CudaStreamPool in a field initializer leaks native resources if CuVSResourcesImpl(Path) throws later during cuvsResourcesCreate/cuvsDeviceIdGet, because the object never finishes construction and close() is never reachable. Build the pool inside the constructor and tear it down in the constructor's failure path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`
around lines 31 - 33, The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.

Comment on lines +96 to +111
CagraSearchParams searchParameters = queries.get(0).getCagraSearchParameters();

try (var resourcesAccessor = resources.access()) {
long cuvsRes = resourcesAccessor.handle();
var cuvsStream = getStream(cuvsRes);

try (var globalNeighborsDP = allocateRMMSegment(cuvsRes, neighborsBytes);
var globalDistancesDP = allocateRMMSegment(cuvsRes, distancesBytes);
var outIdxDP = allocateRMMSegment(cuvsRes, outIdxBytes);
var outValDP = allocateRMMSegment(cuvsRes, outValBytes)) {

// --- Phase 1: call cuvsCagraSearchMultiSegment ---
// Single kernel launch covers all segments; results land in globalNeighborsDP /
// globalDistancesDP on the same CUDA stream, so SelectK below sees them via ordering.
try (var arena = Arena.ofConfined()) {
MemorySegment sp = CuVSParamsHelper.buildCagraSearchParams(arena, searchParameters);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Per-segment query settings are being dropped.

This path always builds search params from queries.get(0), and the native multi-segment API you're calling has no filter argument, so any search params or prefilter carried by queries[1..] are silently ignored. Either validate that every CagraQuery is equivalent and unfiltered, or plumb per-segment filtering/params through the native API before exposing List<CagraQuery> here.

Also applies to: 147-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 96 - 111, The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.

Comment on lines +119 to +145
long[] segShape = {1, k};
for (int i = 0; i < numSegments; i++) {
// Index handle
indexArray.setAtIndex(ValueLayout.ADDRESS, i, buffered[i].getIndexHandle());

// Query DLTensor
var queryVectors = (CuVSMatrixInternal) queries.get(i).getQueryVectors();
queriesArray.setAtIndex(ValueLayout.ADDRESS, i, queryVectors.toTensor(arena));

// Neighbors DLTensor — slice of global buffer
long nByteOffset = (long) i * k * Integer.BYTES;
MemorySegment nSlice =
MemorySegment.ofAddress(globalNeighborsDP.handle().address() + nByteOffset);
neighborsArray.setAtIndex(
ValueLayout.ADDRESS,
i,
prepareTensor(arena, nSlice, segShape, kDLUInt(), 32, kDLCUDA()));

// Distances DLTensor — slice of global buffer
long dByteOffset = (long) i * k * Float.BYTES;
MemorySegment dSlice =
MemorySegment.ofAddress(globalDistancesDP.handle().address() + dByteOffset);
distancesArray.setAtIndex(
ValueLayout.ADDRESS,
i,
prepareTensor(arena, dSlice, segShape, kDLFloat(), 32, kDLCUDA()));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Enforce the single-query contract before sizing output buffers.

Each segment gets a [1, k] neighbors/distances tensor, but the input query tensor comes straight from queries.get(i).getQueryVectors(). If any caller passes nq > 1, the native search will try to write nq * k results into storage sized for only k, which is an out-of-bounds device write. Please reject multi-row query tensors here or size the output slices from the actual query row count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 119 - 145, The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant