Skip to content

Add ACE CAGRA Graph Reordering Option#2032

Draft
julianmi wants to merge 8 commits intorapidsai:mainfrom
julianmi:ace-reorder-graph
Draft

Add ACE CAGRA Graph Reordering Option#2032
julianmi wants to merge 8 commits intorapidsai:mainfrom
julianmi:ace-reorder-graph

Conversation

@julianmi
Copy link
Copy Markdown
Contributor

The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using from_cagra uses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, remapping the graph to the original index space can minimize the network transfers for downstream consumers if they have the original dataset locally available.

This PR proposes an option to remap the CAGRA graph to original index space using remap_disk_graph_to_original_ids. Passing the original dataset to from_cagra for a disk-backed index would first reorder the graph to original index space and then assemble the HNSW index from the remapped graph and the original dataset using serialize_to_hnswlib_with_original_dataset. See examples/cpp/src/cagra_hnsw_ace_example.cu using REMAP_GRAPH_TO_ORIGINAL_IDS for an example.

hnsw::build() is unchanged. Use cagra::build with ACE disk parameters followed by from_cagra passing the original dataset to use this option.

cuvs_cagra_hnswlib bench wrapper and JSON config gain a use_original_id_graph flag to benchmark these changes. The reordering is mostly I/O bound and scales roughly with the disk specifications and graph size. The overhead of remapping it is typically smaller than sending the reordered dataset over a slow network interface.

- Passing the original dataset to `from_cagra` reorders the CAGRA graph on disk to original ID space.
- Added a new benchmarking parameter `use_original_id_graph` to control the remapping behavior.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning to start using our new "Dataset" abstraction to represent datasets, and will be deprecating the mdspan-based datasets. Let's please wait until that lands so we can adjust accordingly.

For reordering, we will want to make a new dataset and update that accordingly.

#include <cuvs/neighbors/hnsw.hpp>
#include <cuvs/util/file_io.hpp>

#include <raft/core/detail/mdspan_numpy_serializer.hpp>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely should not be using detail APIs from RAFT.

If this is really needed in cuVS, it should be properly exposed through a public API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense to me. We use parse_descr, read_header, get_numpy_dtype, write_header from raft::detail::numpy_serializer throughout the codebase. These would be good candidates to expose.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so my point is that we can't be using internal APIs. They will need to be exposed before they PR can be merged. But it's also important that public apis are also generally reusable and we aren't just exposing them to avoid exposing internal APIs.

@aamijar aamijar added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Apr 21, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants