Add ACE CAGRA Graph Reordering Option#2032
Conversation
- Passing the original dataset to `from_cagra` reorders the CAGRA graph on disk to original ID space. - Added a new benchmarking parameter `use_original_id_graph` to control the remapping behavior.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cjnolet
left a comment
There was a problem hiding this comment.
We are planning to start using our new "Dataset" abstraction to represent datasets, and will be deprecating the mdspan-based datasets. Let's please wait until that lands so we can adjust accordingly.
For reordering, we will want to make a new dataset and update that accordingly.
| #include <cuvs/neighbors/hnsw.hpp> | ||
| #include <cuvs/util/file_io.hpp> | ||
|
|
||
| #include <raft/core/detail/mdspan_numpy_serializer.hpp> |
There was a problem hiding this comment.
Definitely should not be using detail APIs from RAFT.
If this is really needed in cuVS, it should be properly exposed through a public API.
There was a problem hiding this comment.
Thanks, this makes sense to me. We use parse_descr, read_header, get_numpy_dtype, write_header from raft::detail::numpy_serializer throughout the codebase. These would be good candidates to expose.
There was a problem hiding this comment.
Yeah, so my point is that we can't be using internal APIs. They will need to be exposed before they PR can be merged. But it's also important that public apis are also generally reusable and we aren't just exposing them to avoid exposing internal APIs.
The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using
from_cagrauses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, remapping the graph to the original index space can minimize the network transfers for downstream consumers if they have the original dataset locally available.This PR proposes an option to remap the CAGRA graph to original index space using
remap_disk_graph_to_original_ids. Passing the original dataset tofrom_cagrafor a disk-backed index would first reorder the graph to original index space and then assemble the HNSW index from the remapped graph and the original dataset usingserialize_to_hnswlib_with_original_dataset. Seeexamples/cpp/src/cagra_hnsw_ace_example.cuusingREMAP_GRAPH_TO_ORIGINAL_IDSfor an example.hnsw::build()is unchanged. Usecagra::buildwith ACE disk parameters followed byfrom_cagrapassing the original dataset to use this option.cuvs_cagra_hnswlibbench wrapper and JSON config gain ause_original_id_graphflag to benchmark these changes. The reordering is mostly I/O bound and scales roughly with the disk specifications and graph size. The overhead of remapping it is typically smaller than sending the reordered dataset over a slow network interface.