Skip to content

Add HNSW Layered Index Support#2148

Open
julianmi wants to merge 11 commits into
rapidsai:mainfrom
julianmi:hnsw-layered-index
Open

Add HNSW Layered Index Support#2148
julianmi wants to merge 11 commits into
rapidsai:mainfrom
julianmi:hnsw-layered-index

Conversation

@julianmi

@julianmi julianmi commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using hnsw::from_cagra uses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, building only the layers of the HNSW index without the dataset and moving this to the search node can minimize the network transfers for downstream consumers if they have the original dataset locally available. The hnsw::deserialize step then takes the layered index and combines it with the local dataset to form a hnswlib compatible search index.

Artifact Layout

hnsw_index.cuvs
  fixed_header
    magic = CUVS_HNSW_LAYERED
    version
    metadata_offset
    metadata_size

  metadata_binary
    dataset shape, dtype, metric
    hnsw parameters
    section sizes
    upper-layer descriptors

  levels
    uint8[n_rows]
    indexed by original dataset row ID

  base_nodes
    uint32[n_rows]
    maps each base topology row to original row ID

  base_links
    n_rows fixed-size hnswlib-ready rows
    [count:uint32][neighbors:uint32[maxM0]]
    neighbors are original row IDs

  upper_nodes
    concatenated uint32 original row IDs for layers 1..maxlevel

  upper_links
    fixed-size hnswlib-ready rows
    [count:uint32][neighbors:uint32[maxM]]
    neighbors are original row IDs

Layered HNSW Serialization

The layered serializer creates hnsw_index.cuvs from the disk-backed ACE graph.

  1. Create the .cuvs file and write the fixed header and metadata.
  2. Generate HNSW levels in original ID space.
  3. Write levels sequentially.
  4. Read dataset_mapping.npy sequentially into reordered_to_original.
  5. Read cagra_graph.npy source-sequentially in ACE reordered row order.
  6. For each ACE graph row:
    • write base_nodes[row] = reordered_to_original[ace_reordered_row]
    • convert each neighbor from ACE reordered ID to original ID
    • write a padded hnswlib-ready row to base_links[row]
  7. Gather promoted vectors from the original dataset.
  8. Build upper-layer graphs using temporary HNSW promoted order.
  9. Write upper_nodes and upper_links with node IDs and neighbor IDs converted back to original IDs.

This keeps remapping, link padding, and upper-layer KNN work on the build node.

Deserialization

The search node reads:

  • hnsw_index.cuvs
  • the external original-order dataset from index_params.dataset_path

The loader:

  1. Reads the fixed header and metadata.
  2. Validates artifact shape, section sizes, and dataset shape.
  3. Reads levels sequentially.
  4. Allocates hnswlib storage.
  5. Reads the external dataset sequentially in original row order.
  6. Initializes hnswlib with:
    • internal ID = original ID
    • label = original ID
    • level = levels[original_id]
  7. Reads base_nodes and base_links sequentially.
  8. Copies each base link row into get_linklist0(base_node_id).
  9. Reads upper_nodes and upper_links sequentially by layer.
  10. Copies each upper link row into get_linklist(node_id, level).

The search node does no graph remapping, no level generation, no link padding, and no KNN work.

Disk Access Patterns

Build node:

  • Sequential scan of the original dataset for ACE partitioning.
  • Buffered partition writes for reordered and augmented datasets.
  • Contiguous per-partition reads from reordered_dataset.npy and augmented_dataset.npy.
  • Source-sequential reads from cagra_graph.npy when creating the final layered artifact.
  • Sequential writes to hnsw_index.cuvs.

Search node:

  • Sequential reads from hnsw_index.cuvs.
  • Sequential reads from the external original-order dataset.
  • Scatter writes only into in-memory hnswlib link storage by original ID.

Runtime Requirements

Only hnsw_index.cuvs is copied to the search node. ACE temporary files remain build-node-only.

The search node must have the original dataset in original row order and must provide that path through index_params.dataset_path.

Misc

Unifies the logging format of the ACE algorithm.

@copy-pr-bot

copy-pr-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tfeher tfeher requested a review from mfoerste4 June 1, 2026 09:49

@mfoerste4 mfoerste4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @julianmi for the PR. The structure looks good to me. Only a few comments that are not necessarily actionable.

Comment on lines +31 to +33
} else if (conf.at("hierarchy") == "gpu_layered_on_disk" ||
conf.at("hierarchy") == "gpu_layered" || conf.at("hierarchy") == "layered") {
hnsw_params.hierarchy = cuvs::neighbors::hnsw::HnswHierarchy::GPU_LAYERED_ON_DISK;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these all just synonyms for GPU_LAYERED_ON_DISK?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've removed them to better align with the other formats.

"Layered HNSW artifact '%s' does not exist.",
src_artifact.c_str());

copy_file_overwrite(src_artifact, std::filesystem::path(file));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we copy here instead of moving (like in the cagra_ace block below?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I've changed this and reuse the helper in the cagra_ace block.

Comment thread cpp/src/neighbors/detail/hnsw.hpp Outdated
Comment on lines +648 to +663
inline auto json_parse_double(const std::string& json, const std::string& key) -> double
{
auto pos = json_find_key(json, key);
auto colon = json.find(':', pos);
RAFT_EXPECTS(colon != std::string::npos, "Malformed JSON near key '%s'", key.c_str());
auto begin = json.find_first_of("0123456789-.", colon + 1);
auto end = json.find_first_not_of("0123456789-.eE+", begin);
RAFT_EXPECTS(begin != std::string::npos, "Malformed double JSON value for key '%s'", key.c_str());
return std::stod(json.substr(begin, end - begin));
}

inline auto json_parse_layer_field(const std::string& object, const std::string& key) -> size_t
{
return json_parse_size(object, key);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use JSON here - why don't we use a common json utility for parsing? This whole section feels very verbose for a standard. Is there a cleaner alternative?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would pull the JSON dependency (nlohmann/json) into libcuvs. It is currently only used in the benchmarking. I've changed to a binary header format instead given the simplicity. Let me know what you think.

Comment on lines +907 to +916
const auto current_node_bytes = current_batch_size * sizeof(IdxT);
const auto current_link_bytes = current_batch_size * base_link_row_bytes;
cuvs::util::write_large_file(output_fd,
base_node_buffer.data(),
current_node_bytes,
base_nodes_offset + source_start * sizeof(IdxT));
cuvs::util::write_large_file(output_fd,
base_link_buffer.data(),
current_link_bytes,
base_links_offset + source_start * base_link_row_bytes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the graph connections / links are shifted to reflect the original ordering, but I don't see the rows/nodes themself being reordered. Is this intended?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The idea is to write sequentially and read scatter into memory. I've added a comment.

Comment on lines +477 to +484
template <typename T, typename IdxT, typename Callback>
void build_hnsw_upper_layer_graphs(
raft::resources const& res,
raft::host_matrix_view<const T, int64_t, raft::row_major> promoted_dataset,
const hnsw_level_plan& plan,
size_t M,
cuvs::distance::DistanceType metric,
Callback&& callback)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have this extracted as utility.

Comment on lines +1127 to +1142
for (int64_t batch_idx = 0; batch_idx < static_cast<int64_t>(current_batch_size);
++batch_idx) {
const auto row = batch_start + static_cast<size_t>(batch_idx);
node_buffer[batch_idx] = static_cast<IdxT>(hierarchy.order[start_idx + row]);
auto* link_row = link_buffer.data() + batch_idx * metadata.upper_link_row_bytes;
hnswlib::linklistsizeint list_count =
static_cast<hnswlib::linklistsizeint>(layer.degree);
std::memcpy(link_row, &list_count, sizeof(list_count));
auto* dst = reinterpret_cast<IdxT*>(link_row + sizeof(hnswlib::linklistsizeint));
if (layer.degree > 0) {
auto* src = host_neighbors.data_handle() + row * layer.degree;
for (size_t j = 0; j < layer.degree; ++j) {
dst[j] = static_cast<IdxT>(hierarchy.order[src[j] + start_idx]);
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here we seem to write in in both the modified data order (not original) and by level connections instead of by row. I assume this gets re-ordered upon deserialize.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've added a comment.

Comment on lines +2326 to +2329
auto ll0 = appr_algo->get_linklist0(node_id);
memcpy(ll0,
base_link_buffer.data() + batch_idx * metadata.base_link_row_bytes,
metadata.base_link_row_bytes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok - this is where we re-order upon insertion.

Comment on lines +2388 to +2392
auto ll = appr_algo->get_linklist(node_id, layer.level);
memcpy(ll,
link_buffer.data() + batch_idx * metadata.upper_link_row_bytes,
metadata.upper_link_row_bytes);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this way we can re-arrange the levels upon load as well. I guess this works as long as we can provide our own deserialize-loader and don't have to mimic the serialized file structure of the original format.

But is this really the use-case? Do we eventually need to support a disk-layered+dataset->disk conversion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this minimizes the I/O for reading and constructing the HNSW index in memory. However, the same approach can be used for the memory bound case where we need to support the disk-layered+dataset->disk conversion you've mentioned. I'm working on a follow-up PR that enables it. The idea is to reorder the base links to original ID space and interleave the dataset vectors by either:

  1. Simple scatter if we have enough host memory.
  2. File-baked mmap: probably very slow.
  3. Write [id][link row] records to temporary files in buckets. Replay each bucket and store in small per-bucket scatter buffer. Thus, the base section needs to be written and re-read but disk I/O is sequential. This seems to be the most promising approach when memory constrained. Let me know if you have other ideas please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added #2241 on top of this that implements strategies (1) and (3).

@julianmi julianmi marked this pull request as ready for review June 12, 2026 13:25
@julianmi julianmi requested review from a team as code owners June 12, 2026 13:25
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0dcd4a3a-36b4-4988-af04-b71db4a05dbe

📥 Commits

Reviewing files that changed from the base of the PR and between 6672103 and 265206b.

📒 Files selected for processing (12)
  • cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cu
  • cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h
  • cpp/include/cuvs/neighbors/hnsw.hpp
  • cpp/src/neighbors/detail/cagra/cagra_build.cuh
  • cpp/src/neighbors/detail/hnsw.hpp
  • cpp/tests/neighbors/ann_hnsw_ace.cuh
  • cpp/tests/neighbors/ann_hnsw_ace/test_float_uint32_t.cu
  • cpp/tests/neighbors/ann_hnsw_ace/test_half_uint32_t.cu
  • cpp/tests/neighbors/ann_hnsw_ace/test_int8_t_uint32_t.cu
  • cpp/tests/neighbors/ann_hnsw_ace/test_uint8_t_uint32_t.cu
  • examples/cpp/CMakeLists.txt
  • examples/cpp/src/hnsw_ace_layered_example.cu

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GPU_LAYERED_ON_DISK hierarchy mode for HNSW indexes, enabling efficient on-disk data storage with topology-only in-memory artifacts.
    • Introduced dataset_path parameter in index configuration for layered-on-disk deserialization.
  • Tests

    • Added comprehensive test coverage for layered HNSW-ACE build, serialization, and search workflows.
  • Documentation

    • Added example demonstrating layered HNSW index construction with optional quantization.

Walkthrough

This PR introduces GPU_LAYERED_ON_DISK support for HNSW, enabling disk-efficient serialization of hierarchical indices built from disk-backed CAGRA with ACE. The implementation includes a custom artifact file format, serialization/deserialization paths, safe file handling, and comprehensive test coverage across multiple data types with an example demonstrating usage.

Changes

Layered HNSW GPU Disk Implementation

Layer / File(s) Summary
Public API Extension
cpp/include/cuvs/neighbors/hnsw.hpp
New HnswHierarchy::GPU_LAYERED_ON_DISK enum option and std::string dataset_path field in index_params for deserializing layered artifacts.
Layered Artifact Format & Planning
cpp/src/neighbors/detail/hnsw.hpp
Introduces hnsw_level_plan structure, layered file format enums/structs (layered_hnsw_file_header, layered_hnsw_layer_descriptor, layered_hnsw_dtype), and helper functions for hierarchy planning and file format encoding/decoding.
Disk I/O and File Helpers
cpp/src/neighbors/detail/hnsw.hpp, cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h
Implements dataset loading from .npy and benchmark formats, move_file_overwrite() for safe file movement, and base-layer topology writing utilities.
Layered HNSW Serialization from Disk
cpp/src/neighbors/detail/hnsw.hpp
Core serialize_to_layered_hnsw_from_disk() that builds random hierarchy levels, writes artifact header/descriptors/levels, base topology, gathers promoted vectors in original order, computes upper-layer neighbor graphs, and streams final hnsw_index.cuvs.
Layered HNSW Deserialization
cpp/src/neighbors/detail/hnsw.hpp
Full deserialize_layered_hnsw() implementation validating artifact header/descriptor sections (magic/version/dtype/metric), dataset shape, loading levels, and reconstructing hnswlib storage from topology batches.
Build & Index Integration
cpp/src/neighbors/detail/hnsw.hpp
Integrates layered serialization into from_cagra() disk-backed path and build(), including GPU_LAYERED_ON_DISK-specific validation requiring ACE disk mode and build_dir configuration.
Serialize/Deserialize Entry Points
cpp/src/neighbors/detail/hnsw.hpp
Adds GPU_LAYERED_ON_DISK branches to serialize() and deserialize() routing to layered implementations with artifact copying and overwrite safety.
Search & Index Loading Wiring
cpp/src/neighbors/detail/hnsw.hpp
Updates ensure_loaded() to require hnsw::deserialize with dataset_path when hierarchy is GPU_LAYERED_ON_DISK.
Disk Serialization Refactoring
cpp/src/neighbors/detail/hnsw.hpp
Refactors serialize_to_hnswlib_from_disk() to use new hnsw_level_plan for hierarchy planning instead of local histogram/order vectors, rewiring upper-layer computation and linklist assembly.
Benchmark Parameter Parsing & File Handling
cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cu, cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h
Parses gpu_layered_on_disk hierarchy from config, sets dataset_path from base file when not provided, extracts ACE parameters with ace_ prefix, and uses move_file_overwrite() in save() path.
Test Infrastructure & Harness
cpp/tests/neighbors/ann_hnsw_ace.cuh
Test helpers for layered HNSW: .npy dataset writing, artefact validation, deserialization/search workflows, and negative test cases for corrupted files and parameter mismatches.
Test Suite Instantiation
cpp/tests/neighbors/ann_hnsw_ace/test_*.cu
Parameterized GoogleTest suites across float/half/int8_t/uint8_t types with layered build/deserialize/search and memory fallback test cases.
Example Code
examples/cpp/src/hnsw_ace_layered_example.cu, examples/cpp/CMakeLists.txt
Demonstrates layered HNSW with optional int8 quantization: generates synthetic dataset, writes to .npy, builds via ACE, deserializes, and searches with configurable ef.
ACE Build Logging & Validation
cpp/src/neighbors/detail/cagra/cagra_build.cuh
Standardizes ACE progress reporting with progress_step_10 helper, restructures log messages with "ACE build:" prefix, adds validation for non-empty dataset/positive dimensions, and provides detailed memory/timing breakdowns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

feature request, improvement, non-breaking

Suggested reviewers

  • cjnolet
  • KyleFromNVIDIA
  • dantegd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding support for layered HNSW indexing, which is the core feature across all modified files.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the artifact layout, serialization process, deserialization logic, and disk access patterns for the new layered HNSW feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants