Add HNSW Layered Index Support by julianmi · Pull Request #2148 · rapidsai/cuvs

julianmi · 2026-06-01T08:36:25Z

The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using hnsw::from_cagra uses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, building only the layers of the HNSW index without the dataset and moving this to the search node can minimize the network transfers for downstream consumers if they have the original dataset locally available. The hnsw::deserialize step then takes the layered index and combines it with the local dataset to form a hnswlib compatible search index.

Artifact Layout

hnsw_index.cuvs
  fixed_header
    magic = CUVS_HNSW_LAYERED
    version
    metadata_offset
    metadata_size

  metadata_binary
    dataset shape, dtype, metric
    hnsw parameters
    section sizes
    upper-layer descriptors

  levels
    uint8[n_rows]
    indexed by original dataset row ID

  base_nodes
    uint32[n_rows]
    maps each base topology row to original row ID

  base_links
    n_rows fixed-size hnswlib-ready rows
    [count:uint32][neighbors:uint32[maxM0]]
    neighbors are original row IDs

  upper_nodes
    concatenated uint32 original row IDs for layers 1..maxlevel

  upper_links
    fixed-size hnswlib-ready rows
    [count:uint32][neighbors:uint32[maxM]]
    neighbors are original row IDs

Layered HNSW Serialization

The layered serializer creates hnsw_index.cuvs from the disk-backed ACE graph.

Create the .cuvs file and write the fixed header and metadata.
Generate HNSW levels in original ID space.
Write levels sequentially.
Read dataset_mapping.npy sequentially into reordered_to_original.
Read cagra_graph.npy source-sequentially in ACE reordered row order.
For each ACE graph row:
- write base_nodes[row] = reordered_to_original[ace_reordered_row]
- convert each neighbor from ACE reordered ID to original ID
- write a padded hnswlib-ready row to base_links[row]
Gather promoted vectors from the original dataset.
Build upper-layer graphs using temporary HNSW promoted order.
Write upper_nodes and upper_links with node IDs and neighbor IDs converted back to original IDs.

This keeps remapping, link padding, and upper-layer KNN work on the build node.

Deserialization

The search node reads:

hnsw_index.cuvs
the external original-order dataset from index_params.dataset_path

The loader:

Reads the fixed header and metadata.
Validates artifact shape, section sizes, and dataset shape.
Reads levels sequentially.
Allocates hnswlib storage.
Reads the external dataset sequentially in original row order.
Initializes hnswlib with:
- internal ID = original ID
- label = original ID
- level = levels[original_id]
Reads base_nodes and base_links sequentially.
Copies each base link row into get_linklist0(base_node_id).
Reads upper_nodes and upper_links sequentially by layer.
Copies each upper link row into get_linklist(node_id, level).

The search node does no graph remapping, no level generation, no link padding, and no KNN work.

Disk Access Patterns

Build node:

Sequential scan of the original dataset for ACE partitioning.
Buffered partition writes for reordered and augmented datasets.
Contiguous per-partition reads from reordered_dataset.npy and augmented_dataset.npy.
Source-sequential reads from cagra_graph.npy when creating the final layered artifact.
Sequential writes to hnsw_index.cuvs.

Search node:

Sequential reads from hnsw_index.cuvs.
Sequential reads from the external original-order dataset.
Scatter writes only into in-memory hnswlib link storage by original ID.

Runtime Requirements

Only hnsw_index.cuvs is copied to the search node. ACE temporary files remain build-node-only.

The search node must have the original dataset in original row order and must provide that path through index_params.dataset_path.

Misc

Unifies the logging format of the ACE algorithm.

- Add base node IDs for sequential access. - Scattered writes happen only in deserialization step using host memory.

copy-pr-bot · 2026-06-01T08:36:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mfoerste4

Thanks @julianmi for the PR. The structure looks good to me. Only a few comments that are not necessarily actionable.

mfoerste4 · 2026-06-08T12:50:57Z

+    } else if (conf.at("hierarchy") == "gpu_layered_on_disk" ||
+               conf.at("hierarchy") == "gpu_layered" || conf.at("hierarchy") == "layered") {
+      hnsw_params.hierarchy = cuvs::neighbors::hnsw::HnswHierarchy::GPU_LAYERED_ON_DISK;


Are these all just synonyms for GPU_LAYERED_ON_DISK?

Yes, I've removed them to better align with the other formats.

mfoerste4 · 2026-06-08T12:53:05Z

+                 "Layered HNSW artifact '%s' does not exist.",
+                 src_artifact.c_str());
+
+    copy_file_overwrite(src_artifact, std::filesystem::path(file));


Why do we copy here instead of moving (like in the cagra_ace block below?)

Good catch, thanks. I've changed this and reuse the helper in the cagra_ace block.

mfoerste4 · 2026-06-08T13:30:51Z

+inline auto json_parse_double(const std::string& json, const std::string& key) -> double
+{
+  auto pos   = json_find_key(json, key);
+  auto colon = json.find(':', pos);
+  RAFT_EXPECTS(colon != std::string::npos, "Malformed JSON near key '%s'", key.c_str());
+  auto begin = json.find_first_of("0123456789-.", colon + 1);
+  auto end   = json.find_first_not_of("0123456789-.eE+", begin);
+  RAFT_EXPECTS(begin != std::string::npos, "Malformed double JSON value for key '%s'", key.c_str());
+  return std::stod(json.substr(begin, end - begin));
+}
+
+inline auto json_parse_layer_field(const std::string& object, const std::string& key) -> size_t
+{
+  return json_parse_size(object, key);
+}
+


If we use JSON here - why don't we use a common json utility for parsing? This whole section feels very verbose for a standard. Is there a cleaner alternative?

This would pull the JSON dependency (nlohmann/json) into libcuvs. It is currently only used in the benchmarking. I've changed to a binary header format instead given the simplicity. Let me know what you think.

mfoerste4 · 2026-06-08T14:37:02Z

+    const auto current_node_bytes = current_batch_size * sizeof(IdxT);
+    const auto current_link_bytes = current_batch_size * base_link_row_bytes;
+    cuvs::util::write_large_file(output_fd,
+                                 base_node_buffer.data(),
+                                 current_node_bytes,
+                                 base_nodes_offset + source_start * sizeof(IdxT));
+    cuvs::util::write_large_file(output_fd,
+                                 base_link_buffer.data(),
+                                 current_link_bytes,
+                                 base_links_offset + source_start * base_link_row_bytes);


I see that the graph connections / links are shifted to reflect the original ordering, but I don't see the rows/nodes themself being reordered. Is this intended?

Yes. The idea is to write sequentially and read scatter into memory. I've added a comment.

mfoerste4 · 2026-06-08T14:51:59Z

+template <typename T, typename IdxT, typename Callback>
+void build_hnsw_upper_layer_graphs(
+  raft::resources const& res,
+  raft::host_matrix_view<const T, int64_t, raft::row_major> promoted_dataset,
+  const hnsw_level_plan& plan,
+  size_t M,
+  cuvs::distance::DistanceType metric,
+  Callback&& callback)


Good to have this extracted as utility.

mfoerste4 · 2026-06-08T15:10:39Z

+          for (int64_t batch_idx = 0; batch_idx < static_cast<int64_t>(current_batch_size);
+               ++batch_idx) {
+            const auto row         = batch_start + static_cast<size_t>(batch_idx);
+            node_buffer[batch_idx] = static_cast<IdxT>(hierarchy.order[start_idx + row]);
+            auto* link_row         = link_buffer.data() + batch_idx * metadata.upper_link_row_bytes;
+            hnswlib::linklistsizeint list_count =
+              static_cast<hnswlib::linklistsizeint>(layer.degree);
+            std::memcpy(link_row, &list_count, sizeof(list_count));
+            auto* dst = reinterpret_cast<IdxT*>(link_row + sizeof(hnswlib::linklistsizeint));
+            if (layer.degree > 0) {
+              auto* src = host_neighbors.data_handle() + row * layer.degree;
+              for (size_t j = 0; j < layer.degree; ++j) {
+                dst[j] = static_cast<IdxT>(hierarchy.order[src[j] + start_idx]);
+              }
+            }
+          }


Ok, here we seem to write in in both the modified data order (not original) and by level connections instead of by row. I assume this gets re-ordered upon deserialize.

Yes. I've added a comment.

mfoerste4 · 2026-06-08T15:17:01Z

+      auto ll0 = appr_algo->get_linklist0(node_id);
+      memcpy(ll0,
+             base_link_buffer.data() + batch_idx * metadata.base_link_row_bytes,
+             metadata.base_link_row_bytes);


Ah ok - this is where we re-order upon insertion.

mfoerste4 · 2026-06-08T15:23:22Z

+        auto ll = appr_algo->get_linklist(node_id, layer.level);
+        memcpy(ll,
+               link_buffer.data() + batch_idx * metadata.upper_link_row_bytes,
+               metadata.upper_link_row_bytes);
+      }


I see, this way we can re-arrange the levels upon load as well. I guess this works as long as we can provide our own deserialize-loader and don't have to mimic the serialized file structure of the original format.

But is this really the use-case? Do we eventually need to support a disk-layered+dataset->disk conversion?

Yes, this minimizes the I/O for reading and constructing the HNSW index in memory. However, the same approach can be used for the memory bound case where we need to support the disk-layered+dataset->disk conversion you've mentioned. I'm working on a follow-up PR that enables it. The idea is to reorder the base links to original ID space and interleave the dataset vectors by either:

Simple scatter if we have enough host memory.

File-baked mmap: probably very slow.

Write [id][link row] records to temporary files in buckets. Replay each bucket and store in small per-bucket scatter buffer. Thus, the base section needs to be written and re-read but disk I/O is sequential. This seems to be the most promising approach when memory constrained. Let me know if you have other ideas please.

I've added #2241 on top of this that implements strategies (1) and (3).

coderabbitai · 2026-06-12T13:32:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0dcd4a3a-36b4-4988-af04-b71db4a05dbe

📥 Commits

Reviewing files that changed from the base of the PR and between 6672103 and 265206b.

📒 Files selected for processing (12)

cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cu
cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h
cpp/include/cuvs/neighbors/hnsw.hpp
cpp/src/neighbors/detail/cagra/cagra_build.cuh
cpp/src/neighbors/detail/hnsw.hpp
cpp/tests/neighbors/ann_hnsw_ace.cuh
cpp/tests/neighbors/ann_hnsw_ace/test_float_uint32_t.cu
cpp/tests/neighbors/ann_hnsw_ace/test_half_uint32_t.cu
cpp/tests/neighbors/ann_hnsw_ace/test_int8_t_uint32_t.cu
cpp/tests/neighbors/ann_hnsw_ace/test_uint8_t_uint32_t.cu
examples/cpp/CMakeLists.txt
examples/cpp/src/hnsw_ace_layered_example.cu

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Added GPU_LAYERED_ON_DISK hierarchy mode for HNSW indexes, enabling efficient on-disk data storage with topology-only in-memory artifacts.
- Introduced dataset_path parameter in index configuration for layered-on-disk deserialization.
Tests
- Added comprehensive test coverage for layered HNSW-ACE build, serialization, and search workflows.
Documentation
- Added example demonstrating layered HNSW index construction with optional quantization.

Walkthrough

This PR introduces GPU_LAYERED_ON_DISK support for HNSW, enabling disk-efficient serialization of hierarchical indices built from disk-backed CAGRA with ACE. The implementation includes a custom artifact file format, serialization/deserialization paths, safe file handling, and comprehensive test coverage across multiple data types with an example demonstrating usage.

Changes

Layered HNSW GPU Disk Implementation

Layer / File(s)	Summary
Public API Extension `cpp/include/cuvs/neighbors/hnsw.hpp`	New `HnswHierarchy::GPU_LAYERED_ON_DISK` enum option and `std::string dataset_path` field in `index_params` for deserializing layered artifacts.
Layered Artifact Format & Planning `cpp/src/neighbors/detail/hnsw.hpp`	Introduces `hnsw_level_plan` structure, layered file format enums/structs (`layered_hnsw_file_header`, `layered_hnsw_layer_descriptor`, `layered_hnsw_dtype`), and helper functions for hierarchy planning and file format encoding/decoding.
Disk I/O and File Helpers `cpp/src/neighbors/detail/hnsw.hpp`, `cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h`	Implements dataset loading from .npy and benchmark formats, `move_file_overwrite()` for safe file movement, and base-layer topology writing utilities.
Layered HNSW Serialization from Disk `cpp/src/neighbors/detail/hnsw.hpp`	Core `serialize_to_layered_hnsw_from_disk()` that builds random hierarchy levels, writes artifact header/descriptors/levels, base topology, gathers promoted vectors in original order, computes upper-layer neighbor graphs, and streams final `hnsw_index.cuvs`.
Layered HNSW Deserialization `cpp/src/neighbors/detail/hnsw.hpp`	Full `deserialize_layered_hnsw()` implementation validating artifact header/descriptor sections (magic/version/dtype/metric), dataset shape, loading levels, and reconstructing hnswlib storage from topology batches.
Build & Index Integration `cpp/src/neighbors/detail/hnsw.hpp`	Integrates layered serialization into `from_cagra()` disk-backed path and `build()`, including GPU_LAYERED_ON_DISK-specific validation requiring ACE disk mode and `build_dir` configuration.
Serialize/Deserialize Entry Points `cpp/src/neighbors/detail/hnsw.hpp`	Adds GPU_LAYERED_ON_DISK branches to `serialize()` and `deserialize()` routing to layered implementations with artifact copying and overwrite safety.
Search & Index Loading Wiring `cpp/src/neighbors/detail/hnsw.hpp`	Updates `ensure_loaded()` to require `hnsw::deserialize` with `dataset_path` when hierarchy is GPU_LAYERED_ON_DISK.
Disk Serialization Refactoring `cpp/src/neighbors/detail/hnsw.hpp`	Refactors `serialize_to_hnswlib_from_disk()` to use new `hnsw_level_plan` for hierarchy planning instead of local histogram/order vectors, rewiring upper-layer computation and linklist assembly.
Benchmark Parameter Parsing & File Handling `cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cu`, `cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h`	Parses `gpu_layered_on_disk` hierarchy from config, sets `dataset_path` from base file when not provided, extracts ACE parameters with `ace_` prefix, and uses `move_file_overwrite()` in save() path.
Test Infrastructure & Harness `cpp/tests/neighbors/ann_hnsw_ace.cuh`	Test helpers for layered HNSW: .npy dataset writing, artefact validation, deserialization/search workflows, and negative test cases for corrupted files and parameter mismatches.
Test Suite Instantiation `cpp/tests/neighbors/ann_hnsw_ace/test_*.cu`	Parameterized GoogleTest suites across float/half/int8_t/uint8_t types with layered build/deserialize/search and memory fallback test cases.
Example Code `examples/cpp/src/hnsw_ace_layered_example.cu`, `examples/cpp/CMakeLists.txt`	Demonstrates layered HNSW with optional int8 quantization: generates synthetic dataset, writes to .npy, builds via ACE, deserializes, and searches with configurable ef.
ACE Build Logging & Validation `cpp/src/neighbors/detail/cagra/cagra_build.cuh`	Standardizes ACE progress reporting with `progress_step_10` helper, restructures log messages with "ACE build:" prefix, adds validation for non-empty dataset/positive dimensions, and provides detailed memory/timing breakdowns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

feature request, improvement, non-breaking

Suggested reviewers

cjnolet
KyleFromNVIDIA
dantegd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding support for layered HNSW indexing, which is the core feature across all modified files.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the artifact layout, serialization process, deserialization logic, and disk access patterns for the new layered HNSW feature.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

julianmi added 6 commits June 1, 2026 10:15

Add HNSW layered hierarchy

52997da

Improve deserialization logging

a11d9ee

Use ace prefix in benchmarking consistently

a95b0e0

Validate metadata before allocating

47e2d85

Store layered base topology by original node ID

21ce339

- Add base node IDs for sequential access. - Scattered writes happen only in deserialization step using host memory.

Unify the ACE logging format

4514bc8

github-project-automation Bot added this to Unstructured Data Processing Jun 1, 2026

tfeher requested a review from mfoerste4 June 1, 2026 09:49

Merge branch 'main' into hnsw-layered-index

0e9accd

mfoerste4 approved these changes Jun 8, 2026

View reviewed changes

julianmi added 4 commits June 9, 2026 09:51

Address review feedback

7a761cf

Replace JSON header with binary header

ea1a96c

Merge branch 'main' into hnsw-layered-index

153a82d

Merge branch 'main' into hnsw-layered-index

265206b

julianmi mentioned this pull request Jun 12, 2026

Materialize HNSW Layered Index to hnswlib Index #2241

Draft

julianmi marked this pull request as ready for review June 12, 2026 13:25

julianmi requested review from a team as code owners June 12, 2026 13:25

Conversation

julianmi commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Artifact Layout

Layered HNSW Serialization

Deserialization

Disk Access Patterns

Runtime Requirements

Misc

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

mfoerste4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

julianmi commented Jun 1, 2026 •

edited

Loading