Add HNSW Layered Index Support#2148
Conversation
- Add base node IDs for sequential access. - Scattered writes happen only in deserialization step using host memory.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| } else if (conf.at("hierarchy") == "gpu_layered_on_disk" || | ||
| conf.at("hierarchy") == "gpu_layered" || conf.at("hierarchy") == "layered") { | ||
| hnsw_params.hierarchy = cuvs::neighbors::hnsw::HnswHierarchy::GPU_LAYERED_ON_DISK; |
There was a problem hiding this comment.
Are these all just synonyms for GPU_LAYERED_ON_DISK?
There was a problem hiding this comment.
Yes, I've removed them to better align with the other formats.
| "Layered HNSW artifact '%s' does not exist.", | ||
| src_artifact.c_str()); | ||
|
|
||
| copy_file_overwrite(src_artifact, std::filesystem::path(file)); |
There was a problem hiding this comment.
Why do we copy here instead of moving (like in the cagra_ace block below?)
There was a problem hiding this comment.
Good catch, thanks. I've changed this and reuse the helper in the cagra_ace block.
| inline auto json_parse_double(const std::string& json, const std::string& key) -> double | ||
| { | ||
| auto pos = json_find_key(json, key); | ||
| auto colon = json.find(':', pos); | ||
| RAFT_EXPECTS(colon != std::string::npos, "Malformed JSON near key '%s'", key.c_str()); | ||
| auto begin = json.find_first_of("0123456789-.", colon + 1); | ||
| auto end = json.find_first_not_of("0123456789-.eE+", begin); | ||
| RAFT_EXPECTS(begin != std::string::npos, "Malformed double JSON value for key '%s'", key.c_str()); | ||
| return std::stod(json.substr(begin, end - begin)); | ||
| } | ||
|
|
||
| inline auto json_parse_layer_field(const std::string& object, const std::string& key) -> size_t | ||
| { | ||
| return json_parse_size(object, key); | ||
| } | ||
|
|
There was a problem hiding this comment.
If we use JSON here - why don't we use a common json utility for parsing? This whole section feels very verbose for a standard. Is there a cleaner alternative?
There was a problem hiding this comment.
This would pull the JSON dependency (nlohmann/json) into libcuvs. It is currently only used in the benchmarking. I've changed to a binary header format instead given the simplicity. Let me know what you think.
| const auto current_node_bytes = current_batch_size * sizeof(IdxT); | ||
| const auto current_link_bytes = current_batch_size * base_link_row_bytes; | ||
| cuvs::util::write_large_file(output_fd, | ||
| base_node_buffer.data(), | ||
| current_node_bytes, | ||
| base_nodes_offset + source_start * sizeof(IdxT)); | ||
| cuvs::util::write_large_file(output_fd, | ||
| base_link_buffer.data(), | ||
| current_link_bytes, | ||
| base_links_offset + source_start * base_link_row_bytes); |
There was a problem hiding this comment.
I see that the graph connections / links are shifted to reflect the original ordering, but I don't see the rows/nodes themself being reordered. Is this intended?
There was a problem hiding this comment.
Yes. The idea is to write sequentially and read scatter into memory. I've added a comment.
| template <typename T, typename IdxT, typename Callback> | ||
| void build_hnsw_upper_layer_graphs( | ||
| raft::resources const& res, | ||
| raft::host_matrix_view<const T, int64_t, raft::row_major> promoted_dataset, | ||
| const hnsw_level_plan& plan, | ||
| size_t M, | ||
| cuvs::distance::DistanceType metric, | ||
| Callback&& callback) |
There was a problem hiding this comment.
Good to have this extracted as utility.
| for (int64_t batch_idx = 0; batch_idx < static_cast<int64_t>(current_batch_size); | ||
| ++batch_idx) { | ||
| const auto row = batch_start + static_cast<size_t>(batch_idx); | ||
| node_buffer[batch_idx] = static_cast<IdxT>(hierarchy.order[start_idx + row]); | ||
| auto* link_row = link_buffer.data() + batch_idx * metadata.upper_link_row_bytes; | ||
| hnswlib::linklistsizeint list_count = | ||
| static_cast<hnswlib::linklistsizeint>(layer.degree); | ||
| std::memcpy(link_row, &list_count, sizeof(list_count)); | ||
| auto* dst = reinterpret_cast<IdxT*>(link_row + sizeof(hnswlib::linklistsizeint)); | ||
| if (layer.degree > 0) { | ||
| auto* src = host_neighbors.data_handle() + row * layer.degree; | ||
| for (size_t j = 0; j < layer.degree; ++j) { | ||
| dst[j] = static_cast<IdxT>(hierarchy.order[src[j] + start_idx]); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Ok, here we seem to write in in both the modified data order (not original) and by level connections instead of by row. I assume this gets re-ordered upon deserialize.
There was a problem hiding this comment.
Yes. I've added a comment.
| auto ll0 = appr_algo->get_linklist0(node_id); | ||
| memcpy(ll0, | ||
| base_link_buffer.data() + batch_idx * metadata.base_link_row_bytes, | ||
| metadata.base_link_row_bytes); |
There was a problem hiding this comment.
Ah ok - this is where we re-order upon insertion.
| auto ll = appr_algo->get_linklist(node_id, layer.level); | ||
| memcpy(ll, | ||
| link_buffer.data() + batch_idx * metadata.upper_link_row_bytes, | ||
| metadata.upper_link_row_bytes); | ||
| } |
There was a problem hiding this comment.
I see, this way we can re-arrange the levels upon load as well. I guess this works as long as we can provide our own deserialize-loader and don't have to mimic the serialized file structure of the original format.
But is this really the use-case? Do we eventually need to support a disk-layered+dataset->disk conversion?
There was a problem hiding this comment.
Yes, this minimizes the I/O for reading and constructing the HNSW index in memory. However, the same approach can be used for the memory bound case where we need to support the disk-layered+dataset->disk conversion you've mentioned. I'm working on a follow-up PR that enables it. The idea is to reorder the base links to original ID space and interleave the dataset vectors by either:
- Simple scatter if we have enough host memory.
- File-baked mmap: probably very slow.
- Write
[id][link row]records to temporary files in buckets. Replay each bucket and store in small per-bucket scatter buffer. Thus, the base section needs to be written and re-read but disk I/O is sequential. This seems to be the most promising approach when memory constrained. Let me know if you have other ideas please.
There was a problem hiding this comment.
I've added #2241 on top of this that implements strategies (1) and (3).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (12)
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR introduces GPU_LAYERED_ON_DISK support for HNSW, enabling disk-efficient serialization of hierarchical indices built from disk-backed CAGRA with ACE. The implementation includes a custom artifact file format, serialization/deserialization paths, safe file handling, and comprehensive test coverage across multiple data types with an example demonstrating usage. ChangesLayered HNSW GPU Disk Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using
hnsw::from_cagrauses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, building only the layers of the HNSW index without the dataset and moving this to the search node can minimize the network transfers for downstream consumers if they have the original dataset locally available. Thehnsw::deserializestep then takes the layered index and combines it with the local dataset to form a hnswlib compatible search index.Artifact Layout
Layered HNSW Serialization
The layered serializer creates
hnsw_index.cuvsfrom the disk-backed ACE graph..cuvsfile and write the fixed header and metadata.levelssequentially.dataset_mapping.npysequentially intoreordered_to_original.cagra_graph.npysource-sequentially in ACE reordered row order.base_nodes[row] = reordered_to_original[ace_reordered_row]base_links[row]upper_nodesandupper_linkswith node IDs and neighbor IDs converted back to original IDs.This keeps remapping, link padding, and upper-layer KNN work on the build node.
Deserialization
The search node reads:
hnsw_index.cuvsindex_params.dataset_pathThe loader:
levelssequentially.levels[original_id]base_nodesandbase_linkssequentially.get_linklist0(base_node_id).upper_nodesandupper_linkssequentially by layer.get_linklist(node_id, level).The search node does no graph remapping, no level generation, no link padding, and no KNN work.
Disk Access Patterns
Build node:
reordered_dataset.npyandaugmented_dataset.npy.cagra_graph.npywhen creating the final layered artifact.hnsw_index.cuvs.Search node:
hnsw_index.cuvs.Runtime Requirements
Only
hnsw_index.cuvsis copied to the search node. ACE temporary files remain build-node-only.The search node must have the original dataset in original row order and must provide that path through
index_params.dataset_path.Misc
Unifies the logging format of the ACE algorithm.