feat: add BuildWithCache support for HGraph incremental index building by misaka0714 · Pull Request #2095 · antgroup/vsag

misaka0714 · 2026-05-26T03:00:45Z

Summary

Add BuildWithCache to support incremental HGraph index building from a cache exported by a previous build. This reduces rebuild time when the dataset changes between builds (for example, day1 → day2 updates), while keeping recall stable.

Change Type

Linked Issue

Fixes:

What Changed

Core implementation

Added hgraph_build_cache.cpp with the main BuildWithCache workflow:
- cache loading
- warm start with hit/miss classification
- two-phase parallel refine
- route graph building

Performance optimizations

Optimized reverse-edge construction using a scatter-materialize parallel pattern in the group phase
Added shard-parallel merge_prune
Reused distances computed in the select phase
Replaced quadratic deduplication with unordered_set-based deduplication (O(M) vs. O(M^2))
Added self-entry optimization: hit nodes use their own inner_id as the search entry for faster convergence

Public API additions

Added BuildCacheOptions and BuildCacheStats to index.h
Added virtual methods:
- SupportsBuildCache
- ExportBuildCache
- BuildWithCache
- PrepareFeatureIdsForBuildCache
- GetBuildCacheStats
Added feature ID tracking via:
- Dataset::FeatureIds
- Dataset::GetFeatureIds

Tools

build_hgraph_with_cache: build an index with cache
export_hgraph_build_cache: export build cache from an existing index
eval_recall_brute_force: brute-force recall evaluation
compare_hgraph_ids: compare IDs across indexes

Performance and Quality

Tested on 30M 768-dim vectors with day1 → day2 incremental updates:

Reverse phase: 858s -> 23s (-97.3%)
Group sub-phase: 408s -> 12s (-97.1%)
Total build time: 4920s -> 4070s (-17.3%)
Recall: stable within ±0.35pp

Test Evidence

Test details:

<!-- Paste commands and key output here -->

Compatibility Impact

API/ABI compatibility: new public API added; no breaking change intended
Behavior changes: optional cache-based build path added for HGraph

Performance and Concurrency Impact

Performance impact: improved for incremental rebuild scenarios
Concurrency/thread-safety impact: parallel build path updated; no intended thread-safety regression

Documentation Impact

Risk and Rollback

Risk level: medium
Rollback plan: revert this PR and fall back to the existing full-build path without cache

Checklist

I have linked the relevant issue (required for kind/bug and kind/feature; see "Linked Issue" above)
I have added/updated tests for new behavior or bug fixes
I have considered API compatibility impact
I have updated docs if behavior/workflow changed
My commit messages follow project conventions (Conventional Commits, optional [skip ci] prefix)

Add BuildWithCache feature that enables incremental HGraph index building using a cache from a previous build. This significantly reduces build time when the dataset changes between builds (e.g., day1 -> day2 updates). Core implementation: - hgraph_build_cache.cpp: Main BuildWithCache logic including cache loading, warm_start (hit/missed classification), two-phase parallel refine with optimized reverse-edge installation, and route graph building - Reverse-edge optimization: scatter-materialise parallel pattern for group phase, shard-parallel merge_prune, distance reuse from select phase, unordered_set dedup (O(M) vs O(M^2)) - Self-entry optimization: hit nodes use their own inner_id as search entry point for faster convergence Public API additions: - BuildCacheOptions/BuildCacheStats structs in index.h - SupportsBuildCache/ExportBuildCache/BuildWithCache/ PrepareFeatureIdsForBuildCache/GetBuildCacheStats virtual methods - Dataset::FeatureIds/GetFeatureIds for feature ID tracking Tools: - build_hgraph_with_cache: Build index using cache - export_hgraph_build_cache: Export cache from existing index - eval_recall_brute_force: Brute-force recall evaluation - compare_hgraph_ids: Compare IDs between indexes Performance (30M 768-dim vectors, day1->day2): - Reverse phase: 858s -> 23s (-97.3%) - Group sub-phase: 408s -> 12s (-97.1%) - Total build: 4920s -> 4070s (-17.3%) - Recall: stable within +/-0.35pp Signed-off-by: baoyuan <129934985+misaka0714@users.noreply.github.com>

mergify · 2026-05-26T03:01:28Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require kind label

Waiting for

label~=^kind/

This rule is failing.

label~=^kind/

🔴 Require version label

Waiting for

label~=^version/

This rule is failing.

label~=^version/

gemini-code-assist

Code Review

This pull request introduces build cache support for HGraph, allowing indexes to be exported and rebuilt efficiently using warm starts and refinement phases. It also adds several utility tools for cache export, building, and recall evaluation. The code review identified several critical issues that need to be addressed: a concurrency data race in HGraph::Add when modifying feature_ids_, potential Denial of Service (DoS) vulnerabilities in cache loading due to unvalidated header fields (feature_id_bytes and max_degree), a logic bug in cache preparation when duplicate labels are enabled, a metadata handling bug in HGraph::move_id, and dead code in the form of the unused refine_nodes_for_build_cache function.

gemini-code-assist · 2026-05-26T03:03:06Z

+        if (feature_ids != nullptr) {
+            if (inner_id >= this->feature_ids_.size()) {
+                this->feature_ids_.resize(this->total_count_.load());
+            }
+            this->feature_ids_[inner_id] = feature_ids[j];
+        }


Critical Concurrency Bug (Data Race):\n\nIn HGraph::Add, this->feature_ids_ is resized and written to without any lock protection. Since HGraph supports concurrent Add operations, multiple threads can concurrently call Add, leading to a severe data race on std::vector<std::string> feature_ids_ (which is undefined behavior and can cause crashes or memory corruption).\n\nTo fix this, the resizing and assignment of feature_ids_ must be protected by this->add_mutex_.

if (feature_ids != nullptr) { std::scoped_lock lock(this->add_mutex_); if (inner_id >= this->feature_ids_.size()) { this->feature_ids_.resize(this->total_count_.load()); } this->feature_ids_[inner_id] = feature_ids[j]; }

gemini-code-assist · 2026-05-26T03:03:07Z

+    std::string feature_blob(header.feature_id_bytes, '\0');
+    if (header.feature_id_bytes > 0) {
+        reader.Read(feature_blob.data(), header.feature_id_bytes);
+    }


Security/Robustness Issue (Uncontrolled Resource Consumption / OOM):\n\nheader.feature_id_bytes is read directly from the input stream and used to allocate a std::string buffer. If the cache file is corrupted or maliciously crafted with a huge feature_id_bytes value, this will trigger a std::bad_alloc exception, causing a Denial of Service (crash) of the application.\n\nWe should validate header.feature_id_bytes before allocating memory.

Suggested change

std::string feature_blob(header.feature_id_bytes, '\0');

if (header.feature_id_bytes > 0) {

reader.Read(feature_blob.data(), header.feature_id_bytes);

}

constexpr uint64_t kMaxReasonableFeatureIdBytes = 1024ULL * 1024 * 1024; // 1 GB limit

CHECK_ARGUMENT(header.feature_id_bytes <= kMaxReasonableFeatureIdBytes, "feature_id_bytes is unreasonably large");

std::string feature_blob(header.feature_id_bytes, '\0');

if (header.feature_id_bytes > 0) {

reader.Read(feature_blob.data(), header.feature_id_bytes);

}

gemini-code-assist · 2026-05-26T03:03:07Z

+    BuildCacheHeader header;
+    StreamReader::ReadObj(reader, header);


Security/Robustness Issue (Infinite Loop / DoS):\n\nheader.max_degree is read directly from the input stream and used to control nested loops that read neighbor records. If header.max_degree is corrupted or maliciously set to a very large value (e.g., UINT32_MAX), the nested loops will run billions of times, causing an infinite loop or severe CPU hang (Denial of Service).\n\nWe should validate header.max_degree immediately after reading the header.

Suggested change

BuildCacheHeader header;

StreamReader::ReadObj(reader, header);

BuildCacheHeader header;

StreamReader::ReadObj(reader, header);

constexpr uint32_t kMaxReasonableMaxDegree = 1024;

CHECK_ARGUMENT(header.max_degree <= kMaxReasonableMaxDegree, "max_degree is unreasonably large");

gemini-code-assist · 2026-05-26T03:03:07Z

+    std::unordered_map<LabelType, InnerIdType> label_to_inner_id;
+    label_to_inner_id.reserve(static_cast<size_t>(this->total_count_.load()));
+    for (uint64_t inner_id = 0; inner_id < static_cast<uint64_t>(this->total_count_.load()); ++inner_id) {
+        auto label = this->label_table_->GetLabelById(static_cast<InnerIdType>(inner_id));
+        auto [iter, inserted] = label_to_inner_id.emplace(label, static_cast<InnerIdType>(inner_id));
+        CHECK_ARGUMENT(inserted || this->support_duplicate_,
+                       fmt::format("duplicate label {} found in index when preparing feature ids",
+                                   label));
+        if (!inserted && this->support_duplicate_) {
+            iter->second = static_cast<InnerIdType>(inner_id);
+        }
+    }


Logic Bug with Duplicate Labels:\n\nWhen this->support_duplicate_ is true, the index can contain duplicate labels. However, label_to_inner_id is defined as a std::unordered_map<LabelType, InnerIdType>, which only maps a label to a single inner_id (overwriting previous ones with iter->second = inner_id).\n\nDuring the mapping phase, only the last inner_id for a duplicate label will be resolved, leaving other duplicate inner_ids empty. This will cause mapped_count != expected_count and trigger a fatal CHECK_ARGUMENT failure, completely breaking the build cache preparation when duplicate labels exist.\n\nRecommended Fix:\nUse a std::unordered_map<LabelType, std::vector<InnerIdType>> to store all inner_ids for duplicate labels, and consume them sequentially (e.g., using a tracking index or popping from the back) during the mapping loop.

gemini-code-assist · 2026-05-26T03:03:07Z

+    if (from < this->feature_ids_.size()) {
+        if (to >= this->feature_ids_.size()) {
+            this->feature_ids_.resize(to + 1);
+        }
+        this->feature_ids_[to] = std::move(this->feature_ids_[from]);
+        this->feature_ids_[from].clear();
+    }


Logic Bug in move_id:\n\nIn HGraph::move_id, if from >= feature_ids_.size() but to < feature_ids_.size(), the old feature ID at to is not cleared. Since from is moving to to, to should have no feature ID after the move. Leaving the old feature ID at to intact results in stale/incorrect metadata.

if (to < this->feature_ids_.size()) { if (from < this->feature_ids_.size()) { this->feature_ids_[to] = std::move(this->feature_ids_[from]); this->feature_ids_[from].clear(); } else { this->feature_ids_[to].clear(); } } else if (from < this->feature_ids_.size()) { this->feature_ids_.resize(to + 1); this->feature_ids_[to] = std::move(this->feature_ids_[from]); this->feature_ids_[from].clear(); }

gemini-code-assist · 2026-05-26T03:03:07Z

+HGraph::RefineExecutionStats
+HGraph::refine_nodes_for_build_cache(


Dead Code Cleanup:\n\nThe function refine_nodes_for_build_cache is defined but never called anywhere in the codebase (it was likely replaced by refine_nodes_two_phase). Additionally, its effective_parallelism is unconditionally forced to 1, making its parallel execution branch dead code.\n\nRemove the unused refine_nodes_for_build_cache function from both hgraph.h and hgraph_build_cache.cpp to improve maintainability and reduce code bloat.

misaka0714 requested review from LHT129, inabao, jiaweizone and wxyucs as code owners May 26, 2026 03:00

pull-request-size Bot added the size/XXL label May 26, 2026

mergify Bot added module/api module/index module/tools labels May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BuildWithCache support for HGraph incremental index building#2095

feat: add BuildWithCache support for HGraph incremental index building#2095
misaka0714 wants to merge 1 commit into
antgroup:mainfrom
misaka0714:build_with_cache_latest

misaka0714 commented May 26, 2026

Uh oh!

mergify Bot commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		BuildCacheHeader header;
		StreamReader::ReadObj(reader, header);

		HGraph::RefineExecutionStats
		HGraph::refine_nodes_for_build_cache(

Conversation

misaka0714 commented May 26, 2026

Summary

Change Type

Linked Issue

What Changed

Core implementation

Performance optimizations

Public API additions

Tools

Performance and Quality

Test Evidence

Compatibility Impact

Performance and Concurrency Impact

Documentation Impact

Risk and Rollback

Checklist

Uh oh!

mergify Bot commented May 26, 2026

Merge Protections

🔴 Require kind label

🔴 Require version label

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant