feat: add BuildWithCache support for HGraph incremental index building#2095
feat: add BuildWithCache support for HGraph incremental index building#2095misaka0714 wants to merge 1 commit into
Conversation
Add BuildWithCache feature that enables incremental HGraph index building using a cache from a previous build. This significantly reduces build time when the dataset changes between builds (e.g., day1 -> day2 updates). Core implementation: - hgraph_build_cache.cpp: Main BuildWithCache logic including cache loading, warm_start (hit/missed classification), two-phase parallel refine with optimized reverse-edge installation, and route graph building - Reverse-edge optimization: scatter-materialise parallel pattern for group phase, shard-parallel merge_prune, distance reuse from select phase, unordered_set dedup (O(M) vs O(M^2)) - Self-entry optimization: hit nodes use their own inner_id as search entry point for faster convergence Public API additions: - BuildCacheOptions/BuildCacheStats structs in index.h - SupportsBuildCache/ExportBuildCache/BuildWithCache/ PrepareFeatureIdsForBuildCache/GetBuildCacheStats virtual methods - Dataset::FeatureIds/GetFeatureIds for feature ID tracking Tools: - build_hgraph_with_cache: Build index using cache - export_hgraph_build_cache: Export cache from existing index - eval_recall_brute_force: Brute-force recall evaluation - compare_hgraph_ids: Compare IDs between indexes Performance (30M 768-dim vectors, day1->day2): - Reverse phase: 858s -> 23s (-97.3%) - Group sub-phase: 408s -> 12s (-97.1%) - Total build: 4920s -> 4070s (-17.3%) - Recall: stable within +/-0.35pp Signed-off-by: baoyuan <129934985+misaka0714@users.noreply.github.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require kind labelWaiting for
This rule is failing.
🔴 Require version labelWaiting for
This rule is failing.
|
There was a problem hiding this comment.
Code Review
This pull request introduces build cache support for HGraph, allowing indexes to be exported and rebuilt efficiently using warm starts and refinement phases. It also adds several utility tools for cache export, building, and recall evaluation. The code review identified several critical issues that need to be addressed: a concurrency data race in HGraph::Add when modifying feature_ids_, potential Denial of Service (DoS) vulnerabilities in cache loading due to unvalidated header fields (feature_id_bytes and max_degree), a logic bug in cache preparation when duplicate labels are enabled, a metadata handling bug in HGraph::move_id, and dead code in the form of the unused refine_nodes_for_build_cache function.
| if (feature_ids != nullptr) { | ||
| if (inner_id >= this->feature_ids_.size()) { | ||
| this->feature_ids_.resize(this->total_count_.load()); | ||
| } | ||
| this->feature_ids_[inner_id] = feature_ids[j]; | ||
| } |
There was a problem hiding this comment.
Critical Concurrency Bug (Data Race):\n\nIn HGraph::Add, this->feature_ids_ is resized and written to without any lock protection. Since HGraph supports concurrent Add operations, multiple threads can concurrently call Add, leading to a severe data race on std::vector<std::string> feature_ids_ (which is undefined behavior and can cause crashes or memory corruption).\n\nTo fix this, the resizing and assignment of feature_ids_ must be protected by this->add_mutex_.
if (feature_ids != nullptr) {
std::scoped_lock lock(this->add_mutex_);
if (inner_id >= this->feature_ids_.size()) {
this->feature_ids_.resize(this->total_count_.load());
}
this->feature_ids_[inner_id] = feature_ids[j];
}| std::string feature_blob(header.feature_id_bytes, '\0'); | ||
| if (header.feature_id_bytes > 0) { | ||
| reader.Read(feature_blob.data(), header.feature_id_bytes); | ||
| } |
There was a problem hiding this comment.
Security/Robustness Issue (Uncontrolled Resource Consumption / OOM):\n\nheader.feature_id_bytes is read directly from the input stream and used to allocate a std::string buffer. If the cache file is corrupted or maliciously crafted with a huge feature_id_bytes value, this will trigger a std::bad_alloc exception, causing a Denial of Service (crash) of the application.\n\nWe should validate header.feature_id_bytes before allocating memory.
| std::string feature_blob(header.feature_id_bytes, '\0'); | |
| if (header.feature_id_bytes > 0) { | |
| reader.Read(feature_blob.data(), header.feature_id_bytes); | |
| } | |
| constexpr uint64_t kMaxReasonableFeatureIdBytes = 1024ULL * 1024 * 1024; // 1 GB limit | |
| CHECK_ARGUMENT(header.feature_id_bytes <= kMaxReasonableFeatureIdBytes, "feature_id_bytes is unreasonably large"); | |
| std::string feature_blob(header.feature_id_bytes, '\0'); | |
| if (header.feature_id_bytes > 0) { | |
| reader.Read(feature_blob.data(), header.feature_id_bytes); | |
| } |
| BuildCacheHeader header; | ||
| StreamReader::ReadObj(reader, header); |
There was a problem hiding this comment.
Security/Robustness Issue (Infinite Loop / DoS):\n\nheader.max_degree is read directly from the input stream and used to control nested loops that read neighbor records. If header.max_degree is corrupted or maliciously set to a very large value (e.g., UINT32_MAX), the nested loops will run billions of times, causing an infinite loop or severe CPU hang (Denial of Service).\n\nWe should validate header.max_degree immediately after reading the header.
| BuildCacheHeader header; | |
| StreamReader::ReadObj(reader, header); | |
| BuildCacheHeader header; | |
| StreamReader::ReadObj(reader, header); | |
| constexpr uint32_t kMaxReasonableMaxDegree = 1024; | |
| CHECK_ARGUMENT(header.max_degree <= kMaxReasonableMaxDegree, "max_degree is unreasonably large"); |
| std::unordered_map<LabelType, InnerIdType> label_to_inner_id; | ||
| label_to_inner_id.reserve(static_cast<size_t>(this->total_count_.load())); | ||
| for (uint64_t inner_id = 0; inner_id < static_cast<uint64_t>(this->total_count_.load()); ++inner_id) { | ||
| auto label = this->label_table_->GetLabelById(static_cast<InnerIdType>(inner_id)); | ||
| auto [iter, inserted] = label_to_inner_id.emplace(label, static_cast<InnerIdType>(inner_id)); | ||
| CHECK_ARGUMENT(inserted || this->support_duplicate_, | ||
| fmt::format("duplicate label {} found in index when preparing feature ids", | ||
| label)); | ||
| if (!inserted && this->support_duplicate_) { | ||
| iter->second = static_cast<InnerIdType>(inner_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
Logic Bug with Duplicate Labels:\n\nWhen this->support_duplicate_ is true, the index can contain duplicate labels. However, label_to_inner_id is defined as a std::unordered_map<LabelType, InnerIdType>, which only maps a label to a single inner_id (overwriting previous ones with iter->second = inner_id).\n\nDuring the mapping phase, only the last inner_id for a duplicate label will be resolved, leaving other duplicate inner_ids empty. This will cause mapped_count != expected_count and trigger a fatal CHECK_ARGUMENT failure, completely breaking the build cache preparation when duplicate labels exist.\n\nRecommended Fix:\nUse a std::unordered_map<LabelType, std::vector<InnerIdType>> to store all inner_ids for duplicate labels, and consume them sequentially (e.g., using a tracking index or popping from the back) during the mapping loop.
| if (from < this->feature_ids_.size()) { | ||
| if (to >= this->feature_ids_.size()) { | ||
| this->feature_ids_.resize(to + 1); | ||
| } | ||
| this->feature_ids_[to] = std::move(this->feature_ids_[from]); | ||
| this->feature_ids_[from].clear(); | ||
| } |
There was a problem hiding this comment.
Logic Bug in move_id:\n\nIn HGraph::move_id, if from >= feature_ids_.size() but to < feature_ids_.size(), the old feature ID at to is not cleared. Since from is moving to to, to should have no feature ID after the move. Leaving the old feature ID at to intact results in stale/incorrect metadata.
if (to < this->feature_ids_.size()) {
if (from < this->feature_ids_.size()) {
this->feature_ids_[to] = std::move(this->feature_ids_[from]);
this->feature_ids_[from].clear();
} else {
this->feature_ids_[to].clear();
}
} else if (from < this->feature_ids_.size()) {
this->feature_ids_.resize(to + 1);
this->feature_ids_[to] = std::move(this->feature_ids_[from]);
this->feature_ids_[from].clear();
}| HGraph::RefineExecutionStats | ||
| HGraph::refine_nodes_for_build_cache( |
There was a problem hiding this comment.
Dead Code Cleanup:\n\nThe function refine_nodes_for_build_cache is defined but never called anywhere in the codebase (it was likely replaced by refine_nodes_two_phase). Additionally, its effective_parallelism is unconditionally forced to 1, making its parallel execution branch dead code.\n\nRemove the unused refine_nodes_for_build_cache function from both hgraph.h and hgraph_build_cache.cpp to improve maintainability and reduce code bloat.
Summary
Add BuildWithCache to support incremental HGraph index building from a cache exported by a previous build. This reduces rebuild time when the dataset changes between builds (for example, day1 → day2 updates), while keeping recall stable.
Change Type
Linked Issue
What Changed
Core implementation
hgraph_build_cache.cppwith the mainBuildWithCacheworkflow:Performance optimizations
merge_pruneunordered_set-based deduplication (O(M)vs.O(M^2))inner_idas the search entry for faster convergencePublic API additions
BuildCacheOptionsandBuildCacheStatstoindex.hSupportsBuildCacheExportBuildCacheBuildWithCachePrepareFeatureIdsForBuildCacheGetBuildCacheStatsDataset::FeatureIdsDataset::GetFeatureIdsTools
build_hgraph_with_cache: build an index with cacheexport_hgraph_build_cache: export build cache from an existing indexeval_recall_brute_force: brute-force recall evaluationcompare_hgraph_ids: compare IDs across indexesPerformance and Quality
Tested on 30M 768-dim vectors with day1 → day2 incremental updates:
858s -> 23s(-97.3%)408s -> 12s(-97.1%)4920s -> 4070s(-17.3%)±0.35ppTest Evidence
make fmtmake lintmake testmake cov, run tests, and collect coverageTest details:
Compatibility Impact
Performance and Concurrency Impact
Documentation Impact
No docs update needed
Updated docs:
README.mdDEVELOPMENT.mdCONTRIBUTING.mdRisk and Rollback
Checklist
kind/bugandkind/feature; see "Linked Issue" above)[skip ci]prefix)