Skip to content

feat: add BuildWithCache support for HGraph incremental index building#2095

Open
misaka0714 wants to merge 1 commit into
antgroup:mainfrom
misaka0714:build_with_cache_latest
Open

feat: add BuildWithCache support for HGraph incremental index building#2095
misaka0714 wants to merge 1 commit into
antgroup:mainfrom
misaka0714:build_with_cache_latest

Conversation

@misaka0714
Copy link
Copy Markdown
Contributor

Summary

Add BuildWithCache to support incremental HGraph index building from a cache exported by a previous build. This reduces rebuild time when the dataset changes between builds (for example, day1 → day2 updates), while keeping recall stable.

Change Type

  • Bug fix
  • New feature
  • Improvement/Refactor
  • Documentation
  • CI/Build/Infra

Linked Issue

  • Fixes:

What Changed

Core implementation

  • Added hgraph_build_cache.cpp with the main BuildWithCache workflow:
    • cache loading
    • warm start with hit/miss classification
    • two-phase parallel refine
    • route graph building

Performance optimizations

  • Optimized reverse-edge construction using a scatter-materialize parallel pattern in the group phase
  • Added shard-parallel merge_prune
  • Reused distances computed in the select phase
  • Replaced quadratic deduplication with unordered_set-based deduplication (O(M) vs. O(M^2))
  • Added self-entry optimization: hit nodes use their own inner_id as the search entry for faster convergence

Public API additions

  • Added BuildCacheOptions and BuildCacheStats to index.h
  • Added virtual methods:
    • SupportsBuildCache
    • ExportBuildCache
    • BuildWithCache
    • PrepareFeatureIdsForBuildCache
    • GetBuildCacheStats
  • Added feature ID tracking via:
    • Dataset::FeatureIds
    • Dataset::GetFeatureIds

Tools

  • build_hgraph_with_cache: build an index with cache
  • export_hgraph_build_cache: export build cache from an existing index
  • eval_recall_brute_force: brute-force recall evaluation
  • compare_hgraph_ids: compare IDs across indexes

Performance and Quality

Tested on 30M 768-dim vectors with day1 → day2 incremental updates:

  • Reverse phase: 858s -> 23s (-97.3%)
  • Group sub-phase: 408s -> 12s (-97.1%)
  • Total build time: 4920s -> 4070s (-17.3%)
  • Recall: stable within ±0.35pp

Test Evidence

  • make fmt
  • make lint
  • make test
  • make cov, run tests, and collect coverage
  • Other (describe below)

Test details:

<!-- Paste commands and key output here -->

Compatibility Impact

  • API/ABI compatibility: new public API added; no breaking change intended
  • Behavior changes: optional cache-based build path added for HGraph

Performance and Concurrency Impact

  • Performance impact: improved for incremental rebuild scenarios
  • Concurrency/thread-safety impact: parallel build path updated; no intended thread-safety regression

Documentation Impact

  • No docs update needed

  • Updated docs:

    • README.md
    • DEVELOPMENT.md
    • CONTRIBUTING.md
    • Other:

Risk and Rollback

  • Risk level: medium
  • Rollback plan: revert this PR and fall back to the existing full-build path without cache

Checklist

  • I have linked the relevant issue (required for kind/bug and kind/feature; see "Linked Issue" above)
  • I have added/updated tests for new behavior or bug fixes
  • I have considered API compatibility impact
  • I have updated docs if behavior/workflow changed
  • My commit messages follow project conventions (Conventional Commits, optional [skip ci] prefix)

Add BuildWithCache feature that enables incremental HGraph index building
using a cache from a previous build. This significantly reduces build time
when the dataset changes between builds (e.g., day1 -> day2 updates).

Core implementation:
- hgraph_build_cache.cpp: Main BuildWithCache logic including cache
  loading, warm_start (hit/missed classification), two-phase parallel
  refine with optimized reverse-edge installation, and route graph
  building
- Reverse-edge optimization: scatter-materialise parallel pattern for
  group phase, shard-parallel merge_prune, distance reuse from select
  phase, unordered_set dedup (O(M) vs O(M^2))
- Self-entry optimization: hit nodes use their own inner_id as search
  entry point for faster convergence

Public API additions:
- BuildCacheOptions/BuildCacheStats structs in index.h
- SupportsBuildCache/ExportBuildCache/BuildWithCache/
  PrepareFeatureIdsForBuildCache/GetBuildCacheStats virtual methods
- Dataset::FeatureIds/GetFeatureIds for feature ID tracking

Tools:
- build_hgraph_with_cache: Build index using cache
- export_hgraph_build_cache: Export cache from existing index
- eval_recall_brute_force: Brute-force recall evaluation
- compare_hgraph_ids: Compare IDs between indexes

Performance (30M 768-dim vectors, day1->day2):
- Reverse phase: 858s -> 23s (-97.3%)
- Group sub-phase: 408s -> 12s (-97.1%)
- Total build: 4920s -> 4070s (-17.3%)
- Recall: stable within +/-0.35pp

Signed-off-by: baoyuan <129934985+misaka0714@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 26, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require kind label

Waiting for

  • label~=^kind/
This rule is failing.
  • label~=^kind/

🔴 Require version label

Waiting for

  • label~=^version/
This rule is failing.
  • label~=^version/

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces build cache support for HGraph, allowing indexes to be exported and rebuilt efficiently using warm starts and refinement phases. It also adds several utility tools for cache export, building, and recall evaluation. The code review identified several critical issues that need to be addressed: a concurrency data race in HGraph::Add when modifying feature_ids_, potential Denial of Service (DoS) vulnerabilities in cache loading due to unvalidated header fields (feature_id_bytes and max_degree), a logic bug in cache preparation when duplicate labels are enabled, a metadata handling bug in HGraph::move_id, and dead code in the form of the unused refine_nodes_for_build_cache function.

Comment on lines +300 to +305
if (feature_ids != nullptr) {
if (inner_id >= this->feature_ids_.size()) {
this->feature_ids_.resize(this->total_count_.load());
}
this->feature_ids_[inner_id] = feature_ids[j];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Concurrency Bug (Data Race):\n\nIn HGraph::Add, this->feature_ids_ is resized and written to without any lock protection. Since HGraph supports concurrent Add operations, multiple threads can concurrently call Add, leading to a severe data race on std::vector<std::string> feature_ids_ (which is undefined behavior and can cause crashes or memory corruption).\n\nTo fix this, the resizing and assignment of feature_ids_ must be protected by this->add_mutex_.

        if (feature_ids != nullptr) {
            std::scoped_lock lock(this->add_mutex_);
            if (inner_id >= this->feature_ids_.size()) {
                this->feature_ids_.resize(this->total_count_.load());
            }
            this->feature_ids_[inner_id] = feature_ids[j];
        }

Comment on lines +966 to +969
std::string feature_blob(header.feature_id_bytes, '\0');
if (header.feature_id_bytes > 0) {
reader.Read(feature_blob.data(), header.feature_id_bytes);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Security/Robustness Issue (Uncontrolled Resource Consumption / OOM):\n\nheader.feature_id_bytes is read directly from the input stream and used to allocate a std::string buffer. If the cache file is corrupted or maliciously crafted with a huge feature_id_bytes value, this will trigger a std::bad_alloc exception, causing a Denial of Service (crash) of the application.\n\nWe should validate header.feature_id_bytes before allocating memory.

Suggested change
std::string feature_blob(header.feature_id_bytes, '\0');
if (header.feature_id_bytes > 0) {
reader.Read(feature_blob.data(), header.feature_id_bytes);
}
constexpr uint64_t kMaxReasonableFeatureIdBytes = 1024ULL * 1024 * 1024; // 1 GB limit
CHECK_ARGUMENT(header.feature_id_bytes <= kMaxReasonableFeatureIdBytes, "feature_id_bytes is unreasonably large");
std::string feature_blob(header.feature_id_bytes, '\0');
if (header.feature_id_bytes > 0) {
reader.Read(feature_blob.data(), header.feature_id_bytes);
}

Comment on lines +945 to +946
BuildCacheHeader header;
StreamReader::ReadObj(reader, header);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Security/Robustness Issue (Infinite Loop / DoS):\n\nheader.max_degree is read directly from the input stream and used to control nested loops that read neighbor records. If header.max_degree is corrupted or maliciously set to a very large value (e.g., UINT32_MAX), the nested loops will run billions of times, causing an infinite loop or severe CPU hang (Denial of Service).\n\nWe should validate header.max_degree immediately after reading the header.

Suggested change
BuildCacheHeader header;
StreamReader::ReadObj(reader, header);
BuildCacheHeader header;
StreamReader::ReadObj(reader, header);
constexpr uint32_t kMaxReasonableMaxDegree = 1024;
CHECK_ARGUMENT(header.max_degree <= kMaxReasonableMaxDegree, "max_degree is unreasonably large");

Comment on lines +147 to +158
std::unordered_map<LabelType, InnerIdType> label_to_inner_id;
label_to_inner_id.reserve(static_cast<size_t>(this->total_count_.load()));
for (uint64_t inner_id = 0; inner_id < static_cast<uint64_t>(this->total_count_.load()); ++inner_id) {
auto label = this->label_table_->GetLabelById(static_cast<InnerIdType>(inner_id));
auto [iter, inserted] = label_to_inner_id.emplace(label, static_cast<InnerIdType>(inner_id));
CHECK_ARGUMENT(inserted || this->support_duplicate_,
fmt::format("duplicate label {} found in index when preparing feature ids",
label));
if (!inserted && this->support_duplicate_) {
iter->second = static_cast<InnerIdType>(inner_id);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Logic Bug with Duplicate Labels:\n\nWhen this->support_duplicate_ is true, the index can contain duplicate labels. However, label_to_inner_id is defined as a std::unordered_map<LabelType, InnerIdType>, which only maps a label to a single inner_id (overwriting previous ones with iter->second = inner_id).\n\nDuring the mapping phase, only the last inner_id for a duplicate label will be resolved, leaving other duplicate inner_ids empty. This will cause mapped_count != expected_count and trigger a fatal CHECK_ARGUMENT failure, completely breaking the build cache preparation when duplicate labels exist.\n\nRecommended Fix:\nUse a std::unordered_map<LabelType, std::vector<InnerIdType>> to store all inner_ids for duplicate labels, and consume them sequentially (e.g., using a tracking index or popping from the back) during the mapping loop.

Comment on lines +149 to +155
if (from < this->feature_ids_.size()) {
if (to >= this->feature_ids_.size()) {
this->feature_ids_.resize(to + 1);
}
this->feature_ids_[to] = std::move(this->feature_ids_[from]);
this->feature_ids_[from].clear();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Logic Bug in move_id:\n\nIn HGraph::move_id, if from >= feature_ids_.size() but to < feature_ids_.size(), the old feature ID at to is not cleared. Since from is moving to to, to should have no feature ID after the move. Leaving the old feature ID at to intact results in stale/incorrect metadata.

    if (to < this->feature_ids_.size()) {
        if (from < this->feature_ids_.size()) {
            this->feature_ids_[to] = std::move(this->feature_ids_[from]);
            this->feature_ids_[from].clear();
        } else {
            this->feature_ids_[to].clear();
        }
    } else if (from < this->feature_ids_.size()) {
        this->feature_ids_.resize(to + 1);
        this->feature_ids_[to] = std::move(this->feature_ids_[from]);
        this->feature_ids_[from].clear();
    }

Comment on lines +313 to +314
HGraph::RefineExecutionStats
HGraph::refine_nodes_for_build_cache(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Dead Code Cleanup:\n\nThe function refine_nodes_for_build_cache is defined but never called anywhere in the codebase (it was likely replaced by refine_nodes_two_phase). Additionally, its effective_parallelism is unconditionally forced to 1, making its parallel execution branch dead code.\n\nRemove the unused refine_nodes_for_build_cache function from both hgraph.h and hgraph_build_cache.cpp to improve maintainability and reduce code bloat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant