Skip to content

feat: Support sampled subgraph matching with an extension#260

Open
ShunyangLi wants to merge 29 commits into
alibaba:mainfrom
ShunyangLi:main
Open

feat: Support sampled subgraph matching with an extension#260
ShunyangLi wants to merge 29 commits into
alibaba:mainfrom
ShunyangLi:main

Conversation

@ShunyangLi
Copy link
Copy Markdown
Collaborator

@ShunyangLi ShunyangLi commented Apr 20, 2026

What do these changes do?

Add an extension to perform subgraph matching in reasonable time on very large graphs with sampling algorithms

Related issue number

Fix #35, #345

@ShunyangLi ShunyangLi requested a review from Copilot April 20, 2026 11:54
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new sampled_match extension to NeuG that performs sampled subgraph matching / cardinality estimation on large graphs (based on the FaSTest approach), along with a minimal end-to-end smoke test and build integration.

Changes:

  • Adds a new sampled_match extension library that registers table functions (INITIALIZE / SAMPLED_MATCH / property accessors / checkpointing).
  • Vendors FaSTest-related sources under extension/sampled_match/include/fastest_lib and adapts them to NeuG’s graph storage interfaces.
  • Adds a CTest smoke test that creates a tiny graph, loads the extension, and runs CALL SAMPLED_MATCH.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
extension/CMakeLists.txt Enables building the new sampled_match extension when selected in BUILD_EXTENSIONS.
extension/sampled_match/CMakeLists.txt Builds and links the extension with Boost/GSL and includes FaSTest sources.
extension/sampled_match/README.md Documents how to build and run the extension smoke test.
extension/sampled_match/src/sampled_match_extension.cpp Extension entrypoint registering functions + extension metadata.
extension/sampled_match/src/sampled_match_data_graph_meta.cpp Implements preprocessing + checkpoint (de)serialization for DataGraphMeta.
extension/sampled_match/tests/CMakeLists.txt Builds and registers the smoke test as a CTest.
extension/sampled_match/tests/sampled_match_test.cpp End-to-end smoke test that loads the extension and runs SAMPLED_MATCH.
extension/sampled_match/include/sampled_match_data_graph_meta.h Declares DataGraphMeta and graph-access helpers used by sampling/matching.
extension/sampled_match/include/sampled_match_functions.h Implements pattern parsing, graph cache/checkpoint logic, and main match execution path.
extension/sampled_match/include/sampled_match_value.h Adds missing relational operators for neug::execution::Value used by FaSTest code.
extension/sampled_match/include/fastest_lib/CMakeLists.txt Vendored FaSTest CMake (not used by NeuG build, but included with sources).
extension/sampled_match/include/fastest_lib/LICENSE Vendored license text for FaSTest code.
extension/sampled_match/include/fastest_lib/README.md Vendored FaSTest README and usage notes.
extension/sampled_match/include/fastest_lib/driver/subgraph-cardinality-estimation.cc Vendored FaSTest standalone driver (not wired into extension build).
extension/sampled_match/include/fastest_lib/src/Base/base.h Vendored FaSTest base utilities used by the algorithm.
extension/sampled_match/include/fastest_lib/src/Base/basic_algorithms.h Vendored union-find + bipartite matching utilities used by filters.
extension/sampled_match/include/fastest_lib/src/Base/metrics.h Vendored metrics helpers (unused in extension path).
extension/sampled_match/include/fastest_lib/src/Base/timer.h Vendored timer utility used by sampling code paths.
extension/sampled_match/include/fastest_lib/src/DataStructure/graph.h Vendored/adapted directed graph structure used by FaSTest components.
extension/sampled_match/include/fastest_lib/src/DataStructure/graph.cpp Adds schema-aware “no-edge-pairs” construction for pattern constraints.
extension/sampled_match/include/fastest_lib/src/SpecialSubgraphs/small_cycle.h Vendored small-cycle enumeration utilities (mostly disabled/unused here).
extension/sampled_match/include/fastest_lib/src/SubgraphCounting/cardinality_estimation.h FaSTest cardinality estimation orchestration adapted to DataGraphMeta.
extension/sampled_match/include/fastest_lib/src/SubgraphCounting/candidate_graph_sampling.h Graph sampling strategy used when tree sampling is insufficient.
extension/sampled_match/include/fastest_lib/src/SubgraphCounting/candidate_tree_sampling.h Tree sampling strategy used for estimation + sample collection.
extension/sampled_match/include/fastest_lib/src/SubgraphCounting/option.h Options for FaSTest estimation/sampling configuration.
extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_filter.h Placeholder header for candidate filtering (currently empty).
extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_space.h Core candidate space construction/refinement adapted for directed graphs + metadata access.
extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_space.cpp Empty translation unit placeholder for candidate space.
extension/sampled_match/include/fastest_lib/src/SubgraphMatching/data_graph.h Vendored FaSTest data graph wrapper (mostly used by standalone driver).
extension/sampled_match/include/fastest_lib/src/SubgraphMatching/pattern_graph.h Pattern graph processing and adjacency index construction for directed support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +108 to +110
const std::string db_path = "/tmp/neug_smd_test";
const std::string pattern_file = "pattern.json";
Cleanup cleanup{db_path, pattern_file};
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The smoke test uses a fixed DB path (/tmp/neug_smd_test) and writes pattern.json in the current working directory. This can cause flaky failures when tests run in parallel or under restricted sandboxes. Prefer creating a unique temporary directory (std::filesystem::temp_directory_path + random/uuid/mkdtemp) and write the pattern file under that directory; also consider setting the test WORKING_DIRECTORY in CTest if you want deterministic cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +484 to +497
auto readIntVec = [&](std::vector<int>& vec) {
int32_t sz = readInt();
vec.resize(sz);
if (sz > 0) {
ifs.read(reinterpret_cast<char*>(vec.data()), sz * sizeof(int));
}
};
auto readDoubleVec = [&](std::vector<double>& vec) {
int32_t sz = readInt();
vec.resize(sz);
if (sz > 0) {
ifs.read(reinterpret_cast<char*>(vec.data()), sz * sizeof(double));
}
};
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadFromFile reads vector sizes from the checkpoint file and resizes vectors without any sanity checks (e.g., negative/huge sizes, inconsistent counts). A corrupted or malicious checkpoint can trigger unbounded allocations/OOM and potentially crash the process. Add validation for sizes (non-negative, reasonable upper bounds, and consistency with num_vertex_/num_labels_) and fail fast if the file is invalid.

Copilot uses AI. Check for mistakes.
Comment on lines +481 to +520
if (!cached_data.preprocessed) {
std::cout << "[match] Graph not initialized, calling DoGraphInitialization..." << std::endl;
DoGraphInitialization(graph_, true);
} else {
std::cout << "[0-1] Using cached graph data..." << std::endl;
std::cout << " Vertices: " << cached_data.data_meta->GetNumVertices() << std::endl;
std::cout << " Edges: " << cached_data.data_meta->GetNumEdges() << std::endl;
}
std::cout << std::endl;

// Step 2: always reload the pattern — callers can vary it per invocation.
std::cout << "[2] Loading pattern graph from: " << pattern_file_ << std::endl;
pattern_graph_ = CreatePatternFromJson(pattern_file_);
if (!pattern_graph_ || pattern_graph_->GetNumVertices() == 0) {
std::cerr << " ERROR: Failed to load pattern!" << std::endl;
return -1;
}
std::cout << " Pattern: " << pattern_graph_->GetNumVertices() << " vertices, "
<< pattern_graph_->GetNumEdges() << " edges" << std::endl;

// Print pattern details
std::cout << " Pattern vertices:" << std::endl;
for (int i = 0; i < pattern_graph_->GetNumVertices(); i++) {
int label = pattern_graph_->vertex_label[i];
std::cout << " v" << i << ": label=" << label
<< " (out_deg=" << pattern_graph_->GetOutDegree(i)
<< ", in_deg=" << pattern_graph_->GetInDegree(i) << ")" << std::endl;
}
std::cout << " Pattern edges:" << std::endl;
for (int i = 0; i < pattern_graph_->GetNumEdges(); i++) {
auto& [src, dst] = pattern_graph_->edge_list[i];
int label = pattern_graph_->edge_label[i];
std::cout << " e" << i << ": " << src << " -[label=" << label << "]-> " << dst << std::endl;
}
std::cout << std::endl;

// Step 3: Process pattern (compute core numbers, build incidence list, etc.)
std::cout << "[3] Processing pattern..." << std::endl;
pattern_graph_->ProcessPattern(*cached_data.data_meta, cached_data.schema_graph);
std::cout << " Done." << std::endl;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SampledSubgraphMatcher::match() prints extensive progress and pattern details to stdout unconditionally. In production this can severely impact performance and pollute logs for every CALL SAMPLED_MATCH. Consider using LOG(INFO)/VLOG with a verbosity flag, or gate these prints behind a debug option passed via function parameters.

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +199
inline std::string GenerateOutputFilePath(const std::string& prefix) {
auto now = std::chrono::system_clock::now();
auto timestamp = std::chrono::duration_cast<std::chrono::milliseconds>(
now.time_since_epoch()).count();
std::filesystem::create_directories("/tmp/p/neug_sample");
return "/tmp/p/neug_sample/" + prefix + "_" + std::to_string(timestamp) + ".csv";
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerateOutputFilePath always writes under a hard-coded /tmp path and uses only millisecond timestamps, which can collide under high concurrency. Prefer std::filesystem::temp_directory_path() and a collision-resistant unique suffix (random/uuid or mkstemp-style) and propagate filesystem errors back to the caller.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +162
for (const auto& edge : GetAllOutIncidentEdges(global_id)) {
int dst_global = std::get<1>(edge);
if (!nbr_seen_[dst_global]) {
nbr_seen_[dst_global] = true;
nbr_seen_reset_.push_back(dst_global);
result.push_back(dst_global);
}
}
for (const auto& edge : GetAllInIncidentEdges(global_id)) {
int src_global = std::get<0>(edge);
if (!nbr_seen_[src_global]) {
nbr_seen_[src_global] = true;
nbr_seen_reset_.push_back(src_global);
result.push_back(src_global);
}
}
for (int v : nbr_seen_reset_) nbr_seen_[v] = false;
nbr_seen_reset_.clear();
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataGraphMeta::GetNeighbors/GetOutNeighbors/GetInNeighbors mutate shared mutable scratch state (nbr_seen_/nbr_seen_reset_) inside const methods. Because DataGraphMeta is cached globally (GraphDataCache) and can be used concurrently across queries/sessions, this introduces data races and can corrupt results. Consider making the dedup scratch space thread-local/per-call (e.g., pass in a scratch struct, or use a per-thread epoch array) or guard these methods with a mutex (prefer scratch to avoid contention).

Suggested change
for (const auto& edge : GetAllOutIncidentEdges(global_id)) {
int dst_global = std::get<1>(edge);
if (!nbr_seen_[dst_global]) {
nbr_seen_[dst_global] = true;
nbr_seen_reset_.push_back(dst_global);
result.push_back(dst_global);
}
}
for (const auto& edge : GetAllInIncidentEdges(global_id)) {
int src_global = std::get<0>(edge);
if (!nbr_seen_[src_global]) {
nbr_seen_[src_global] = true;
nbr_seen_reset_.push_back(src_global);
result.push_back(src_global);
}
}
for (int v : nbr_seen_reset_) nbr_seen_[v] = false;
nbr_seen_reset_.clear();
std::unordered_set<int> seen;
for (const auto& edge : GetAllOutIncidentEdges(global_id)) {
int dst_global = std::get<1>(edge);
if (seen.insert(dst_global).second) {
result.push_back(dst_global);
}
}
for (const auto& edge : GetAllInIncidentEdges(global_id)) {
int src_global = std::get<0>(edge);
if (seen.insert(src_global).second) {
result.push_back(src_global);
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +48
inline BipartiteMaximumMatching BPSolver;
enum STRUCTURE_FILTER {
NO_STRUCTURE_FILTER,
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BPSolver is a single global inline instance shared across all CandidateSpace objects and all threads. CandidateSpace::CandidateSpace calls BPSolver.Initialize(), which overwrites internal pointers without freeing prior allocations (leak) and makes the solver state non-reentrant; BuildCS/EdgeBipartiteSafety also mutate it. Make BipartiteMaximumMatching an instance member of CandidateSpace (or allocate per call / thread_local) so concurrent queries cannot race and repeated calls do not leak memory.

Copilot uses AI. Check for mistakes.
@longbinlai longbinlai changed the title Support sampled subgraph matching with an extension feat: Support sampled subgraph matching with an extension Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1022 to +1033
// 辅助函数:检查值是否满足约束条件
// Note: neug::execution::Value only supports == operator, other comparisons not yet implemented
inline bool CandidateSpace::CheckValueConstraint(const Value& data_value, CompType comp_type, const Value& constraint_value) {
switch (comp_type) {
case CompType::COMP_EQUAL:
return data_value == constraint_value;
// TODO: Implement other comparisons when Value supports them
// Currently Value class doesn't have >, <, >=, <= operators
case CompType::COMP_GREATER:
return data_value > constraint_value;
case CompType::COMP_LESS:
return data_value < constraint_value;
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above CheckValueConstraint() says Value doesn't support > < >= <=, but this file now includes sampled_match_value.h specifically to add those operators. Please update/remove the outdated comment so it doesn't mislead future changes.

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +191
~BipartiteMaximumMatching() {
delete[] left;
delete[] right;
delete[] used;
for (int i = 0; i < left_len; i++) {
delete[] adj[i];
delete[] matchable[i];
delete[] upper_graph[i];
}
delete[] matchable;
delete[] adj;
delete[] adj_size;
delete[] lower_graph;
delete[] upper_graph;
delete[] lower_graph_size;
delete[] upper_graph_size;
delete[] right_order;
delete[] inverse_right_order;
delete[] bfs_visited;
}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~BipartiteMaximumMatching() does not free several buffers allocated in Initialize() (e.g., adj_index and its rows, lower_graph[i] rows, and Q/S/dfsn/scch/scc_idx). This causes per-query leaks now that CandidateSpace owns the solver. Ensure every allocation in Initialize() is paired with a corresponding delete[] in the destructor (and consider switching to std::vector for ownership).

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +125
std::fill(sample.begin(), sample.end(), -1);
memset(local_candidate_size, 0, query_->GetNumVertices());
sample[root] = (i % root_candidates_.size());
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset(local_candidate_size, 0, query_->GetNumVertices()); again uses an element count (bytes) instead of query_->GetNumVertices() * sizeof(int). This can leave stale sizes for later vertices and cause incorrect sampling / memory corruption.

Copilot uses AI. Check for mistakes.

void Add(const Timer &other) { time += other.time; }

double Peek() { Stop(); return std::chrono::duration<double, std::milli>(e - s).count(); }
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer::Peek() calls Stop(), but Stop() resets s to now before Peek() computes (e - s), so Peek() will typically return ~0 instead of the elapsed time. Either compute the delta without resetting s, or have Peek() just return time + (now - s) without mutating state.

Suggested change
double Peek() { Stop(); return std::chrono::duration<double, std::milli>(e - s).count(); }
double Peek() {
auto now = std::chrono::high_resolution_clock::now();
return time + std::chrono::duration<double, std::milli>(now - s).count();
}

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +279
// For each neighbor of v (using neighbors_)
for (int u : GetNeighbors(v)) {
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeCoreNum() calls GetNeighbors(v) inside the main decomposition loop. GetNeighbors() allocates and rebuilds the neighbor list by scanning incident edges each time, which can make preprocessing far more expensive than necessary on large graphs. Consider materializing an undirected adjacency list once (or caching per-vertex neighbors) for the k-core computation.

Suggested change
// For each neighbor of v (using neighbors_)
for (int u : GetNeighbors(v)) {
// For each neighbor of v, iterate the precomputed adjacency list directly.
for (int u : neighbors_[v]) {

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +119

std::shuffle(root_candidates_.begin(), root_candidates_.end(), gen);
double est = 0.0;
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header uses gen for std::shuffle, but gen is not declared here; it currently relies on another header defining a global RNG and on a specific include order. Make the dependency explicit (e.g., own an std::mt19937 in CandidateGraphSampler, or declare an extern RNG in a dedicated header) so including this file alone compiles reliably.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +66
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple source headers in fastest_lib state the code is derived from FaSTest under the MIT License, but this directory’s LICENSE file contains the Apache 2.0 text. For license compliance, include the MIT license text (and preserve upstream copyright/notice requirements), or clarify the intended dual-licensing explicitly.

Suggested change
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
FaSTest-derived code in this directory is provided under the MIT License.
Preserve all upstream copyright and permission notices in the source files.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +19
#include "neug/compiler/extension/extension_api.h"
#include "neug/utils/exception/exception.h"

#include "sampled_match_functions.h"
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::cout is used in this file but <iostream> is not included (and extension_api.h doesn't include it). This will fail to compile on a clean build; add the proper header include.

Copilot uses AI. Check for mistakes.
Comment on lines +435 to +439
writeInt(static_cast<int32_t>(global_to_local_.size()));
for (const auto& [label, vid] : global_to_local_) {
writeInt(static_cast<int32_t>(label));
writeInt(static_cast<int32_t>(vid));
}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveToFile() serializes vid_t (which is uint32_t) by casting to int32_t. Vertex IDs above INT32_MAX will be corrupted on disk and then mis-loaded. Serialize vid_t as uint32_t (or uint64_t) and add corresponding range/consistency checks in LoadFromFile().

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +257
memset(BitsetCS[i], false, data_meta_.GetNumVertices());
}
// Clear edge candidate sets
for (int i = 0; i < query_->GetNumEdges(); i++) {
BitsetEdgeCS[i].clear();
}
memset(num_visit_cs_, 0, data_meta_.GetNumVertices());
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset(BitsetCS[i], false, data_meta_.GetNumVertices()); passes an element count, not a byte count. Use data_meta_.GetNumVertices() * sizeof(bool) (or std::fill_n) to avoid partial initialization on platforms where sizeof(bool) != 1.

Suggested change
memset(BitsetCS[i], false, data_meta_.GetNumVertices());
}
// Clear edge candidate sets
for (int i = 0; i < query_->GetNumEdges(); i++) {
BitsetEdgeCS[i].clear();
}
memset(num_visit_cs_, 0, data_meta_.GetNumVertices());
std::fill_n(BitsetCS[i], data_meta_.GetNumVertices(), false);
}
// Clear edge candidate sets
for (int i = 0; i < query_->GetNumEdges(); i++) {
BitsetEdgeCS[i].clear();
}
std::fill_n(num_visit_cs_, data_meta_.GetNumVertices(), 0);

Copilot uses AI. Check for mistakes.
Comment thread doc/source/extensions/index.md Outdated
| --------------- | ----------------------------- | ------------------------------------------------------------------------ | ------------- |
| Data Source | [JSON](load_json.md) | Import & export data from JSON file format | v0.1 |
| Data Source | [PARQUET](load_parquet.md) | Import & Export data from PARQUET format files | v0.1.1 |
| Graph Algorithm | [SAMPLED_MATCH](sampled_match.md) | Subgraph matching cardinality estimation (FaSTest, VLDB 2024) | v0.1.1 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里要写成 v0.1.2

Comment thread doc/source/extensions/sampled_match.md Outdated
],
"required_props": ["name"]
},
{"id": 1, "label": "Person", "required_props": ["name"]},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里面有没有考虑一些corner case(和测试):例如写了一个 不存在于图中的label会怎么样,用了图中不存在的property会怎么样,然后 constraints 中,不能使用用多个constraints (and, or)组合吗?constraints的 property 和value 类型不匹配会如何?

@ShunyangLi ShunyangLi requested a review from shirly121 May 13, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support sampled subgraph matching with an extension

3 participants