feat: Support sampled subgraph matching with an extension#260
feat: Support sampled subgraph matching with an extension#260ShunyangLi wants to merge 29 commits into
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Pull request overview
This PR adds a new sampled_match extension to NeuG that performs sampled subgraph matching / cardinality estimation on large graphs (based on the FaSTest approach), along with a minimal end-to-end smoke test and build integration.
Changes:
- Adds a new
sampled_matchextension library that registers table functions (INITIALIZE / SAMPLED_MATCH / property accessors / checkpointing). - Vendors FaSTest-related sources under
extension/sampled_match/include/fastest_liband adapts them to NeuG’s graph storage interfaces. - Adds a CTest smoke test that creates a tiny graph, loads the extension, and runs
CALL SAMPLED_MATCH.
Reviewed changes
Copilot reviewed 30 out of 31 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| extension/CMakeLists.txt | Enables building the new sampled_match extension when selected in BUILD_EXTENSIONS. |
| extension/sampled_match/CMakeLists.txt | Builds and links the extension with Boost/GSL and includes FaSTest sources. |
| extension/sampled_match/README.md | Documents how to build and run the extension smoke test. |
| extension/sampled_match/src/sampled_match_extension.cpp | Extension entrypoint registering functions + extension metadata. |
| extension/sampled_match/src/sampled_match_data_graph_meta.cpp | Implements preprocessing + checkpoint (de)serialization for DataGraphMeta. |
| extension/sampled_match/tests/CMakeLists.txt | Builds and registers the smoke test as a CTest. |
| extension/sampled_match/tests/sampled_match_test.cpp | End-to-end smoke test that loads the extension and runs SAMPLED_MATCH. |
| extension/sampled_match/include/sampled_match_data_graph_meta.h | Declares DataGraphMeta and graph-access helpers used by sampling/matching. |
| extension/sampled_match/include/sampled_match_functions.h | Implements pattern parsing, graph cache/checkpoint logic, and main match execution path. |
| extension/sampled_match/include/sampled_match_value.h | Adds missing relational operators for neug::execution::Value used by FaSTest code. |
| extension/sampled_match/include/fastest_lib/CMakeLists.txt | Vendored FaSTest CMake (not used by NeuG build, but included with sources). |
| extension/sampled_match/include/fastest_lib/LICENSE | Vendored license text for FaSTest code. |
| extension/sampled_match/include/fastest_lib/README.md | Vendored FaSTest README and usage notes. |
| extension/sampled_match/include/fastest_lib/driver/subgraph-cardinality-estimation.cc | Vendored FaSTest standalone driver (not wired into extension build). |
| extension/sampled_match/include/fastest_lib/src/Base/base.h | Vendored FaSTest base utilities used by the algorithm. |
| extension/sampled_match/include/fastest_lib/src/Base/basic_algorithms.h | Vendored union-find + bipartite matching utilities used by filters. |
| extension/sampled_match/include/fastest_lib/src/Base/metrics.h | Vendored metrics helpers (unused in extension path). |
| extension/sampled_match/include/fastest_lib/src/Base/timer.h | Vendored timer utility used by sampling code paths. |
| extension/sampled_match/include/fastest_lib/src/DataStructure/graph.h | Vendored/adapted directed graph structure used by FaSTest components. |
| extension/sampled_match/include/fastest_lib/src/DataStructure/graph.cpp | Adds schema-aware “no-edge-pairs” construction for pattern constraints. |
| extension/sampled_match/include/fastest_lib/src/SpecialSubgraphs/small_cycle.h | Vendored small-cycle enumeration utilities (mostly disabled/unused here). |
| extension/sampled_match/include/fastest_lib/src/SubgraphCounting/cardinality_estimation.h | FaSTest cardinality estimation orchestration adapted to DataGraphMeta. |
| extension/sampled_match/include/fastest_lib/src/SubgraphCounting/candidate_graph_sampling.h | Graph sampling strategy used when tree sampling is insufficient. |
| extension/sampled_match/include/fastest_lib/src/SubgraphCounting/candidate_tree_sampling.h | Tree sampling strategy used for estimation + sample collection. |
| extension/sampled_match/include/fastest_lib/src/SubgraphCounting/option.h | Options for FaSTest estimation/sampling configuration. |
| extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_filter.h | Placeholder header for candidate filtering (currently empty). |
| extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_space.h | Core candidate space construction/refinement adapted for directed graphs + metadata access. |
| extension/sampled_match/include/fastest_lib/src/SubgraphMatching/candidate_space.cpp | Empty translation unit placeholder for candidate space. |
| extension/sampled_match/include/fastest_lib/src/SubgraphMatching/data_graph.h | Vendored FaSTest data graph wrapper (mostly used by standalone driver). |
| extension/sampled_match/include/fastest_lib/src/SubgraphMatching/pattern_graph.h | Pattern graph processing and adjacency index construction for directed support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const std::string db_path = "/tmp/neug_smd_test"; | ||
| const std::string pattern_file = "pattern.json"; | ||
| Cleanup cleanup{db_path, pattern_file}; |
There was a problem hiding this comment.
The smoke test uses a fixed DB path (/tmp/neug_smd_test) and writes pattern.json in the current working directory. This can cause flaky failures when tests run in parallel or under restricted sandboxes. Prefer creating a unique temporary directory (std::filesystem::temp_directory_path + random/uuid/mkdtemp) and write the pattern file under that directory; also consider setting the test WORKING_DIRECTORY in CTest if you want deterministic cleanup.
| auto readIntVec = [&](std::vector<int>& vec) { | ||
| int32_t sz = readInt(); | ||
| vec.resize(sz); | ||
| if (sz > 0) { | ||
| ifs.read(reinterpret_cast<char*>(vec.data()), sz * sizeof(int)); | ||
| } | ||
| }; | ||
| auto readDoubleVec = [&](std::vector<double>& vec) { | ||
| int32_t sz = readInt(); | ||
| vec.resize(sz); | ||
| if (sz > 0) { | ||
| ifs.read(reinterpret_cast<char*>(vec.data()), sz * sizeof(double)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
LoadFromFile reads vector sizes from the checkpoint file and resizes vectors without any sanity checks (e.g., negative/huge sizes, inconsistent counts). A corrupted or malicious checkpoint can trigger unbounded allocations/OOM and potentially crash the process. Add validation for sizes (non-negative, reasonable upper bounds, and consistency with num_vertex_/num_labels_) and fail fast if the file is invalid.
| if (!cached_data.preprocessed) { | ||
| std::cout << "[match] Graph not initialized, calling DoGraphInitialization..." << std::endl; | ||
| DoGraphInitialization(graph_, true); | ||
| } else { | ||
| std::cout << "[0-1] Using cached graph data..." << std::endl; | ||
| std::cout << " Vertices: " << cached_data.data_meta->GetNumVertices() << std::endl; | ||
| std::cout << " Edges: " << cached_data.data_meta->GetNumEdges() << std::endl; | ||
| } | ||
| std::cout << std::endl; | ||
|
|
||
| // Step 2: always reload the pattern — callers can vary it per invocation. | ||
| std::cout << "[2] Loading pattern graph from: " << pattern_file_ << std::endl; | ||
| pattern_graph_ = CreatePatternFromJson(pattern_file_); | ||
| if (!pattern_graph_ || pattern_graph_->GetNumVertices() == 0) { | ||
| std::cerr << " ERROR: Failed to load pattern!" << std::endl; | ||
| return -1; | ||
| } | ||
| std::cout << " Pattern: " << pattern_graph_->GetNumVertices() << " vertices, " | ||
| << pattern_graph_->GetNumEdges() << " edges" << std::endl; | ||
|
|
||
| // Print pattern details | ||
| std::cout << " Pattern vertices:" << std::endl; | ||
| for (int i = 0; i < pattern_graph_->GetNumVertices(); i++) { | ||
| int label = pattern_graph_->vertex_label[i]; | ||
| std::cout << " v" << i << ": label=" << label | ||
| << " (out_deg=" << pattern_graph_->GetOutDegree(i) | ||
| << ", in_deg=" << pattern_graph_->GetInDegree(i) << ")" << std::endl; | ||
| } | ||
| std::cout << " Pattern edges:" << std::endl; | ||
| for (int i = 0; i < pattern_graph_->GetNumEdges(); i++) { | ||
| auto& [src, dst] = pattern_graph_->edge_list[i]; | ||
| int label = pattern_graph_->edge_label[i]; | ||
| std::cout << " e" << i << ": " << src << " -[label=" << label << "]-> " << dst << std::endl; | ||
| } | ||
| std::cout << std::endl; | ||
|
|
||
| // Step 3: Process pattern (compute core numbers, build incidence list, etc.) | ||
| std::cout << "[3] Processing pattern..." << std::endl; | ||
| pattern_graph_->ProcessPattern(*cached_data.data_meta, cached_data.schema_graph); | ||
| std::cout << " Done." << std::endl; |
There was a problem hiding this comment.
SampledSubgraphMatcher::match() prints extensive progress and pattern details to stdout unconditionally. In production this can severely impact performance and pollute logs for every CALL SAMPLED_MATCH. Consider using LOG(INFO)/VLOG with a verbosity flag, or gate these prints behind a debug option passed via function parameters.
| inline std::string GenerateOutputFilePath(const std::string& prefix) { | ||
| auto now = std::chrono::system_clock::now(); | ||
| auto timestamp = std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| now.time_since_epoch()).count(); | ||
| std::filesystem::create_directories("/tmp/p/neug_sample"); | ||
| return "/tmp/p/neug_sample/" + prefix + "_" + std::to_string(timestamp) + ".csv"; | ||
| } |
There was a problem hiding this comment.
GenerateOutputFilePath always writes under a hard-coded /tmp path and uses only millisecond timestamps, which can collide under high concurrency. Prefer std::filesystem::temp_directory_path() and a collision-resistant unique suffix (random/uuid or mkstemp-style) and propagate filesystem errors back to the caller.
| for (const auto& edge : GetAllOutIncidentEdges(global_id)) { | ||
| int dst_global = std::get<1>(edge); | ||
| if (!nbr_seen_[dst_global]) { | ||
| nbr_seen_[dst_global] = true; | ||
| nbr_seen_reset_.push_back(dst_global); | ||
| result.push_back(dst_global); | ||
| } | ||
| } | ||
| for (const auto& edge : GetAllInIncidentEdges(global_id)) { | ||
| int src_global = std::get<0>(edge); | ||
| if (!nbr_seen_[src_global]) { | ||
| nbr_seen_[src_global] = true; | ||
| nbr_seen_reset_.push_back(src_global); | ||
| result.push_back(src_global); | ||
| } | ||
| } | ||
| for (int v : nbr_seen_reset_) nbr_seen_[v] = false; | ||
| nbr_seen_reset_.clear(); |
There was a problem hiding this comment.
DataGraphMeta::GetNeighbors/GetOutNeighbors/GetInNeighbors mutate shared mutable scratch state (nbr_seen_/nbr_seen_reset_) inside const methods. Because DataGraphMeta is cached globally (GraphDataCache) and can be used concurrently across queries/sessions, this introduces data races and can corrupt results. Consider making the dedup scratch space thread-local/per-call (e.g., pass in a scratch struct, or use a per-thread epoch array) or guard these methods with a mutex (prefer scratch to avoid contention).
| for (const auto& edge : GetAllOutIncidentEdges(global_id)) { | |
| int dst_global = std::get<1>(edge); | |
| if (!nbr_seen_[dst_global]) { | |
| nbr_seen_[dst_global] = true; | |
| nbr_seen_reset_.push_back(dst_global); | |
| result.push_back(dst_global); | |
| } | |
| } | |
| for (const auto& edge : GetAllInIncidentEdges(global_id)) { | |
| int src_global = std::get<0>(edge); | |
| if (!nbr_seen_[src_global]) { | |
| nbr_seen_[src_global] = true; | |
| nbr_seen_reset_.push_back(src_global); | |
| result.push_back(src_global); | |
| } | |
| } | |
| for (int v : nbr_seen_reset_) nbr_seen_[v] = false; | |
| nbr_seen_reset_.clear(); | |
| std::unordered_set<int> seen; | |
| for (const auto& edge : GetAllOutIncidentEdges(global_id)) { | |
| int dst_global = std::get<1>(edge); | |
| if (seen.insert(dst_global).second) { | |
| result.push_back(dst_global); | |
| } | |
| } | |
| for (const auto& edge : GetAllInIncidentEdges(global_id)) { | |
| int src_global = std::get<0>(edge); | |
| if (seen.insert(src_global).second) { | |
| result.push_back(src_global); | |
| } | |
| } |
| inline BipartiteMaximumMatching BPSolver; | ||
| enum STRUCTURE_FILTER { | ||
| NO_STRUCTURE_FILTER, |
There was a problem hiding this comment.
BPSolver is a single global inline instance shared across all CandidateSpace objects and all threads. CandidateSpace::CandidateSpace calls BPSolver.Initialize(), which overwrites internal pointers without freeing prior allocations (leak) and makes the solver state non-reentrant; BuildCS/EdgeBipartiteSafety also mutate it. Make BipartiteMaximumMatching an instance member of CandidateSpace (or allocate per call / thread_local) so concurrent queries cannot race and repeated calls do not leak memory.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 32 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 辅助函数:检查值是否满足约束条件 | ||
| // Note: neug::execution::Value only supports == operator, other comparisons not yet implemented | ||
| inline bool CandidateSpace::CheckValueConstraint(const Value& data_value, CompType comp_type, const Value& constraint_value) { | ||
| switch (comp_type) { | ||
| case CompType::COMP_EQUAL: | ||
| return data_value == constraint_value; | ||
| // TODO: Implement other comparisons when Value supports them | ||
| // Currently Value class doesn't have >, <, >=, <= operators | ||
| case CompType::COMP_GREATER: | ||
| return data_value > constraint_value; | ||
| case CompType::COMP_LESS: | ||
| return data_value < constraint_value; |
There was a problem hiding this comment.
The comment above CheckValueConstraint() says Value doesn't support > < >= <=, but this file now includes sampled_match_value.h specifically to add those operators. Please update/remove the outdated comment so it doesn't mislead future changes.
| ~BipartiteMaximumMatching() { | ||
| delete[] left; | ||
| delete[] right; | ||
| delete[] used; | ||
| for (int i = 0; i < left_len; i++) { | ||
| delete[] adj[i]; | ||
| delete[] matchable[i]; | ||
| delete[] upper_graph[i]; | ||
| } | ||
| delete[] matchable; | ||
| delete[] adj; | ||
| delete[] adj_size; | ||
| delete[] lower_graph; | ||
| delete[] upper_graph; | ||
| delete[] lower_graph_size; | ||
| delete[] upper_graph_size; | ||
| delete[] right_order; | ||
| delete[] inverse_right_order; | ||
| delete[] bfs_visited; | ||
| } |
There was a problem hiding this comment.
~BipartiteMaximumMatching() does not free several buffers allocated in Initialize() (e.g., adj_index and its rows, lower_graph[i] rows, and Q/S/dfsn/scch/scc_idx). This causes per-query leaks now that CandidateSpace owns the solver. Ensure every allocation in Initialize() is paired with a corresponding delete[] in the destructor (and consider switching to std::vector for ownership).
| std::fill(sample.begin(), sample.end(), -1); | ||
| memset(local_candidate_size, 0, query_->GetNumVertices()); | ||
| sample[root] = (i % root_candidates_.size()); |
There was a problem hiding this comment.
memset(local_candidate_size, 0, query_->GetNumVertices()); again uses an element count (bytes) instead of query_->GetNumVertices() * sizeof(int). This can leave stale sizes for later vertices and cause incorrect sampling / memory corruption.
|
|
||
| void Add(const Timer &other) { time += other.time; } | ||
|
|
||
| double Peek() { Stop(); return std::chrono::duration<double, std::milli>(e - s).count(); } |
There was a problem hiding this comment.
Timer::Peek() calls Stop(), but Stop() resets s to now before Peek() computes (e - s), so Peek() will typically return ~0 instead of the elapsed time. Either compute the delta without resetting s, or have Peek() just return time + (now - s) without mutating state.
| double Peek() { Stop(); return std::chrono::duration<double, std::milli>(e - s).count(); } | |
| double Peek() { | |
| auto now = std::chrono::high_resolution_clock::now(); | |
| return time + std::chrono::duration<double, std::milli>(now - s).count(); | |
| } |
| // For each neighbor of v (using neighbors_) | ||
| for (int u : GetNeighbors(v)) { |
There was a problem hiding this comment.
ComputeCoreNum() calls GetNeighbors(v) inside the main decomposition loop. GetNeighbors() allocates and rebuilds the neighbor list by scanning incident edges each time, which can make preprocessing far more expensive than necessary on large graphs. Consider materializing an undirected adjacency list once (or caching per-vertex neighbors) for the k-core computation.
| // For each neighbor of v (using neighbors_) | |
| for (int u : GetNeighbors(v)) { | |
| // For each neighbor of v, iterate the precomputed adjacency list directly. | |
| for (int u : neighbors_[v]) { |
|
|
||
| std::shuffle(root_candidates_.begin(), root_candidates_.end(), gen); | ||
| double est = 0.0; |
There was a problem hiding this comment.
This header uses gen for std::shuffle, but gen is not declared here; it currently relies on another header defining a global RNG and on a specific include order. Make the dependency explicit (e.g., own an std::mt19937 in CandidateGraphSampler, or declare an extern RNG in a dedicated header) so including this file alone compiles reliably.
| Apache License | ||
| Version 2.0, January 2004 | ||
| http://www.apache.org/licenses/ | ||
|
|
||
| TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION | ||
|
|
||
| 1. Definitions. | ||
|
|
||
| "License" shall mean the terms and conditions for use, reproduction, | ||
| and distribution as defined by Sections 1 through 9 of this document. | ||
|
|
||
| "Licensor" shall mean the copyright owner or entity authorized by | ||
| the copyright owner that is granting the License. | ||
|
|
||
| "Legal Entity" shall mean the union of the acting entity and all | ||
| other entities that control, are controlled by, or are under common | ||
| control with that entity. For the purposes of this definition, | ||
| "control" means (i) the power, direct or indirect, to cause the | ||
| direction or management of such entity, whether by contract or | ||
| otherwise, or (ii) ownership of fifty percent (50%) or more of the | ||
| outstanding shares, or (iii) beneficial ownership of such entity. | ||
|
|
||
| "You" (or "Your") shall mean an individual or Legal Entity | ||
| exercising permissions granted by this License. | ||
|
|
||
| "Source" form shall mean the preferred form for making modifications, | ||
| including but not limited to software source code, documentation | ||
| source, and configuration files. | ||
|
|
||
| "Object" form shall mean any form resulting from mechanical | ||
| transformation or translation of a Source form, including but | ||
| not limited to compiled object code, generated documentation, | ||
| and conversions to other media types. | ||
|
|
||
| "Work" shall mean the work of authorship, whether in Source or | ||
| Object form, made available under the License, as indicated by a | ||
| copyright notice that is included in or attached to the work | ||
| (an example is provided in the Appendix below). | ||
|
|
||
| "Derivative Works" shall mean any work, whether in Source or Object | ||
| form, that is based on (or derived from) the Work and for which the | ||
| editorial revisions, annotations, elaborations, or other modifications | ||
| represent, as a whole, an original work of authorship. For the purposes | ||
| of this License, Derivative Works shall not include works that remain | ||
| separable from, or merely link (or bind by name) to the interfaces of, | ||
| the Work and Derivative Works thereof. | ||
|
|
||
| "Contribution" shall mean any work of authorship, including | ||
| the original version of the Work and any modifications or additions | ||
| to that Work or Derivative Works thereof, that is intentionally | ||
| submitted to Licensor for inclusion in the Work by the copyright owner | ||
| or by an individual or Legal Entity authorized to submit on behalf of | ||
| the copyright owner. For the purposes of this definition, "submitted" | ||
| means any form of electronic, verbal, or written communication sent | ||
| to the Licensor or its representatives, including but not limited to | ||
| communication on electronic mailing lists, source code control systems, | ||
| and issue tracking systems that are managed by, or on behalf of, the | ||
| Licensor for the purpose of discussing and improving the Work, but | ||
| excluding communication that is conspicuously marked or otherwise | ||
| designated in writing by the copyright owner as "Not a Contribution." | ||
|
|
||
| "Contributor" shall mean Licensor and any individual or Legal Entity | ||
| on behalf of whom a Contribution has been received by Licensor and | ||
| subsequently incorporated within the Work. | ||
|
|
||
| 2. Grant of Copyright License. Subject to the terms and conditions of |
There was a problem hiding this comment.
Multiple source headers in fastest_lib state the code is derived from FaSTest under the MIT License, but this directory’s LICENSE file contains the Apache 2.0 text. For license compliance, include the MIT license text (and preserve upstream copyright/notice requirements), or clarify the intended dual-licensing explicitly.
| Apache License | |
| Version 2.0, January 2004 | |
| http://www.apache.org/licenses/ | |
| TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION | |
| 1. Definitions. | |
| "License" shall mean the terms and conditions for use, reproduction, | |
| and distribution as defined by Sections 1 through 9 of this document. | |
| "Licensor" shall mean the copyright owner or entity authorized by | |
| the copyright owner that is granting the License. | |
| "Legal Entity" shall mean the union of the acting entity and all | |
| other entities that control, are controlled by, or are under common | |
| control with that entity. For the purposes of this definition, | |
| "control" means (i) the power, direct or indirect, to cause the | |
| direction or management of such entity, whether by contract or | |
| otherwise, or (ii) ownership of fifty percent (50%) or more of the | |
| outstanding shares, or (iii) beneficial ownership of such entity. | |
| "You" (or "Your") shall mean an individual or Legal Entity | |
| exercising permissions granted by this License. | |
| "Source" form shall mean the preferred form for making modifications, | |
| including but not limited to software source code, documentation | |
| source, and configuration files. | |
| "Object" form shall mean any form resulting from mechanical | |
| transformation or translation of a Source form, including but | |
| not limited to compiled object code, generated documentation, | |
| and conversions to other media types. | |
| "Work" shall mean the work of authorship, whether in Source or | |
| Object form, made available under the License, as indicated by a | |
| copyright notice that is included in or attached to the work | |
| (an example is provided in the Appendix below). | |
| "Derivative Works" shall mean any work, whether in Source or Object | |
| form, that is based on (or derived from) the Work and for which the | |
| editorial revisions, annotations, elaborations, or other modifications | |
| represent, as a whole, an original work of authorship. For the purposes | |
| of this License, Derivative Works shall not include works that remain | |
| separable from, or merely link (or bind by name) to the interfaces of, | |
| the Work and Derivative Works thereof. | |
| "Contribution" shall mean any work of authorship, including | |
| the original version of the Work and any modifications or additions | |
| to that Work or Derivative Works thereof, that is intentionally | |
| submitted to Licensor for inclusion in the Work by the copyright owner | |
| or by an individual or Legal Entity authorized to submit on behalf of | |
| the copyright owner. For the purposes of this definition, "submitted" | |
| means any form of electronic, verbal, or written communication sent | |
| to the Licensor or its representatives, including but not limited to | |
| communication on electronic mailing lists, source code control systems, | |
| and issue tracking systems that are managed by, or on behalf of, the | |
| Licensor for the purpose of discussing and improving the Work, but | |
| excluding communication that is conspicuously marked or otherwise | |
| designated in writing by the copyright owner as "Not a Contribution." | |
| "Contributor" shall mean Licensor and any individual or Legal Entity | |
| on behalf of whom a Contribution has been received by Licensor and | |
| subsequently incorporated within the Work. | |
| 2. Grant of Copyright License. Subject to the terms and conditions of | |
| FaSTest-derived code in this directory is provided under the MIT License. | |
| Preserve all upstream copyright and permission notices in the source files. | |
| MIT License | |
| Permission is hereby granted, free of charge, to any person obtaining a copy | |
| of this software and associated documentation files (the "Software"), to deal | |
| in the Software without restriction, including without limitation the rights | |
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
| copies of the Software, and to permit persons to whom the Software is | |
| furnished to do so, subject to the following conditions: | |
| The above copyright notice and this permission notice shall be included in all | |
| copies or substantial portions of the Software. | |
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
| SOFTWARE. |
| #include "neug/compiler/extension/extension_api.h" | ||
| #include "neug/utils/exception/exception.h" | ||
|
|
||
| #include "sampled_match_functions.h" |
There was a problem hiding this comment.
std::cout is used in this file but <iostream> is not included (and extension_api.h doesn't include it). This will fail to compile on a clean build; add the proper header include.
| writeInt(static_cast<int32_t>(global_to_local_.size())); | ||
| for (const auto& [label, vid] : global_to_local_) { | ||
| writeInt(static_cast<int32_t>(label)); | ||
| writeInt(static_cast<int32_t>(vid)); | ||
| } |
There was a problem hiding this comment.
SaveToFile() serializes vid_t (which is uint32_t) by casting to int32_t. Vertex IDs above INT32_MAX will be corrupted on disk and then mis-loaded. Serialize vid_t as uint32_t (or uint64_t) and add corresponding range/consistency checks in LoadFromFile().
| memset(BitsetCS[i], false, data_meta_.GetNumVertices()); | ||
| } | ||
| // Clear edge candidate sets | ||
| for (int i = 0; i < query_->GetNumEdges(); i++) { | ||
| BitsetEdgeCS[i].clear(); | ||
| } | ||
| memset(num_visit_cs_, 0, data_meta_.GetNumVertices()); |
There was a problem hiding this comment.
memset(BitsetCS[i], false, data_meta_.GetNumVertices()); passes an element count, not a byte count. Use data_meta_.GetNumVertices() * sizeof(bool) (or std::fill_n) to avoid partial initialization on platforms where sizeof(bool) != 1.
| memset(BitsetCS[i], false, data_meta_.GetNumVertices()); | |
| } | |
| // Clear edge candidate sets | |
| for (int i = 0; i < query_->GetNumEdges(); i++) { | |
| BitsetEdgeCS[i].clear(); | |
| } | |
| memset(num_visit_cs_, 0, data_meta_.GetNumVertices()); | |
| std::fill_n(BitsetCS[i], data_meta_.GetNumVertices(), false); | |
| } | |
| // Clear edge candidate sets | |
| for (int i = 0; i < query_->GetNumEdges(); i++) { | |
| BitsetEdgeCS[i].clear(); | |
| } | |
| std::fill_n(num_visit_cs_, data_meta_.GetNumVertices(), 0); |
| | --------------- | ----------------------------- | ------------------------------------------------------------------------ | ------------- | | ||
| | Data Source | [JSON](load_json.md) | Import & export data from JSON file format | v0.1 | | ||
| | Data Source | [PARQUET](load_parquet.md) | Import & Export data from PARQUET format files | v0.1.1 | | ||
| | Graph Algorithm | [SAMPLED_MATCH](sampled_match.md) | Subgraph matching cardinality estimation (FaSTest, VLDB 2024) | v0.1.1 | |
| ], | ||
| "required_props": ["name"] | ||
| }, | ||
| {"id": 1, "label": "Person", "required_props": ["name"]}, |
There was a problem hiding this comment.
这里面有没有考虑一些corner case(和测试):例如写了一个 不存在于图中的label会怎么样,用了图中不存在的property会怎么样,然后 constraints 中,不能使用用多个constraints (and, or)组合吗?constraints的 property 和value 类型不匹配会如何?
What do these changes do?
Add an extension to perform subgraph matching in reasonable time on very large graphs with sampling algorithms
Related issue number
Fix #35, #345