feat: add ST_KNN spatial join operator#797
Conversation
The Copy() method returns a unique_ptr<BindData> which must be converted to unique_ptr<FunctionData>. Under GCC 11+ this conversion requires explicit std::move() — newer compilers reject the implicit conversion as a copy-initialization. Fixes: error: could not convert 'copy' from 'unique_ptr<ST_Distance_Sphere::BindData>' to 'unique_ptr<FunctionData>'
There was a problem hiding this comment.
Pull request overview
Adds a new ST_KNN(geom1, geom2, k) join-marker function and corresponding optimizer + execution pipeline to support K-nearest-neighbor spatial joins in the DuckDB spatial extension.
Changes:
- Introduces
ST_KNNscalar stub + bind-time constant-kextraction helper. - Extends the spatial join optimizer to detect
ST_KNNinJOIN ONand rewrite to a new logical KNN join operator (with optional filter wrapping for extraANDpredicates onINNERjoins). - Adds new logical + physical KNN join operators, and wires them into extension operator (de)serialization and build configuration.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/spatial/util/knn_extract.hpp | Declares helper for extracting constant k from bound ST_KNN bind data. |
| src/spatial/modules/main/spatial_functions_scalar.cpp | Registers ST_KNN and implements constant-k extraction helper. |
| src/spatial/operators/spatial_join_optimizer.cpp | Adds detection/rewrite rule for ST_KNN joins before existing spatial join rewrites. |
| src/spatial/operators/spatial_knn_join_logical.hpp / .cpp | Adds logical operator for KNN join with projection maps + serialization. |
| src/spatial/operators/spatial_knn_join_physical.hpp / .cpp | Adds physical operator that builds a FlatRTree and probes it using KNN search + refinement. |
| src/spatial/operators/spatial_operator_extension.cpp | Registers logical operator type deserialization for the new KNN join. |
| src/spatial/operators/flat_rtree.hpp | Introduces shared FlatRTree + scan state + KNN search state/implementation. |
| src/spatial/operators/spatial_join_physical.cpp | Switches spatial join to include shared FlatRTree header. |
| src/spatial/operators/CMakeLists.txt | Adds new KNN logical/physical sources to the build. |
| src/spatial/geometry/bbox.hpp | Adds MinDistanceSquared helpers used by KNN traversal. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Move FlatRTree from anonymous namespace in spatial_join_physical.cpp to a shared header (flat_rtree.hpp) so it can be reused by the upcoming KNN join operator. Changes: - Extract FlatRTree class to src/spatial/operators/flat_rtree.hpp - Add KNNSearch method using Hjaltason-Samet priority queue traversal - Add FlatRTreeKNNState for KNN search state management - Add Box2D::MinDistanceSquared for KNN distance lower bound - Use Allocator instead of BufferManager (in-memory only) - No functional changes to existing SPATIAL_JOIN behavior
Adds a K-nearest neighbor spatial join to duckdb-spatial: - ST_KNN(geom1, geom2, k) — scalar stub recognized by the optimizer - Optimizer rewrites JOIN ON ST_KNN(...) into SPATIAL_KNN_JOIN - Hjaltason-Samet priority queue KNN on FlatRTree - Exact distance refinement with adaptive overfetch (2x, retry 8x) - INNER and LEFT JOIN support - JOO child swap handling via build_child_idx Stripped for review (follow-up PRs): - No out-of-core / spill-to-disk (tree always in memory) - No spheroid support (planar distance only) - No partitioning (single R-tree) The FlatRTree is compact (~24 bytes/entry), so in-memory handles up to ~100M entries (2.4GB tree) on commodity hardware.
b31f019 to
ae53272
Compare
|
Addressed all items in the amended commit (
|
Adds a K-nearest neighbor spatial join to duckdb-spatial. SedonaDB-compatible syntax, pragmatic scope.
Depends on: #796 (FlatRTree extraction), which itself depends on #795 (GCC 11+ build fix). Please merge #795 → #796 → this PR in order.
Usage
Returns
count(buildings) * 5rows. Ties at the k-th position follow the R-tree scan order (documented, not random).Pipeline
Implementation notes
ST_KNN(geom1, geom2, k)— 3-arg scalar stub. ReturnsBOOLEAN. ThrowsInvalidInputExceptionif executed outside aJOIN ONclause. This is intentional:ST_KNNis a join marker the optimizer recognizes, not a computable predicate. Same pattern as SedonaDB'sst_knn.rs. Documented in the function description.spatial_join_optimizer.cpp): matchesLogicalJoinwithST_KNNcondition, rewrites toLogicalSpatialKNNJoin. HandlesJoinOrderOptimizerchild swaps by trackingbuild_child_idx. Rejects explicitRIGHT JOIN; convertsINNERwith extraANDpredicates intoJoin(LogicalSpatialKNNJoin, filter).FlatRTreefrom refactor: extract FlatRTree to shared header + add KNN search primitive #796. Hjaltason-Samet priority queue traversal, exact distance refinement with 2× overfetch (adaptive retry at 8× if needed).INNER JOINandLEFT JOINsupported.LEFT JOINemitsNULLrows for unmatched probe rows.LogicalSpatialKNNJoin::Serialize/Deserialize).FlatRTreeis built once inFinalize, read-only duringExecute. Probe parallelism via DuckDB's standard pipeline parallelism.Explicitly out of scope (follow-up PRs)
FlatRTreeis compact (~24 bytes/entry, so 10M entries = 240MB). In-memory covers virtually all real workloads. A spill-and-rebuild design (per-partition rebuild from spilled source batches, similar to SedonaDB's approach) is ready but will be a separate PR integrated with DuckDB'sTemporaryMemoryManager.ST_DWithin_Spheroidin the regular spatial join is a better first step.LATERALsyntax recognition —ST_KNNrequires the optimizer rule. Ifdisabled_optimizers='spatial_join', the query errors (same as SedonaDB's behavior). ALATERALoptimizer rule can be added when the core hook for lateral-join rewrites lands.Files
Cross-engine validation
On the NYC benchmark (1.2M buildings × 109K hydrants, k=5):
300/300 identical
(probe_id, distance)pairs between the two engines.Verification