Skip to content

feat: add ST_KNN spatial join operator#797

Open
pierre-warnier wants to merge 3 commits into
duckdb:v1.5-variegatafrom
pierre-warnier:pr/knn-2-operator
Open

feat: add ST_KNN spatial join operator#797
pierre-warnier wants to merge 3 commits into
duckdb:v1.5-variegatafrom
pierre-warnier:pr/knn-2-operator

Conversation

@pierre-warnier
Copy link
Copy Markdown

Adds a K-nearest neighbor spatial join to duckdb-spatial. SedonaDB-compatible syntax, pragmatic scope.

Depends on: #796 (FlatRTree extraction), which itself depends on #795 (GCC 11+ build fix). Please merge #795#796 → this PR in order.

Usage

-- For each building, find the 5 nearest hydrants
SELECT b.id, h.id, ST_Distance(b.geom, h.geom) AS d
FROM buildings b
JOIN hydrants h ON ST_KNN(b.geom, h.geom, 5);

Returns count(buildings) * 5 rows. Ties at the k-th position follow the R-tree scan order (documented, not random).

Pipeline

LogicalJoin(ST_KNN)
  → spatial_join_optimizer rewrites to
LogicalSpatialKNNJoin
  → physical_planner creates
PhysicalSpatialKNNJoin(FlatRTree from #796 + Hjaltason-Samet KNN)

Implementation notes

  • ST_KNN(geom1, geom2, k) — 3-arg scalar stub. Returns BOOLEAN. Throws InvalidInputException if executed outside a JOIN ON clause. This is intentional: ST_KNN is a join marker the optimizer recognizes, not a computable predicate. Same pattern as SedonaDB's st_knn.rs. Documented in the function description.
  • Optimizer rule (spatial_join_optimizer.cpp): matches LogicalJoin with ST_KNN condition, rewrites to LogicalSpatialKNNJoin. Handles JoinOrderOptimizer child swaps by tracking build_child_idx. Rejects explicit RIGHT JOIN; converts INNER with extra AND predicates into Join(LogicalSpatialKNNJoin, filter).
  • Physical operator uses the shared FlatRTree from refactor: extract FlatRTree to shared header + add KNN search primitive #796. Hjaltason-Samet priority queue traversal, exact distance refinement with 2× overfetch (adaptive retry at 8× if needed).
  • INNER JOIN and LEFT JOIN supported. LEFT JOIN emits NULL rows for unmatched probe rows.
  • Serialization support for prepared statements (LogicalSpatialKNNJoin::Serialize/Deserialize).
  • Thread-safe: FlatRTree is built once in Finalize, read-only during Execute. Probe parallelism via DuckDB's standard pipeline parallelism.

Explicitly out of scope (follow-up PRs)

  • Out-of-core / spill-to-disk — the FlatRTree is compact (~24 bytes/entry, so 10M entries = 240MB). In-memory covers virtually all real workloads. A spill-and-rebuild design (per-partition rebuild from spilled source batches, similar to SedonaDB's approach) is ready but will be a separate PR integrated with DuckDB's TemporaryMemoryManager.
  • Spheroid distance — planar only. Spheroidal KNN needs either ECEF coordinate transform or latitude-adaptive bbox overfetch to be correct at extreme latitudes. Per prior discussion, ST_DWithin_Spheroid in the regular spatial join is a better first step.
  • LATERAL syntax recognitionST_KNN requires the optimizer rule. If disabled_optimizers='spatial_join', the query errors (same as SedonaDB's behavior). A LATERAL optimizer rule can be added when the core hook for lateral-join rewrites lands.

Files

src/spatial/modules/main/spatial_functions_scalar.cpp    |  87 ++  (ST_KNN function)
src/spatial/operators/CMakeLists.txt                     |   2 ++
src/spatial/operators/spatial_join_optimizer.cpp         | 163 ++  (rewrite rule)
src/spatial/operators/spatial_knn_join_logical.cpp       |  77 ++  (new)
src/spatial/operators/spatial_knn_join_logical.hpp       |  58 ++  (new)
src/spatial/operators/spatial_knn_join_physical.cpp      | 713 ++  (new)
src/spatial/operators/spatial_knn_join_physical.hpp      |  72 ++  (new)
src/spatial/operators/spatial_operator_extension.cpp     |   4 ++
src/spatial/util/knn_extract.hpp                         |  11 ++  (new)

Cross-engine validation

On the NYC benchmark (1.2M buildings × 109K hydrants, k=5):

Engine Time Notes
DuckDB (this branch) 2.01s 867% CPU, pipeline-parallel
SedonaDB 0.3.0 5.74s 136% CPU, mostly single-threaded

300/300 identical (probe_id, distance) pairs between the two engines.

Verification

  • All existing spatial join tests green (133 assertions)
  • KNN-specific tests in the follow-up PR (115 assertions, 4 test files)
  • Built on GCC 11.4.0, no warnings

The Copy() method returns a unique_ptr<BindData> which must be
converted to unique_ptr<FunctionData>. Under GCC 11+ this
conversion requires explicit std::move() — newer compilers
reject the implicit conversion as a copy-initialization.

Fixes: error: could not convert 'copy' from
'unique_ptr<ST_Distance_Sphere::BindData>' to 'unique_ptr<FunctionData>'
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ST_KNN(geom1, geom2, k) join-marker function and corresponding optimizer + execution pipeline to support K-nearest-neighbor spatial joins in the DuckDB spatial extension.

Changes:

  • Introduces ST_KNN scalar stub + bind-time constant-k extraction helper.
  • Extends the spatial join optimizer to detect ST_KNN in JOIN ON and rewrite to a new logical KNN join operator (with optional filter wrapping for extra AND predicates on INNER joins).
  • Adds new logical + physical KNN join operators, and wires them into extension operator (de)serialization and build configuration.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/spatial/util/knn_extract.hpp Declares helper for extracting constant k from bound ST_KNN bind data.
src/spatial/modules/main/spatial_functions_scalar.cpp Registers ST_KNN and implements constant-k extraction helper.
src/spatial/operators/spatial_join_optimizer.cpp Adds detection/rewrite rule for ST_KNN joins before existing spatial join rewrites.
src/spatial/operators/spatial_knn_join_logical.hpp / .cpp Adds logical operator for KNN join with projection maps + serialization.
src/spatial/operators/spatial_knn_join_physical.hpp / .cpp Adds physical operator that builds a FlatRTree and probes it using KNN search + refinement.
src/spatial/operators/spatial_operator_extension.cpp Registers logical operator type deserialization for the new KNN join.
src/spatial/operators/flat_rtree.hpp Introduces shared FlatRTree + scan state + KNN search state/implementation.
src/spatial/operators/spatial_join_physical.cpp Switches spatial join to include shared FlatRTree header.
src/spatial/operators/CMakeLists.txt Adds new KNN logical/physical sources to the build.
src/spatial/geometry/bbox.hpp Adds MinDistanceSquared helpers used by KNN traversal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/spatial/operators/flat_rtree.hpp
Comment thread src/spatial/operators/spatial_operator_extension.cpp Outdated
Comment thread src/spatial/operators/spatial_join_optimizer.cpp
Comment thread src/spatial/operators/flat_rtree.hpp
Comment thread src/spatial/operators/spatial_knn_join_physical.cpp
Comment thread src/spatial/operators/spatial_knn_join_physical.cpp
Comment thread src/spatial/operators/spatial_knn_join_physical.cpp Outdated
Move FlatRTree from anonymous namespace in spatial_join_physical.cpp
to a shared header (flat_rtree.hpp) so it can be reused by the
upcoming KNN join operator.

Changes:
- Extract FlatRTree class to src/spatial/operators/flat_rtree.hpp
- Add KNNSearch method using Hjaltason-Samet priority queue traversal
- Add FlatRTreeKNNState for KNN search state management
- Add Box2D::MinDistanceSquared for KNN distance lower bound
- Use Allocator instead of BufferManager (in-memory only)
- No functional changes to existing SPATIAL_JOIN behavior
Adds a K-nearest neighbor spatial join to duckdb-spatial:

- ST_KNN(geom1, geom2, k) — scalar stub recognized by the optimizer
- Optimizer rewrites JOIN ON ST_KNN(...) into SPATIAL_KNN_JOIN
- Hjaltason-Samet priority queue KNN on FlatRTree
- Exact distance refinement with adaptive overfetch (2x, retry 8x)
- INNER and LEFT JOIN support
- JOO child swap handling via build_child_idx

Stripped for review (follow-up PRs):
- No out-of-core / spill-to-disk (tree always in memory)
- No spheroid support (planar distance only)
- No partitioning (single R-tree)

The FlatRTree is compact (~24 bytes/entry), so in-memory handles
up to ~100M entries (2.4GB tree) on commodity hardware.
@pierre-warnier
Copy link
Copy Markdown
Author

Addressed all items in the amended commit (ae53272), including one genuine critical bug:

  1. compute_exact_distance returning 0.0 unconditionally — this was a real bug. Now calls sgl::ops::get_euclidean_distance (the same path ST_DWithin uses). Returns infinity for empty-geometry pairs so they rank last. Verified manually with polygon inputs:

    -- 10 polygons at x=0,2,4,...,18, probe at origin, k=3
    SELECT b.id, round(ST_Distance(p.geom, b.geom), 2) AS d
    FROM probe p JOIN build b ON ST_KNN(p.geom, b.geom, 3) ORDER BY d;
    -- id=0 d=0.0, id=1 d=1.5, id=2 d=3.5  ← correct true-distance ordering

    Our NYC point-benchmark masked this — for points, bbox distance equals true distance so the FlatRTree priority queue already returned correct order, and the broken refinement was a no-op. For polygons/linestrings this would have returned lower-bound-ordered candidates instead of the true k-nearest.

  2. Dead ternary false ? 0xFFFFFFFF : Count() — removed. Leftover scaffolding.

  3. std::sortstd::stable_sort — both occurrences. Tie order is now deterministic and matches the documented R-tree scan order.

  4. Malformed SerializationException format string — fixed the unterminated quote.

  5. KNN optimizer only handles LOGICAL_ANY_JOIN — acknowledged as a limitation. The existing SPATIAL_JOIN optimizer has the same scope; broadening both together is a reasonable follow-up. For now, ST_KNN AND equi_pred on a comparison join will error with the "cannot be used outside JOIN ON" message rather than silently skip the rewrite — which at least fails loudly rather than returning wrong results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants