Skip to content

Farthest Point Sampling#1696

Open
mnabian wants to merge 3 commits into
NVIDIA:mainfrom
mnabian:fps
Open

Farthest Point Sampling#1696
mnabian wants to merge 3 commits into
NVIDIA:mainfrom
mnabian:fps

Conversation

@mnabian
Copy link
Copy Markdown
Collaborator

@mnabian mnabian commented Jun 5, 2026

PhysicsNeMo Pull Request

Description

Adds farthest-point sampling (FPS) — a core point-cloud primitive the library was missing — as a FunctionSpec functional under nn/functional/geometry/, with a pure-PyTorch baseline and a fused, Warp-accelerated CUDA backend.

case torch (ms) warp (ms) speedup match
small-p1024-d3-k128 5.384 0.221 24.31× ok
medium-p4096-d3-k512 21.491 1.771 12.13× ok
large-p16384-d3-k1024 43.121 15.273 2.82× ok
batched-b4-p4096-d3-k512 21.407 1.772 12.08× ok

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@mnabian mnabian requested a review from loliverhennigh as a code owner June 5, 2026 01:02
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mnabian mnabian self-assigned this Jun 5, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

This PR adds farthest_point_sampling to physicsnemo.nn.functional, implementing greedy FPS for point clouds with a pure-PyTorch CPU fallback and a single-launch fused Warp CUDA kernel. It follows the existing FunctionSpec dispatch pattern established by MeshPoissonDiskSample and MeshToVoxelFraction.

  • Warp kernel (kernels.py): one block per cloud with strided lane ownership of min_dist avoids cross-lane aliasing; block-wide tile_max/tile_min reductions correctly implement argmax with smallest-index tie-breaking matching torch.argmax. Algorithm is correct for all valid num_samples ∈ [1, N].
  • Dispatch (farthest_point_sampling.py): auto-selects warp on CUDA, torch on CPU; explicit implementation overrides are validated against availability; torch.library.custom_op + register_fake enable torch.compile compatibility for the warp path.
  • Test suite (test_farthest_point_sampling.py): good coverage including known-answer collinear/outlier cases, batching, backend parity, opcheck, and fullgraph=True compile verification.

Important Files Changed

Filename Overview
physicsnemo/nn/functional/geometry/farthest_point_sampling/_warp_impl.py Warp-accelerated FPS using @torch.library.custom_op for torch.compile compatibility; follows the established pattern of unconditional warp import and wp.init() at module level.
physicsnemo/nn/functional/geometry/farthest_point_sampling/kernels.py Single-launch fused Warp kernel using tile_max/tile_min for block-wide argmax; algorithm is correct — strided lane ownership avoids cross-lane aliasing on min_dist, and the sentinel num_points correctly handles tie-breaking by smallest index.
physicsnemo/nn/functional/geometry/farthest_point_sampling/_torch_impl.py Pure-PyTorch FPS implementation; correct vectorized algorithm — inf-initialized min_dist with torch.minimum update prevents re-selection; Python for-loop will cause graph breaks under torch.compile (expected, documented).
physicsnemo/nn/functional/geometry/farthest_point_sampling/farthest_point_sampling.py FunctionSpec dispatch class with warp/torch backends; auto-dispatch correctly routes CUDA tensors to warp and CPU tensors to torch; compare_forward uses sorted set comparison which is weaker than order comparison for a deterministic algorithm.
physicsnemo/nn/functional/geometry/farthest_point_sampling/utils.py Input validation correctly normalizes 2D/3D inputs, enforces num_samples bounds, and returns was_unbatched flag.
test/nn/functional/geometry/test_farthest_point_sampling.py Comprehensive test suite with known-answer, parity, batching, opcheck, and compile tests; unconditional top-level import of private _warp_impl module means any warp load failure prevents collection of all tests including pure-torch ones.
benchmarks/physicsnemo/nn/functional/registry.py Correctly registers FarthestPointSampling in the benchmark FUNCTIONAL_SPECS tuple.
physicsnemo/nn/functional/geometry/farthest_point_sampling/init.py Clean package init, exports both class and functional form.

Comments Outside Diff (2)

  1. test/nn/functional/geometry/test_farthest_point_sampling.py, line 640-642 (link)

    P2 Unconditional top-level import of private _warp_impl module

    fps_warp_op is imported unconditionally at module collection time and is only used in test_fps_opcheck. If Warp fails to load for any reason (CUDA toolkit mismatch, compilation error), this import failure prevents pytest from collecting the entire test file — including tests like test_fps_higher_dim and test_fps_error_handling that only exercise the torch backend and don't need warp at all. Since torch.library.opcheck requires the raw registered op, consider moving this import inside test_fps_opcheck so a warp load failure only skips that one test.

  2. physicsnemo/nn/functional/geometry/farthest_point_sampling/farthest_point_sampling.py, line 459-468 (link)

    P2 compare_forward sorted-set comparison is weaker than order comparison for a deterministic algorithm

    FPS is fully deterministic for tie-free inputs and both backends are documented to traverse in the same order. Sorting before assert_close means this method would silently pass even if the two backends produce identical index sets but in different orders — which would indicate a tie-breaking divergence. Since the benchmark framework relies on compare_forward to validate backend parity, a direct (un-sorted) comparison would be a stricter and more informative check for tie-free inputs. If sorted comparison is intentional for robustness with near-tie float64 inputs, that's worth a comment clarifying the trade-off.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "add FPS" | Re-trigger Greptile

@mnabian
Copy link
Copy Markdown
Collaborator Author

mnabian commented Jun 5, 2026

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant