Sync NEON optimized kernels to current upstream PR (#19119)#7
Merged
Sync NEON optimized kernels to current upstream PR (#19119)#7
Conversation
Replace the in-tree NEON grid_sampler_2d / sum implementation on polycam with the version that's gone through review on pytorch#19119. The diff is essentially a fast-forward of three rounds of upstream review feedback that we hadn't carried back here: 1. Hardware fp16 vs software fp16 split. The original polycam version compiled the entire op_grid_sampler_2d.cpp with -march=armv8.2-a+fp16, which raises SIGILL on ARMv8 chips that lack the +fp16 extension. The new version moves the hardware-fp16 path into a dedicated TU (op_grid_sampler_2d_fp16_hw.cpp + .h) compiled as a separate buck cxx_library / CMake OBJECT library so the -march flag stays scoped to that single TU. The dispatcher in op_grid_sampler_2d.cpp picks between the HW path and a software-convert NEON path at runtime via cpuinfo_has_arm_neon_fp16(). Plain ARMv8 chips run the SW path with no illegal instructions; ARMv8.2+ chips get the faster HW path. 2. Forward declaration of grid_sampler_2d_bilinear_fp16_hw extracted into op_grid_sampler_2d_fp16_hw.h so the definition site has a visible prototype and -Wmissing-prototypes builds stay green. 3. dtypes_match gate at the top of the fast-path eligibility check (input.scalar_type() == grid.scalar_type() == out.scalar_type()). The reinterpret_cast<__fp16*> calls in the HW kernel and the data_ptr<T>() calls in the other paths each assume a single dtype across all three tensors; mismatched buffers would silently corrupt output. Falls back to portable when types disagree. CMake side switches from set_source_files_properties on a single source to a dedicated OBJECT library (grid_sampler_2d_fp16_hw_impl) linked into optimized_kernels via $<BUILD_LOCAL_INTERFACE:>; mirrors the buck cxx_library of the same name. Verify binary (kernels/optimized/verify.cpp) and the preset.cmake DESCRIPTION quoting fix are unchanged — those are polycam-specific tooling, not part of the NEON sync. Build verified on macOS arm64 with EXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON; op_grid_sampler_2d_fp16_hw.cpp.o is archived into liboptimized_kernels.a as expected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the in-tree NEON
grid_sampler_2d/sumimplementation onpolycamwith the version that's been through three rounds of review on pytorch/executorch#19119. No new functionality — purely catching us up to the version that's about to land upstream.What changed vs the version on
polycamHardware fp16 vs software fp16 split. The original
polycamversion compiled the entireop_grid_sampler_2d.cppwith-march=armv8.2-a+fp16, which raises SIGILL on ARMv8 chips that lack the+fp16extension. The new version moves the HW-fp16 path into a dedicated TU (op_grid_sampler_2d_fp16_hw.cpp+.h) compiled as a separate buckcxx_library/ CMakeOBJECTlibrary so the-marchflag stays scoped to that one TU. The dispatcher inop_grid_sampler_2d.cpppicks between the HW path and a software-convert NEON path at runtime viacpuinfo_has_arm_neon_fp16(). Plain ARMv8 chips run the SW path with no illegal instructions; ARMv8.2+ chips get the faster HW path.Forward declaration of
grid_sampler_2d_bilinear_fp16_hwextracted intoop_grid_sampler_2d_fp16_hw.hso the definition site has a visible prototype (fixes-Wmissing-prototypes).dtypes_matchgate at the top of the fast-path eligibility check (input.scalar_type() == grid.scalar_type() == out.scalar_type()). Thereinterpret_cast<__fp16*>in the HW kernel and thedata_ptr<T>()calls in the other paths each assume a single dtype across all three tensors; mismatched buffers would silently corrupt output. Falls back to portable when types disagree.CMake side switches from
set_source_files_propertieson a single source to a dedicatedOBJECTlibrary (grid_sampler_2d_fp16_hw_impl) linked intooptimized_kernelsvia$<BUILD_LOCAL_INTERFACE:>. Mirrors the buckcxx_libraryof the same name.What is NOT touched
kernels/optimized/verify.cpp— polycam-specific on-device verification binary, stays as-is.tools/cmake/common/preset.cmake— polycam's DESCRIPTION quoting fix, stays as-is.Test plan
cmake --build cmake-out --target optimized_kernels(macOS arm64,EXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON).op_grid_sampler_2d_fp16_hw.cpp.oarchived intoliboptimized_kernels.a(verified viaar -t).verify_optimized_kernels) on a Pixel 9 to confirm fp16 HW path still produces matching outputs.🤖 Generated with Claude Code