Skip to content

Commit b384173

Browse files
jgibson2claude
andauthored
optimized: add grid_sampler_2d.out (NEON) and sum.IntList_out (Vectorized<float>) (#19119)
## Summary Two new optimized CPU kernels registered alongside the existing `optimized_kernels` library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with vectorized implementations that accumulate in fp32, which also sidesteps the fp16 precision issue noted in #19117 for `grid_sampler_2d` bilinear. Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | |---|---:|---:|---:| | `grid_sampler_2d.out` | 17.3 ms | **3.4 ms** | **5.1×** | | `sum.IntList_out` (5 calls, aggregate) | 3.0 ms | **0.56 ms** | **5.4×** | ## `grid_sampler_2d.out` aarch64 NEON, bilinear + zeros padding only (the dominant mode for depth / MVS / spatial transformer networks). Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation (same concern as #19117). Unsupported modes and non-aarch64 targets delegate to the portable kernel. ## `sum.IntList_out` `at::vec::Vectorized<float>`-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; always accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ## Integration - Sources added to `OPTIMIZED_KERNELS_SRCS` in `build_variables.bzl` and to `OPTIMIZED_ATEN_OPS` in `op_registration_util.bzl`. Single source of truth for both Buck and CMake builds. - `optimized.yaml` registers the ops with the standard `opt_*` naming convention used by sibling kernels. - `kernels/optimized/CMakeLists.txt` scopes the `-march=armv8.2-a+fp16` flag to just `op_grid_sampler_2d.cpp` via `set_source_files_properties`, so x86_64 builds are unaffected. The kernel has `#ifdef __aarch64__` guards and falls through to portable on non-arm64 targets. ## Test plan - [x] Builds cleanly for Android arm64-v8a, Android x86_64 (via `scripts/build_android_library.sh`), and host (macOS / Apple Clang 21). - [x] Existing `kernels/test/op_grid_sampler_2d_test.cpp` and `op_sum_test.cpp` unit tests continue to pass — both target the `aten::sum_outf` / `aten::grid_sampler_2d_outf` codegen-dispatched entry points, so they automatically exercise the optimized kernels when linked. - [x] Numerical verification against an fp32 reference (run portable in fp32, cast to fp16) on the shapes the polycam depth model uses — all cases pass within fp16 ULP. - [x] End-to-end Pixel 9 latency on a representative trained model matches the handwritten-NEON reference implementation to within run-to-run noise while producing more accurate fp16 outputs (fp32 accumulation). Candidate successor to #19117 for the grid_sampler half — applies the same precision fix but at the optimized-kernel layer, so callers who link `optimized_ops_lib` get both the correctness fix and the speedup. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d767516 commit b384173

10 files changed

Lines changed: 969 additions & 0 deletions

File tree

kernels/optimized/CMakeLists.txt

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,38 @@ target_link_libraries(
7575
kernels_util_all_deps
7676
)
7777
target_compile_options(optimized_kernels PUBLIC ${_common_compile_options})
78+
79+
# op_grid_sampler_2d_fp16_hw.cpp uses hardware fp16 NEON intrinsics
80+
# (vcvt_f32_f16 / vld1_f16). Those are part of the ARMv8.2-a+fp16 extension and
81+
# raise SIGILL on chips without it. Build it as a separate OBJECT library so the
82+
# `-march=armv8.2-a+fp16` flag stays strictly scoped to that translation unit
83+
# and never reaches the dispatcher / fallback code in op_grid_sampler_2d.cpp
84+
# (which would otherwise risk auto-vectorizing into fp16 NEON instructions). The
85+
# dispatcher chooses between this entry point and the fp16 software-convert path
86+
# at runtime via cpuinfo_has_arm_neon_fp16(). Mirrors the buck
87+
# `grid_sampler_2d_fp16_hw_impl` library.
88+
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|arm64" OR ANDROID_ABI STREQUAL
89+
"arm64-v8a"
90+
)
91+
add_library(
92+
grid_sampler_2d_fp16_hw_impl OBJECT
93+
${EXECUTORCH_ROOT}/kernels/optimized/cpu/op_grid_sampler_2d_fp16_hw.cpp
94+
)
95+
target_compile_options(
96+
grid_sampler_2d_fp16_hw_impl PRIVATE -march=armv8.2-a+fp16
97+
${_common_compile_options}
98+
)
99+
target_link_libraries(grid_sampler_2d_fp16_hw_impl PRIVATE executorch_core)
100+
# BUILD_LOCAL_INTERFACE: object files are baked into optimized_kernels.a at
101+
# archive time, so this OBJECT target stays out of the install EXPORT set and
102+
# downstream consumers of the installed optimized_kernels need no separate
103+
# link entry.
104+
target_link_libraries(
105+
optimized_kernels
106+
PRIVATE $<BUILD_LOCAL_INTERFACE:grid_sampler_2d_fp16_hw_impl>
107+
)
108+
endif()
109+
78110
# Build a library for _optimized_kernels_srcs
79111
#
80112
# optimized_ops_lib: Register optimized ops kernels into Executorch runtime

0 commit comments

Comments
 (0)