You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three changes consolidated for review:
1. Move the forward declaration of grid_sampler_2d_bilinear_fp16_hw out
of op_grid_sampler_2d.cpp into a new header
kernels/optimized/cpu/op_grid_sampler_2d_fp16_hw.h. The function has
external linkage (the dispatcher in op_grid_sampler_2d.cpp calls into
it across translation units), and prior to this its definition site
had no prior prototype visible — which trips -Wmissing-prototypes on
build configurations that enable it. Both .cpp files now include the
shared header. The function body stays in op_grid_sampler_2d_fp16_hw.cpp
because that TU is the only one compiled with -march=armv8.2-a+fp16,
so it cannot be inlined into a header. The header itself uses void* for
input/output buffers and is fp16-free, so callers don't need the
+fp16 march flag just to declare or call it.
2. Split the fp16 HW path into its own CMake target. Previously the
-march=armv8.2-a+fp16 flag was scoped per-source-file via
set_source_files_properties on the sole TU inside the optimized_kernels
library. That works for a clean non-LTO build, but with ThinLTO or
cross-TU optimizations the flag boundary becomes fuzzy and the fallback
path in op_grid_sampler_2d.cpp could in principle be auto-vectorized
into fp16 NEON instructions — exactly the SIGILL hazard the runtime
dispatch is meant to prevent. Build the file as an OBJECT library
(grid_sampler_2d_fp16_hw_impl) with target-scoped -march flag and link
it into optimized_kernels via $<BUILD_LOCAL_INTERFACE:...> so the
object code is baked into liboptimized_kernels.a at archive time and
the OBJECT target is kept out of the install EXPORT set. Mirrors the
existing buck `grid_sampler_2d_fp16_hw_impl` cxx_library.
3. Gate the optimized fast paths on input/grid/out dtype match. Each fast
path assumes a single dtype across all three tensors:
fp32 NEON path: data_ptr<float>() on all three
fp16 HW path: void* pointers reinterpret_cast<__fp16*> on all three
fp16 SW NEON: data_ptr<c10::Half>() on all three
Until now the dispatcher gated only on input.scalar_type(). The
reinterpret_casts in the fp16 HW kernel are particularly load-bearing
because their behavior on a mismatched dtype would be silent
corruption (reading int64/double bytes as __fp16 stride). The
data_ptr<T>() runtime check exists but is not guaranteed in release
builds. Add a dtypes_match clause at the top of the fast-path
eligibility check that requires all three scalar types equal; fall
back to the portable kernel otherwise. The portable kernel handles
arbitrary dtype combinations correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments