Skip to content

Commit a5862bd

Browse files
committed
fix(where): gate x86-specific SIMD path on Sse41/Avx2 for ARM64 compatibility
CI failure on macos-latest (ARM64/Apple Silicon) reported 31 np.where tests throwing PlatformNotSupportedException at runtime: PlatformNotSupportedException: Operation is not supported on this platform. at System.Runtime.Intrinsics.X86.Sse41.ConvertToVector128Int64(Vector128`1 value) at IL_Where_Int64(...) Root cause: the SIMD-emit path was gated only on `VectorBits >= 128`. On ARM64, `Vector128.IsHardwareAccelerated` is true (maps to Neon), so VectorBits is 128, and the kernel emits calls to Sse41/Avx2 byte-lane expansion intrinsics which are x86-only. Breakdown of the byte-mask expansion path by element size: - 1-byte (byte): portable Vector*.Load/GreaterThan — safe on any SIMD platform - 2-byte: Sse41.ConvertToVector128Int16 / Avx2.ConvertToVector256Int16 - 4-byte: Sse41.ConvertToVector128Int32 / Avx2.ConvertToVector256Int32 - 8-byte: Sse41.ConvertToVector128Int64 / Avx2.ConvertToVector256Int64 Fix: in GenerateWhereKernelIL, compute `useV256`/`useV128` with an additional Sse41.IsSupported / Avx2.IsSupported guard — but only when elementSize > 1, since the 1-byte path is portable. If neither x86 intrinsic set is available for the required lane size, skip SIMD emission entirely; the scalar IL loop that follows handles correctness. Also passes the useV256 decision to EmitWhereSIMDLoop explicitly instead of recomputing it from VectorBits inside the loop, which was both duplicative and ignored the IsSupported guard. Result: on ARM64, byte-typed arrays still use Neon-backed SIMD; int/long/float/ double/short fall back to the scalar IL kernel. On x86 nothing changes. Verified: 180 np.where + np.asanyarray tests pass on Windows x64 (net8.0 + net10.0). ARM path awaits CI verification.
1 parent 21d7eec commit a5862bd

1 file changed

Lines changed: 15 additions & 8 deletions

File tree

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -115,8 +115,17 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
115115
{
116116
int elementSize = Unsafe.SizeOf<T>();
117117

118-
// Determine if we can use SIMD
119-
bool canSimd = elementSize <= 8 && IsSimdSupported<T>();
118+
// SIMD eligibility:
119+
// - 1-byte types (byte) only touch portable Vector128/Vector256 APIs, so they work
120+
// on any SIMD-capable platform (including ARM64/Neon).
121+
// - 2/4/8-byte types need Sse41.ConvertToVector128Int* (V128 path) or
122+
// Avx2.ConvertToVector256Int* (V256 path) to expand the bool-mask lanes.
123+
// These x86 intrinsics throw PlatformNotSupportedException on ARM64.
124+
bool canSimdDtype = elementSize <= 8 && IsSimdSupported<T>();
125+
bool needsX86 = elementSize > 1;
126+
bool useV256 = VectorBits >= 256 && (!needsX86 || Avx2.IsSupported);
127+
bool useV128 = !useV256 && VectorBits >= 128 && (!needsX86 || Sse41.IsSupported);
128+
bool emitSimd = canSimdDtype && (useV256 || useV128);
120129

121130
var dm = new DynamicMethod(
122131
name: $"IL_Where_{typeof(T).Name}",
@@ -139,10 +148,9 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
139148
il.Emit(OpCodes.Ldc_I8, 0L);
140149
il.Emit(OpCodes.Stloc, locI);
141150

142-
if (canSimd && VectorBits >= 128)
151+
if (emitSimd)
143152
{
144-
// Generate SIMD path
145-
EmitWhereSIMDLoop<T>(il, locI);
153+
EmitWhereSIMDLoop<T>(il, locI, useV256);
146154
}
147155

148156
// Scalar loop for remainder
@@ -170,13 +178,12 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
170178
return (WhereKernel<T>)dm.CreateDelegate(typeof(WhereKernel<T>));
171179
}
172180

173-
private static void EmitWhereSIMDLoop<T>(ILGenerator il, LocalBuilder locI) where T : unmanaged
181+
private static void EmitWhereSIMDLoop<T>(ILGenerator il, LocalBuilder locI, bool useV256) where T : unmanaged
174182
{
175183
long elementSize = Unsafe.SizeOf<T>();
176-
long vectorCount = VectorBits >= 256 ? (32 / elementSize) : (16 / elementSize);
184+
long vectorCount = useV256 ? (32 / elementSize) : (16 / elementSize);
177185
long unrollFactor = 4;
178186
long unrollStep = vectorCount * unrollFactor;
179-
bool useV256 = VectorBits >= 256;
180187

181188
var locUnrollEnd = il.DeclareLocal(typeof(long));
182189
var locVectorEnd = il.DeclareLocal(typeof(long));

0 commit comments

Comments
 (0)