fix(where): gate x86-specific SIMD path on Sse41/Avx2 for ARM64 compatibility

Nucs · Nucs · commit a5862bd226f2 · 2026-04-20T21:11:22.000+03:00
CI failure on macos-latest (ARM64/Apple Silicon) reported 31 np.where tests
throwing PlatformNotSupportedException at runtime:

    PlatformNotSupportedException: Operation is not supported on this platform.
      at System.Runtime.Intrinsics.X86.Sse41.ConvertToVector128Int64(Vector128`1 value)
      at IL_Where_Int64(...)

Root cause: the SIMD-emit path was gated only on `VectorBits &gt;= 128`. On ARM64,
`Vector128.IsHardwareAccelerated` is true (maps to Neon), so VectorBits is 128,
and the kernel emits calls to Sse41/Avx2 byte-lane expansion intrinsics which
are x86-only.

Breakdown of the byte-mask expansion path by element size:
  - 1-byte (byte): portable Vector*.Load/GreaterThan — safe on any SIMD platform
  - 2-byte: Sse41.ConvertToVector128Int16 / Avx2.ConvertToVector256Int16
  - 4-byte: Sse41.ConvertToVector128Int32 / Avx2.ConvertToVector256Int32
  - 8-byte: Sse41.ConvertToVector128Int64 / Avx2.ConvertToVector256Int64

Fix: in GenerateWhereKernelIL, compute `useV256`/`useV128` with an additional
Sse41.IsSupported / Avx2.IsSupported guard — but only when elementSize &gt; 1,
since the 1-byte path is portable. If neither x86 intrinsic set is available
for the required lane size, skip SIMD emission entirely; the scalar IL loop
that follows handles correctness.

Also passes the useV256 decision to EmitWhereSIMDLoop explicitly instead of
recomputing it from VectorBits inside the loop, which was both duplicative and
ignored the IsSupported guard.

Result: on ARM64, byte-typed arrays still use Neon-backed SIMD; int/long/float/
double/short fall back to the scalar IL kernel. On x86 nothing changes.

Verified: 180 np.where + np.asanyarray tests pass on Windows x64 (net8.0 +
net10.0). ARM path awaits CI verification.
diff --git a/src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs b/src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Where.cs
@@ -115,8 +115,17 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
         {
             int elementSize = Unsafe.SizeOf<T>();
 
-            // Determine if we can use SIMD
-            bool canSimd = elementSize <= 8 && IsSimdSupported<T>();
+            // SIMD eligibility:
+            //  - 1-byte types (byte) only touch portable Vector128/Vector256 APIs, so they work
+            //    on any SIMD-capable platform (including ARM64/Neon).
+            //  - 2/4/8-byte types need Sse41.ConvertToVector128Int* (V128 path) or
+            //    Avx2.ConvertToVector256Int* (V256 path) to expand the bool-mask lanes.
+            //    These x86 intrinsics throw PlatformNotSupportedException on ARM64.
+            bool canSimdDtype = elementSize <= 8 && IsSimdSupported<T>();
+            bool needsX86 = elementSize > 1;
+            bool useV256 = VectorBits >= 256 && (!needsX86 || Avx2.IsSupported);
+            bool useV128 = !useV256 && VectorBits >= 128 && (!needsX86 || Sse41.IsSupported);
+            bool emitSimd = canSimdDtype && (useV256 || useV128);
 
             var dm = new DynamicMethod(
                 name: $"IL_Where_{typeof(T).Name}",
@@ -139,10 +148,9 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
             il.Emit(OpCodes.Ldc_I8, 0L);
             il.Emit(OpCodes.Stloc, locI);
 
-            if (canSimd && VectorBits >= 128)
+            if (emitSimd)
             {
-                // Generate SIMD path
-                EmitWhereSIMDLoop<T>(il, locI);
+                EmitWhereSIMDLoop<T>(il, locI, useV256);
             }
 
             // Scalar loop for remainder
@@ -170,13 +178,12 @@ private static unsafe WhereKernel<T> GenerateWhereKernelIL<T>() where T : unmana
             return (WhereKernel<T>)dm.CreateDelegate(typeof(WhereKernel<T>));
         }
 
-        private static void EmitWhereSIMDLoop<T>(ILGenerator il, LocalBuilder locI) where T : unmanaged
+        private static void EmitWhereSIMDLoop<T>(ILGenerator il, LocalBuilder locI, bool useV256) where T : unmanaged
         {
             long elementSize = Unsafe.SizeOf<T>();
-            long vectorCount = VectorBits >= 256 ? (32 / elementSize) : (16 / elementSize);
+            long vectorCount = useV256 ? (32 / elementSize) : (16 / elementSize);
             long unrollFactor = 4;
             long unrollStep = vectorCount * unrollFactor;
-            bool useV256 = VectorBits >= 256;
 
             var locUnrollEnd = il.DeclareLocal(typeof(long));
             var locVectorEnd = il.DeclareLocal(typeof(long));