LostBeard
diff --git a/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎ILGPU.Algorithms/ILGPU.Algorithms.csproj‎
Lines changed: 1 addition & 1 deletion b/‎ILGPU.Algorithms/ILGPU.Algorithms.csproj‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ILGPU/Backends/OpenCL/CLBackend.cs‎
Lines changed: 5 additions & 3 deletions b/‎ILGPU/Backends/OpenCL/CLBackend.cs‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎ILGPU/HalfConversion.cs‎
Lines changed: 43 additions & 10 deletions b/‎ILGPU/HalfConversion.cs‎
Lines changed: 43 additions & 10 deletions
diff --git a/‎ILGPU/HalfConversion.tt‎
Lines changed: 43 additions & 10 deletions b/‎ILGPU/HalfConversion.tt‎
Lines changed: 43 additions & 10 deletions
diff --git a/‎ILGPU/ILGPU.csproj‎
Lines changed: 1 addition & 1 deletion b/‎ILGPU/ILGPU.csproj‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs‎
Lines changed: 54 additions & 0 deletions b/‎SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎SpawnDev.ILGPU/SpawnDev.ILGPU.csproj‎
Lines changed: 4 additions & 4 deletions b/‎SpawnDev.ILGPU/SpawnDev.ILGPU.csproj‎
Lines changed: 4 additions & 4 deletions
@@ -2,6 +2,13 @@
 
 This file tracks notable changes per release. The README's "Recent Highlights" section links here for the full version history.
 
+## 4.14.0-local.3 (2026-06-17) - Half float→half is now IEEE round-to-nearest-even on every backend (was truncating)
+
+Fixes a real conversion-correctness + cross-backend-consistency bug in the most-used low-precision type, found by validating against the authoritative references (numpy.float16 / ml_dtypes.bfloat16). Forks bump to `2.0.30`.
+
+- **`ILGPU.Half` `float→half` now uses IEEE round-to-nearest-even (incl. proper subnormal rounding + overflow→Inf) on CPU + WebGPU + WebGL + Wasm** - bit-exact to `numpy.float16` / PyTorch / CUDA (`cvt.rn.f16.f32`) / OpenCL (`vstore_half`). **Before:** the managed conversion used the von der Zijp TABLE method which **truncates toward zero** (`HalfConversion.tt`: shift with no round bit), and the WebGPU/WebGL/Wasm emitters **truncated AND flushed every subnormal to signed zero**. That diverged from numpy/PyTorch in ~half of all values (every non-exact conversion lost up to ½ ULP) AND from ILGPU's own CUDA/OpenCL backends (which were already round-to-nearest) - so a Half model produced different results on WebGPU vs CUDA. Replaced the managed conversion with a direct RNE bit-manip (mirrors the bf16/FP8 conversions) and rewrote the WGSL/GLSL `_f32_to_f16` + the Wasm `EmitF32ToF16` inline bytecode to match. CUDA/OpenCL unchanged (already correct). The "f16 emulation is lossless / matches numpy byte-for-byte" doc claims (which were false for encode) are corrected.
+- **Validated exhaustively:** new `DemoConsole -- bf16-f16-oracle` checks managed BFloat16 + Half vs `ml_dtypes.bfloat16` / `numpy.float16` over **all 65536 patterns** (decode + round-trip identity) + RNE/overflow/subnormal probes. `BFloat16`: bit-exact (decode 65536/65536, round-trip 65536/65536, probes 67503/67503) - was already correct. `Half`: now decode 65536/65536, round-trip 65536/65536, probes 64060/64060 (was ~32294/64060 - the subnormal region + RNE midpoints). Cross-backend gate: new PMT `Half_FloatToHalf_RoundToNearestEven` (kernel `(Half)x` over subnormals/midpoints/overflow/specials, bit-exact vs the managed=numpy reference) **9/0 all backend lanes**; existing Half suite `PMT_FILTER=Half` **204/0/8** (no regression).
+
 ## 4.14.0-local.2 (2026-06-17) - Float8E4M3 is now bit-exact to float8_e4m3fn (overflow → NaN), saturating opt-in
 
 `Float8E4M3` float→fp8 conversion changed from saturating to the `fn` (`float8_e4m3fn`) convention as the DEFAULT, matching the dtype it is named after. Forks bump to `2.0.29`.
 
@@ -12,7 +12,7 @@
          SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
          Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.29</Version>
+    <Version>2.0.30</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
 
@@ -97,9 +97,11 @@ public CLBackend(
             // which AscendingHalf / DescendingHalf radix-sort encodings depend on. The
             // hardware path uses `as_short(half)` directly when shader-fp16 is on; the
             // emulated path calls these helpers instead. They are tiny, no-op when unused,
-            // and let the OpenCL compiler optimize out the call when inlined. Mirrors WGSL's
-            // _f32_to_f16 / _f16_to_f32 byte-for-byte (denormals flush to signed zero,
-            // overflow clamps exp to 31 with mantissa preserved so NaN stays NaN).
+            // and let the OpenCL compiler optimize out the call when inlined. This helper is the
+            // radix FloatAsInt(Half) bit-encoder - its inputs are already representable Half values
+            // (widened to f32), so the encoding is exact regardless of rounding mode. General
+            // float->half conversion on OpenCL goes through vstore_half (IEEE round-to-nearest, like
+            // CUDA's cvt.rn and the managed/WGSL/GLSL/Wasm RNE path as of 4.14.0).
             if (!Capabilities.Float16Native)
             {
                 extensionBuilder.AppendLine();
 
@@ -332,22 +332,55 @@ public static float ConvertHalfToFloat(Half halfValue)
         }
 
         /// <summary>
-        /// Converts a float value to a half value by using van der Zijp's algorithm.
+        /// Converts a float value to a half value using IEEE round-to-nearest-even. Matches
+        /// numpy.float16 / PyTorch / CUDA (cvt.rn.f16.f32) / OpenCL (vstore_half) bit-for-bit -
+        /// including subnormals and overflow-to-Inf. (Replaced the von der Zijp TABLE method, which
+        /// truncated toward zero and flushed sub-smallest-normal values incorrectly - it diverged
+        /// from every ML reference and from CUDA/OpenCL; verified via DemoConsole -- bf16-f16-oracle.)
+        /// f16: 1 sign / 5 exponent / 10 mantissa, bias 15; has +-Inf + NaN; max normal 65504.
         /// </summary>
         /// <param name="floatValue">The value to convert.</param>
         /// <returns>The converted half value.</returns>
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         public static Half ConvertFloatToHalf(float floatValue)
         {
-            uint rawValue = Interop.FloatAsInt(floatValue);
-            uint rawUpperValue = rawValue >> 23;
-
-            uint baseEntry = BaseTable[rawUpperValue];
-            int shiftAmount = ShiftTable[rawUpperValue];
-            uint mantissaOffset = rawValue & 0x7FFFFF;
-
-            uint result = baseEntry + (mantissaOffset >> shiftAmount);
-            return new Half((ushort)result);
+            uint bits = Interop.FloatAsInt(floatValue);
+            uint sign = (bits >> 16) & 0x8000u;          // f16 sign bit (bit 15)
+            uint rest = bits & 0x7FFFFFFFu;
+
+            // NaN or Inf input.
+            if (rest >= 0x7F800000u)
+                return new Half((ushort)(sign | (rest > 0x7F800000u ? 0x7E00u : 0x7C00u))); // NaN : Inf
+
+            int e = (int)((rest >> 23) & 0xFFu) - 127;   // unbiased f32 exponent
+            uint f32Mant = rest & 0x7FFFFFu;
+
+            if (e > 15)                                  // overflow -> +-Inf
+                return new Half((ushort)(sign | 0x7C00u));
+
+            if (e < -14)
+            {
+                // Subnormal or zero. f16 subnormal value = mant * 2^-24.
+                if (e < -25)
+                    return new Half((ushort)sign);       // below half the smallest subnormal -> +-0
+                uint signif = f32Mant | 0x800000u;       // implicit leading 1 (24-bit significand)
+                int shift = (-14 - e) + 13;              // align to the 10-bit subnormal field (in [14,24])
+                uint m = signif >> shift;
+                uint roundBit = (signif >> (shift - 1)) & 1u;
+                uint sticky = (signif & ((1u << (shift - 1)) - 1u)) != 0u ? 1u : 0u;
+                if (roundBit == 1u && (sticky == 1u || (m & 1u) == 1u))
+                    m += 1u;                             // RNE; may carry to 0x400 = smallest normal (correct)
+                return new Half((ushort)(sign | m));
+            }
+
+            // Normal. Rebias and round the mantissa 23 -> 10 bits (RNE).
+            uint mant10 = f32Mant >> 13;
+            uint round = (f32Mant >> 12) & 1u;
+            uint sticky2 = (f32Mant & 0xFFFu) != 0u ? 1u : 0u;
+            uint outBits = ((uint)(e + 15) << 10) | mant10;
+            if (round == 1u && (sticky2 == 1u || (mant10 & 1u) == 1u))
+                outBits += 1u;                           // RNE; may carry into the exponent, up to Inf
+            return new Half((ushort)(sign | outBits));
         }
 
         #endregion
 
@@ -220,22 +220,55 @@ namespace ILGPU
         }
 
         /// <summary>
-        /// Converts a float value to a half value by using van der Zijp's algorithm.
+        /// Converts a float value to a half value using IEEE round-to-nearest-even. Matches
+        /// numpy.float16 / PyTorch / CUDA (cvt.rn.f16.f32) / OpenCL (vstore_half) bit-for-bit -
+        /// including subnormals and overflow-to-Inf. (Replaced the von der Zijp TABLE method, which
+        /// truncated toward zero and flushed sub-smallest-normal values incorrectly - it diverged
+        /// from every ML reference and from CUDA/OpenCL; verified via DemoConsole -- bf16-f16-oracle.)
+        /// f16: 1 sign / 5 exponent / 10 mantissa, bias 15; has +-Inf + NaN; max normal 65504.
         /// </summary>
         /// <param name="floatValue">The value to convert.</param>
         /// <returns>The converted half value.</returns>
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         public static Half ConvertFloatToHalf(float floatValue)
         {
-            uint rawValue = Interop.FloatAsInt(floatValue);
-            uint rawUpperValue = rawValue >> <#= FloatMantissaBits #>;
-
-            uint baseEntry = BaseTable[rawUpperValue];
-            int shiftAmount = ShiftTable[rawUpperValue];
-            uint mantissaOffset = rawValue & <#= $"0x{FloatMantissaMask:X}" #>;
-
-            uint result = baseEntry + (mantissaOffset >> shiftAmount);
-            return new Half((ushort)result);
+            uint bits = Interop.FloatAsInt(floatValue);
+            uint sign = (bits >> 16) & 0x8000u;          // f16 sign bit (bit 15)
+            uint rest = bits & 0x7FFFFFFFu;
+
+            // NaN or Inf input.
+            if (rest >= 0x7F800000u)
+                return new Half((ushort)(sign | (rest > 0x7F800000u ? 0x7E00u : 0x7C00u))); // NaN : Inf
+
+            int e = (int)((rest >> 23) & 0xFFu) - 127;   // unbiased f32 exponent
+            uint f32Mant = rest & 0x7FFFFFu;
+
+            if (e > 15)                                  // overflow -> +-Inf
+                return new Half((ushort)(sign | 0x7C00u));
+
+            if (e < -14)
+            {
+                // Subnormal or zero. f16 subnormal value = mant * 2^-24.
+                if (e < -25)
+                    return new Half((ushort)sign);       // below half the smallest subnormal -> +-0
+                uint signif = f32Mant | 0x800000u;       // implicit leading 1 (24-bit significand)
+                int shift = (-14 - e) + 13;              // align to the 10-bit subnormal field (in [14,24])
+                uint m = signif >> shift;
+                uint roundBit = (signif >> (shift - 1)) & 1u;
+                uint sticky = (signif & ((1u << (shift - 1)) - 1u)) != 0u ? 1u : 0u;
+                if (roundBit == 1u && (sticky == 1u || (m & 1u) == 1u))
+                    m += 1u;                             // RNE; may carry to 0x400 = smallest normal (correct)
+                return new Half((ushort)(sign | m));
+            }
+
+            // Normal. Rebias and round the mantissa 23 -> 10 bits (RNE).
+            uint mant10 = f32Mant >> 13;
+            uint round = (f32Mant >> 12) & 1u;
+            uint sticky2 = (f32Mant & 0xFFFu) != 0u ? 1u : 0u;
+            uint outBits = ((uint)(e + 15) << 10) | mant10;
+            if (round == 1u && (sticky2 == 1u || (mant10 & 1u) == 1u))
+                outBits += 1u;                           // RNE; may carry into the exponent, up to Inf
+            return new Half((ushort)(sign | outBits));
         }
 
         #endregion
 
@@ -16,7 +16,7 @@
          check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
          transitively and the fix is invisible. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.29</Version>
+    <Version>2.0.30</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
 
@@ -173,6 +173,60 @@ private static void Float8OverflowKernel(Index1D i,
             sat[i] = global::ILGPU.Float8E4M3.FromSingleSaturating(x[i]);   // opt-in saturating
         }
 
+        // ILGPU.Half float->half must be round-to-nearest-even on EVERY backend (Geordi 2026-06-17,
+        // bf16/Half oracle validation). The managed conversion is bit-exact to numpy.float16 (verified
+        // DemoConsole -- bf16-f16-oracle, all 65536 patterns); this proves the WebGPU/WebGL/Wasm
+        // emitters match it - they previously TRUNCATED + flushed all subnormals to zero (diverging
+        // from numpy AND from CUDA/OpenCL which use native round-to-nearest). Subnormals + overflow +
+        // RNE midpoints are the point.
+        private static void HalfConvertKernel(Index1D i,
+            ArrayView1D<float, Stride1D.Dense> x, ArrayView1D<global::ILGPU.Half, Stride1D.Dense> y) =>
+            y[i] = (global::ILGPU.Half)x[i];
+
+        [TestMethod]
+        public async Task Half_FloatToHalf_RoundToNearestEven() => await RunTest(async accelerator =>
+        {
+            float[] inputs =
+            {
+                1f, 1.5f, -2.5f, 100.3f, 0.333f, 1024.7f,                 // normals
+                1.00048828125f, 1.0014648438f, 0.99975586f,              // normal RNE midpoints near 1.0
+                65504f, 65519f, 65520f, 65535f, 70000f, 1e30f,           // overflow region -> 65504 / Inf
+                float.PositiveInfinity, float.NegativeInfinity, -65504f, -70000f,
+                (float)Math.Pow(2, -24), (float)Math.Pow(2, -23), (float)Math.Pow(2, -15), // exact subnormals/boundary
+                (float)Math.Pow(2, -25),                                  // tie -> +0 (even)
+                (float)(Math.Pow(2, -25) * 1.5),                          // -> smallest subnormal
+                (float)Math.Pow(2, -26),                                  // -> +0
+                -2.9831426E-08f,                                          // the original failing case -> -smallest subnormal
+                0f, -0f, float.NaN, 5.96e-8f, 1e-7f,
+            };
+            int n = inputs.Length;
+            var expected = new ushort[n];
+            for (int i = 0; i < n; i++)
+            {
+                var v = (global::ILGPU.Half)inputs[i];   // managed = bit-exact numpy.float16 (oracle-proven)
+                expected[i] = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Half, ushort>(ref v);
+            }
+
+            using var inBuf = accelerator.Allocate1D(inputs);
+            using var outBuf = accelerator.Allocate1D<global::ILGPU.Half>(n);
+            var k = accelerator.LoadAutoGroupedStreamKernel<Index1D,
+                ArrayView1D<float, Stride1D.Dense>, ArrayView1D<global::ILGPU.Half, Stride1D.Dense>>(HalfConvertKernel);
+            k(n, inBuf.View, outBuf.View);
+            await accelerator.SynchronizeAsync();
+            var got = await outBuf.CopyToHostAsync<global::ILGPU.Half>();
+
+            for (int i = 0; i < n; i++)
+            {
+                ushort g = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Half, ushort>(ref got[i]);
+                bool bothNaN = (g & 0x7C00) == 0x7C00 && (g & 0x03FF) != 0
+                    && (expected[i] & 0x7C00) == 0x7C00 && (expected[i] & 0x03FF) != 0;
+                if (!bothNaN && g != expected[i])
+                    throw new Exception(
+                        $"float->Half kernel @{i} ({BackendName}): input {inputs[i]} -> got 0x{g:X4}, " +
+                        $"want 0x{expected[i]:X4} (must be round-to-nearest-even incl subnormals, matching numpy/managed).");
+            }
+        });
+
         [TestMethod]
         public async Task Float8E4M3_FromSingleFn_OverflowToNaN() => await RunTest(async accelerator =>
         {
 
@@ -4,9 +4,9 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.14.0-local.2</Version>
+		<Version>4.14.0-local.3</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
-		<PackageReleaseNotes>4.14.0 makes Float8E4M3 bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn (the dtype it is named after): the cast operator and the IR-level convert now use the fn convention - finite overflow AND +-Inf map to NaN (was: saturate to +-448) - verified against the ml_dtypes oracle and on all 6 backends. The saturating (NVIDIA TE / OCP) cast is available opt-in via Float8E4M3.FromSingleSaturating / FromSingle(x, saturate: true). 4.13.2 is a packaging fix over 4.13.1: removes stray Wasm/repro JSON files that the Razor SDK swept into the package, and bundles the precompiled-shaders precompiler tool (tools/) that 4.13.0/4.13.1 were missing. The 4.13.x line brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and FP8 (Float8E4M3 + Float8E5M2) - including FP8 radix-sort keys (4.13.1) - plus generic INumber&lt;T&gt; mixed-precision kernels, PrecisionConvert, and bf16/FP8 portability to pre-Ampere CUDA cards (GTX 1080 / RTX 2060). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
+		<PackageReleaseNotes>4.14.0 fixes low-precision float CONVERSION CORRECTNESS against the references. (1) Half (float-&gt;half) is now IEEE round-to-nearest-even on every backend (CPU + WebGPU + WebGL + Wasm), bit-exact to numpy.float16 / PyTorch / CUDA / OpenCL - it previously truncated toward zero and flushed subnormals to zero (diverging from numpy AND from CUDA/OpenCL). (2) Float8E4M3 is now bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn: the cast + IR convert use the fn convention (finite overflow AND +-Inf -&gt; NaN; was saturate to +-448); saturating is opt-in via Float8E4M3.FromSingleSaturating. BFloat16 was already bit-exact to ml_dtypes.bfloat16 (verified). All validated exhaustively against ml_dtypes/numpy oracles + cross-backend PMT gates. 4.13.2 is a packaging fix over 4.13.1: removes stray Wasm/repro JSON files that the Razor SDK swept into the package, and bundles the precompiled-shaders precompiler tool (tools/) that 4.13.0/4.13.1 were missing. The 4.13.x line brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and FP8 (Float8E4M3 + Float8E5M2) - including FP8 radix-sort keys (4.13.1) - plus generic INumber&lt;T&gt; mixed-precision kernels, PrecisionConvert, and bf16/FP8 portability to pre-Ampere CUDA cards (GTX 1080 / RTX 2060). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
 		<GenerateDocumentationFile>true</GenerateDocumentationFile>
 		<EmbedAllSources>true</EmbedAllSources>
@@ -65,8 +65,8 @@
 		<ProjectReference Include="..\ILGPU.Algorithms\ILGPU.Algorithms.csproj" />
 	</ItemGroup>
 	<ItemGroup Condition="!Exists('$(MSBuildThisFileDirectory)..\ILGPU\ILGPU.csproj')">
-		<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.29" />
-		<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.29" />
+		<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.30" />
+		<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.30" />
 	</ItemGroup>
 
 	<ItemGroup>