LostBeard
diff --git a/‎CHANGELOG.md‎
Lines changed: 8 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion b/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Docs/data-type-support.md‎
Lines changed: 22 additions & 6 deletions b/‎Docs/data-type-support.md‎
Lines changed: 22 additions & 6 deletions
diff --git a/‎ILGPU.Algorithms/ILGPU.Algorithms.csproj‎
Lines changed: 1 addition & 1 deletion b/‎ILGPU.Algorithms/ILGPU.Algorithms.csproj‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ILGPU/Float8E4M3.cs‎
Lines changed: 62 additions & 6 deletions b/‎ILGPU/Float8E4M3.cs‎
Lines changed: 62 additions & 6 deletions
diff --git a/‎ILGPU/ILGPU.csproj‎
Lines changed: 1 addition & 1 deletion b/‎ILGPU/ILGPU.csproj‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs‎
Lines changed: 48 additions & 0 deletions b/‎SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs‎
Lines changed: 48 additions & 0 deletions
@@ -2,6 +2,14 @@
 
 This file tracks notable changes per release. The README's "Recent Highlights" section links here for the full version history.
 
+## 4.14.0-local.1 (2026-06-17) - Float8E4M3 selectable overflow convention (float8_e4m3fn parity)
+
+Additive new API on `Float8E4M3` (forks bump to `2.0.28`). No change to any existing behavior - the cast operator is unchanged.
+
+- **Validated `Float8E4M3` / `Float8E5M2` conversions against the `ml_dtypes` reference** (the impl PyTorch / JAX `float8_e4m3fn` / `float8_e5m2` share) with a new evidence harness `DemoConsole -- fp8-oracle` (generators in `_research/fp8_oracle/`). Result: decode is bit-exact (0/256) and encode rounding/subnormal is bit-exact (0 divergences across 1099 / 723 probes) for both types. The **only** divergence was E4M3 finite overflow: ILGPU saturated to ±448 (the NVIDIA Transformer Engine / OCP saturating cast), whereas the dtype literally named `e4m3fn` overflows to **NaN**. Both are real-world conventions; they agree everywhere except `|x| > 464` (the region that rounds up past the 448 slot).
+- **Made the overflow convention selectable.** The bare cast `(Float8E4M3)x`, `Float8E4M3.FromSingleSaturating(x)`, and `FromSingle(x, saturate: true)` keep the **saturating** behavior (finite overflow → ±448, ±Inf → NaN). New **`Float8E4M3.FromSingleFn(x)`** / `FromSingle(x, saturate: false)` use the **fn** convention (finite overflow AND ±Inf → NaN), **bit-exact to PyTorch / JAX / ml_dtypes `float8_e4m3fn`** - use it for reference-matching ML (loading/comparing FP8 checkpoints). `FromSingleFn` is composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN), so it transpiles with **no per-backend conversion codegen** and is bit-exact on all 6 backends.
+- Gates: `DemoConsole -- fp8-oracle` (managed `FromSingleFn` 1099/1099 vs `float8_e4m3fn`; saturating cast's 62 overflow points reported as the documented convention) + `fp8-verify` desktop kernel (`FromSingleFn` 24/24 bit-exact on CPU/OpenCL/CUDA) + **PMT `Float8E4M3_FromSingleFn_OverflowToNaN` 9/0 across all backend lanes** (CPU/CUDA/OpenCL/WebGPU/WebGPU-NoSubgroups/WebGL/Wasm). No regression to existing FP8/bf16/Half gates. `Float8E5M2` already matched its reference (overflow → ±Inf); its canonical NaN byte is `0x7F` (ml_dtypes uses `0x7E` - both valid NaN patterns).
+
 ## 4.13.2 (2026-06-16) - Packaging fix (no code changes)
 
 Wrapper-package-only fix over 4.13.1 (forks unchanged at `2.0.27`). No library/runtime behavior changed.
 
@@ -147,7 +147,7 @@ If total > 10: `InvalidOperationException` at dispatch time (v4.9.1+). Before v4
 
 `ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, `ArrayView<Half>` (ILGPU.Half), `ArrayView<BFloat16>` (ILGPU.BFloat16) supported on all 6 backends.
 
-**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf, sat ±448) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
+**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf; **selectable overflow**: the cast/`FromSingleSaturating` clamps to ±448 = NVIDIA TE/OCP, `FromSingleFn` → NaN = bit-exact PyTorch/JAX `float8_e4m3fn`) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
 
 **Per-backend implementation:**
 - **WebGPU:** Packed into `array<atomic<u32>>`. Load via atomicLoad + shift + mask. Store via atomicAnd + atomicOr (thread-safe sub-word writes). Float16 load/store calls `_f16_to_f32` / `_f32_to_f16` helpers from `WGSLEmulationLibrary.F16Functions` when `!shader-f16`; native WGSL `f16` type otherwise. `WebGPUBackend.ForceEmulatedF16` test flag forces the emulation path for verification.
 
@@ -164,9 +164,9 @@ NaN-preservation guard. Values compute as f32 everywhere; only the storage is 2-
 6 backends.**
 
 - **`Float8E4M3`** - 1 sign / 4 exponent / 3 mantissa, bias 7. The "E4M3FN" finite variant: **no
-  infinities** (the only non-finite value is NaN at `0x7F`/`0xFF`), max finite magnitude **448**, finite
-  overflow **saturates** to ±448. The FP8 **forward / inference** format (one extra mantissa bit vs E5M2,
-  at the cost of range).
+  infinities** (the only non-finite value is NaN at `0x7F`/`0xFF`), max finite magnitude **448**. The
+  overflow convention is **selectable** (see the convention note below). The FP8 **forward / inference**
+  format (one extra mantissa bit vs E5M2, at the cost of range).
 - **`Float8E5M2`** - 1 sign / 5 exponent / 2 mantissa, bias 15. IEEE-754-style: **has infinities and
   NaNs** (like fp16 but with 8 fewer mantissa bits). The FP8 **backward / gradient** format (fp16-class
   dynamic range, which gradients need).
@@ -187,9 +187,25 @@ backend** (CPU-verified idempotence 0/256 for all representable values).
 | **CUDA** | f32-register model. The FP8<->f32 conversion is **inline PTX bit-manipulation** (branchless `setp`/`selp`, unrolled normalize) using only basic integer ops - FP8 has no portable native PTX cvt (`cvt.*.e4m3` is sm_89/Hopper only), so this works on every CUDA arch. Load = `ld.global.u8` + convert; store = convert + `st.global.u8`. |
 | **CPU** | Native - the managed `Float8E4M3`/`Float8E5M2` structs run directly. |
 
-> **Convention note (E4M3 overflow):** out-of-range *inputs* to `Float8E4M3` saturate finite overflow to
-> ±448 and map ±Inf -> NaN (the OCP / NVIDIA Transformer Engine saturating-forward convention). Only the
-> out-of-range input behavior is convention-dependent; every *representable* value round-trips exactly.
+> **Convention note (E4M3 overflow) - SELECTABLE.** E4M3 has two real-world overflow conventions and
+> both are exposed; the conversion is otherwise **bit-exact** to the `ml_dtypes` reference (the impl
+> PyTorch / JAX `float8_e4m3fn` share) - verified by `DemoConsole -- fp8-oracle`: decode 0/256, encode
+> rounding/subnormal 0 divergences across 1099 probes.
+>
+> | Entry point | Finite overflow | ±Inf | Matches |
+> |---|---|---|---|
+> | `(Float8E4M3)x` cast / `FromSingleSaturating(x)` / `FromSingle(x, saturate: true)` | clamps to ±448 | → NaN | NVIDIA Transformer Engine default cast / OCP saturating-forward |
+> | `FromSingleFn(x)` / `FromSingle(x, saturate: false)` | → NaN | → NaN | **PyTorch / JAX / ml_dtypes `float8_e4m3fn`** (bit-exact) |
+>
+> The two agree everywhere except `|x| > 464` (the region that rounds up past the 448 slot): saturating
+> gives ±448, fn gives NaN. Every *representable* value round-trips exactly under both. `FromSingleFn` is
+> composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN), so it
+> transpiles and is bit-exact on **all 6 backends** (PMT `Float8E4M3_FromSingleFn_OverflowToNaN`). Use
+> `FromSingleFn` for reference-matching ML (e.g. loading/comparing PyTorch FP8 checkpoints); use the
+> saturating cast when you want overflow clamped rather than NaN-poisoning a downstream reduction.
+>
+> `Float8E5M2` is IEEE-754-style (has ±Inf): overflow → ±Inf, bit-exact to `float8_e5m2` (decode 0/256,
+> encode 723/723); its canonical NaN byte is `0x7F` (ml_dtypes uses `0x7E` - both are valid NaN patterns).
 
 ### Sub-Word Usage Notes
 
 
@@ -12,7 +12,7 @@
          SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
          Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.27</Version>
+    <Version>2.0.28</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
 
@@ -10,12 +10,18 @@
 // training recipe (E4M3 forward, E5M2 backward): it trades dynamic range for an extra mantissa
 // bit vs E5M2, which is what forward activations/weights want.
 //
-// CONVENTION (flagged for ML-oracle confirmation, plan §9 risk #2 - confirm vs PyTorch
-// float8_e4m3fn / NVIDIA Transformer Engine when wired into the ML lane): finite overflow
-// SATURATES to +-448; a real +-Inf input maps to NaN (E4M3 has no Inf); NaN -> NaN. This
-// matches the OCP/TE saturating-forward convention. Only the out-of-range INPUT behavior is
-// convention-dependent; every REPRESENTABLE value round-trips exactly (verified by the CPU
-// idempotence harness, `DemoConsole -- fp8-verify`).
+// OVERFLOW CONVENTION (verified vs the ml_dtypes reference, `DemoConsole -- fp8-oracle` -
+// ml_dtypes is the impl PyTorch / JAX float8_e4m3fn share). E4M3 has two real-world conventions
+// and BOTH are selectable here; the conversion is otherwise bit-exact to the reference (decode
+// 0/256, encode rounding/subnormal 0 divergences across 1099 probes):
+//   * SATURATING (the bare cast operator + FromSingleSaturating): finite overflow clamps to
+//     +-448; +-Inf -> NaN; NaN -> NaN. Matches the NVIDIA Transformer Engine default cast /
+//     OCP saturating-forward mode. Avoids NaN propagation when activations overflow unscaled.
+//   * fn / non-saturating (FromSingleFn): finite overflow AND +-Inf -> NaN; NaN -> NaN. Bit-
+//     exact to PyTorch/JAX/ml_dtypes float8_e4m3fn (the dtype this layout is named after). Use
+//     this for reference-matching ML. The two conventions agree everywhere except |x|>464 (the
+//     region that rounds up past the 448 slot): saturating gives +-448, fn gives NaN.
+// Every REPRESENTABLE value round-trips exactly under both (CPU idempotence harness fp8-verify).
 //
 // Modeled on ILGPU.Half / BFloat16 / Float8E5M2: FP32-based [MathIntrinsic]/[CompareIntrinisc]/
 // [ConvertIntrinisc] operators (transpiled on every backend). 1-byte storage.
@@ -61,6 +67,35 @@ namespace ILGPU
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         public static bool IsFinite(Float8E4M3 value) => Float8E4M3Extensions.IsFinite(value);
 
+        /// <summary>
+        /// Converts a float to E4M3 with a selectable overflow convention. When
+        /// <paramref name="saturate"/> is true (the default, matching the cast operator): finite
+        /// overflow clamps to +-448 (NVIDIA Transformer Engine / OCP saturating cast). When false:
+        /// finite overflow and +-Inf map to NaN, bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn.
+        /// </summary>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public static Float8E4M3 FromSingle(float value, bool saturate) =>
+            saturate ? Float8E4M3Extensions.ConvertFloatToFloat8E4M3(value)
+                     : Float8E4M3Extensions.FromSingleFn(value);
+
+        /// <summary>
+        /// Converts a float to E4M3 using the SATURATING convention: finite overflow clamps to
+        /// +-448; +-Inf -> NaN; NaN -> NaN. Identical to the explicit cast operator. Matches the
+        /// NVIDIA Transformer Engine default cast / OCP saturating-forward mode.
+        /// </summary>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public static Float8E4M3 FromSingleSaturating(float value) =>
+            Float8E4M3Extensions.ConvertFloatToFloat8E4M3(value);
+
+        /// <summary>
+        /// Converts a float to E4M3 using the fn (non-saturating) convention: finite overflow AND
+        /// +-Inf map to NaN; NaN -> NaN. Bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn - use
+        /// this for reference-matching ML. Differs from the saturating cast only for |value|>464.
+        /// </summary>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public static Float8E4M3 FromSingleFn(float value) =>
+            Float8E4M3Extensions.FromSingleFn(value);
+
         #endregion
 
         #region Constants
@@ -324,6 +359,27 @@ public static Float8E4M3 ConvertFloatToFloat8E4M3(float value)
             return new Float8E4M3((byte)(sign | (outBits & 0x7Fu)));
         }
 
+        /// <summary>
+        /// Converts a float to E4M3 using the fn (float8_e4m3fn) convention: finite overflow and
+        /// +-Inf map to NaN (NOT saturation); NaN -> NaN. Bit-exact to PyTorch / JAX / ml_dtypes
+        /// (verified, <c>DemoConsole -- fp8-oracle</c>). Composed only of existing intrinsics
+        /// (compare, the saturating cast, Neg, cast-of-NaN) so it transpiles on every backend with
+        /// no per-backend conversion codegen.
+        /// </summary>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public static Float8E4M3 FromSingleFn(float value)
+        {
+            // |value| <= 464 rounds to <= 448 (bit-exact to the reference) via the saturating base
+            // convert; |value| > 464 is the round-up-past-448 region -> NaN. A NaN input fails both
+            // ordered compares (NaN > / < are false) and falls through to the base convert, which
+            // already maps NaN -> NaN. +-Inf trip the compares -> signed NaN.
+            if (value > 464.0f)
+                return (Float8E4M3)float.NaN;          // +overflow / +Inf -> +NaN (0x7F)
+            if (value < -464.0f)
+                return -(Float8E4M3)float.NaN;         // -overflow / -Inf -> -NaN (0xFF)
+            return (Float8E4M3)value;
+        }
+
         #endregion
 
         #region Predicates
 
@@ -16,7 +16,7 @@
          check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
          transitively and the fix is invisible. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.27</Version>
+    <Version>2.0.28</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
 
@@ -157,5 +157,53 @@ private async Task RunGenericPrecision<T>(Func<float, T> toT, Func<T, float> toF
                         "must transpile + marshal correctly.");
             }
         });
+
+        // Float8E4M3.FromSingleFn (the float8_e4m3fn convention: finite overflow AND +-Inf -> NaN,
+        // NOT saturation - Geordi 2026-06-17, from the FP8 ML-oracle validation). FromSingleFn is
+        // composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN)
+        // so it must transpile and produce the SAME byte as the managed fn result (which fp8-oracle
+        // proved bit-exact to ml_dtypes/PyTorch float8_e4m3fn) on EVERY backend. The overflow region
+        // is the point: a correct kernel emits NaN (0x7F/0xFF) there, not 0x7E (+-448).
+        private static void Float8FromSingleFnKernel(Index1D i,
+            ArrayView1D<float, Stride1D.Dense> x, ArrayView1D<global::ILGPU.Float8E4M3, Stride1D.Dense> y) =>
+            y[i] = global::ILGPU.Float8E4M3.FromSingleFn(x[i]);
+
+        [TestMethod]
+        public async Task Float8E4M3_FromSingleFn_OverflowToNaN() => await RunTest(async accelerator =>
+        {
+            float[] inputs =
+            {
+                480f, 512f, 1000f, 1e30f, float.PositiveInfinity,        // +overflow -> +NaN
+                -480f, -512f, -1e30f, float.NegativeInfinity,            // -overflow -> -NaN
+                448f, 449f, 463f, 464f, -448f, -464f,                    // round-to-448 region (finite)
+                1f, 1.25f, 256f, -2.5f, 0.5f, 0.001953125f, 0f, -0f, float.NaN,
+            };
+            int n = inputs.Length;
+            var expected = new byte[n];
+            for (int i = 0; i < n; i++)
+            {
+                var v = global::ILGPU.Float8E4M3.FromSingleFn(inputs[i]);   // managed = proven reference
+                expected[i] = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Float8E4M3, byte>(ref v);
+            }
+
+            using var inBuf = accelerator.Allocate1D(inputs);
+            using var outBuf = accelerator.Allocate1D<global::ILGPU.Float8E4M3>(n);
+            var k = accelerator.LoadAutoGroupedStreamKernel<Index1D,
+                ArrayView1D<float, Stride1D.Dense>, ArrayView1D<global::ILGPU.Float8E4M3, Stride1D.Dense>>(
+                Float8FromSingleFnKernel);
+            k(n, inBuf.View, outBuf.View);
+            await accelerator.SynchronizeAsync();
+            var got = await outBuf.CopyToHostAsync<global::ILGPU.Float8E4M3>();
+
+            for (int i = 0; i < n; i++)
+            {
+                byte g = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Float8E4M3, byte>(ref got[i]);
+                bool bothNaN = (g & 0x7F) == 0x7F && (expected[i] & 0x7F) == 0x7F;   // NaN-slot tolerant
+                if (!bothNaN && g != expected[i])
+                    throw new Exception(
+                        $"FromSingleFn kernel @{i} ({BackendName}): input {inputs[i]} -> got 0x{g:X2}, " +
+                        $"want 0x{expected[i]:X2} (fn: overflow/+-Inf must be NaN, not saturated +-448).");
+            }
+        });
     }
 }