Skip to content

Commit 999c8dd

Browse files
LostBeardclaude
andcommitted
Float8E4M3: selectable overflow convention (float8_e4m3fn parity) - 4.14.0-local.1
Validated Float8E4M3/Float8E5M2 conversions against the ml_dtypes reference (the impl PyTorch/JAX float8_e4m3fn / float8_e5m2 share) with a new evidence harness (DemoConsole -- fp8-oracle, generators in _research/fp8_oracle/). Decode bit-exact 0/256; encode rounding/subnormal bit-exact (0 divergences over 1099/723 probes). The ONLY divergence: E4M3 finite overflow saturated to +-448 (NVIDIA TE / OCP saturating cast) vs the e4m3fn reference's overflow->NaN. Made it selectable (no change to existing behavior - the cast operator is unchanged, still saturating): - (Float8E4M3)x / FromSingleSaturating / FromSingle(x, saturate:true) = saturate to +-448 - FromSingleFn(x) / FromSingle(x, saturate:false) = overflow->NaN, BIT-EXACT to float8_e4m3fn FromSingleFn is composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN), so it transpiles with NO per-backend conversion codegen. Gates: fp8-oracle FromSingleFn 1099/1099 vs float8_e4m3fn; fp8-verify desktop kernel 24/24 (CPU/OpenCL/CUDA); PMT Float8E4M3_FromSingleFn_OverflowToNaN 9/0 all backend lanes (incl WebGPU/WebGL/Wasm). Forks 2.0.28; wrapper 4.14.0-local.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent e14fc34 commit 999c8dd

15 files changed

Lines changed: 567 additions & 19 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
This file tracks notable changes per release. The README's "Recent Highlights" section links here for the full version history.
44

5+
## 4.14.0-local.1 (2026-06-17) - Float8E4M3 selectable overflow convention (float8_e4m3fn parity)
6+
7+
Additive new API on `Float8E4M3` (forks bump to `2.0.28`). No change to any existing behavior - the cast operator is unchanged.
8+
9+
- **Validated `Float8E4M3` / `Float8E5M2` conversions against the `ml_dtypes` reference** (the impl PyTorch / JAX `float8_e4m3fn` / `float8_e5m2` share) with a new evidence harness `DemoConsole -- fp8-oracle` (generators in `_research/fp8_oracle/`). Result: decode is bit-exact (0/256) and encode rounding/subnormal is bit-exact (0 divergences across 1099 / 723 probes) for both types. The **only** divergence was E4M3 finite overflow: ILGPU saturated to ±448 (the NVIDIA Transformer Engine / OCP saturating cast), whereas the dtype literally named `e4m3fn` overflows to **NaN**. Both are real-world conventions; they agree everywhere except `|x| > 464` (the region that rounds up past the 448 slot).
10+
- **Made the overflow convention selectable.** The bare cast `(Float8E4M3)x`, `Float8E4M3.FromSingleSaturating(x)`, and `FromSingle(x, saturate: true)` keep the **saturating** behavior (finite overflow → ±448, ±Inf → NaN). New **`Float8E4M3.FromSingleFn(x)`** / `FromSingle(x, saturate: false)` use the **fn** convention (finite overflow AND ±Inf → NaN), **bit-exact to PyTorch / JAX / ml_dtypes `float8_e4m3fn`** - use it for reference-matching ML (loading/comparing FP8 checkpoints). `FromSingleFn` is composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN), so it transpiles with **no per-backend conversion codegen** and is bit-exact on all 6 backends.
11+
- Gates: `DemoConsole -- fp8-oracle` (managed `FromSingleFn` 1099/1099 vs `float8_e4m3fn`; saturating cast's 62 overflow points reported as the documented convention) + `fp8-verify` desktop kernel (`FromSingleFn` 24/24 bit-exact on CPU/OpenCL/CUDA) + **PMT `Float8E4M3_FromSingleFn_OverflowToNaN` 9/0 across all backend lanes** (CPU/CUDA/OpenCL/WebGPU/WebGPU-NoSubgroups/WebGL/Wasm). No regression to existing FP8/bf16/Half gates. `Float8E5M2` already matched its reference (overflow → ±Inf); its canonical NaN byte is `0x7F` (ml_dtypes uses `0x7E` - both valid NaN patterns).
12+
513
## 4.13.2 (2026-06-16) - Packaging fix (no code changes)
614

715
Wrapper-package-only fix over 4.13.1 (forks unchanged at `2.0.27`). No library/runtime behavior changed.

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ If total > 10: `InvalidOperationException` at dispatch time (v4.9.1+). Before v4
147147

148148
`ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, `ArrayView<Half>` (ILGPU.Half), `ArrayView<BFloat16>` (ILGPU.BFloat16) supported on all 6 backends.
149149

150-
**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf, sat ±448) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
150+
**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf; **selectable overflow**: the cast/`FromSingleSaturating` clamps to ±448 = NVIDIA TE/OCP, `FromSingleFn` → NaN = bit-exact PyTorch/JAX `float8_e4m3fn`) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
151151

152152
**Per-backend implementation:**
153153
- **WebGPU:** Packed into `array<atomic<u32>>`. Load via atomicLoad + shift + mask. Store via atomicAnd + atomicOr (thread-safe sub-word writes). Float16 load/store calls `_f16_to_f32` / `_f32_to_f16` helpers from `WGSLEmulationLibrary.F16Functions` when `!shader-f16`; native WGSL `f16` type otherwise. `WebGPUBackend.ForceEmulatedF16` test flag forces the emulation path for verification.

Docs/data-type-support.md

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -164,9 +164,9 @@ NaN-preservation guard. Values compute as f32 everywhere; only the storage is 2-
164164
6 backends.**
165165

166166
- **`Float8E4M3`** - 1 sign / 4 exponent / 3 mantissa, bias 7. The "E4M3FN" finite variant: **no
167-
infinities** (the only non-finite value is NaN at `0x7F`/`0xFF`), max finite magnitude **448**, finite
168-
overflow **saturates** to ±448. The FP8 **forward / inference** format (one extra mantissa bit vs E5M2,
169-
at the cost of range).
167+
infinities** (the only non-finite value is NaN at `0x7F`/`0xFF`), max finite magnitude **448**. The
168+
overflow convention is **selectable** (see the convention note below). The FP8 **forward / inference**
169+
format (one extra mantissa bit vs E5M2, at the cost of range).
170170
- **`Float8E5M2`** - 1 sign / 5 exponent / 2 mantissa, bias 15. IEEE-754-style: **has infinities and
171171
NaNs** (like fp16 but with 8 fewer mantissa bits). The FP8 **backward / gradient** format (fp16-class
172172
dynamic range, which gradients need).
@@ -187,9 +187,25 @@ backend** (CPU-verified idempotence 0/256 for all representable values).
187187
| **CUDA** | f32-register model. The FP8<->f32 conversion is **inline PTX bit-manipulation** (branchless `setp`/`selp`, unrolled normalize) using only basic integer ops - FP8 has no portable native PTX cvt (`cvt.*.e4m3` is sm_89/Hopper only), so this works on every CUDA arch. Load = `ld.global.u8` + convert; store = convert + `st.global.u8`. |
188188
| **CPU** | Native - the managed `Float8E4M3`/`Float8E5M2` structs run directly. |
189189

190-
> **Convention note (E4M3 overflow):** out-of-range *inputs* to `Float8E4M3` saturate finite overflow to
191-
> ±448 and map ±Inf -> NaN (the OCP / NVIDIA Transformer Engine saturating-forward convention). Only the
192-
> out-of-range input behavior is convention-dependent; every *representable* value round-trips exactly.
190+
> **Convention note (E4M3 overflow) - SELECTABLE.** E4M3 has two real-world overflow conventions and
191+
> both are exposed; the conversion is otherwise **bit-exact** to the `ml_dtypes` reference (the impl
192+
> PyTorch / JAX `float8_e4m3fn` share) - verified by `DemoConsole -- fp8-oracle`: decode 0/256, encode
193+
> rounding/subnormal 0 divergences across 1099 probes.
194+
>
195+
> | Entry point | Finite overflow | ±Inf | Matches |
196+
> |---|---|---|---|
197+
> | `(Float8E4M3)x` cast / `FromSingleSaturating(x)` / `FromSingle(x, saturate: true)` | clamps to ±448 | → NaN | NVIDIA Transformer Engine default cast / OCP saturating-forward |
198+
> | `FromSingleFn(x)` / `FromSingle(x, saturate: false)` | → NaN | → NaN | **PyTorch / JAX / ml_dtypes `float8_e4m3fn`** (bit-exact) |
199+
>
200+
> The two agree everywhere except `|x| > 464` (the region that rounds up past the 448 slot): saturating
201+
> gives ±448, fn gives NaN. Every *representable* value round-trips exactly under both. `FromSingleFn` is
202+
> composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN), so it
203+
> transpiles and is bit-exact on **all 6 backends** (PMT `Float8E4M3_FromSingleFn_OverflowToNaN`). Use
204+
> `FromSingleFn` for reference-matching ML (e.g. loading/comparing PyTorch FP8 checkpoints); use the
205+
> saturating cast when you want overflow clamped rather than NaN-poisoning a downstream reduction.
206+
>
207+
> `Float8E5M2` is IEEE-754-style (has ±Inf): overflow → ±Inf, bit-exact to `float8_e5m2` (decode 0/256,
208+
> encode 723/723); its canonical NaN byte is `0x7F` (ml_dtypes uses `0x7E` - both are valid NaN patterns).
193209
194210
### Sub-Word Usage Notes
195211

ILGPU.Algorithms/ILGPU.Algorithms.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
1313
Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
1414
SpawnDev.ILGPU.csproj for the full procedure. -->
15-
<Version>2.0.27</Version>
15+
<Version>2.0.28</Version>
1616
<IsPackable>true</IsPackable>
1717
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
1818
</PropertyGroup>

ILGPU/Float8E4M3.cs

Lines changed: 62 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,18 @@
1010
// training recipe (E4M3 forward, E5M2 backward): it trades dynamic range for an extra mantissa
1111
// bit vs E5M2, which is what forward activations/weights want.
1212
//
13-
// CONVENTION (flagged for ML-oracle confirmation, plan §9 risk #2 - confirm vs PyTorch
14-
// float8_e4m3fn / NVIDIA Transformer Engine when wired into the ML lane): finite overflow
15-
// SATURATES to +-448; a real +-Inf input maps to NaN (E4M3 has no Inf); NaN -> NaN. This
16-
// matches the OCP/TE saturating-forward convention. Only the out-of-range INPUT behavior is
17-
// convention-dependent; every REPRESENTABLE value round-trips exactly (verified by the CPU
18-
// idempotence harness, `DemoConsole -- fp8-verify`).
13+
// OVERFLOW CONVENTION (verified vs the ml_dtypes reference, `DemoConsole -- fp8-oracle` -
14+
// ml_dtypes is the impl PyTorch / JAX float8_e4m3fn share). E4M3 has two real-world conventions
15+
// and BOTH are selectable here; the conversion is otherwise bit-exact to the reference (decode
16+
// 0/256, encode rounding/subnormal 0 divergences across 1099 probes):
17+
// * SATURATING (the bare cast operator + FromSingleSaturating): finite overflow clamps to
18+
// +-448; +-Inf -> NaN; NaN -> NaN. Matches the NVIDIA Transformer Engine default cast /
19+
// OCP saturating-forward mode. Avoids NaN propagation when activations overflow unscaled.
20+
// * fn / non-saturating (FromSingleFn): finite overflow AND +-Inf -> NaN; NaN -> NaN. Bit-
21+
// exact to PyTorch/JAX/ml_dtypes float8_e4m3fn (the dtype this layout is named after). Use
22+
// this for reference-matching ML. The two conventions agree everywhere except |x|>464 (the
23+
// region that rounds up past the 448 slot): saturating gives +-448, fn gives NaN.
24+
// Every REPRESENTABLE value round-trips exactly under both (CPU idempotence harness fp8-verify).
1925
//
2026
// Modeled on ILGPU.Half / BFloat16 / Float8E5M2: FP32-based [MathIntrinsic]/[CompareIntrinisc]/
2127
// [ConvertIntrinisc] operators (transpiled on every backend). 1-byte storage.
@@ -61,6 +67,35 @@ namespace ILGPU
6167
[MethodImpl(MethodImplOptions.AggressiveInlining)]
6268
public static bool IsFinite(Float8E4M3 value) => Float8E4M3Extensions.IsFinite(value);
6369

70+
/// <summary>
71+
/// Converts a float to E4M3 with a selectable overflow convention. When
72+
/// <paramref name="saturate"/> is true (the default, matching the cast operator): finite
73+
/// overflow clamps to +-448 (NVIDIA Transformer Engine / OCP saturating cast). When false:
74+
/// finite overflow and +-Inf map to NaN, bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn.
75+
/// </summary>
76+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
77+
public static Float8E4M3 FromSingle(float value, bool saturate) =>
78+
saturate ? Float8E4M3Extensions.ConvertFloatToFloat8E4M3(value)
79+
: Float8E4M3Extensions.FromSingleFn(value);
80+
81+
/// <summary>
82+
/// Converts a float to E4M3 using the SATURATING convention: finite overflow clamps to
83+
/// +-448; +-Inf -> NaN; NaN -> NaN. Identical to the explicit cast operator. Matches the
84+
/// NVIDIA Transformer Engine default cast / OCP saturating-forward mode.
85+
/// </summary>
86+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
87+
public static Float8E4M3 FromSingleSaturating(float value) =>
88+
Float8E4M3Extensions.ConvertFloatToFloat8E4M3(value);
89+
90+
/// <summary>
91+
/// Converts a float to E4M3 using the fn (non-saturating) convention: finite overflow AND
92+
/// +-Inf map to NaN; NaN -> NaN. Bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn - use
93+
/// this for reference-matching ML. Differs from the saturating cast only for |value|>464.
94+
/// </summary>
95+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
96+
public static Float8E4M3 FromSingleFn(float value) =>
97+
Float8E4M3Extensions.FromSingleFn(value);
98+
6499
#endregion
65100

66101
#region Constants
@@ -324,6 +359,27 @@ public static Float8E4M3 ConvertFloatToFloat8E4M3(float value)
324359
return new Float8E4M3((byte)(sign | (outBits & 0x7Fu)));
325360
}
326361

362+
/// <summary>
363+
/// Converts a float to E4M3 using the fn (float8_e4m3fn) convention: finite overflow and
364+
/// +-Inf map to NaN (NOT saturation); NaN -> NaN. Bit-exact to PyTorch / JAX / ml_dtypes
365+
/// (verified, <c>DemoConsole -- fp8-oracle</c>). Composed only of existing intrinsics
366+
/// (compare, the saturating cast, Neg, cast-of-NaN) so it transpiles on every backend with
367+
/// no per-backend conversion codegen.
368+
/// </summary>
369+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
370+
public static Float8E4M3 FromSingleFn(float value)
371+
{
372+
// |value| <= 464 rounds to <= 448 (bit-exact to the reference) via the saturating base
373+
// convert; |value| > 464 is the round-up-past-448 region -> NaN. A NaN input fails both
374+
// ordered compares (NaN > / < are false) and falls through to the base convert, which
375+
// already maps NaN -> NaN. +-Inf trip the compares -> signed NaN.
376+
if (value > 464.0f)
377+
return (Float8E4M3)float.NaN; // +overflow / +Inf -> +NaN (0x7F)
378+
if (value < -464.0f)
379+
return -(Float8E4M3)float.NaN; // -overflow / -Inf -> -NaN (0xFF)
380+
return (Float8E4M3)value;
381+
}
382+
327383
#endregion
328384

329385
#region Predicates

ILGPU/ILGPU.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
1717
transitively and the fix is invisible. See the banner comment in
1818
SpawnDev.ILGPU.csproj for the full procedure. -->
19-
<Version>2.0.27</Version>
19+
<Version>2.0.28</Version>
2020
<IsPackable>true</IsPackable>
2121
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
2222
</PropertyGroup>

SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,5 +157,53 @@ private async Task RunGenericPrecision<T>(Func<float, T> toT, Func<T, float> toF
157157
"must transpile + marshal correctly.");
158158
}
159159
});
160+
161+
// Float8E4M3.FromSingleFn (the float8_e4m3fn convention: finite overflow AND +-Inf -> NaN,
162+
// NOT saturation - Geordi 2026-06-17, from the FP8 ML-oracle validation). FromSingleFn is
163+
// composed only of existing intrinsics (compare + the saturating cast + Neg + cast-of-NaN)
164+
// so it must transpile and produce the SAME byte as the managed fn result (which fp8-oracle
165+
// proved bit-exact to ml_dtypes/PyTorch float8_e4m3fn) on EVERY backend. The overflow region
166+
// is the point: a correct kernel emits NaN (0x7F/0xFF) there, not 0x7E (+-448).
167+
private static void Float8FromSingleFnKernel(Index1D i,
168+
ArrayView1D<float, Stride1D.Dense> x, ArrayView1D<global::ILGPU.Float8E4M3, Stride1D.Dense> y) =>
169+
y[i] = global::ILGPU.Float8E4M3.FromSingleFn(x[i]);
170+
171+
[TestMethod]
172+
public async Task Float8E4M3_FromSingleFn_OverflowToNaN() => await RunTest(async accelerator =>
173+
{
174+
float[] inputs =
175+
{
176+
480f, 512f, 1000f, 1e30f, float.PositiveInfinity, // +overflow -> +NaN
177+
-480f, -512f, -1e30f, float.NegativeInfinity, // -overflow -> -NaN
178+
448f, 449f, 463f, 464f, -448f, -464f, // round-to-448 region (finite)
179+
1f, 1.25f, 256f, -2.5f, 0.5f, 0.001953125f, 0f, -0f, float.NaN,
180+
};
181+
int n = inputs.Length;
182+
var expected = new byte[n];
183+
for (int i = 0; i < n; i++)
184+
{
185+
var v = global::ILGPU.Float8E4M3.FromSingleFn(inputs[i]); // managed = proven reference
186+
expected[i] = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Float8E4M3, byte>(ref v);
187+
}
188+
189+
using var inBuf = accelerator.Allocate1D(inputs);
190+
using var outBuf = accelerator.Allocate1D<global::ILGPU.Float8E4M3>(n);
191+
var k = accelerator.LoadAutoGroupedStreamKernel<Index1D,
192+
ArrayView1D<float, Stride1D.Dense>, ArrayView1D<global::ILGPU.Float8E4M3, Stride1D.Dense>>(
193+
Float8FromSingleFnKernel);
194+
k(n, inBuf.View, outBuf.View);
195+
await accelerator.SynchronizeAsync();
196+
var got = await outBuf.CopyToHostAsync<global::ILGPU.Float8E4M3>();
197+
198+
for (int i = 0; i < n; i++)
199+
{
200+
byte g = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Float8E4M3, byte>(ref got[i]);
201+
bool bothNaN = (g & 0x7F) == 0x7F && (expected[i] & 0x7F) == 0x7F; // NaN-slot tolerant
202+
if (!bothNaN && g != expected[i])
203+
throw new Exception(
204+
$"FromSingleFn kernel @{i} ({BackendName}): input {inputs[i]} -> got 0x{g:X2}, " +
205+
$"want 0x{expected[i]:X2} (fn: overflow/+-Inf must be NaN, not saturated +-448).");
206+
}
207+
});
160208
}
161209
}

0 commit comments

Comments
 (0)