Bump to 4.13.0-local.10 (forks 2.0.26): FP8 on all 6 backends + bf16 pre-Ampere CUDA fix

LostBeard · claude · LostBeard · commit efa88f41b17b · 2026-06-16T18:48:24.000-04:00
Four-package bundle bump. Ships FP8 (Float8E4M3/E5M2) complete on CPU/OpenCL/WebGPU/WebGL/Wasm/CUDA
+ the bf16 pre-Ampere PTX fix (1080/2060 unblocked). Also adds FP8 PTX struct-field IO
(EmitIOLoad/Store) for completeness (CUDA fp8-verify still 257/257).

- ILGPU.Fork + ILGPU.Algorithms.Fork: 2.0.25 -&gt; 2.0.26
- SpawnDev.ILGPU: 4.13.0-local.9 -&gt; 4.13.0-local.10 (+ PackageReference lines, release notes)
- CHANGELOG local.10 entry

Gates: PrecisionConvert (incl FP8 round-trip all 6 backends) 37/0; BFloat16 107/0 (incl CUDA);
fp8-verify CPU+OpenCL+CUDA 257/257.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,11 @@ This file tracks notable changes per release. The README's "Recent Highlights" s
 
 ## 4.13.0 (unreleased) - BFloat16 (bfloat16) Phases 0-3b: core type (CPU) + WebGPU + WebGL + Wasm + OpenCL + CUDA codegen (all 6 backends)
 
+### local.10 - FP8 complete on ALL 6 backends + bf16 pre-Ampere CUDA fix
+
+- **FP8 (`ILGPU.Float8E4M3` + `ILGPU.Float8E5M2`) now works on all 6 backends** (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA). The two OCP FP8 8-bit float formats - E4M3FN (1/4/3, bias 7, no Inf, saturates to +-448, the forward/inference format) and E5M2 (1/5/2, bias 15, IEEE Inf/NaN, the backward/gradient format) - as full `INumber<T>` value types (FP32-based `[MathIntrinsic]`/`[CompareIntrinisc]`/`[ConvertIntrinisc]` operators) + `BasicValueType.Float8E4M3`/`Float8E5M2` IR primitives (append-only). FP8 computes as f32 in-register (the f32-register model) and is converted at the 1-byte load/store boundary. The FP8<->f32 conversion (exponent rebias 127->7/15, RNE rounding, subnormal normalize, variant specials) is emitted per backend: callable helper functions on OpenCL (`_e4m3/_e5m2_bits_to_f32` + inverse), WGSL, and GLSL; inline WebAssembly bytecode on Wasm (`EmitFP8ToF32`/`EmitF32ToFP8`, subnormal-normalize unrolled for bit-exactness); inline PTX bit-manipulation on CUDA (`EmitFP8BitsToF32`/`EmitF32ToFP8Bits`, branchless setp/selp - FP8 has no portable native PTX cvt). All byte-identical to the CPU-verified managed conversion. Gate: `BackendTestBase.PrecisionConvert_Float8E{4M3,5M2}_RoundTripBitExact` (pure `ConvertFromSingle(ConvertToSingle(x))` bit-exact vs the concrete `(T)(float)x` cast) on every backend; the `relu(x*scale+bias)` generic `INumber<T>` kernel 257/257 on CPU+OpenCL+CUDA. FP8 radix-sort keys (`Interop.FloatAsInt`) are a tracked follow-up.
+- **bf16 fix for PRE-AMPERE CUDA cards.** The PTX bf16 path emitted `cvt.f32.bf16` / `cvt.rn.bf16.f32` **unconditionally** at all 7 sites (load, store, scalar param, FloatAsInt, IntAsFloat, struct-field IO load+store). Those instructions are **sm_80+ (Ampere/Ada/Hopper) ONLY**, so any bf16 kernel **failed to compile on older CUDA cards** - Pascal (GTX 1080 = sm_61), Volta (sm_70), Turing (RTX 2060 = sm_75). Since bf16 is consumed by the ML stack, that path was broken on pre-Ampere. Fixed with portable bit-manipulation (`EmitBF16BitsToF32` = zero-extend + shl 16 + reinterpret, exact since bf16 is the top 16 bits of an fp32; `EmitF32ToBF16Bits` = RNE + NaN-preservation guard, branchless via setp/selp) using only basic integer ops available on **every CUDA arch**, replacing the native cvt at all 7 sites. Byte-identical to the managed/WGSL/GLSL/Wasm/OpenCL bf16 conversion (which already used bit-manip - only PTX had taken the sm_80 native shortcut). Gate: `PMT_FILTER=BFloat16` **107/0 across all 6 backends incl. CUDA** (radix keys, struct fields, range/specials, arithmetic). **Lesson:** native-cvt shortcuts (bf16's sm_80, FP8's sm_89) silently gate out older hardware - prefer portable bit-manip unless explicitly capability-gated.
+
 ### local.9 - `PrecisionConvert` (generic in-kernel float<->T conversion) + FP8 foundation
 
 - **`PrecisionConvert.ConvertToSingle<T>(T)` + `ConvertFromSingle<T>(float)`** - a transpilable GENERIC conversion between `float` and any `INumber<T>` (Half/BFloat16/Float8E4M3/Float8E5M2/...). Inside a generic `where T : INumber<T>` kernel there is no C# way to write `(float)t`/`(T)f` (no cast constraint exists), so callers reached for `float.CreateChecked(t)`/`T.CreateChecked(f)` - which lower to System.Numerics range/identity checks that touch `System.Type`, and the kernel transpiler rejects that ("Class type 'System.Type' is not supported") on every GPU backend. The two new methods are tagged `[ConvertIntrinisc]`, so the frontend lowers each call to the SAME `ConvertValue` IR node the concrete `(float)Half`/`(Half)float` cast emits (resolving `T` per instantiation via the method return type) - no `System.Type`, reusing the convert path that already handles Half/bf16/fp8. This collapses every precision-aware op (read low-precision input, accumulate in float, write low-precision output - Conv/GroupNorm/SiLU/MatMul, the Rule-4 zero-fp32-temp path) to ONE generic kernel for float/Half/bf16 instead of N per-type variants. Gate: new `BackendTestBase.PrecisionConvert` round-trip (float/Half/bf16) - a pure `ConvertFromSingle(ConvertToSingle(x))` with no accumulation is **bit-exact vs the concrete `(T)(float)x` cast, 23/0 across all 6 backends** (incl. WebGPU/WebGL/Wasm); `GenericPrecision` still 23/0 (no regression).
diff --git a/ILGPU.Algorithms/ILGPU.Algorithms.csproj b/ILGPU.Algorithms/ILGPU.Algorithms.csproj
@@ -12,7 +12,7 @@
          SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
          Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.25</Version>
+    <Version>2.0.26</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
diff --git a/ILGPU/Backends/PTX/PTXCodeGenerator.Emitter.cs b/ILGPU/Backends/PTX/PTXCodeGenerator.Emitter.cs
@@ -904,6 +904,18 @@ public void EmitIOLoad<TIOEmitter, T>(
                 return;
             }
 
+            // FP8 field/value: 1-byte storage, f32 register. Load the byte, widen via portable
+            // bit-manip (every CUDA arch). Same model as bf16.
+            if (register.BasicValueType == BasicValueType.Float8E4M3 ||
+                register.BasicValueType == BasicValueType.Float8E5M2)
+            {
+                var rawReg = AllocateRegister(BasicValueType.Int16, PTXRegisterKind.Int16);
+                emitter.Emit(this, command, rawReg, userState);
+                EmitFP8BitsToF32(rawReg, register, register.BasicValueType == BasicValueType.Float8E4M3);
+                FreeRegister(rawReg);
+                return;
+            }
+
             HardwareRegister? originalRegister = null;
             // We need a temporary 32bit register for predicate conversion at this point:
             // 1) load value into temporary register
@@ -962,6 +974,18 @@ public void EmitIOStore<TIOEmitter, T>(
                 return;
             }
 
+            // FP8 field/value store: round the f32 value to its 1-byte pattern via portable bit-manip.
+            if (register.BasicValueType == BasicValueType.Float8E4M3 ||
+                register.BasicValueType == BasicValueType.Float8E5M2)
+            {
+                var f32Register = EnsureHardwareRegister(register);
+                var rawReg = AllocateRegister(BasicValueType.Int16, PTXRegisterKind.Int16);
+                EmitF32ToFP8Bits(f32Register, rawReg, register.BasicValueType == BasicValueType.Float8E4M3);
+                emitter.Emit(this, command, rawReg, userState);
+                FreeRegister(rawReg);
+                return;
+            }
+
             // We need a temporary 32bit register for predicate conversion at this point:
             // 1) convert current predicate into 32bit integer
             // 2) store the converted value from the temporary register
diff --git a/ILGPU/ILGPU.csproj b/ILGPU/ILGPU.csproj
@@ -16,7 +16,7 @@
          check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
          transitively and the fix is invisible. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.25</Version>
+    <Version>2.0.26</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj