Docs: trim PackageReleaseNotes to CHANGELOG; FP8 + bf16-portable in README + data-type-support

LostBeard · claude · LostBeard · commit aad9e62dd5bb · 2026-06-16T18:58:46.000-04:00
- SpawnDev.ILGPU.csproj: PackageReleaseNotes was carrying the full local.5-&gt;local.10 history inline
  (huge). Trimmed to a concise 4.13.0 summary (low-precision types on all 6 backends + the bf16
  pre-Ampere fix) pointing at CHANGELOG.md for the per-version detail.
- README "Recent Highlights" 4.13.0: rewritten to the full release - BFloat16 + FP8 (Float8E4M3 +
  Float8E5M2) on all 6 backends, generic INumber&lt;T&gt; mixed-precision kernels, PrecisionConvert, and
  the portable-bit-manip CUDA fix that brings bf16/FP8 to pre-Ampere cards. Features list updated.
- Docs/data-type-support.md: FP8 rows in Read/Write/EndToEnd tables (all 6 backends), a full FP8
  per-backend implementation section, and the bf16 CUDA mechanism corrected from native sm_80 cvt to
  portable bit-manip (works on every CUDA arch).

GitHub release description for 4.13.0 written separately (outside the repo) for posting.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/Docs/data-type-support.md b/Docs/data-type-support.md
@@ -1,7 +1,7 @@
 # Data Type Support by Backend
 
 Tracks verified support for all data types across all 7 backends.
-Updated: 2026-06-15
+Updated: 2026-06-16
 
 **Legend:**
 - [x] PASS - verified with unit tests (real data, real kernels, real verification)
@@ -26,6 +26,8 @@ Updated: 2026-06-15
 | UInt64 | ulong | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float16 | Half | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | BFloat16 | BFloat16 | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E4M3 | Float8E4M3 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E5M2 | Float8E5M2 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float32 | float | 4B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float64 | double | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 
@@ -43,6 +45,8 @@ Updated: 2026-06-15
 | UInt64 | ulong | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float16 | Half | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | BFloat16 | BFloat16 | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E4M3 | Float8E4M3 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E5M2 | Float8E5M2 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float32 | float | 4B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float64 | double | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 
@@ -60,6 +64,8 @@ Updated: 2026-06-15
 | UInt64 | ulong | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float16 | Half | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | BFloat16 | BFloat16 | 2B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E4M3 | Float8E4M3 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
+| Float8E5M2 | Float8E5M2 | 1B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float32 | float | 4B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 | Float64 | double | 8B | [x] | [x] | [x] | [x] | [x] | [x] | [x] |
 
@@ -148,9 +154,43 @@ NaN-preservation guard. Values compute as f32 everywhere; only the storage is 2-
 | **Wasm** | `EmitBF16ToF32` / `EmitF32ToBF16` emit the conversion as inline WebAssembly bytecode; 2-byte `i32.load16_u` / `i32.store16` (atomic in barrier kernels). |
 | **WebGL** | Packed-u16 in an R32I texel; `texelFetch` + shift/mask load, Transform-Feedback varying store; `_bf16_to_f32` / `_f32_to_bf16` GLSL helpers. |
 | **OpenCL** | Emulated (no common native bf16 extension; `cl_khr_fp16` is fp16, not bf16). View params are `ushort*` (2-byte storage stride - a `float*` typedef silently corrupts), `_bf16_bits_to_f32` / `_f32_to_bf16_bits` OpenCL-C helpers + tracked LEA base pointer. |
-| **CUDA** | **f32-register-compute model** (PTX has no native bf16 *arithmetic*, only `cvt.*.bf16`): the value lives in an `.f32` register and computes as f32; arithmetic/compare route through the f32 tables; `ConvertValue` bf16<->f32 is a register no-op. Load = `ld.global.b16` + `cvt.f32.bf16`; store = `cvt.rn.bf16.f32` (RNE) + `st.global.b16`. `cvt.*.bf16` is native on sm_80+ (Ampere/Ada/Hopper). |
+| **CUDA** | **f32-register-compute model** (PTX has no native bf16 *arithmetic*): the value lives in an `.f32` register and computes as f32; arithmetic/compare route through the f32 tables; `ConvertValue` bf16<->f32 is a register no-op. **The bf16<->f32 conversion at the load/store boundary uses PORTABLE bit-manipulation (basic integer ops on EVERY CUDA arch), NOT the native `cvt.*.bf16`** - those `cvt` instructions are sm_80+ (Ampere) only, so the earlier native-cvt path failed to compile on pre-Ampere cards (Pascal sm_61 / Volta sm_70 / Turing sm_75). Load = `ld.global.u8`... no: `ld.global.b16` + zero-extend + `shl 16` + reinterpret (exact, bf16 = top 16 bits of fp32); store = RNE round + NaN-guard + `st.global.b16`. Byte-identical to every other backend. (4.13.0+; pre-4.13.0 used the sm_80 native cvt and broke on older cards.) |
 | **CPU** | Native - the managed `BFloat16` struct runs directly (`DefaultILBackend`). |
 
+### FP8 (`Float8E4M3` + `Float8E5M2`) buffer access
+
+`ILGPU.Float8E4M3` and `ILGPU.Float8E5M2` add the two OCP 8-bit floating-point formats, each with the
+`BasicValueType.Float8E4M3` / `Float8E5M2` IR primitive. **Complete Read/Write/EndToEnd support on ALL
+6 backends.**
+
+- **`Float8E4M3`** - 1 sign / 4 exponent / 3 mantissa, bias 7. The "E4M3FN" finite variant: **no
+  infinities** (the only non-finite value is NaN at `0x7F`/`0xFF`), max finite magnitude **448**, finite
+  overflow **saturates** to ±448. The FP8 **forward / inference** format (one extra mantissa bit vs E5M2,
+  at the cost of range).
+- **`Float8E5M2`** - 1 sign / 5 exponent / 2 mantissa, bias 15. IEEE-754-style: **has infinities and
+  NaNs** (like fp16 but with 8 fewer mantissa bits). The FP8 **backward / gradient** format (fp16-class
+  dynamic range, which gradients need).
+
+Like `Half`/`BFloat16`, FP8 uses the **f32-register model**: values compute as f32 in-register and are
+converted to the 1-byte FP8 grid only at the load/store boundary, so accumulation stays full-precision
+(matching how real FP8 tensor-core hardware accumulates). Unlike bf16 (a trivial top-16-bits shift), the
+FP8 conversion needs exponent rebias (127 -> 7/15), round-to-nearest-even from 23 to 2/3 mantissa bits,
+subnormal normalization, and the per-format specials. The conversion is **byte-identical across every
+backend** (CPU-verified idempotence 0/256 for all representable values).
+
+| Backend | Mechanism |
+|---------|-----------|
+| **WebGPU** | Always emulated. Packed **4 FP8 per `array<atomic<u32>>` word** (1-byte sub-word storage); `_e4m3_to_f32`/`_e5m2_to_f32` + inverse WGSL helpers at the load/store boundary. |
+| **Wasm** | Conversion emitted as **inline WebAssembly bytecode** (`EmitFP8ToF32`/`EmitF32ToFP8`, the subnormal-normalize loop unrolled for bit-exactness); 1-byte `i32.load8_u` / `i32.store8` (verified-atomic in barrier kernels). |
+| **WebGL** | Packed 4 FP8 per R32I texel; `texelFetch` + shift/mask load, Transform-Feedback varying store; `_e4m3/_e5m2` GLSL helpers. |
+| **OpenCL** | Emulated as `uchar*` storage (1-byte stride); `_e4m3_bits_to_f32` / `_f32_to_e4m3_bits` (+ E5M2) OpenCL-C helpers + tracked LEA base pointer. |
+| **CUDA** | f32-register model. The FP8<->f32 conversion is **inline PTX bit-manipulation** (branchless `setp`/`selp`, unrolled normalize) using only basic integer ops - FP8 has no portable native PTX cvt (`cvt.*.e4m3` is sm_89/Hopper only), so this works on every CUDA arch. Load = `ld.global.u8` + convert; store = convert + `st.global.u8`. |
+| **CPU** | Native - the managed `Float8E4M3`/`Float8E5M2` structs run directly. |
+
+> **Convention note (E4M3 overflow):** out-of-range *inputs* to `Float8E4M3` saturate finite overflow to
+> ±448 and map ±Inf -> NaN (the OCP / NVIDIA Transformer Engine saturating-forward convention). Only the
+> out-of-range input behavior is convention-dependent; every *representable* value round-trips exactly.
+
 ### Sub-Word Usage Notes
 
 These apply to any kernel using `ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, or `ArrayView<Half>`:
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ Write parallel compute code in C# and let the library pick the best available ba
 
 ## Recent Highlights
 
-**4.13.0 (newest):** **BFloat16 (`ILGPU.BFloat16`) on all 6 backends** - a kernel-native "brain float" (1 sign / 8 exponent / 7 mantissa = the top 16 bits of an fp32, so it keeps fp32's *full dynamic range* - the right trade for ML weights/activations where fp16's tiny range overflows/underflows). `BasicValueType.BFloat16` IR primitive + `INumber<BFloat16>` for generic-math kernels. The bf16<->f32 conversion (exact zero-extend + round-to-nearest-even) is byte-identical across every backend; CUDA uses native `cvt.*.bf16` (sm_80+) with an f32-register-compute model since PTX has no native bf16 arithmetic. Bundles forks **2.0.18**.
+**4.13.0 (newest):** **Full low-precision floating-point support across all 6 backends.** Three new kernel-native types join `Half` - **`BFloat16`** ("brain float", top 16 bits of an fp32, so it keeps fp32's full dynamic range), and the two **FP8** formats **`Float8E4M3`** (forward/inference: no Inf, saturates to ±448) and **`Float8E5M2`** (backward/gradient: IEEE Inf/NaN) - each with a `BasicValueType` IR primitive and full `INumber<T>` support. Plus: **generic `INumber<T>` mixed-precision kernels** (one `where T : INumber<T>` kernel for float/Half/bf16/fp8 instead of N per-type variants) and **`PrecisionConvert.ConvertToSingle<T>` / `ConvertFromSingle<T>`** for transpilable generic `float`<->`T` conversion *inside* a kernel (the read-low-precision / accumulate-in-float / write-low-precision path, as one generic op). All conversions are byte-identical across every backend (helper functions on OpenCL/WGSL/GLSL, inline WebAssembly bytecode on Wasm, inline PTX bit-manipulation on CUDA). **bf16 + FP8 now run on every CUDA architecture** including pre-Ampere cards (GTX 1080 / RTX 2060): the PTX path uses portable bit-manipulation rather than the sm_80/sm_89-only native cvt instructions that previously failed to compile on older cards.
 
 **4.12.0:** **Sync/async contract** - operations that *wait* or *read a result back* are async-only; the sync form now throws on the browser (`Synchronize()` -> use `await SynchronizeAsync()`) instead of silently returning wrong data, while fire-and-forget work (dispatch / alloc / upload / `Flush`-submit) stays sync. Plus `AcceleratorRequirements.RequiresScatterStores` (4.12.1) to gate WebGL out of in-kernel scatter kernels at selection time.
 
@@ -87,7 +87,8 @@ All support sub-word + `Half` + `BFloat16` types, 64-bit (native on Wasm/CUDA/Op
 
 ## Features
 
-- **Sub-word data types** - `Int8`, `UInt8`, `Int16`, `UInt16`, `Float16` (`ILGPU.Half`), and `BFloat16` (`ILGPU.BFloat16`) buffer access on all 6 backends. Packed storage with correct stride handling per backend. `Half.Abs`, `Half.Min`, `Half.Max`, `Half.Clamp` intrinsics. `BFloat16` keeps fp32's full dynamic range (top 16 bits of an fp32) - the ML-weights trade
+- **Sub-word & low-precision float data types** - `Int8`, `UInt8`, `Int16`, `UInt16`, `Float16` (`ILGPU.Half`), `BFloat16` (`ILGPU.BFloat16`), and FP8 (`ILGPU.Float8E4M3` + `ILGPU.Float8E5M2`) buffer access on all 6 backends. Packed storage with correct stride handling per backend. `Half.Abs/Min/Max/Clamp` intrinsics. `BFloat16` keeps fp32's full dynamic range (top 16 bits of an fp32); FP8 gives 1-byte storage in the two OCP formats (E4M3 forward/inference, E5M2 backward/gradient) - the ML-weights/activations trade
+- **Generic mixed-precision kernels** - one `where T : INumber<T>` kernel runs for float/Half/bf16/fp8 instead of N hand-written per-type variants; `PrecisionConvert.ConvertToSingle<T>` / `ConvertFromSingle<T>` give transpilable generic `float`<->`T` conversion inside a kernel (read low-precision, accumulate in float, write low-precision)
 - **CopyFromJS** - Write JavaScript `TypedArray` or `ArrayBuffer` data directly to GPU memory without .NET heap allocation. Available on all browser backends
 - **Lambda kernels** - Write kernels as capturing C# lambdas - captured scalar values are automatically passed to the GPU at dispatch time. No boilerplate, all 6 backends
 - **Higher-order kernels** - `DelegateSpecialization<Func<T,R>>` lets you pass operations as kernel parameters. The delegate is resolved and inlined at compile time - one kernel, many behaviors
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -6,7 +6,7 @@
 		<Nullable>enable</Nullable>
 		<Version>4.13.0-local.10</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
-		<PackageReleaseNotes>4.13.0-local.10: FP8 (Float8E4M3 + Float8E5M2) now works on ALL 6 BACKENDS (CPU/OpenCL/WebGPU/WebGL/Wasm/CUDA), AND a bf16 fix for PRE-AMPERE CUDA cards. (1) **FP8 complete:** the two OCP FP8 8-bit float types (E4M3FN = forward/inference, sat to +-448, no Inf; E5M2 = backward/gradient, IEEE Inf/NaN) as full INumber&lt;T&gt; values + BasicValueType IR primitives, with the FP8&lt;-&gt;f32 conversion emitted per-backend (helper fns on OpenCL/WGSL/GLSL; inline WASM bytecode; inline PTX bit-manip). Gate: BackendTestBase.PrecisionConvert_Float8E{4M3,5M2}_RoundTripBitExact bit-exact vs the concrete cast on every backend; relu(x*scale+bias) generic kernel 257/257 on CPU+OpenCL+CUDA. (2) **bf16 PRE-AMPERE FIX (important):** the PTX bf16 path emitted cvt.f32.bf16 / cvt.rn.bf16.f32 unconditionally - those are sm_80+ (Ampere) ONLY, so bf16 kernels FAILED TO COMPILE on older CUDA cards (GTX 1080 = sm_61, RTX 2060 = sm_75). Replaced with portable bit-manip (basic integer ops, every CUDA arch) at all 7 sites; PMT BFloat16 107/0 all 6 backends incl CUDA. bf16 (consumed by ML) now runs on pre-Ampere. LESSON: native-cvt shortcuts silently gate out older hardware - FP8 likewise uses portable bit-manip (FP8 cvt is sm_89-only). Forks bump to 2.0.26. --- 4.13.0-local.9: `PrecisionConvert` - transpilable GENERIC float&lt;-&gt;T conversion inside a `where T:INumber&lt;T&gt;` kernel (`PrecisionConvert.ConvertToSingle&lt;T&gt;(T)` + `ConvertFromSingle&lt;T&gt;(float)`). Inside a generic kernel there is no C# way to write `(float)t`/`(T)f`, so callers reach for `float.CreateChecked(t)`/`T.CreateChecked(f)` - which touch System.Type and the transpiler rejects on every GPU backend. These two methods are tagged [ConvertIntrinisc] so the frontend lowers each call to the SAME ConvertValue the concrete `(float)Half`/`(Half)float` cast emits (resolved per instantiation), bypassing System.Type. This lets every precision-aware op (read low-p, accumulate in float, write low-p - Conv/GroupNorm/SiLU/MatMul) be ONE generic kernel for float/Half/bf16 instead of N per-type variants. Gate: new BackendTestBase.PrecisionConvert round-trip (float/Half/bf16) **23/0 all 6 backends** (pure ConvertFromSingle(ConvertToSingle(x)) bit-exact vs the concrete cast); GenericPrecision still 23/0. Also lands the FP8 (Float8E4M3 + Float8E5M2) FOUNDATION: the two OCP FP8 core types as full INumber&lt;T&gt; values + BasicValueType IR primitives (append-only, no ordinal shift) wired through the IR core + desktop type tables - FP8 kernels run bit-exact on the CPU backend; GPU FP8 codegen is in progress (additive, opt-in, zero effect on existing types). Forks bump to 2.0.25. --- 4.13.0-local.8: Generic `INumber&lt;T&gt;` mixed-precision kernels (float/Half/bf16) now transpile + run correctly on ALL 6 backends - one `where T:INumber&lt;T&gt;` kernel instead of N hand-written per-type variants. Fixes the codegen gaps the concrete-typed bf16/Half work didn't cover (distinct from sub-word BUFFER elements): (1) PTX bf16 SELECT (the `v>0?v:0` ternary `selp`) was missing the bf16-&gt;f32 remap -&gt; KeyNotFoundException at compile. (2) By-value sub-word SCALAR params (e.g. a kernel's scale/bias): PTX declared the bf16 param .f32 (4B) but the host packs 2B storage -&gt; arrived as 0; OpenCL declared emulated Half/bf16 as 4B `float` -&gt; CL_INVALID_ARG_SIZE; WebGPU/WebGL read the scalar without the sub-word conversion -&gt; 0; Wasm struct-serialized the value -&gt; raw bits. All now declare/pass the 2-byte storage and convert to f32 at the boundary (the same conversion buffer ELEMENTS use), keyed on the type so non-sub-word params are byte-identical. Gate: new BackendTestBase.GenericPrecision (float/Half/bf16) **23/0 all 6 backends**; no regression (BFloat16 100/0, Half 190/0/8). Forks bump to 2.0.24. --- 4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host&lt;-&gt;device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
+		<PackageReleaseNotes>4.13.0 brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and now FP8 (Float8E4M3 + Float8E5M2), plus generic INumber&lt;T&gt; mixed-precision kernels and PrecisionConvert for transpilable generic float&lt;-&gt;T conversion inside a kernel. This release also fixes bf16 on PRE-AMPERE CUDA cards (GTX 1080 / RTX 2060 etc.): the PTX bf16 path used sm_80+ cvt instructions and failed to compile on older cards; it now uses portable bit-manipulation that works on every CUDA architecture (FP8 likewise). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
 		<GenerateDocumentationFile>true</GenerateDocumentationFile>
 		<EmbedAllSources>true</EmbedAllSources>