CLAUDE.md: correct bf16 CUDA mechanism (portable bit-manip, not sm_80 cvt) + add FP8

LostBeard · claude · LostBeard · commit bd5581881e3b · 2026-06-16T19:08:53.000-04:00
The bf16 pre-Ampere fix (4.13.0) made two CLAUDE.md current-state facts stale: the feature matrix
and the sub-word-types note both claimed bf16 on CUDA uses native cvt.*.bf16 (sm_80+). It now uses
portable bit-manipulation (every CUDA arch). Updated both, and added the two FP8 types
(Float8E4M3 + Float8E5M2) to the matrix + note. Doc-only.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -147,7 +147,7 @@ If total > 10: `InvalidOperationException` at dispatch time (v4.9.1+). Before v4
 
 `ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, `ArrayView<Half>` (ILGPU.Half), `ArrayView<BFloat16>` (ILGPU.BFloat16) supported on all 6 backends.
 
-**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`). bf16 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 uses an f32-register-compute model (no native PTX bf16 arithmetic; `cvt.*.bf16` at load/store, sm_80+); the browser/OpenCL backends emulate it via the same exact bf16<->f32 + RNE conversion, byte-identical to CUDA.
+**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf, sat ±448) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
 
 **Per-backend implementation:**
 - **WebGPU:** Packed into `array<atomic<u32>>`. Load via atomicLoad + shift + mask. Store via atomicAnd + atomicOr (thread-safe sub-word writes). Float16 load/store calls `_f16_to_f32` / `_f32_to_f16` helpers from `WGSLEmulationLibrary.F16Functions` when `!shader-f16`; native WGSL `f16` type otherwise. `WebGPUBackend.ForceEmulatedF16` test flag forces the emulation path for verification.
@@ -208,7 +208,8 @@ var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
 | f64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
 | i64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
 | f16 native | Native or emulated** | No (emulated)*** | No (emulated) | Yes | Native or emulated**** | Yes |
-| bf16 | Emulated | Emulated | Emulated | f32-reg + native cvt (sm_80+) | Emulated | Native (managed) |
+| bf16 | Emulated | Emulated | Emulated | f32-reg + portable bit-manip (all arch) | Emulated | Native (managed) |
+| FP8 (E4M3 + E5M2) | Emulated | Emulated | Emulated | f32-reg + portable bit-manip (all arch) | Emulated | Native (managed) |
 
 *Subgroups: WebGPU requires browser support + adapter feature. OpenCL: device-dependent.
 *****Wasm subgroups are EMULATED (no hardware warps): `WarpSize = 8` (mirrors CPU), `Warp.Shuffle`/`SubWarpShuffle` lower to a shared-memory exchange (write per-lane slot → barrier → read source-lane slot) — see `WasmBackend.WasmWarpSize` + `EmitWarpShuffle` in `WasmKernelFunctionGenerator.cs`. `LaneIdx = threadIdX % WarpSize`; `WarpIdx`/`IsFirstLane` derive in IL. Algorithm-layer warp Reduce/Scan route through `WasmWarpExtensions` (also shared-memory). Verified vs the CPU oracle (`SubgroupShuffleTest`, `ReduceMinMaxTest`). Fixed 2026-06-07 (`116c789`).