Skip to content

Commit bd55818

Browse files
LostBeardclaude
andcommitted
CLAUDE.md: correct bf16 CUDA mechanism (portable bit-manip, not sm_80 cvt) + add FP8
The bf16 pre-Ampere fix (4.13.0) made two CLAUDE.md current-state facts stale: the feature matrix and the sub-word-types note both claimed bf16 on CUDA uses native cvt.*.bf16 (sm_80+). It now uses portable bit-manipulation (every CUDA arch). Updated both, and added the two FP8 types (Float8E4M3 + Float8E5M2) to the matrix + note. Doc-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent aad9e62 commit bd55818

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

CLAUDE.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ If total > 10: `InvalidOperationException` at dispatch time (v4.9.1+). Before v4
147147

148148
`ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, `ArrayView<Half>` (ILGPU.Half), `ArrayView<BFloat16>` (ILGPU.BFloat16) supported on all 6 backends.
149149

150-
**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`). bf16 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 uses an f32-register-compute model (no native PTX bf16 arithmetic; `cvt.*.bf16` at load/store, sm_80+); the browser/OpenCL backends emulate it via the same exact bf16<->f32 + RNE conversion, byte-identical to CUDA.
150+
**Use `ILGPU.Half`, NOT `System.Half`** in kernel signatures. Implicit conversion operators exist for interop. Same for **`ILGPU.BFloat16`** (the "brain float": top 16 bits of an fp32, so fp32's full dynamic range - the ML-weights trade vs `Half`) and the two FP8 types **`ILGPU.Float8E4M3`** (forward/inference, no Inf, sat ±448) + **`ILGPU.Float8E5M2`** (backward/gradient, IEEE Inf/NaN). bf16/FP8 detail: [Docs/data-type-support.md](Docs/data-type-support.md). On CUDA bf16 + FP8 use an f32-register-compute model (no native PTX bf16/fp8 arithmetic); the load/store conversion is **PORTABLE bit-manipulation (basic integer ops on every CUDA arch incl. pre-Ampere)** - 4.13.0+ replaced the sm_80-only `cvt.*.bf16` shortcut that broke on older cards. The browser/OpenCL/Wasm backends emulate the same exact conversion, byte-identical to CUDA.
151151

152152
**Per-backend implementation:**
153153
- **WebGPU:** Packed into `array<atomic<u32>>`. Load via atomicLoad + shift + mask. Store via atomicAnd + atomicOr (thread-safe sub-word writes). Float16 load/store calls `_f16_to_f32` / `_f32_to_f16` helpers from `WGSLEmulationLibrary.F16Functions` when `!shader-f16`; native WGSL `f16` type otherwise. `WebGPUBackend.ForceEmulatedF16` test flag forces the emulation path for verification.
@@ -208,7 +208,8 @@ var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
208208
| f64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
209209
| i64 native | No (emulated) | No (emulated) | Yes | Yes | Yes | Yes |
210210
| f16 native | Native or emulated** | No (emulated)*** | No (emulated) | Yes | Native or emulated**** | Yes |
211-
| bf16 | Emulated | Emulated | Emulated | f32-reg + native cvt (sm_80+) | Emulated | Native (managed) |
211+
| bf16 | Emulated | Emulated | Emulated | f32-reg + portable bit-manip (all arch) | Emulated | Native (managed) |
212+
| FP8 (E4M3 + E5M2) | Emulated | Emulated | Emulated | f32-reg + portable bit-manip (all arch) | Emulated | Native (managed) |
212213

213214
*Subgroups: WebGPU requires browser support + adapter feature. OpenCL: device-dependent.
214215
*****Wasm subgroups are EMULATED (no hardware warps): `WarpSize = 8` (mirrors CPU), `Warp.Shuffle`/`SubWarpShuffle` lower to a shared-memory exchange (write per-lane slot → barrier → read source-lane slot) — see `WasmBackend.WasmWarpSize` + `EmitWarpShuffle` in `WasmKernelFunctionGenerator.cs`. `LaneIdx = threadIdX % WarpSize`; `WarpIdx`/`IsFirstLane` derive in IL. Algorithm-layer warp Reduce/Scan route through `WasmWarpExtensions` (also shared-memory). Verified vs the CPU oracle (`SubgroupShuffleTest`, `ReduceMinMaxTest`). Fixed 2026-06-07 (`116c789`).

0 commit comments

Comments
 (0)