You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docs: trim PackageReleaseNotes to CHANGELOG; FP8 + bf16-portable in README + data-type-support
- SpawnDev.ILGPU.csproj: PackageReleaseNotes was carrying the full local.5->local.10 history inline
(huge). Trimmed to a concise 4.13.0 summary (low-precision types on all 6 backends + the bf16
pre-Ampere fix) pointing at CHANGELOG.md for the per-version detail.
- README "Recent Highlights" 4.13.0: rewritten to the full release - BFloat16 + FP8 (Float8E4M3 +
Float8E5M2) on all 6 backends, generic INumber<T> mixed-precision kernels, PrecisionConvert, and
the portable-bit-manip CUDA fix that brings bf16/FP8 to pre-Ampere cards. Features list updated.
- Docs/data-type-support.md: FP8 rows in Read/Write/EndToEnd tables (all 6 backends), a full FP8
per-backend implementation section, and the bf16 CUDA mechanism corrected from native sm_80 cvt to
portable bit-manip (works on every CUDA arch).
GitHub release description for 4.13.0 written separately (outside the repo) for posting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@@ -148,9 +154,43 @@ NaN-preservation guard. Values compute as f32 everywhere; only the storage is 2-
148
154
|**Wasm**|`EmitBF16ToF32` / `EmitF32ToBF16` emit the conversion as inline WebAssembly bytecode; 2-byte `i32.load16_u` / `i32.store16` (atomic in barrier kernels). |
149
155
|**WebGL**| Packed-u16 in an R32I texel; `texelFetch` + shift/mask load, Transform-Feedback varying store; `_bf16_to_f32` / `_f32_to_bf16` GLSL helpers. |
150
156
|**OpenCL**| Emulated (no common native bf16 extension; `cl_khr_fp16` is fp16, not bf16). View params are `ushort*` (2-byte storage stride - a `float*` typedef silently corrupts), `_bf16_bits_to_f32` / `_f32_to_bf16_bits` OpenCL-C helpers + tracked LEA base pointer. |
151
-
|**CUDA**|**f32-register-compute model** (PTX has no native bf16 *arithmetic*, only `cvt.*.bf16`): the value lives in an `.f32` register and computes as f32; arithmetic/compare route through the f32 tables; `ConvertValue` bf16<->f32 is a register no-op. Load = `ld.global.b16` + `cvt.f32.bf16`; store = `cvt.rn.bf16.f32` (RNE) + `st.global.b16`. `cvt.*.bf16` is native on sm_80+ (Ampere/Ada/Hopper).|
157
+
|**CUDA**|**f32-register-compute model** (PTX has no native bf16 *arithmetic*): the value lives in an `.f32` register and computes as f32; arithmetic/compare route through the f32 tables; `ConvertValue` bf16<->f32 is a register no-op. **The bf16<->f32 conversion at the load/store boundary uses PORTABLE bit-manipulation (basic integer ops on EVERY CUDA arch), NOT the native `cvt.*.bf16`** - those `cvt` instructions are sm_80+ (Ampere) only, so the earlier native-cvt path failed to compile on pre-Ampere cards (Pascal sm_61 / Volta sm_70 / Turing sm_75). Load = `ld.global.u8`... no: `ld.global.b16` + zero-extend + `shl 16` + reinterpret (exact, bf16 = top 16 bits of fp32); store = RNE round + NaN-guard + `st.global.b16`. Byte-identical to every other backend. (4.13.0+; pre-4.13.0 used the sm_80 native cvt and broke on older cards.)|
|**OpenCL**| Emulated as `uchar*` storage (1-byte stride); `_e4m3_bits_to_f32` / `_f32_to_e4m3_bits` (+ E5M2) OpenCL-C helpers + tracked LEA base pointer. |
187
+
|**CUDA**| f32-register model. The FP8<->f32 conversion is **inline PTX bit-manipulation** (branchless `setp`/`selp`, unrolled normalize) using only basic integer ops - FP8 has no portable native PTX cvt (`cvt.*.e4m3` is sm_89/Hopper only), so this works on every CUDA arch. Load = `ld.global.u8` + convert; store = convert + `st.global.u8`. |
188
+
|**CPU**| Native - the managed `Float8E4M3`/`Float8E5M2` structs run directly. |
189
+
190
+
> **Convention note (E4M3 overflow):** out-of-range *inputs* to `Float8E4M3` saturate finite overflow to
191
+
> ±448 and map ±Inf -> NaN (the OCP / NVIDIA Transformer Engine saturating-forward convention). Only the
192
+
> out-of-range input behavior is convention-dependent; every *representable* value round-trips exactly.
193
+
154
194
### Sub-Word Usage Notes
155
195
156
196
These apply to any kernel using `ArrayView<byte>`, `ArrayView<sbyte>`, `ArrayView<short>`, `ArrayView<ushort>`, or `ArrayView<Half>`:
Copy file name to clipboardExpand all lines: README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ Write parallel compute code in C# and let the library pick the best available ba
9
9
10
10
## Recent Highlights
11
11
12
-
**4.13.0 (newest):****BFloat16 (`ILGPU.BFloat16`) on all 6 backends** - a kernel-native "brain float" (1 sign / 8 exponent / 7 mantissa = the top 16 bits of an fp32, so it keeps fp32's *full dynamic range* - the right trade for ML weights/activations where fp16's tiny range overflows/underflows). `BasicValueType.BFloat16` IR primitive + `INumber<BFloat16>` for generic-math kernels. The bf16<->f32 conversion (exact zero-extend + round-to-nearest-even) is byte-identical across every backend; CUDA uses native `cvt.*.bf16` (sm_80+) with an f32-register-compute model since PTX has no native bf16 arithmetic. Bundles forks **2.0.18**.
12
+
**4.13.0 (newest):** **Full low-precision floating-point support across all 6 backends.** Three new kernel-native types join `Half` - **`BFloat16`** ("brain float", top 16 bits of an fp32, so it keeps fp32's full dynamic range), and the two **FP8** formats **`Float8E4M3`** (forward/inference: no Inf, saturates to ±448) and **`Float8E5M2`** (backward/gradient: IEEE Inf/NaN) - each with a `BasicValueType` IR primitive and full `INumber<T>` support. Plus: **generic `INumber<T>` mixed-precision kernels** (one `where T : INumber<T>` kernel for float/Half/bf16/fp8 instead of N per-type variants) and **`PrecisionConvert.ConvertToSingle<T>` / `ConvertFromSingle<T>`** for transpilable generic `float`<->`T` conversion *inside* a kernel (the read-low-precision / accumulate-in-float / write-low-precision path, as one generic op). All conversions are byte-identical across every backend (helper functions on OpenCL/WGSL/GLSL, inline WebAssembly bytecode on Wasm, inline PTX bit-manipulation on CUDA). **bf16 + FP8 now run on every CUDA architecture** including pre-Ampere cards (GTX 1080 / RTX 2060): the PTX path uses portable bit-manipulation rather than the sm_80/sm_89-only native cvt instructions that previously failed to compile on older cards.
13
13
14
14
**4.12.0:****Sync/async contract** - operations that *wait* or *read a result back* are async-only; the sync form now throws on the browser (`Synchronize()` -> use `await SynchronizeAsync()`) instead of silently returning wrong data, while fire-and-forget work (dispatch / alloc / upload / `Flush`-submit) stays sync. Plus `AcceleratorRequirements.RequiresScatterStores` (4.12.1) to gate WebGL out of in-kernel scatter kernels at selection time.
15
15
@@ -87,7 +87,8 @@ All support sub-word + `Half` + `BFloat16` types, 64-bit (native on Wasm/CUDA/Op
87
87
88
88
## Features
89
89
90
-
-**Sub-word data types** - `Int8`, `UInt8`, `Int16`, `UInt16`, `Float16` (`ILGPU.Half`), and `BFloat16` (`ILGPU.BFloat16`) buffer access on all 6 backends. Packed storage with correct stride handling per backend. `Half.Abs`, `Half.Min`, `Half.Max`, `Half.Clamp` intrinsics. `BFloat16` keeps fp32's full dynamic range (top 16 bits of an fp32) - the ML-weights trade
90
+
-**Sub-word & low-precision float data types** - `Int8`, `UInt8`, `Int16`, `UInt16`, `Float16` (`ILGPU.Half`), `BFloat16` (`ILGPU.BFloat16`), and FP8 (`ILGPU.Float8E4M3` + `ILGPU.Float8E5M2`) buffer access on all 6 backends. Packed storage with correct stride handling per backend. `Half.Abs/Min/Max/Clamp` intrinsics. `BFloat16` keeps fp32's full dynamic range (top 16 bits of an fp32); FP8 gives 1-byte storage in the two OCP formats (E4M3 forward/inference, E5M2 backward/gradient) - the ML-weights/activations trade
91
+
-**Generic mixed-precision kernels** - one `where T : INumber<T>` kernel runs for float/Half/bf16/fp8 instead of N hand-written per-type variants; `PrecisionConvert.ConvertToSingle<T>` / `ConvertFromSingle<T>` give transpilable generic `float`<->`T` conversion inside a kernel (read low-precision, accumulate in float, write low-precision)
91
92
-**CopyFromJS** - Write JavaScript `TypedArray` or `ArrayBuffer` data directly to GPU memory without .NET heap allocation. Available on all browser backends
92
93
-**Lambda kernels** - Write kernels as capturing C# lambdas - captured scalar values are automatically passed to the GPU at dispatch time. No boilerplate, all 6 backends
93
94
-**Higher-order kernels** - `DelegateSpecialization<Func<T,R>>` lets you pass operations as kernel parameters. The delegate is resolved and inlined at compile time - one kernel, many behaviors
Copy file name to clipboardExpand all lines: SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
<Nullable>enable</Nullable>
7
7
<Version>4.13.0-local.10</Version>
8
8
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9
-
<PackageReleaseNotes>4.13.0-local.10: FP8 (Float8E4M3 + Float8E5M2) now works on ALL 6 BACKENDS (CPU/OpenCL/WebGPU/WebGL/Wasm/CUDA), AND a bf16 fix for PRE-AMPERE CUDA cards. (1) **FP8 complete:** the two OCP FP8 8-bit float types (E4M3FN = forward/inference, sat to +-448, no Inf; E5M2 = backward/gradient, IEEE Inf/NaN) as full INumber<T> values + BasicValueType IR primitives, with the FP8<->f32 conversion emitted per-backend (helper fns on OpenCL/WGSL/GLSL; inline WASM bytecode; inline PTX bit-manip). Gate: BackendTestBase.PrecisionConvert_Float8E{4M3,5M2}_RoundTripBitExact bit-exact vs the concrete cast on every backend; relu(x*scale+bias) generic kernel 257/257 on CPU+OpenCL+CUDA. (2) **bf16 PRE-AMPERE FIX (important):** the PTX bf16 path emitted cvt.f32.bf16 / cvt.rn.bf16.f32 unconditionally - those are sm_80+ (Ampere) ONLY, so bf16 kernels FAILED TO COMPILE on older CUDA cards (GTX 1080 = sm_61, RTX 2060 = sm_75). Replaced with portable bit-manip (basic integer ops, every CUDA arch) at all 7 sites; PMT BFloat16 107/0 all 6 backends incl CUDA. bf16 (consumed by ML) now runs on pre-Ampere. LESSON: native-cvt shortcuts silently gate out older hardware - FP8 likewise uses portable bit-manip (FP8 cvt is sm_89-only). Forks bump to 2.0.26. --- 4.13.0-local.9: `PrecisionConvert` - transpilable GENERIC float<->T conversion inside a `where T:INumber<T>` kernel (`PrecisionConvert.ConvertToSingle<T>(T)` + `ConvertFromSingle<T>(float)`). Inside a generic kernel there is no C# way to write `(float)t`/`(T)f`, so callers reach for `float.CreateChecked(t)`/`T.CreateChecked(f)` - which touch System.Type and the transpiler rejects on every GPU backend. These two methods are tagged [ConvertIntrinisc] so the frontend lowers each call to the SAME ConvertValue the concrete `(float)Half`/`(Half)float` cast emits (resolved per instantiation), bypassing System.Type. This lets every precision-aware op (read low-p, accumulate in float, write low-p - Conv/GroupNorm/SiLU/MatMul) be ONE generic kernel for float/Half/bf16 instead of N per-type variants. Gate: new BackendTestBase.PrecisionConvert round-trip (float/Half/bf16) **23/0 all 6 backends** (pure ConvertFromSingle(ConvertToSingle(x)) bit-exact vs the concrete cast); GenericPrecision still 23/0. Also lands the FP8 (Float8E4M3 + Float8E5M2) FOUNDATION: the two OCP FP8 core types as full INumber<T> values + BasicValueType IR primitives (append-only, no ordinal shift) wired through the IR core + desktop type tables - FP8 kernels run bit-exact on the CPU backend; GPU FP8 codegen is in progress (additive, opt-in, zero effect on existing types). Forks bump to 2.0.25. --- 4.13.0-local.8: Generic `INumber<T>` mixed-precision kernels (float/Half/bf16) now transpile + run correctly on ALL 6 backends - one `where T:INumber<T>` kernel instead of N hand-written per-type variants. Fixes the codegen gaps the concrete-typed bf16/Half work didn't cover (distinct from sub-word BUFFER elements): (1) PTX bf16 SELECT (the `v>0?v:0` ternary `selp`) was missing the bf16->f32 remap -> KeyNotFoundException at compile. (2) By-value sub-word SCALAR params (e.g. a kernel's scale/bias): PTX declared the bf16 param .f32 (4B) but the host packs 2B storage -> arrived as 0; OpenCL declared emulated Half/bf16 as 4B `float` -> CL_INVALID_ARG_SIZE; WebGPU/WebGL read the scalar without the sub-word conversion -> 0; Wasm struct-serialized the value -> raw bits. All now declare/pass the 2-byte storage and convert to f32 at the boundary (the same conversion buffer ELEMENTS use), keyed on the type so non-sub-word params are byte-identical. Gate: new BackendTestBase.GenericPrecision (float/Half/bf16) **23/0 all 6 backends**; no regression (BFloat16 100/0, Half 190/0/8). Forks bump to 2.0.24. --- 4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host<->device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9
+
<PackageReleaseNotes>4.13.0 brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and now FP8 (Float8E4M3 + Float8E5M2), plus generic INumber<T> mixed-precision kernels and PrecisionConvert for transpilable generic float<->T conversion inside a kernel. This release also fixes bf16 on PRE-AMPERE CUDA cards (GTX 1080 / RTX 2060 etc.): the PTX bf16 path used sm_80+ cvt instructions and failed to compile on older cards; it now uses portable bit-manipulation that works on every CUDA architecture (FP8 likewise). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
0 commit comments