Skip to content

Commit efa88f4

Browse files
LostBeardclaude
andcommitted
Bump to 4.13.0-local.10 (forks 2.0.26): FP8 on all 6 backends + bf16 pre-Ampere CUDA fix
Four-package bundle bump. Ships FP8 (Float8E4M3/E5M2) complete on CPU/OpenCL/WebGPU/WebGL/Wasm/CUDA + the bf16 pre-Ampere PTX fix (1080/2060 unblocked). Also adds FP8 PTX struct-field IO (EmitIOLoad/Store) for completeness (CUDA fp8-verify still 257/257). - ILGPU.Fork + ILGPU.Algorithms.Fork: 2.0.25 -> 2.0.26 - SpawnDev.ILGPU: 4.13.0-local.9 -> 4.13.0-local.10 (+ PackageReference lines, release notes) - CHANGELOG local.10 entry Gates: PrecisionConvert (incl FP8 round-trip all 6 backends) 37/0; BFloat16 107/0 (incl CUDA); fp8-verify CPU+OpenCL+CUDA 257/257. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 1dd4d5b commit efa88f4

5 files changed

Lines changed: 35 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@ This file tracks notable changes per release. The README's "Recent Highlights" s
44

55
## 4.13.0 (unreleased) - BFloat16 (bfloat16) Phases 0-3b: core type (CPU) + WebGPU + WebGL + Wasm + OpenCL + CUDA codegen (all 6 backends)
66

7+
### local.10 - FP8 complete on ALL 6 backends + bf16 pre-Ampere CUDA fix
8+
9+
- **FP8 (`ILGPU.Float8E4M3` + `ILGPU.Float8E5M2`) now works on all 6 backends** (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA). The two OCP FP8 8-bit float formats - E4M3FN (1/4/3, bias 7, no Inf, saturates to +-448, the forward/inference format) and E5M2 (1/5/2, bias 15, IEEE Inf/NaN, the backward/gradient format) - as full `INumber<T>` value types (FP32-based `[MathIntrinsic]`/`[CompareIntrinisc]`/`[ConvertIntrinisc]` operators) + `BasicValueType.Float8E4M3`/`Float8E5M2` IR primitives (append-only). FP8 computes as f32 in-register (the f32-register model) and is converted at the 1-byte load/store boundary. The FP8<->f32 conversion (exponent rebias 127->7/15, RNE rounding, subnormal normalize, variant specials) is emitted per backend: callable helper functions on OpenCL (`_e4m3/_e5m2_bits_to_f32` + inverse), WGSL, and GLSL; inline WebAssembly bytecode on Wasm (`EmitFP8ToF32`/`EmitF32ToFP8`, subnormal-normalize unrolled for bit-exactness); inline PTX bit-manipulation on CUDA (`EmitFP8BitsToF32`/`EmitF32ToFP8Bits`, branchless setp/selp - FP8 has no portable native PTX cvt). All byte-identical to the CPU-verified managed conversion. Gate: `BackendTestBase.PrecisionConvert_Float8E{4M3,5M2}_RoundTripBitExact` (pure `ConvertFromSingle(ConvertToSingle(x))` bit-exact vs the concrete `(T)(float)x` cast) on every backend; the `relu(x*scale+bias)` generic `INumber<T>` kernel 257/257 on CPU+OpenCL+CUDA. FP8 radix-sort keys (`Interop.FloatAsInt`) are a tracked follow-up.
10+
- **bf16 fix for PRE-AMPERE CUDA cards.** The PTX bf16 path emitted `cvt.f32.bf16` / `cvt.rn.bf16.f32` **unconditionally** at all 7 sites (load, store, scalar param, FloatAsInt, IntAsFloat, struct-field IO load+store). Those instructions are **sm_80+ (Ampere/Ada/Hopper) ONLY**, so any bf16 kernel **failed to compile on older CUDA cards** - Pascal (GTX 1080 = sm_61), Volta (sm_70), Turing (RTX 2060 = sm_75). Since bf16 is consumed by the ML stack, that path was broken on pre-Ampere. Fixed with portable bit-manipulation (`EmitBF16BitsToF32` = zero-extend + shl 16 + reinterpret, exact since bf16 is the top 16 bits of an fp32; `EmitF32ToBF16Bits` = RNE + NaN-preservation guard, branchless via setp/selp) using only basic integer ops available on **every CUDA arch**, replacing the native cvt at all 7 sites. Byte-identical to the managed/WGSL/GLSL/Wasm/OpenCL bf16 conversion (which already used bit-manip - only PTX had taken the sm_80 native shortcut). Gate: `PMT_FILTER=BFloat16` **107/0 across all 6 backends incl. CUDA** (radix keys, struct fields, range/specials, arithmetic). **Lesson:** native-cvt shortcuts (bf16's sm_80, FP8's sm_89) silently gate out older hardware - prefer portable bit-manip unless explicitly capability-gated.
11+
712
### local.9 - `PrecisionConvert` (generic in-kernel float<->T conversion) + FP8 foundation
813

914
- **`PrecisionConvert.ConvertToSingle<T>(T)` + `ConvertFromSingle<T>(float)`** - a transpilable GENERIC conversion between `float` and any `INumber<T>` (Half/BFloat16/Float8E4M3/Float8E5M2/...). Inside a generic `where T : INumber<T>` kernel there is no C# way to write `(float)t`/`(T)f` (no cast constraint exists), so callers reached for `float.CreateChecked(t)`/`T.CreateChecked(f)` - which lower to System.Numerics range/identity checks that touch `System.Type`, and the kernel transpiler rejects that ("Class type 'System.Type' is not supported") on every GPU backend. The two new methods are tagged `[ConvertIntrinisc]`, so the frontend lowers each call to the SAME `ConvertValue` IR node the concrete `(float)Half`/`(Half)float` cast emits (resolving `T` per instantiation via the method return type) - no `System.Type`, reusing the convert path that already handles Half/bf16/fp8. This collapses every precision-aware op (read low-precision input, accumulate in float, write low-precision output - Conv/GroupNorm/SiLU/MatMul, the Rule-4 zero-fp32-temp path) to ONE generic kernel for float/Half/bf16 instead of N per-type variants. Gate: new `BackendTestBase.PrecisionConvert` round-trip (float/Half/bf16) - a pure `ConvertFromSingle(ConvertToSingle(x))` with no accumulation is **bit-exact vs the concrete `(T)(float)x` cast, 23/0 across all 6 backends** (incl. WebGPU/WebGL/Wasm); `GenericPrecision` still 23/0 (no regression).

ILGPU.Algorithms/ILGPU.Algorithms.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
1313
Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
1414
SpawnDev.ILGPU.csproj for the full procedure. -->
15-
<Version>2.0.25</Version>
15+
<Version>2.0.26</Version>
1616
<IsPackable>true</IsPackable>
1717
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
1818
</PropertyGroup>

ILGPU/Backends/PTX/PTXCodeGenerator.Emitter.cs

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -904,6 +904,18 @@ public void EmitIOLoad<TIOEmitter, T>(
904904
return;
905905
}
906906

907+
// FP8 field/value: 1-byte storage, f32 register. Load the byte, widen via portable
908+
// bit-manip (every CUDA arch). Same model as bf16.
909+
if (register.BasicValueType == BasicValueType.Float8E4M3 ||
910+
register.BasicValueType == BasicValueType.Float8E5M2)
911+
{
912+
var rawReg = AllocateRegister(BasicValueType.Int16, PTXRegisterKind.Int16);
913+
emitter.Emit(this, command, rawReg, userState);
914+
EmitFP8BitsToF32(rawReg, register, register.BasicValueType == BasicValueType.Float8E4M3);
915+
FreeRegister(rawReg);
916+
return;
917+
}
918+
907919
HardwareRegister? originalRegister = null;
908920
// We need a temporary 32bit register for predicate conversion at this point:
909921
// 1) load value into temporary register
@@ -962,6 +974,18 @@ public void EmitIOStore<TIOEmitter, T>(
962974
return;
963975
}
964976

977+
// FP8 field/value store: round the f32 value to its 1-byte pattern via portable bit-manip.
978+
if (register.BasicValueType == BasicValueType.Float8E4M3 ||
979+
register.BasicValueType == BasicValueType.Float8E5M2)
980+
{
981+
var f32Register = EnsureHardwareRegister(register);
982+
var rawReg = AllocateRegister(BasicValueType.Int16, PTXRegisterKind.Int16);
983+
EmitF32ToFP8Bits(f32Register, rawReg, register.BasicValueType == BasicValueType.Float8E4M3);
984+
emitter.Emit(this, command, rawReg, userState);
985+
FreeRegister(rawReg);
986+
return;
987+
}
988+
965989
// We need a temporary 32bit register for predicate conversion at this point:
966990
// 1) convert current predicate into 32bit integer
967991
// 2) store the converted value from the temporary register

ILGPU/ILGPU.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
1717
transitively and the fix is invisible. See the banner comment in
1818
SpawnDev.ILGPU.csproj for the full procedure. -->
19-
<Version>2.0.25</Version>
19+
<Version>2.0.26</Version>
2020
<IsPackable>true</IsPackable>
2121
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
2222
</PropertyGroup>

0 commit comments

Comments
 (0)