Skip to content

Latest commit

 

History

History
274 lines (225 loc) · 20.6 KB

File metadata and controls

274 lines (225 loc) · 20.6 KB

Data Type Support by Backend

Tracks verified support for all data types across all 7 backends. Updated: 2026-06-16

Legend:

  • PASS - verified with unit tests (real data, real kernels, real verification)
  • FAIL - tests exist, currently failing
  • [!] KNOWN LIMITATION - architectural constraint, not a bug
  • [-] NOT TESTED - no tests yet, status unknown
  • [N/A] - not applicable to this backend

Buffer Read (Load from ArrayView)

Type C# Type Size WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Int8 sbyte 1B [x] [x] [x] [x] [x] [x] [x]
UInt8 byte 1B [x] [x] [x] [x] [x] [x] [x]
Int16 short 2B [x] [x] [x] [x] [x] [x] [x]
UInt16 ushort 2B [x] [x] [x] [x] [x] [x] [x]
Int32 int 4B [x] [x] [x] [x] [x] [x] [x]
UInt32 uint 4B [x] [x] [x] [x] [x] [x] [x]
Int64 long 8B [x] [x] [x] [x] [x] [x] [x]
UInt64 ulong 8B [x] [x] [x] [x] [x] [x] [x]
Float16 Half 2B [x] [x] [x] [x] [x] [x] [x]
BFloat16 BFloat16 2B [x] [x] [x] [x] [x] [x] [x]
Float8E4M3 Float8E4M3 1B [x] [x] [x] [x] [x] [x] [x]
Float8E5M2 Float8E5M2 1B [x] [x] [x] [x] [x] [x] [x]
Float32 float 4B [x] [x] [x] [x] [x] [x] [x]
Float64 double 8B [x] [x] [x] [x] [x] [x] [x]

Buffer Write (Store to ArrayView)

Type C# Type Size WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Int8 sbyte 1B [x] [x] [x] [x] [x] [x] [x]
UInt8 byte 1B [x] [x] [x] [x] [x] [x] [x]
Int16 short 2B [x] [x] [x] [x] [x] [x] [x]
UInt16 ushort 2B [x] [x] [x] [x] [x] [x] [x]
Int32 int 4B [x] [x] [x] [x] [x] [x] [x]
UInt32 uint 4B [x] [x] [x] [x] [x] [x] [x]
Int64 long 8B [x] [x] [x] [x] [x] [x] [x]
UInt64 ulong 8B [x] [x] [x] [x] [x] [x] [x]
Float16 Half 2B [x] [x] [x] [x] [x] [x] [x]
BFloat16 BFloat16 2B [x] [x] [x] [x] [x] [x] [x]
Float8E4M3 Float8E4M3 1B [x] [x] [x] [x] [x] [x] [x]
Float8E5M2 Float8E5M2 1B [x] [x] [x] [x] [x] [x] [x]
Float32 float 4B [x] [x] [x] [x] [x] [x] [x]
Float64 double 8B [x] [x] [x] [x] [x] [x] [x]

End-to-End (Read + Kernel Process + Write)

Type C# Type Size WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Int8 sbyte 1B [x] [x] [x] [x] [x] [x] [x]
UInt8 byte 1B [x] [x] [x] [x] [x] [x] [x]
Int16 short 2B [x] [x] [x] [x] [x] [x] [x]
UInt16 ushort 2B [x] [x] [x] [x] [x] [x] [x]
Int32 int 4B [x] [x] [x] [x] [x] [x] [x]
UInt32 uint 4B [x] [x] [x] [x] [x] [x] [x]
Int64 long 8B [x] [x] [x] [x] [x] [x] [x]
UInt64 ulong 8B [x] [x] [x] [x] [x] [x] [x]
Float16 Half 2B [x] [x] [x] [x] [x] [x] [x]
BFloat16 BFloat16 2B [x] [x] [x] [x] [x] [x] [x]
Float8E4M3 Float8E4M3 1B [x] [x] [x] [x] [x] [x] [x]
Float8E5M2 Float8E5M2 1B [x] [x] [x] [x] [x] [x] [x]
Float32 float 4B [x] [x] [x] [x] [x] [x] [x]
Float64 double 8B [x] [x] [x] [x] [x] [x] [x]

Buffer RoundTrip (CopyFromCPU -> CopyToHostAsync, no kernel)

Type C# Type Size WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Int8 sbyte 1B [x] [x] [x] [x] [x] [x] [x]
UInt8 byte 1B [x] [x] [x] [x] [x] [x] [x]
Int16 short 2B [x] [x] [x] [x] [x] [x] [x]
UInt16 ushort 2B [x] [x] [x] [x] [x] [x] [x]
Float16 Half 2B [-] [-] [-] [-] [-] [-] [-]
BFloat16 BFloat16 2B [-] [-] [-] [-] [-] [-] [-]

Half Math Intrinsics

Function WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Abs [x] [x] [x] [x] [x] [x] [x]
Min/Max [x] [x] [x] [x] [x] [x] [x]
Clamp [-] [-] [-] [-] [-] [-] [-]

BFloat16 Arithmetic / Min-Max (kernel-side, all compute as f32)

ILGPU.BFloat16 carries fp32's full dynamic range (1 sign / 8 exponent / 7 mantissa - the top 16 bits of an fp32), so values ~1e30 / ~1e-30 that Half cannot hold round-trip exactly. Verified end-to-end by the 4 BFloat16_* tests (round-trip storage, + - * / cross-checked vs the true f64 result with round-to-nearest-even, min/max, and range + ±Inf/NaN/zero/RNE-tie specials).

Op WebGPU WebGPU NoSub Wasm WebGL CUDA OpenCL CPU
Add/Sub/Mul/Div [x] [x] [x] [x] [x] [x] [x]
Min/Max [x] [x] [x] [x] [x] [x] [x]
(float)/(BFloat16) convert [x] [x] [x] [x] [x] [x] [x]

CopyFromJS (Browser-only: JS TypedArray/ArrayBuffer -> GPU)

Type C# Type Size WebGPU Wasm WebGL
Int32 int 4B [x] [x] [x]
Float32 float 4B [x] [x] [x]

Atomic Operations

See Docs/atomic-operations.md for the complete per-operation support matrix.

Type C# Type WebGPU Wasm WebGL CUDA OpenCL CPU
Int32 int [x] [x] [!] Add only (vote TF) [x] [x] [x]
UInt32 uint [x] [x] [!] Add only (vote TF) [x] [x] [x]
Int64 long [x] Add/bitwise, [!] Min/Max/Exch/CAS [x] [!] [x] [x] [x]
UInt64 ulong [x] Add/bitwise, [!] Min/Max/Exch/CAS [x] [!] [x] [x] [x]
Float32 float [x] CAS loop [x] CAS loop [!] [x] [x] [x]
Float64 double [!] [x] CAS loop [!] [x] [x] [x]

[!] = Throws NotSupportedException at kernel compilation time. See atomic-operations.md for details.


Implementation Summary

Sub-word buffer access (Int8, UInt8, Int16, UInt16, Float16)

All sub-word types now have complete Read/Write/EndToEnd support on ALL 7 backends.

Backend Mechanism Signed/Unsigned Detection
WebGPU array<atomic<u32>> + atomicAnd/atomicOr for Store, atomicLoad for Read. IEEE 754 f16<->f32 inline conversion for Float16. EntryPoint.Parameters[N].GetGenericArguments()[0] CLR type check
Wasm Native i32.load8_s/u, i32.load16_s/u, i32.store8, i32.store16 opcodes. Float16 via EmitF16ToF32/EmitF32ToF16. CLR type trace via _generatorArgs.EntryPoint.Parameters
WebGL texelFetch from R32I texture, shift+mask extraction. TF output with sub-word packing in glWorker.js. Float16 via GLSL f16<->f32 bit manipulation. EntryPoint.Parameters[N] CLR type check
OpenCL Native types for Int8/UInt8/Int16/UInt16. Float16 via vload_half/vstore_half with tracked LEA base pointer. Native type support
CPU/CUDA Native sub-word support, no special handling needed. Native

BFloat16 (bf16 / "brain float") buffer access

ILGPU.BFloat16 + the BasicValueType.BFloat16 IR primitive add a second 16-bit float that, unlike Half, keeps fp32's full dynamic range (it is literally the top 16 bits of an fp32) - the right trade for ML weights/activations where fp16's tiny range overflows/underflows. Complete Read/Write/EndToEnd support on ALL 7 backends. The bf16<->f32 conversion is byte-identical across every backend: bf16->f32 is an exact zero-extend <<16; f32->bf16 is round-to-nearest-even truncate with a NaN-preservation guard. Values compute as f32 everywhere; only the storage is 2-byte.

Backend Mechanism
WebGPU Always emulated (no native WGSL bf16). Packed 2 bf16 per array<atomic<u32>> word (reuses f16's sub-word storage via a parallel _subWordBFloat16Params set); _bf16_to_f32 / _f32_to_bf16 WGSL helpers at the load/store boundary.
Wasm EmitBF16ToF32 / EmitF32ToBF16 emit the conversion as inline WebAssembly bytecode; 2-byte i32.load16_u / i32.store16 (atomic in barrier kernels).
WebGL Packed-u16 in an R32I texel; texelFetch + shift/mask load, Transform-Feedback varying store; _bf16_to_f32 / _f32_to_bf16 GLSL helpers.
OpenCL Emulated (no common native bf16 extension; cl_khr_fp16 is fp16, not bf16). View params are ushort* (2-byte storage stride - a float* typedef silently corrupts), _bf16_bits_to_f32 / _f32_to_bf16_bits OpenCL-C helpers + tracked LEA base pointer.
CUDA f32-register-compute model (PTX has no native bf16 arithmetic): the value lives in an .f32 register and computes as f32; arithmetic/compare route through the f32 tables; ConvertValue bf16<->f32 is a register no-op. The bf16<->f32 conversion at the load/store boundary uses PORTABLE bit-manipulation (basic integer ops on EVERY CUDA arch), NOT the native cvt.*.bf16 - those cvt instructions are sm_80+ (Ampere) only, so the earlier native-cvt path failed to compile on pre-Ampere cards (Pascal sm_61 / Volta sm_70 / Turing sm_75). Load = ld.global.u8... no: ld.global.b16 + zero-extend + shl 16 + reinterpret (exact, bf16 = top 16 bits of fp32); store = RNE round + NaN-guard + st.global.b16. Byte-identical to every other backend. (4.13.0+; pre-4.13.0 used the sm_80 native cvt and broke on older cards.)
CPU Native - the managed BFloat16 struct runs directly (DefaultILBackend).

FP8 (Float8E4M3 + Float8E5M2) buffer access

ILGPU.Float8E4M3 and ILGPU.Float8E5M2 add the two OCP 8-bit floating-point formats, each with the BasicValueType.Float8E4M3 / Float8E5M2 IR primitive. Complete Read/Write/EndToEnd support on ALL 6 backends.

  • Float8E4M3 - 1 sign / 4 exponent / 3 mantissa, bias 7. The "E4M3FN" finite variant: no infinities (the only non-finite value is NaN at 0x7F/0xFF), max finite magnitude 448. The overflow convention is selectable (see the convention note below). The FP8 forward / inference format (one extra mantissa bit vs E5M2, at the cost of range).
  • Float8E5M2 - 1 sign / 5 exponent / 2 mantissa, bias 15. IEEE-754-style: has infinities and NaNs (like fp16 but with 8 fewer mantissa bits). The FP8 backward / gradient format (fp16-class dynamic range, which gradients need).

Like Half/BFloat16, FP8 uses the f32-register model: values compute as f32 in-register and are converted to the 1-byte FP8 grid only at the load/store boundary, so accumulation stays full-precision (matching how real FP8 tensor-core hardware accumulates). Unlike bf16 (a trivial top-16-bits shift), the FP8 conversion needs exponent rebias (127 -> 7/15), round-to-nearest-even from 23 to 2/3 mantissa bits, subnormal normalization, and the per-format specials. The conversion is byte-identical across every backend (CPU-verified idempotence 0/256 for all representable values).

Backend Mechanism
WebGPU Always emulated. Packed 4 FP8 per array<atomic<u32>> word (1-byte sub-word storage); _e4m3_to_f32/_e5m2_to_f32 + inverse WGSL helpers at the load/store boundary.
Wasm Conversion emitted as inline WebAssembly bytecode (EmitFP8ToF32/EmitF32ToFP8, the subnormal-normalize loop unrolled for bit-exactness); 1-byte i32.load8_u / i32.store8 (verified-atomic in barrier kernels).
WebGL Packed 4 FP8 per R32I texel; texelFetch + shift/mask load, Transform-Feedback varying store; _e4m3/_e5m2 GLSL helpers.
OpenCL Emulated as uchar* storage (1-byte stride); _e4m3_bits_to_f32 / _f32_to_e4m3_bits (+ E5M2) OpenCL-C helpers + tracked LEA base pointer.
CUDA f32-register model. The FP8<->f32 conversion is inline PTX bit-manipulation (branchless setp/selp, unrolled normalize) using only basic integer ops - FP8 has no portable native PTX cvt (cvt.*.e4m3 is sm_89/Hopper only), so this works on every CUDA arch. Load = ld.global.u8 + convert; store = convert + st.global.u8.
CPU Native - the managed Float8E4M3/Float8E5M2 structs run directly.

Convention note (E4M3 overflow). The conversion is bit-exact to the ml_dtypes reference (the impl PyTorch / JAX float8_e4m3fn share) - verified by DemoConsole -- fp8-oracle: decode 0/256, encode rounding/subnormal/overflow 0 divergences across 1099 probes, on all 6 backends. The overflow behavior is selectable, with the reference-matching fn convention as the default:

Entry point Finite overflow (|x|>464) ±Inf Matches
(Float8E4M3)x cast / FromSingleFn(x) / FromSingle(x, saturate: false)DEFAULT → NaN → NaN PyTorch / JAX / ml_dtypes float8_e4m3fn (bit-exact)
FromSingleSaturating(x) / FromSingle(x, saturate: true) clamps to ±448 → NaN NVIDIA Transformer Engine saturating cast / OCP saturating-forward

The cast operator and the IR-level convert (so PrecisionConvert and the generic INumber<T> path too) are all fn. 449..464 round down to 448 under both conventions; the two differ only for |x|>464, which rounds up past the 448 slot (fn → NaN, saturating → ±448). Every representable value round-trips exactly. FromSingleSaturating is composed only of existing intrinsics (a bit-level finite check + the fn cast + a >464 redirect), so it transpiles and is bit-exact on all 6 backends (PMT Float8E4M3_FromSingleFn_OverflowToNaN). Use the default for reference-matching ML (loading/comparing PyTorch FP8 checkpoints); use FromSingleSaturating when you want overflow clamped rather than NaN-poisoning a downstream reduction.

Float8E5M2 is IEEE-754-style (has ±Inf): overflow → ±Inf, bit-exact to float8_e5m2 (decode 0/256, encode 723/723); its canonical NaN byte is 0x7F (ml_dtypes uses 0x7E - both are valid NaN patterns).

All low-precision conversions are validated against the authoritative references

Every float→low-precision conversion is bit-exact to its reference, verified exhaustively and pinned in CI (DemoConsole -- bf16-f16-oracle / fp8-oracle + the PMT LowPrecision_ConversionPinnedToExternalReference gate, which pins each backend's on-device convert to hardcoded numpy/ml_dtypes values):

Type Reference float→type rounding
Half numpy.float16 (IEEE binary16) round-to-nearest-even incl. subnormals + overflow→Inf (was truncating + flushing subnormals before 4.14.0)
BFloat16 ml_dtypes.bfloat16 round-to-nearest-even (NaN-preserving)
Float8E4M3 PyTorch/JAX/ml_dtypes float8_e4m3fn RNE; overflow→NaN (fn, default)
Float8E5M2 float8_e5m2 RNE; overflow→±Inf

Selectable saturating cast (all four types). Each type exposes FromSingle(float, bool saturate) and FromSingleSaturating(float) (E4M3 additionally has FromSingleFn, its non-saturating name). The saturating cast clamps finite overflow to the max finite magnitude instead of the default (→NaN for E4M3, →±Inf for the IEEE types) - the NVIDIA Transformer Engine / OCP mode for activations you don't want producing Inf/NaN. Each is composed only of existing intrinsics (a bit-level finite check + the default cast + a max-finite-constant cast), so it transpiles with no per-backend codegen and is bit-exact on all 6 backends.

Radix-sort: complete for all four types on all 6 backends. Keys-only and key/value pairs, ascending and descending, plus body-struct key fields - every type × {keys, pairs} × {asc, desc} cell is covered (Interop.FloatAsInt(T) + Ascending/Descending{Half,BFloat16,Float8E4M3,Float8E5M2} + per-backend FloatAsIntCast; PMT RadixGrid_* + Fp8Radix_* + BFloat16_RadixSort*). On WebGL the FP8/Half/bf16 keys route through the unpacked-f32 working representation (the whole-texel scatter can't move a sub-word value); on the other 5 backends they sort as native packed sub-word keys.

Sub-Word Usage Notes

These apply to any kernel using ArrayView<byte>, ArrayView<sbyte>, ArrayView<short>, ArrayView<ushort>, or ArrayView<Half>:

  • Use ILGPU.Half, NOT System.Half, in kernel signatures. Implicit conversion operators are defined for interop, so you can mix the two on the host side; inside the kernel signature the ILGPU.Half type is what the IR + codegen expect.
  • Sub-word writes on WebGPU lower to atomic RMW. Two threads writing different halves of the same u32 word would race without RMW; the codegen always synthesizes atomicAnd mask + atomicOr set so the writes are thread-safe. Setting RequiresAtomics = true in AcceleratorRequirements (or pinning to a backend with atomics) is therefore mandatory whenever a kernel writes a sub-word view — WebGL has no atomics and rejects sub-word writes at compile time. See capabilities-and-backend-selection.md.
  • Sub-word view reads can return stale data on WebGPU if you wrote to the same slot in the same kernel invocation. Byte writes lower to atomic RMW on WebGPU; reading a byte slot you just wrote may observe pre-RMW state in the same dispatch. Treat ArrayView<byte> and ArrayView<sbyte> as write-only within a kernel invocation — buffer the value in a register and route results through that register, not back through the view.
  • arrayLength() on sub-word buffers returns the u32-count, not the element-count. A 256-byte buffer reports arrayLength = 64 (256/4 u32s). Multiply by elements-per-word (4 for byte/sbyte, 2 for short/ushort/Half) when computing element bounds inside the kernel.
  • Sign extension on load is automatic. ArrayView<sbyte> and ArrayView<short> reads sign-extend the narrow value to int when used in arithmetic (unsigned views zero-extend). The codegen emits extractBits(x, 0u, 16u) (WGSL, sign-extends a signed i32) / ((x & 0xFFFF) ^ 0x8000) - 0x8000 (GLSL - GLSL ES 3.0 has no int16_t, and the obvious (x << 16) >> 16 is undefined behavior when bit 15 is set, so this (v ^ signbit) - signbit idiom is used) / i32.extend16_s (Wasm).
  • Signedness reinterprets ((short)someUshort, (ushort)someShort, Int8 analogues) re-extend on the browser backends (4.9.13+). Signed and unsigned sub-word types collapse to one BasicValueType, so the reinterpret's conv is elided in the IR; the browser ConvertValue codegen therefore re-applies sign/zero extension (per the convert's source signedness) when a sub-word value is widened to int. Before 4.9.13 this was dropped, silently corrupting the high bits of a reinterpreted sub-word value on WebGPU/WebGL/Wasm (e.g. (short)bits >> 15 on a value that came from a ushort). Desktop backends use native sub-word registers and were never affected.
  • Wasm minimum buffer size is 4 bytes. Allocating an ArrayView<byte> of length 1, 2, or 3 throws Invalid typed array length: 4 on Wasm. Pad per-block scalar buffers to Math.Max(blockCount, 4L) if your kernel writes one byte per block.

Test Coverage

175 tests total across the sub-word test methods + Half intrinsics + BFloat16, all x 7 backends:

  • Int8: 28 tests (RoundTrip + Read + Write + EndToEnd x 7 backends)
  • UInt8: 28 tests
  • Int16: 35 tests (+ existing CopyFromJS tests)
  • UInt16: 28 tests
  • Float16: 21 tests (Read + Write + EndToEnd x 7 backends)
  • Half Abs: 7 tests
  • Half MinMax: 7 tests
  • BFloat16: 28 tests (BufferRoundTrip + Arithmetic + MinMax + RangeAndSpecials x 7 backends)

Test Files

  • SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.Tests17.BrowserBuffer.cs (sub-word + Half)
  • SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.BFloat16.cs (bf16)

How to Run

# All sub-word + Half tests
dotnet test PlaywrightMultiTest/PlaywrightMultiTest.csproj --filter "FullyQualifiedName~Int8|FullyQualifiedName~UInt8|FullyQualifiedName~Int16|FullyQualifiedName~UInt16|FullyQualifiedName~Float16|FullyQualifiedName~Half_"

# All BFloat16 tests
dotnet test PlaywrightMultiTest/PlaywrightMultiTest.csproj --filter "FullyQualifiedName~BFloat16"