You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bump to 4.13.0-local.8 (forks 2.0.24): generic INumber<T> Half/bf16 on all 6 backends
CHANGELOG + version bump for the generic mixed-precision work (PTX bf16 select + the
by-value sub-word scalar param ABI across PTX/OpenCL/WebGPU/WebGL/Wasm). Four-package
bundle: forks 2.0.23->2.0.24, wrapper local.7->local.8, PackageReference lines updated.
Gates: GenericPrecision 23/0, BFloat16 100/0, Half 190/0/8 (all 6 backends).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
-**PTX bf16 select.**`GetSelectValueOperation` indexed the op table directly with no bf16->f32 remap
36
+
(unlike `GetCompareOperation`/`GetArithmeticOperation`), so a generic kernel selecting a bf16 (the
37
+
`v > T.Zero ? v : T.Zero` ternary -> `selp`) threw `KeyNotFoundException 'BFloat16'` at compile. bf16
38
+
computes in an f32 register, so the select uses the f32 `selp` - now remapped like the others.
39
+
-**By-value sub-word SCALAR params.** A Half/bf16 passed BY VALUE (e.g. a kernel's `scale`/`bias`) is
40
+
2-byte storage but computes as f32; every backend mishandled the boundary. The uniform fix: declare/pass
41
+
the 2-byte STORAGE and convert to f32 at the kernel boundary using the SAME conversion a buffer-element
42
+
load uses (the host packs storage bytes, unchanged), keyed on the type so non-sub-word params are
43
+
byte-identical.
44
+
-**PTX:** declared the bf16 param `.f32` (4B) while the host packs 2B -> garbage (bias=0). Now declares
45
+
`.b16` + `cvt.f32.bf16` at `BindParameters` (mirrors the bf16 buffer-element load). Half is native f16
46
+
(2B register) and was already fine.
47
+
-**OpenCL:** emulated sub-word scalars (bf16 always; Half without `cl_khr_fp16`) were declared 4-byte
48
+
`float` -> `CL_INVALID_ARG_SIZE`. Now declared `ushort` storage + converted via `_bf16_bits_to_f32` /
49
+
`_half_bits_to_f32` at body entry.
50
+
-**WebGPU (WGSL):** read the packed scalar as `bitcast<f32>(slot)` but the 2-byte bits sit in the
51
+
slot's low 16 -> near-zero denormal. Now widens via `_bf16_to_f32` / `_f16_to_f32` (both the
52
+
direct-param and struct-field read sites).
53
+
-**WebGL:** the dispatch derived the scalar type from the C# arg with no Half/bf16 case -> the uniform
54
+
was never sent (arrived as 0). Added Half/bf16 -> `float` (widen via the explicit operator; they don't
55
+
implement `IConvertible`, so `Convert.ToSingle` had returned 0).
56
+
-**Wasm:** a Half/bf16 scalar (a CLR struct) fell into the struct-serialize-to-scratch path and the
57
+
kernel read its raw 16 bits (got 38656). They compute as f32 (`GetWasmTypeFromIR` -> F32), so the host
58
+
now passes the widened f32 value, like a float scalar.
59
+
24
60
### 4.13.0-local.7 - Wasm device-copy ORDERING fix (sync `CopyFrom` works again on browser) + Wasm SIMD128 Stage-3a elementwise vectorization
25
61
26
62
Two Wasm-backend changes. Forks bump to 2.0.23 (the device-copy fix touches the `ILGPU/` fork: `MemoryBuffer.CopyFromBufferOrdered` + `Accelerator.RequiresAsyncDeviceCopy`). Full Wasm PMT lane **537/0/17**.
Copy file name to clipboardExpand all lines: SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
<TargetFramework>net10.0</TargetFramework>
5
5
<ImplicitUsings>enable</ImplicitUsings>
6
6
<Nullable>enable</Nullable>
7
-
<Version>4.13.0-local.7</Version>
7
+
<Version>4.13.0-local.8</Version>
8
8
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9
-
<PackageReleaseNotes>4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host<->device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9
+
<PackageReleaseNotes>4.13.0-local.8: Generic `INumber<T>` mixed-precision kernels (float/Half/bf16) now transpile + run correctly on ALL 6 backends - one `where T:INumber<T>` kernel instead of N hand-written per-type variants. Fixes the codegen gaps the concrete-typed bf16/Half work didn't cover (distinct from sub-word BUFFER elements): (1) PTX bf16 SELECT (the `v>0?v:0` ternary `selp`) was missing the bf16->f32 remap -> KeyNotFoundException at compile. (2) By-value sub-word SCALAR params (e.g. a kernel's scale/bias): PTX declared the bf16 param .f32 (4B) but the host packs 2B storage -> arrived as 0; OpenCL declared emulated Half/bf16 as 4B `float` -> CL_INVALID_ARG_SIZE; WebGPU/WebGL read the scalar without the sub-word conversion -> 0; Wasm struct-serialized the value -> raw bits. All now declare/pass the 2-byte storage and convert to f32 at the boundary (the same conversion buffer ELEMENTS use), keyed on the type so non-sub-word params are byte-identical. Gate: new BackendTestBase.GenericPrecision (float/Half/bf16) **23/0 all 6 backends**; no regression (BFloat16 100/0, Half 190/0/8). Forks bump to 2.0.24. --- 4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host<->device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
0 commit comments