Bump to 4.13.0-local.8 (forks 2.0.24): generic INumber<T> Half/bf16 on all 6 backends

LostBeard · claude · LostBeard · commit 2eeffa5b2b69 · 2026-06-16T16:10:20.000-04:00
CHANGELOG + version bump for the generic mixed-precision work (PTX bf16 select + the
by-value sub-word scalar param ABI across PTX/OpenCL/WebGPU/WebGL/Wasm). Four-package
bundle: forks 2.0.23-&gt;2.0.24, wrapper local.7-&gt;local.8, PackageReference lines updated.
Gates: GenericPrecision 23/0, BFloat16 100/0, Half 190/0/8 (all 6 backends).

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -21,6 +21,42 @@ First phase of `ILGPU.BFloat16` ("brain float") support, mirroring the `ILGPU.Ha
   - **IR const-fold** of bf16 literal arithmetic/compare/convert (the `.tt`-generated `ArithmeticOperations`/`CompareOperations` + `Convert.cs` got bf16 cases; was throwing `NotSupportedException`). **`AcceleratorRequirements.RequiresBFloat16`** (no-op documentation filter; bf16 is always available, no native-vs-emulated split).
   - PMT (`PMT_FILTER=BFloat16`): radix (keys-only asc/desc + pairs), GPU-vs-CPU bucket-compare, less-than ordering, const-fold, range/specials - **all green on all 6 backends**. No Half regression.
 
+### 4.13.0-local.8 - Generic `INumber<T>` mixed-precision kernels (float/Half/bf16) on all 6 backends
+
+A single generic `where T : INumber<T>` compute kernel instantiated for `float` / `ILGPU.Half` /
+`ILGPU.BFloat16` now transpiles AND runs correctly on every backend - one kernel instead of N
+hand-written per-type variants, the foundation for adding further low-precision types. This closes the
+codegen gaps the concrete-typed bf16/Half parity work didn't cover (the generic-specialization compile
+path AND by-value SUB-WORD SCALAR params - distinct from sub-word BUFFER elements, which already worked).
+Forks bump to 2.0.24 (PTX + OpenCL are in the `ILGPU/` fork). Gate: new
+`BackendTestBase.GenericPrecision` (float/Half/bf16) **23/0 across all 6 backends**; no regression
+(`PMT_FILTER=BFloat16` 100/0, `PMT_FILTER=Half` 190/0/8).
+
+- **PTX bf16 select.** `GetSelectValueOperation` indexed the op table directly with no bf16->f32 remap
+  (unlike `GetCompareOperation`/`GetArithmeticOperation`), so a generic kernel selecting a bf16 (the
+  `v > T.Zero ? v : T.Zero` ternary -> `selp`) threw `KeyNotFoundException 'BFloat16'` at compile. bf16
+  computes in an f32 register, so the select uses the f32 `selp` - now remapped like the others.
+- **By-value sub-word SCALAR params.** A Half/bf16 passed BY VALUE (e.g. a kernel's `scale`/`bias`) is
+  2-byte storage but computes as f32; every backend mishandled the boundary. The uniform fix: declare/pass
+  the 2-byte STORAGE and convert to f32 at the kernel boundary using the SAME conversion a buffer-element
+  load uses (the host packs storage bytes, unchanged), keyed on the type so non-sub-word params are
+  byte-identical.
+  - **PTX:** declared the bf16 param `.f32` (4B) while the host packs 2B -> garbage (bias=0). Now declares
+    `.b16` + `cvt.f32.bf16` at `BindParameters` (mirrors the bf16 buffer-element load). Half is native f16
+    (2B register) and was already fine.
+  - **OpenCL:** emulated sub-word scalars (bf16 always; Half without `cl_khr_fp16`) were declared 4-byte
+    `float` -> `CL_INVALID_ARG_SIZE`. Now declared `ushort` storage + converted via `_bf16_bits_to_f32` /
+    `_half_bits_to_f32` at body entry.
+  - **WebGPU (WGSL):** read the packed scalar as `bitcast<f32>(slot)` but the 2-byte bits sit in the
+    slot's low 16 -> near-zero denormal. Now widens via `_bf16_to_f32` / `_f16_to_f32` (both the
+    direct-param and struct-field read sites).
+  - **WebGL:** the dispatch derived the scalar type from the C# arg with no Half/bf16 case -> the uniform
+    was never sent (arrived as 0). Added Half/bf16 -> `float` (widen via the explicit operator; they don't
+    implement `IConvertible`, so `Convert.ToSingle` had returned 0).
+  - **Wasm:** a Half/bf16 scalar (a CLR struct) fell into the struct-serialize-to-scratch path and the
+    kernel read its raw 16 bits (got 38656). They compute as f32 (`GetWasmTypeFromIR` -> F32), so the host
+    now passes the widened f32 value, like a float scalar.
+
 ### 4.13.0-local.7 - Wasm device-copy ORDERING fix (sync `CopyFrom` works again on browser) + Wasm SIMD128 Stage-3a elementwise vectorization
 
 Two Wasm-backend changes. Forks bump to 2.0.23 (the device-copy fix touches the `ILGPU/` fork: `MemoryBuffer.CopyFromBufferOrdered` + `Accelerator.RequiresAsyncDeviceCopy`). Full Wasm PMT lane **537/0/17**.
diff --git a/ILGPU.Algorithms/ILGPU.Algorithms.csproj b/ILGPU.Algorithms/ILGPU.Algorithms.csproj
@@ -12,7 +12,7 @@
          SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
          Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.23</Version>
+    <Version>2.0.24</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
diff --git a/ILGPU/ILGPU.csproj b/ILGPU/ILGPU.csproj
@@ -16,7 +16,7 @@
          check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
          transitively and the fix is invisible. See the banner comment in
          SpawnDev.ILGPU.csproj for the full procedure. -->
-    <Version>2.0.23</Version>
+    <Version>2.0.24</Version>
     <IsPackable>true</IsPackable>
     <GeneratePackageOnBuild>true</GeneratePackageOnBuild>
   </PropertyGroup>
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -4,9 +4,9 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.13.0-local.7</Version>
+		<Version>4.13.0-local.8</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
-		<PackageReleaseNotes>4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host&lt;-&gt;device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
+		<PackageReleaseNotes>4.13.0-local.8: Generic `INumber&lt;T&gt;` mixed-precision kernels (float/Half/bf16) now transpile + run correctly on ALL 6 backends - one `where T:INumber&lt;T&gt;` kernel instead of N hand-written per-type variants. Fixes the codegen gaps the concrete-typed bf16/Half work didn't cover (distinct from sub-word BUFFER elements): (1) PTX bf16 SELECT (the `v>0?v:0` ternary `selp`) was missing the bf16-&gt;f32 remap -&gt; KeyNotFoundException at compile. (2) By-value sub-word SCALAR params (e.g. a kernel's scale/bias): PTX declared the bf16 param .f32 (4B) but the host packs 2B storage -&gt; arrived as 0; OpenCL declared emulated Half/bf16 as 4B `float` -&gt; CL_INVALID_ARG_SIZE; WebGPU/WebGL read the scalar without the sub-word conversion -&gt; 0; Wasm struct-serialized the value -&gt; raw bits. All now declare/pass the 2-byte storage and convert to f32 at the boundary (the same conversion buffer ELEMENTS use), keyed on the type so non-sub-word params are byte-identical. Gate: new BackendTestBase.GenericPrecision (float/Half/bf16) **23/0 all 6 backends**; no regression (BFloat16 100/0, Half 190/0/8). Forks bump to 2.0.24. --- 4.13.0-local.7: (1) Wasm device-copy ORDERING fix - REVISES local.6. Sync device-to-device CopyFrom/CopyTo now WORKS on browser again (no longer throws): the Wasm backend ENQUEUES the copy into its serialized work stream so it runs AFTER the producing kernel (the real fix - the worker pool was not a single ordered queue, so the old immediate SharedArrayBuffer copy read stale data; WebGPU/WebGL were always queue-ordered). New virtual MemoryBuffer.CopyFromBufferOrdered (Wasm overrides to defer; default = immediate); the device-to-device throw + the RequiresAsyncDeviceCopy guard are removed (the flag is now informational - device copies are deferred/queue-ordered, completed at the next drain/dispatch). CopyFromAsync stays as an optional eager-completion convenience. No consumer migration needed - sync CopyFrom is correct again. (2) Wasm SIMD128 Stage-3a: additive v128 kernel_simd (4 lanes/call) for the f32 unit-stride elementwise class on SIMD-capable browsers (PackedSimd.IsSupported; no-SIMD builds stay first-class on the byte-identical scalar path). The lane-variance analysis seeds the index AND the thread-position intrinsics (Grid.GlobalIndex/Group.Idx/Grid.Idx/LaneIdx); a structural guard refuses to emit kernel_simd without a lane-variant v128 store (so the by-4 dispatch can never skip lanes). Full Wasm PMT lane 537/0/17. Forks bump to 2.0.23. --- 4.13.0-local.6: Completes the sync/async contract for device-to-device copies. A SYNCHRONOUS CopyFrom/CopyTo between two device buffers now THROWS NotSupportedException on the browser backends (Wasm/WebGPU/WebGL) - it cannot be ordered against a producing kernel at the async GPU boundary and silently read stale data (a real gemma4 KV argmax flip on Wasm). Use `await CopyFromAsync(...)` (drains the producer, then copies); host&lt;-&gt;device transfers are unaffected. New Accelerator.RequiresAsyncDeviceCopy flag (true on the 3 browser backends), the guard in MemoryBuffer.CopyFromBuffer, public CopyFromBufferAfterDrain + CopyFromUnchecked for library code that orders the copy by other means (the 2 internal scan-fence sites migrated). Locked by BackendTestBase.SyncDeviceCopyContractTest (23/0 all lanes). Forks bump to 2.0.22. --- 4.13.0-local.5: CPU backend - cooperative multi-multiprocessor execution. The CPU accelerator simulated a GPU group of 64 threads as 64 OS threads in ONE multiprocessor; on a machine with fewer cores every in-kernel Group.Barrier oversubscribed and thrashed (~1.4s/launch with multi-second tails on barrier/reduction kernels - e.g. GGUF KV-cache decode timed out). Two fixes: (1) the default CPU device now uses NumMultiprocessors = logical-core-count (was hardcoded 1) so thread-groups run one-per-core in parallel; (2) Auto mode now selects cooperative (sequential-within-group) scheduling - one active thread per multiprocessor, cheap O(1) targeted-pulse barrier handoff (was an O(N^2) Monitor.PulseAll storm) - eliminating oversubscription while keeping cross-core parallelism. Measured: a 256-group x64 shared-memory reduction went 1381ms -> 146ms/launch with zero timing variance, results bit-identical. Explicit Parallel mode still available for barrier-free kernels. Forks bump to 2.0.21. Earlier in 4.13.0: BFloat16 full type parity on all 6 backends (see CHANGELOG). Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
 		<GenerateDocumentationFile>true</GenerateDocumentationFile>
 		<EmbedAllSources>true</EmbedAllSources>
@@ -65,8 +65,8 @@
 		<ProjectReference Include="..\ILGPU.Algorithms\ILGPU.Algorithms.csproj" />
 	</ItemGroup>
 	<ItemGroup Condition="!Exists('$(MSBuildThisFileDirectory)..\ILGPU\ILGPU.csproj')">
-		<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.23" />
-		<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.23" />
+		<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.24" />
+		<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.24" />
 	</ItemGroup>
 
 	<ItemGroup>