Skip to content

Commit 972dcda

Browse files
LostBeardclaude
andcommitted
Half float->half: IEEE round-to-nearest-even on every backend (was truncating) - 4.14.0-local.3
Found by validating against numpy.float16 / ml_dtypes.bfloat16: ILGPU.Half float->half TRUNCATED toward zero (managed von der Zijp table) and the WebGPU/ WebGL/Wasm emitters truncated AND flushed all subnormals to signed zero - diverging from numpy/PyTorch in ~half of all values AND from ILGPU's own CUDA/OpenCL (which were already round-to-nearest). So a Half model gave different results on WebGPU vs CUDA, and every non-exact conversion lost up to 1/2 ULP. Fix: replaced the managed conversion (HalfConversion.tt) with a direct RNE bit-manip (rebias + RNE mantissa rounding + proper subnormal rounding + overflow->Inf, mirrors the bf16/FP8 conversions) and rewrote WGSL/GLSL _f32_to_f16 + the Wasm EmitF32ToF16 inline bytecode to match. CUDA (cvt.rn) + OpenCL (vstore_half) unchanged - already correct. Corrected the false "lossless / flush-to-zero" f16 doc claims. bf16 was already bit-exact to ml_dtypes.bfloat16 (verified, no change). Validation: new DemoConsole -- bf16-f16-oracle (all 65536 patterns, decode + round-trip + RNE/subnormal/overflow probes): Half now decode 65536/65536, round-trip 65536/65536, probes 64060/64060 (was ~32294); bf16 perfect. New PMT Half_FloatToHalf_RoundToNearestEven 9/0 all backend lanes; PMT_FILTER=Half 204/0/8 (no regression). Forks 2.0.30. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent b12174f commit 972dcda

13 files changed

Lines changed: 302 additions & 121 deletions

File tree

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22

33
This file tracks notable changes per release. The README's "Recent Highlights" section links here for the full version history.
44

5+
## 4.14.0-local.3 (2026-06-17) - Half float→half is now IEEE round-to-nearest-even on every backend (was truncating)
6+
7+
Fixes a real conversion-correctness + cross-backend-consistency bug in the most-used low-precision type, found by validating against the authoritative references (numpy.float16 / ml_dtypes.bfloat16). Forks bump to `2.0.30`.
8+
9+
- **`ILGPU.Half` `float→half` now uses IEEE round-to-nearest-even (incl. proper subnormal rounding + overflow→Inf) on CPU + WebGPU + WebGL + Wasm** - bit-exact to `numpy.float16` / PyTorch / CUDA (`cvt.rn.f16.f32`) / OpenCL (`vstore_half`). **Before:** the managed conversion used the von der Zijp TABLE method which **truncates toward zero** (`HalfConversion.tt`: shift with no round bit), and the WebGPU/WebGL/Wasm emitters **truncated AND flushed every subnormal to signed zero**. That diverged from numpy/PyTorch in ~half of all values (every non-exact conversion lost up to ½ ULP) AND from ILGPU's own CUDA/OpenCL backends (which were already round-to-nearest) - so a Half model produced different results on WebGPU vs CUDA. Replaced the managed conversion with a direct RNE bit-manip (mirrors the bf16/FP8 conversions) and rewrote the WGSL/GLSL `_f32_to_f16` + the Wasm `EmitF32ToF16` inline bytecode to match. CUDA/OpenCL unchanged (already correct). The "f16 emulation is lossless / matches numpy byte-for-byte" doc claims (which were false for encode) are corrected.
10+
- **Validated exhaustively:** new `DemoConsole -- bf16-f16-oracle` checks managed BFloat16 + Half vs `ml_dtypes.bfloat16` / `numpy.float16` over **all 65536 patterns** (decode + round-trip identity) + RNE/overflow/subnormal probes. `BFloat16`: bit-exact (decode 65536/65536, round-trip 65536/65536, probes 67503/67503) - was already correct. `Half`: now decode 65536/65536, round-trip 65536/65536, probes 64060/64060 (was ~32294/64060 - the subnormal region + RNE midpoints). Cross-backend gate: new PMT `Half_FloatToHalf_RoundToNearestEven` (kernel `(Half)x` over subnormals/midpoints/overflow/specials, bit-exact vs the managed=numpy reference) **9/0 all backend lanes**; existing Half suite `PMT_FILTER=Half` **204/0/8** (no regression).
11+
512
## 4.14.0-local.2 (2026-06-17) - Float8E4M3 is now bit-exact to float8_e4m3fn (overflow → NaN), saturating opt-in
613

714
`Float8E4M3` float→fp8 conversion changed from saturating to the `fn` (`float8_e4m3fn`) convention as the DEFAULT, matching the dtype it is named after. Forks bump to `2.0.29`.

ILGPU.Algorithms/ILGPU.Algorithms.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
SpawnDev.ILGPU.Fork* PackageReference Versions inside SpawnDev.ILGPU.csproj.
1313
Run `_check-fork-version-sync.bat` at repo root. See the banner comment in
1414
SpawnDev.ILGPU.csproj for the full procedure. -->
15-
<Version>2.0.29</Version>
15+
<Version>2.0.30</Version>
1616
<IsPackable>true</IsPackable>
1717
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
1818
</PropertyGroup>

ILGPU/Backends/OpenCL/CLBackend.cs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,11 @@ public CLBackend(
9797
// which AscendingHalf / DescendingHalf radix-sort encodings depend on. The
9898
// hardware path uses `as_short(half)` directly when shader-fp16 is on; the
9999
// emulated path calls these helpers instead. They are tiny, no-op when unused,
100-
// and let the OpenCL compiler optimize out the call when inlined. Mirrors WGSL's
101-
// _f32_to_f16 / _f16_to_f32 byte-for-byte (denormals flush to signed zero,
102-
// overflow clamps exp to 31 with mantissa preserved so NaN stays NaN).
100+
// and let the OpenCL compiler optimize out the call when inlined. This helper is the
101+
// radix FloatAsInt(Half) bit-encoder - its inputs are already representable Half values
102+
// (widened to f32), so the encoding is exact regardless of rounding mode. General
103+
// float->half conversion on OpenCL goes through vstore_half (IEEE round-to-nearest, like
104+
// CUDA's cvt.rn and the managed/WGSL/GLSL/Wasm RNE path as of 4.14.0).
103105
if (!Capabilities.Float16Native)
104106
{
105107
extensionBuilder.AppendLine();

ILGPU/HalfConversion.cs

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -332,22 +332,55 @@ public static float ConvertHalfToFloat(Half halfValue)
332332
}
333333

334334
/// <summary>
335-
/// Converts a float value to a half value by using van der Zijp's algorithm.
335+
/// Converts a float value to a half value using IEEE round-to-nearest-even. Matches
336+
/// numpy.float16 / PyTorch / CUDA (cvt.rn.f16.f32) / OpenCL (vstore_half) bit-for-bit -
337+
/// including subnormals and overflow-to-Inf. (Replaced the von der Zijp TABLE method, which
338+
/// truncated toward zero and flushed sub-smallest-normal values incorrectly - it diverged
339+
/// from every ML reference and from CUDA/OpenCL; verified via DemoConsole -- bf16-f16-oracle.)
340+
/// f16: 1 sign / 5 exponent / 10 mantissa, bias 15; has +-Inf + NaN; max normal 65504.
336341
/// </summary>
337342
/// <param name="floatValue">The value to convert.</param>
338343
/// <returns>The converted half value.</returns>
339344
[MethodImpl(MethodImplOptions.AggressiveInlining)]
340345
public static Half ConvertFloatToHalf(float floatValue)
341346
{
342-
uint rawValue = Interop.FloatAsInt(floatValue);
343-
uint rawUpperValue = rawValue >> 23;
344-
345-
uint baseEntry = BaseTable[rawUpperValue];
346-
int shiftAmount = ShiftTable[rawUpperValue];
347-
uint mantissaOffset = rawValue & 0x7FFFFF;
348-
349-
uint result = baseEntry + (mantissaOffset >> shiftAmount);
350-
return new Half((ushort)result);
347+
uint bits = Interop.FloatAsInt(floatValue);
348+
uint sign = (bits >> 16) & 0x8000u; // f16 sign bit (bit 15)
349+
uint rest = bits & 0x7FFFFFFFu;
350+
351+
// NaN or Inf input.
352+
if (rest >= 0x7F800000u)
353+
return new Half((ushort)(sign | (rest > 0x7F800000u ? 0x7E00u : 0x7C00u))); // NaN : Inf
354+
355+
int e = (int)((rest >> 23) & 0xFFu) - 127; // unbiased f32 exponent
356+
uint f32Mant = rest & 0x7FFFFFu;
357+
358+
if (e > 15) // overflow -> +-Inf
359+
return new Half((ushort)(sign | 0x7C00u));
360+
361+
if (e < -14)
362+
{
363+
// Subnormal or zero. f16 subnormal value = mant * 2^-24.
364+
if (e < -25)
365+
return new Half((ushort)sign); // below half the smallest subnormal -> +-0
366+
uint signif = f32Mant | 0x800000u; // implicit leading 1 (24-bit significand)
367+
int shift = (-14 - e) + 13; // align to the 10-bit subnormal field (in [14,24])
368+
uint m = signif >> shift;
369+
uint roundBit = (signif >> (shift - 1)) & 1u;
370+
uint sticky = (signif & ((1u << (shift - 1)) - 1u)) != 0u ? 1u : 0u;
371+
if (roundBit == 1u && (sticky == 1u || (m & 1u) == 1u))
372+
m += 1u; // RNE; may carry to 0x400 = smallest normal (correct)
373+
return new Half((ushort)(sign | m));
374+
}
375+
376+
// Normal. Rebias and round the mantissa 23 -> 10 bits (RNE).
377+
uint mant10 = f32Mant >> 13;
378+
uint round = (f32Mant >> 12) & 1u;
379+
uint sticky2 = (f32Mant & 0xFFFu) != 0u ? 1u : 0u;
380+
uint outBits = ((uint)(e + 15) << 10) | mant10;
381+
if (round == 1u && (sticky2 == 1u || (mant10 & 1u) == 1u))
382+
outBits += 1u; // RNE; may carry into the exponent, up to Inf
383+
return new Half((ushort)(sign | outBits));
351384
}
352385

353386
#endregion

ILGPU/HalfConversion.tt

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -220,22 +220,55 @@ namespace ILGPU
220220
}
221221

222222
/// <summary>
223-
/// Converts a float value to a half value by using van der Zijp's algorithm.
223+
/// Converts a float value to a half value using IEEE round-to-nearest-even. Matches
224+
/// numpy.float16 / PyTorch / CUDA (cvt.rn.f16.f32) / OpenCL (vstore_half) bit-for-bit -
225+
/// including subnormals and overflow-to-Inf. (Replaced the von der Zijp TABLE method, which
226+
/// truncated toward zero and flushed sub-smallest-normal values incorrectly - it diverged
227+
/// from every ML reference and from CUDA/OpenCL; verified via DemoConsole -- bf16-f16-oracle.)
228+
/// f16: 1 sign / 5 exponent / 10 mantissa, bias 15; has +-Inf + NaN; max normal 65504.
224229
/// </summary>
225230
/// <param name="floatValue">The value to convert.</param>
226231
/// <returns>The converted half value.</returns>
227232
[MethodImpl(MethodImplOptions.AggressiveInlining)]
228233
public static Half ConvertFloatToHalf(float floatValue)
229234
{
230-
uint rawValue = Interop.FloatAsInt(floatValue);
231-
uint rawUpperValue = rawValue >> <#= FloatMantissaBits #>;
232-
233-
uint baseEntry = BaseTable[rawUpperValue];
234-
int shiftAmount = ShiftTable[rawUpperValue];
235-
uint mantissaOffset = rawValue & <#= $"0x{FloatMantissaMask:X}" #>;
236-
237-
uint result = baseEntry + (mantissaOffset >> shiftAmount);
238-
return new Half((ushort)result);
235+
uint bits = Interop.FloatAsInt(floatValue);
236+
uint sign = (bits >> 16) & 0x8000u; // f16 sign bit (bit 15)
237+
uint rest = bits & 0x7FFFFFFFu;
238+
239+
// NaN or Inf input.
240+
if (rest >= 0x7F800000u)
241+
return new Half((ushort)(sign | (rest > 0x7F800000u ? 0x7E00u : 0x7C00u))); // NaN : Inf
242+
243+
int e = (int)((rest >> 23) & 0xFFu) - 127; // unbiased f32 exponent
244+
uint f32Mant = rest & 0x7FFFFFu;
245+
246+
if (e > 15) // overflow -> +-Inf
247+
return new Half((ushort)(sign | 0x7C00u));
248+
249+
if (e < -14)
250+
{
251+
// Subnormal or zero. f16 subnormal value = mant * 2^-24.
252+
if (e < -25)
253+
return new Half((ushort)sign); // below half the smallest subnormal -> +-0
254+
uint signif = f32Mant | 0x800000u; // implicit leading 1 (24-bit significand)
255+
int shift = (-14 - e) + 13; // align to the 10-bit subnormal field (in [14,24])
256+
uint m = signif >> shift;
257+
uint roundBit = (signif >> (shift - 1)) & 1u;
258+
uint sticky = (signif & ((1u << (shift - 1)) - 1u)) != 0u ? 1u : 0u;
259+
if (roundBit == 1u && (sticky == 1u || (m & 1u) == 1u))
260+
m += 1u; // RNE; may carry to 0x400 = smallest normal (correct)
261+
return new Half((ushort)(sign | m));
262+
}
263+
264+
// Normal. Rebias and round the mantissa 23 -> 10 bits (RNE).
265+
uint mant10 = f32Mant >> 13;
266+
uint round = (f32Mant >> 12) & 1u;
267+
uint sticky2 = (f32Mant & 0xFFFu) != 0u ? 1u : 0u;
268+
uint outBits = ((uint)(e + 15) << 10) | mant10;
269+
if (round == 1u && (sticky2 == 1u || (mant10 & 1u) == 1u))
270+
outBits += 1u; // RNE; may carry into the exponent, up to Inf
271+
return new Half((ushort)(sign | outBits));
239272
}
240273

241274
#endregion

ILGPU/ILGPU.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
check on push. Skipping (b) means consumers ship rc.N still pulling old Fork
1717
transitively and the fix is invisible. See the banner comment in
1818
SpawnDev.ILGPU.csproj for the full procedure. -->
19-
<Version>2.0.29</Version>
19+
<Version>2.0.30</Version>
2020
<IsPackable>true</IsPackable>
2121
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
2222
</PropertyGroup>

SpawnDev.ILGPU.Demo.Shared/UnitTests/BackendTestBase.GenericPrecision.cs

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,60 @@ private static void Float8OverflowKernel(Index1D i,
173173
sat[i] = global::ILGPU.Float8E4M3.FromSingleSaturating(x[i]); // opt-in saturating
174174
}
175175

176+
// ILGPU.Half float->half must be round-to-nearest-even on EVERY backend (Geordi 2026-06-17,
177+
// bf16/Half oracle validation). The managed conversion is bit-exact to numpy.float16 (verified
178+
// DemoConsole -- bf16-f16-oracle, all 65536 patterns); this proves the WebGPU/WebGL/Wasm
179+
// emitters match it - they previously TRUNCATED + flushed all subnormals to zero (diverging
180+
// from numpy AND from CUDA/OpenCL which use native round-to-nearest). Subnormals + overflow +
181+
// RNE midpoints are the point.
182+
private static void HalfConvertKernel(Index1D i,
183+
ArrayView1D<float, Stride1D.Dense> x, ArrayView1D<global::ILGPU.Half, Stride1D.Dense> y) =>
184+
y[i] = (global::ILGPU.Half)x[i];
185+
186+
[TestMethod]
187+
public async Task Half_FloatToHalf_RoundToNearestEven() => await RunTest(async accelerator =>
188+
{
189+
float[] inputs =
190+
{
191+
1f, 1.5f, -2.5f, 100.3f, 0.333f, 1024.7f, // normals
192+
1.00048828125f, 1.0014648438f, 0.99975586f, // normal RNE midpoints near 1.0
193+
65504f, 65519f, 65520f, 65535f, 70000f, 1e30f, // overflow region -> 65504 / Inf
194+
float.PositiveInfinity, float.NegativeInfinity, -65504f, -70000f,
195+
(float)Math.Pow(2, -24), (float)Math.Pow(2, -23), (float)Math.Pow(2, -15), // exact subnormals/boundary
196+
(float)Math.Pow(2, -25), // tie -> +0 (even)
197+
(float)(Math.Pow(2, -25) * 1.5), // -> smallest subnormal
198+
(float)Math.Pow(2, -26), // -> +0
199+
-2.9831426E-08f, // the original failing case -> -smallest subnormal
200+
0f, -0f, float.NaN, 5.96e-8f, 1e-7f,
201+
};
202+
int n = inputs.Length;
203+
var expected = new ushort[n];
204+
for (int i = 0; i < n; i++)
205+
{
206+
var v = (global::ILGPU.Half)inputs[i]; // managed = bit-exact numpy.float16 (oracle-proven)
207+
expected[i] = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Half, ushort>(ref v);
208+
}
209+
210+
using var inBuf = accelerator.Allocate1D(inputs);
211+
using var outBuf = accelerator.Allocate1D<global::ILGPU.Half>(n);
212+
var k = accelerator.LoadAutoGroupedStreamKernel<Index1D,
213+
ArrayView1D<float, Stride1D.Dense>, ArrayView1D<global::ILGPU.Half, Stride1D.Dense>>(HalfConvertKernel);
214+
k(n, inBuf.View, outBuf.View);
215+
await accelerator.SynchronizeAsync();
216+
var got = await outBuf.CopyToHostAsync<global::ILGPU.Half>();
217+
218+
for (int i = 0; i < n; i++)
219+
{
220+
ushort g = System.Runtime.CompilerServices.Unsafe.As<global::ILGPU.Half, ushort>(ref got[i]);
221+
bool bothNaN = (g & 0x7C00) == 0x7C00 && (g & 0x03FF) != 0
222+
&& (expected[i] & 0x7C00) == 0x7C00 && (expected[i] & 0x03FF) != 0;
223+
if (!bothNaN && g != expected[i])
224+
throw new Exception(
225+
$"float->Half kernel @{i} ({BackendName}): input {inputs[i]} -> got 0x{g:X4}, " +
226+
$"want 0x{expected[i]:X4} (must be round-to-nearest-even incl subnormals, matching numpy/managed).");
227+
}
228+
});
229+
176230
[TestMethod]
177231
public async Task Float8E4M3_FromSingleFn_OverflowToNaN() => await RunTest(async accelerator =>
178232
{

SpawnDev.ILGPU/SpawnDev.ILGPU.csproj

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
<TargetFramework>net10.0</TargetFramework>
55
<ImplicitUsings>enable</ImplicitUsings>
66
<Nullable>enable</Nullable>
7-
<Version>4.14.0-local.2</Version>
7+
<Version>4.14.0-local.3</Version>
88
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9-
<PackageReleaseNotes>4.14.0 makes Float8E4M3 bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn (the dtype it is named after): the cast operator and the IR-level convert now use the fn convention - finite overflow AND +-Inf map to NaN (was: saturate to +-448) - verified against the ml_dtypes oracle and on all 6 backends. The saturating (NVIDIA TE / OCP) cast is available opt-in via Float8E4M3.FromSingleSaturating / FromSingle(x, saturate: true). 4.13.2 is a packaging fix over 4.13.1: removes stray Wasm/repro JSON files that the Razor SDK swept into the package, and bundles the precompiled-shaders precompiler tool (tools/) that 4.13.0/4.13.1 were missing. The 4.13.x line brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and FP8 (Float8E4M3 + Float8E5M2) - including FP8 radix-sort keys (4.13.1) - plus generic INumber&lt;T&gt; mixed-precision kernels, PrecisionConvert, and bf16/FP8 portability to pre-Ampere CUDA cards (GTX 1080 / RTX 2060). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9+
<PackageReleaseNotes>4.14.0 fixes low-precision float CONVERSION CORRECTNESS against the references. (1) Half (float-&gt;half) is now IEEE round-to-nearest-even on every backend (CPU + WebGPU + WebGL + Wasm), bit-exact to numpy.float16 / PyTorch / CUDA / OpenCL - it previously truncated toward zero and flushed subnormals to zero (diverging from numpy AND from CUDA/OpenCL). (2) Float8E4M3 is now bit-exact to PyTorch/JAX/ml_dtypes float8_e4m3fn: the cast + IR convert use the fn convention (finite overflow AND +-Inf -&gt; NaN; was saturate to +-448); saturating is opt-in via Float8E4M3.FromSingleSaturating. BFloat16 was already bit-exact to ml_dtypes.bfloat16 (verified). All validated exhaustively against ml_dtypes/numpy oracles + cross-backend PMT gates. 4.13.2 is a packaging fix over 4.13.1: removes stray Wasm/repro JSON files that the Razor SDK swept into the package, and bundles the precompiled-shaders precompiler tool (tools/) that 4.13.0/4.13.1 were missing. The 4.13.x line brings full low-precision floating-point support across ALL 6 backends (CPU, OpenCL, WebGPU, WebGL, Wasm, CUDA): Half, BFloat16, and FP8 (Float8E4M3 + Float8E5M2) - including FP8 radix-sort keys (4.13.1) - plus generic INumber&lt;T&gt; mixed-precision kernels, PrecisionConvert, and bf16/FP8 portability to pre-Ampere CUDA cards (GTX 1080 / RTX 2060). Full per-version history with code samples: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
1010
<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
1111
<GenerateDocumentationFile>true</GenerateDocumentationFile>
1212
<EmbedAllSources>true</EmbedAllSources>
@@ -65,8 +65,8 @@
6565
<ProjectReference Include="..\ILGPU.Algorithms\ILGPU.Algorithms.csproj" />
6666
</ItemGroup>
6767
<ItemGroup Condition="!Exists('$(MSBuildThisFileDirectory)..\ILGPU\ILGPU.csproj')">
68-
<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.29" />
69-
<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.29" />
68+
<PackageReference Include="SpawnDev.ILGPU.Fork" Version="2.0.30" />
69+
<PackageReference Include="SpawnDev.ILGPU.Algorithms.Fork" Version="2.0.30" />
7070
</ItemGroup>
7171

7272
<ItemGroup>

0 commit comments

Comments
 (0)