Date: 2026-04-12 Scope: WebGPU backend - int8, int16, uint16, float16 math and buffer access Finding: Sub-word buffer access infrastructure EXISTS but only handles 8-bit. 16-bit is broken.
⚠ STATUS: HISTORICAL / RESOLVED (do not cite as current). This is the PRE-FIX audit (Apr 2026). Everything below — including "Float16 without shader-f16 ... LIKELY BROKEN" and the "What Needs Fixing" list — has SHIPPED. f16 is supported on EVERY backend; where native
shader-f16is unavailable it is EMULATED losslessly (_f16_to_f32/_f32_to_f16), soCapabilities.Float16is alwaystrueand onlyCapabilities.Float16Nativedistinguishes native vs emulated. SeePlans/f16-emulation-plan.md"Shipping status" (Phases 1-4 SHIPPED) andSpawnDev.ILGPU/WebGPU/CLAUDE.md"Float16 (Half) — Native and Emulated". Kept for historical context only.
WGSLKernelFunctionGenerator.cs line 1126-1128:
if (paramElemType is PrimitiveType pt &&
(pt.BasicValueType == BasicValueType.Int8 || pt.BasicValueType == BasicValueType.Int16))
_byteElementParams.Add(param.Index);Both Int8 AND Int16 get added to _byteElementParams. But the extraction code at line 3708 only handles BYTE extraction:
// Extracts ONE BYTE from a u32 word - divides by 4, shifts by 8 bits, masks 0xFF
var extractExpr = $"i32((param{byteParamIdx}[u32({byteIdx}) / 4u] >> ((u32({byteIdx}) % 4u) * 8u)) & 0xFFu)";For Int16, this reads ONE BYTE instead of TWO BYTES. Data corruption.
// Extracts ONE SHORT (2 bytes) from a u32 word - divides by 2, shifts by 16 bits, masks 0xFFFF
var extractExpr = $"i32((param{paramIdx}[u32({idx}) / 2u] >> ((u32({idx}) % 2u) * 16u)) & 0xFFFFu)";- Separate tracking:
_byteElementParamsfor Int8, new_shortElementParamsfor Int16/UInt16 - LEA codegen: different address math for 1-byte vs 2-byte elements
- Load codegen: byte extraction (/ 4, % 4, * 8, & 0xFF) vs short extraction (/ 2, % 2, * 16, & 0xFFFF)
- Store codegen: same pattern for writes (atomic RMW or read-modify-write)
| Line | Mapping | Status |
|---|---|---|
| 116 | Int8 -> "i32" | OK (promoted) |
| 117 | Int16 -> "i32" | OK (promoted) |
| 120 | Float16 -> "f16" or "f32" | OK (conditional native) |
| 135 | ArithmeticInt8 -> "i32" | OK |
| 136 | ArithmeticInt16 -> "i32" | OK |
| 139 | ArithmeticUInt8 -> "u32" | OK |
| 140 | ArithmeticUInt16 -> "u32" | OK |
| 143 | ArithmeticFloat16 -> "f16" or "f32" | OK |
Type PROMOTION is handled. Types become i32/u32/f32 in WGSL. The issue is only in BUFFER ACCESS.
| Line | What | Issue |
|---|---|---|
| 69-73 | _byteElementParams tracking |
BUG: Int16 lumped with Int8 |
| 1126-1128 | Adding Int8 + Int16 to same set | BUG: should be separate |
| 3553-3565 | LEA for byte-element views | BUG: address math is byte-only |
| 3704-3708 | Load extraction | BUG: extracts 1 byte, not 2 for Int16 |
| 3564 | Cross-block pointer expression | BUG: byte extraction only |
NOT FOUND. There is Load extraction but no Store packing. If a kernel writes to an ArrayView<short>, the Store codegen likely writes a full i32 to the buffer, overwriting the adjacent 16-bit value. This needs atomic read-modify-write or at minimum a pack-and-write.
| Function | short | sbyte | Status |
|---|---|---|---|
| Abs | line 164 | line 158 | OK - C# level, promoted to i32 in WGSL |
| Min | line 190 | line 183 | OK |
| Max | line 218 | line 213 | OK |
These work because they're C# intrinsics that get compiled to i32 WGSL operations after type promotion. No buffer access involved.
| Line | What | Status |
|---|---|---|
| 1594 | Int8 constant emission | OK |
| 1595 | Int16 constant emission | OK |
| 1598 | Float16 constant emission | OK (uses float cast) |
Constants are fine - they're scalar values, not buffer reads.
| Line | What | Issue |
|---|---|---|
| 1188-1189 | f16 bit packing for buffer upload | OK for native f16 |
| Buffer alloc | MemoryBuffer1D | NEEDS CHECK: is buffer size correct? |
When allocating MemoryBuffer1D<short, Dense>(256), does WebGPU allocate 2562=512 bytes? Or 2564=1024 bytes? If the WGSL binding declares array<u32> (128 elements for 256 shorts), the buffer MUST be 128*4=512 bytes. Check that AllocateRawInternal uses the element size correctly.
The IR level handles Int8, Int16, Float16 for constant folding (Neg, Not, Abs, PopCount, LeadingZeroCount, etc.). These are compile-time operations, not runtime buffer access. No issues here.
RadixSort uses ArrayView<int> internally for histograms and scatter. If someone calls RadixSort on ArrayView<short>, the algorithm would need to handle sub-word access. Check: does RadixSort accept non-int element types? If not, it would fail at compile time (type mismatch), which is safe. If it does, it would hit the same buffer access bug.
- Type:
f16in WGSL - Buffer:
array<f16>is valid when shader-f16 enabled - No sub-word extraction needed - native f16 buffer access works
- HalfExtensions intrinsics registered (lines 731-745)
- Status: SHOULD WORK on GPUs with shader-f16
- Type:
f32in WGSL (promoted) - Buffer: would need sub-word access like Int16
- Float16 is added to
_byteElementParams? CHECK - line 1126 only checks Int8 and Int16, NOT Float16 - If Float16 buffers are NOT in
_byteElementParams, the Load codegen treats them as regular f32 reads from a buffer packed with 16-bit floats = same stride mismatch bug as Int16 - Status: LIKELY BROKEN on GPUs without shader-f16
// Does this line also need Float16?
if (paramElemType is PrimitiveType pt &&
(pt.BasicValueType == BasicValueType.Int8 || pt.BasicValueType == BasicValueType.Int16))
_byteElementParams.Add(param.Index);
// Should it be:
if (paramElemType is PrimitiveType pt &&
(pt.BasicValueType == BasicValueType.Int8 ||
pt.BasicValueType == BasicValueType.Int16 ||
(!Backend.HasShaderF16 && pt.BasicValueType == BasicValueType.Float16)))
_byteElementParams.Add(param.Index);- Separate Int16 from Int8 tracking - new
_shortElementParamsHashSet - Int16 Load extraction -
/2u,*16u,&0xFFFFuinstead of/4u,*8u,&0xFFu - Int16 Store packing - write 16 bits into the correct half of a u32 word
- Int16 LEA address math - element index * 2 bytes, not * 1 byte
- Float16 without shader-f16 - add to sub-word tracking when native f16 unavailable
- Float16 Load/Store - same sub-word extraction but with f16<->f32 conversion
- Float16 buffer allocation - correct byte size for packed f16 data
- Int8/UInt8 Store - verify Store codegen handles byte writes (Load exists, Store may not)
- RadixSort type check - ensure algorithms reject or handle sub-word element types
- Unit tests - int16 read, int16 write, int16 kernel, f16 emulated read/write/kernel
WebGPU/Backend/WGSLKernelFunctionGenerator.cs- Load/Store/LEA for int16 + f16WebGPU/Backend/WGSLTypeGenerator.cs- no changes needed (types already promoted)WebGPU/WebGPUAccelerator.cs- verify buffer sizing for sub-word typesWebGPU/Backend/WebGPUBackend.cs- possibly register f16 emulation intrinsics for non-shader-f16- Tests: int16 + f16 buffer access tests on WebGPU backend