Skip to content

Commit 3975c4b

Browse files
LostBeardclaude
andcommitted
Packed QInt4 nibble LOAD on the CPU (IL) backend - packed-qint4-verify now CPU+OpenCL+CUDA 32/32
The CPU accelerator uses DefaultILBackend, which runs the literal managed kernel method - so x[i] over an ArrayView<QInt4> invokes the real ArrayView<QInt4> indexer, not lowered codegen. That indexer returns ref T at byte base+index*ElementSize (ElementSize=1 for QInt4), which reads the WRONG byte for packed 2-nibble-per-byte storage (and over-reads past the ceil(N/2) buffer). Empirically: 31/32 wrong, e.g. i=1 read -6 (low nibble of byte 1) instead of -7 (high nibble of byte 0). A managed ref cannot address a nibble, so the fix decodes the packed element BY VALUE: the indexer body (which only ever runs on the CPU/IL backend - GPU backends replace it with the GetViewElementAddress view-intrinsic) branches on BitsPerElement < 8 and calls a new LoadPackedElement that computes byte = (Index+index)*BitsPerElement/8, shift = bitOffset%8, extracts the nibble, writes it into a [ThreadStatic] scratch T and returns a ref to it. Correct for by-value reads (int v = x[i]) on every parity; thread-static so concurrent CPU kernel threads don't clobber. The branch is statically false for every whole-byte type (BitsPerElement = 8/16/32/ 64), so whole-byte indexing is byte-for-byte unchanged. Generalized over BitsPerElement (not hardcoded to 4-bit) so future sub-byte widths reuse it. Note: the separate Velocity SIMD accelerator (AcceleratorType.Velocity) is NOT exercised by any test and transpiles the indexer via its own Specializer.Load - it remains a tracked follow-on, distinct from this CPU/IL path. Packed in-kernel WRITES (x[i] = v) on the CPU backend are a separate concern handled with the store work (atomic nibble RMW), not this load. Verified: packed-qint4-verify CPU+OpenCL+CUDA 32/32; fp4-verify CPU+CUDA PASS (Float4E2M1 stays 1-byte/unpacked, normal path untouched); packed-alloc-verify PASS; representative CPU array tests (bf16 ArrayView round-trip, CopyFromStream, bf16/FP4 radix sort) all Success. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 1990af6 commit 3975c4b

2 files changed

Lines changed: 50 additions & 6 deletions

File tree

ILGPU/ArrayView.cs

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,16 @@ is PackedBitsAttribute packed && packed.Bits > 0
198198
/// </summary>
199199
public static readonly ArrayView<T> Empty;
200200

201+
/// <summary>
202+
/// Per-thread scratch element used by the CPU (IL) backend to return a by-value
203+
/// decoded element for packed sub-byte views (<see cref="BitsPerElement"/> &lt; 8).
204+
/// The managed ref model cannot address a nibble in place, so a packed read decodes
205+
/// the nibble into this scratch and returns a ref to it. Thread-static so concurrent
206+
/// CPU kernel threads do not clobber each other.
207+
/// </summary>
208+
[ThreadStatic]
209+
private static T PackedElementScratch;
210+
201211
#endregion
202212

203213
#region Instance
@@ -392,11 +402,43 @@ public readonly unsafe ref T this[long index]
392402
{
393403
Trace.Assert(index >= 0 && index < Length, "Index out of range");
394404
EnsureCPUBuffer();
405+
// Packed sub-byte views (4-bit QInt4/QUInt4/Float4E2M1): the element lives in a
406+
// nibble that cannot be addressed by a managed ref, so decode it by value into a
407+
// per-thread scratch. (On GPU backends this whole body is replaced by the
408+
// GetViewElementAddress view-intrinsic and is never executed.)
409+
if (BitsPerElement < 8)
410+
return ref LoadPackedElement(index);
395411
ref var ptr = ref LoadEffectiveAddress(index);
396412
return ref Unsafe.As<byte, T>(ref ptr);
397413
}
398414
}
399415

416+
/// <summary>
417+
/// Decodes a packed sub-byte element (<see cref="BitsPerElement"/> &lt; 8) by value into
418+
/// the per-thread <see cref="PackedElementScratch"/> and returns a ref to it. Used only by
419+
/// the CPU (IL) backend, which runs this managed body directly; the GPU backends lower the
420+
/// indexer to a nibble-addressing load and never reach here.
421+
/// </summary>
422+
/// <param name="index">The relative element index.</param>
423+
/// <returns>A ref to the thread-static scratch holding the decoded element.</returns>
424+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
425+
private readonly unsafe ref T LoadPackedElement(long index)
426+
{
427+
// bit offset of the element within the buffer; byte = offset/8, shift = offset%8.
428+
long bitOffset = (Index + index) * BitsPerElement;
429+
long byteIndex = bitOffset >> 3;
430+
int shift = (int)(bitOffset & 7L);
431+
int mask = (1 << BitsPerElement) - 1;
432+
ref byte packedByte = ref Unsafe.Add(
433+
ref Unsafe.AsRef<byte>(Buffer.NativePtr.ToPointer()),
434+
(nint)byteIndex);
435+
// Keep only this element's bits in the low part of the byte; the consuming
436+
// conversion operator (e.g. QInt4->int) sign/zero-extends from there.
437+
byte raw = (byte)((packedByte >> shift) & mask);
438+
PackedElementScratch = Unsafe.As<byte, T>(ref raw);
439+
return ref PackedElementScratch;
440+
}
441+
400442
#endregion
401443

402444
#region Methods

SpawnDev.ILGPU.DemoConsole/PackedQInt4Verify.cs

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,15 @@ public static Task<int> Run()
4545
if (type != AcceleratorType.CPU && type != AcceleratorType.Cuda && type != AcceleratorType.OpenCL)
4646
continue;
4747

48-
// WIRED backends (packed QInt4 nibble load implemented + asserted). CPU/Velocity is a
49-
// tracked follow-on: its Specializer.Load dispatches by width (no sub-byte path) and the
50-
// vectorized LEA has no per-lane parity channel - a deeper Velocity-SIMD gather change.
51-
bool wired = type == AcceleratorType.Cuda || type == AcceleratorType.OpenCL;
48+
// WIRED backends (packed QInt4 nibble load implemented + asserted). The CPU (IL) backend
49+
// runs the managed ArrayView<QInt4> indexer directly, which decodes the packed nibble by
50+
// value (ArrayView.LoadPackedElement). The separate Velocity SIMD accelerator
51+
// (AcceleratorType.Velocity, not exercised here) is a tracked follow-on.
52+
bool wired = type == AcceleratorType.Cuda || type == AcceleratorType.OpenCL
53+
|| type == AcceleratorType.CPU;
5254
if (!wired)
5355
{
54-
Console.WriteLine($" [{type}] PENDING - packed nibble load not yet wired (Velocity sub-byte gather; tracked)");
56+
Console.WriteLine($" [{type}] PENDING - packed nibble load not yet wired (tracked)");
5557
continue;
5658
}
5759

@@ -96,7 +98,7 @@ public static Task<int> Run()
9698
}
9799

98100
Console.WriteLine(totalFails == 0
99-
? "=== PACKED QInt4 LOAD PASS (wired: OpenCL + CUDA; CPU/Velocity pending) ==="
101+
? "=== PACKED QInt4 LOAD PASS (wired: CPU + OpenCL + CUDA) ==="
100102
: $"=== PACKED QInt4 LOAD: {totalFails} problems ===");
101103
return Task.FromResult(totalFails == 0 ? 0 : 1);
102104
}

0 commit comments

Comments
 (0)