Packed QInt4 nibble LOAD on the CPU (IL) backend - packed-qint4-verify now CPU+OpenCL+CUDA 32/32

LostBeard · claude · LostBeard · commit 3975c4bfa760 · 2026-06-18T18:24:21.000-04:00
The CPU accelerator uses DefaultILBackend, which runs the literal managed kernel method - so
x[i] over an ArrayView&lt;QInt4&gt; invokes the real ArrayView&lt;QInt4&gt; indexer, not lowered codegen.
That indexer returns ref T at byte base+index*ElementSize (ElementSize=1 for QInt4), which reads
the WRONG byte for packed 2-nibble-per-byte storage (and over-reads past the ceil(N/2) buffer).
Empirically: 31/32 wrong, e.g. i=1 read -6 (low nibble of byte 1) instead of -7 (high nibble of
byte 0).

A managed ref cannot address a nibble, so the fix decodes the packed element BY VALUE: the
indexer body (which only ever runs on the CPU/IL backend - GPU backends replace it with the
GetViewElementAddress view-intrinsic) branches on BitsPerElement &lt; 8 and calls a new
LoadPackedElement that computes byte = (Index+index)*BitsPerElement/8, shift = bitOffset%8,
extracts the nibble, writes it into a [ThreadStatic] scratch T and returns a ref to it. Correct
for by-value reads (int v = x[i]) on every parity; thread-static so concurrent CPU kernel threads
don't clobber. The branch is statically false for every whole-byte type (BitsPerElement = 8/16/32/
64), so whole-byte indexing is byte-for-byte unchanged.

Generalized over BitsPerElement (not hardcoded to 4-bit) so future sub-byte widths reuse it.

Note: the separate Velocity SIMD accelerator (AcceleratorType.Velocity) is NOT exercised by any
test and transpiles the indexer via its own Specializer.Load - it remains a tracked follow-on,
distinct from this CPU/IL path. Packed in-kernel WRITES (x[i] = v) on the CPU backend are a
separate concern handled with the store work (atomic nibble RMW), not this load.

Verified: packed-qint4-verify CPU+OpenCL+CUDA 32/32; fp4-verify CPU+CUDA PASS (Float4E2M1 stays
1-byte/unpacked, normal path untouched); packed-alloc-verify PASS; representative CPU array tests
(bf16 ArrayView round-trip, CopyFromStream, bf16/FP4 radix sort) all Success.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/ILGPU/ArrayView.cs b/ILGPU/ArrayView.cs
@@ -198,6 +198,16 @@ is PackedBitsAttribute packed && packed.Bits > 0
         /// </summary>
         public static readonly ArrayView<T> Empty;
 
+        /// <summary>
+        /// Per-thread scratch element used by the CPU (IL) backend to return a by-value
+        /// decoded element for packed sub-byte views (<see cref="BitsPerElement"/> &lt; 8).
+        /// The managed ref model cannot address a nibble in place, so a packed read decodes
+        /// the nibble into this scratch and returns a ref to it. Thread-static so concurrent
+        /// CPU kernel threads do not clobber each other.
+        /// </summary>
+        [ThreadStatic]
+        private static T PackedElementScratch;
+
         #endregion
 
         #region Instance
@@ -392,11 +402,43 @@ public readonly unsafe ref T this[long index]
             {
                 Trace.Assert(index >= 0 && index < Length, "Index out of range");
                 EnsureCPUBuffer();
+                // Packed sub-byte views (4-bit QInt4/QUInt4/Float4E2M1): the element lives in a
+                // nibble that cannot be addressed by a managed ref, so decode it by value into a
+                // per-thread scratch. (On GPU backends this whole body is replaced by the
+                // GetViewElementAddress view-intrinsic and is never executed.)
+                if (BitsPerElement < 8)
+                    return ref LoadPackedElement(index);
                 ref var ptr = ref LoadEffectiveAddress(index);
                 return ref Unsafe.As<byte, T>(ref ptr);
             }
         }
 
+        /// <summary>
+        /// Decodes a packed sub-byte element (<see cref="BitsPerElement"/> &lt; 8) by value into
+        /// the per-thread <see cref="PackedElementScratch"/> and returns a ref to it. Used only by
+        /// the CPU (IL) backend, which runs this managed body directly; the GPU backends lower the
+        /// indexer to a nibble-addressing load and never reach here.
+        /// </summary>
+        /// <param name="index">The relative element index.</param>
+        /// <returns>A ref to the thread-static scratch holding the decoded element.</returns>
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private readonly unsafe ref T LoadPackedElement(long index)
+        {
+            // bit offset of the element within the buffer; byte = offset/8, shift = offset%8.
+            long bitOffset = (Index + index) * BitsPerElement;
+            long byteIndex = bitOffset >> 3;
+            int shift = (int)(bitOffset & 7L);
+            int mask = (1 << BitsPerElement) - 1;
+            ref byte packedByte = ref Unsafe.Add(
+                ref Unsafe.AsRef<byte>(Buffer.NativePtr.ToPointer()),
+                (nint)byteIndex);
+            // Keep only this element's bits in the low part of the byte; the consuming
+            // conversion operator (e.g. QInt4->int) sign/zero-extends from there.
+            byte raw = (byte)((packedByte >> shift) & mask);
+            PackedElementScratch = Unsafe.As<byte, T>(ref raw);
+            return ref PackedElementScratch;
+        }
+
         #endregion
 
         #region Methods
diff --git a/SpawnDev.ILGPU.DemoConsole/PackedQInt4Verify.cs b/SpawnDev.ILGPU.DemoConsole/PackedQInt4Verify.cs
@@ -45,13 +45,15 @@ public static Task<int> Run()
             if (type != AcceleratorType.CPU && type != AcceleratorType.Cuda && type != AcceleratorType.OpenCL)
                 continue;
 
-            // WIRED backends (packed QInt4 nibble load implemented + asserted). CPU/Velocity is a
-            // tracked follow-on: its Specializer.Load dispatches by width (no sub-byte path) and the
-            // vectorized LEA has no per-lane parity channel - a deeper Velocity-SIMD gather change.
-            bool wired = type == AcceleratorType.Cuda || type == AcceleratorType.OpenCL;
+            // WIRED backends (packed QInt4 nibble load implemented + asserted). The CPU (IL) backend
+            // runs the managed ArrayView<QInt4> indexer directly, which decodes the packed nibble by
+            // value (ArrayView.LoadPackedElement). The separate Velocity SIMD accelerator
+            // (AcceleratorType.Velocity, not exercised here) is a tracked follow-on.
+            bool wired = type == AcceleratorType.Cuda || type == AcceleratorType.OpenCL
+                || type == AcceleratorType.CPU;
             if (!wired)
             {
-                Console.WriteLine($"  [{type}] PENDING - packed nibble load not yet wired (Velocity sub-byte gather; tracked)");
+                Console.WriteLine($"  [{type}] PENDING - packed nibble load not yet wired (tracked)");
                 continue;
             }
 
@@ -96,7 +98,7 @@ public static Task<int> Run()
         }
 
         Console.WriteLine(totalFails == 0
-            ? "=== PACKED QInt4 LOAD PASS (wired: OpenCL + CUDA; CPU/Velocity pending) ==="
+            ? "=== PACKED QInt4 LOAD PASS (wired: CPU + OpenCL + CUDA) ==="
             : $"=== PACKED QInt4 LOAD: {totalFails} problems ===");
         return Task.FromResult(totalFails == 0 ? 0 : 1);
     }