Commit 3975c4b
Packed QInt4 nibble LOAD on the CPU (IL) backend - packed-qint4-verify now CPU+OpenCL+CUDA 32/32
The CPU accelerator uses DefaultILBackend, which runs the literal managed kernel method - so
x[i] over an ArrayView<QInt4> invokes the real ArrayView<QInt4> indexer, not lowered codegen.
That indexer returns ref T at byte base+index*ElementSize (ElementSize=1 for QInt4), which reads
the WRONG byte for packed 2-nibble-per-byte storage (and over-reads past the ceil(N/2) buffer).
Empirically: 31/32 wrong, e.g. i=1 read -6 (low nibble of byte 1) instead of -7 (high nibble of
byte 0).
A managed ref cannot address a nibble, so the fix decodes the packed element BY VALUE: the
indexer body (which only ever runs on the CPU/IL backend - GPU backends replace it with the
GetViewElementAddress view-intrinsic) branches on BitsPerElement < 8 and calls a new
LoadPackedElement that computes byte = (Index+index)*BitsPerElement/8, shift = bitOffset%8,
extracts the nibble, writes it into a [ThreadStatic] scratch T and returns a ref to it. Correct
for by-value reads (int v = x[i]) on every parity; thread-static so concurrent CPU kernel threads
don't clobber. The branch is statically false for every whole-byte type (BitsPerElement = 8/16/32/
64), so whole-byte indexing is byte-for-byte unchanged.
Generalized over BitsPerElement (not hardcoded to 4-bit) so future sub-byte widths reuse it.
Note: the separate Velocity SIMD accelerator (AcceleratorType.Velocity) is NOT exercised by any
test and transpiles the indexer via its own Specializer.Load - it remains a tracked follow-on,
distinct from this CPU/IL path. Packed in-kernel WRITES (x[i] = v) on the CPU backend are a
separate concern handled with the store work (atomic nibble RMW), not this load.
Verified: packed-qint4-verify CPU+OpenCL+CUDA 32/32; fp4-verify CPU+CUDA PASS (Float4E2M1 stays
1-byte/unpacked, normal path untouched); packed-alloc-verify PASS; representative CPU array tests
(bf16 ArrayView round-trip, CopyFromStream, bf16/FP4 radix sort) all Success.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>1 parent 1990af6 commit 3975c4b
2 files changed
Lines changed: 50 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
198 | 198 | | |
199 | 199 | | |
200 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
201 | 211 | | |
202 | 212 | | |
203 | 213 | | |
| |||
392 | 402 | | |
393 | 403 | | |
394 | 404 | | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
395 | 411 | | |
396 | 412 | | |
397 | 413 | | |
398 | 414 | | |
399 | 415 | | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
400 | 442 | | |
401 | 443 | | |
402 | 444 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
52 | 54 | | |
53 | 55 | | |
54 | | - | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
| |||
96 | 98 | | |
97 | 99 | | |
98 | 100 | | |
99 | | - | |
| 101 | + | |
100 | 102 | | |
101 | 103 | | |
102 | 104 | | |
| |||
0 commit comments