Commit b9259d9
committed
ggml-ve : Q4_K direct kernel — vgtlzx HBM gather (opt-in)
Adds a gather-based qs load variant that reads u32 lanes directly
from raw HBM via _vel_vgtlzx_vvssl + _vel_vsfa_vvssl, eliminating
the per-row scratch pack (16 vld+vst per block).
Opt-in: GGML_VE_Q4K_STD_GATHER=1 (in addition to _DIRECT + _STD_CHUNK).
How it works:
- One-shot init of g_qs_gather_offsets[256] holding the byte offset
pattern: (i/32)*144 + 16 + (i%32)*4 for i in 0..255.
- Per chunk: chunk_base = row_start + chunk_start*144;
abs_addrs = vsfa(off_v, shift=0, chunk_base, VL);
qs_chunk = vgtlzx(abs_addrs, 0, 0, VL);
- Address pattern is monotonic increasing (eight 128-byte runs
separated by 16-byte block headers), so VE's gather hits a
near-coalesced load -- not random-access cost.
Saves nb vld + nb vst per row of HBM<->LLC traffic. Frees the
g_qs_pool per-thread scratch buffer requirement.
Measured:
- Standalone test_q4k_std_matvec: ALL OK for both gather and
no-gather variants on 12 shapes incl. K=17408.
- 1B Q4_K_M: gather ~+12% pp, +3% tg over scratch-pack
(within noise; high run-to-run variance).
- 27B Q4_K_M N>1: 0.50/0.46 t/s (vs 0.50/0.44 scratch-pack).
Modest +5% tg.
The win is real but modest because the scratch pack was already
fast (sequential vector vld+vst at MVL). The gather route gives
more headroom for future kernels that may want to read partial
chunks or skip blocks, but for the current dense chunked path
it's basically a wash.
Task ggml-org#64.1 parent bff568d commit b9259d9
2 files changed
Lines changed: 168 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
33 | 38 | | |
34 | 39 | | |
35 | 40 | | |
| |||
119 | 124 | | |
120 | 125 | | |
121 | 126 | | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
127 | 130 | | |
128 | 131 | | |
129 | 132 | | |
130 | 133 | | |
131 | 134 | | |
132 | | - | |
133 | | - | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
134 | 152 | | |
135 | 153 | | |
136 | 154 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
331 | 331 | | |
332 | 332 | | |
333 | 333 | | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
334 | 354 | | |
335 | 355 | | |
336 | 356 | | |
| |||
457 | 477 | | |
458 | 478 | | |
459 | 479 | | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
0 commit comments