Commit dc98e54
committed
ggml-ve : Q4_K direct kernel — packed-fp32 pvfmad (+43% tg on 27B)
Adds a packed-FP32 variant that uses _vel_pvfmad_vvvvl (2 fp32 per
64-bit lane) to halve the FMA chain. Opt-in via
GGML_VE_Q4K_STD_PACKED=1 (compatible with _STD_CHUNK + _DIRECT).
Codex called this the biggest remaining win for direct Q4_K, and
the bench bears that out on 27B Q4_K_M:
Direct chunked (baseline) : 0.50 pp / 0.44 tg t/s
Direct chunked + packed (NEW) : 0.68 pp / 0.63 tg t/s
+36% pp, +43% tg
On 1B Q4_K_M (3-run averages, high variance):
Direct chunked : 20.25 pp / 9.00 tg
Direct chunked + packed : 21.89 pp / 9.50 tg (+8% pp / +5% tg)
How it works:
- Per lane, pack (d_low, d_high) into one 64-bit dlane_pk word,
similarly (-m_low, -m_high) into mlane_pk (negated for the
pvfmad encoding w = -m + d*nib = d*nib - m).
- Per byte position bp, build packed nibbles:
low_nib = (qs >> 8bp) & 0x0F (bits 0..3)
high_nib = (qs >> (8bp+4)) & 0x0F << 32 (bits 32..35)
nib_pk = low_nib | high_nib
- pvcvtsw converts packed int32 -> packed fp32.
- pvfmad: w_pk = -m_pk + d_pk*nib_pk, then acc_pk = pvfmad(acc_pk, w_pk, x_pk).
- Reduce the packed accumulator by extracting low+high halves
of each lane and summing (pattern mirrors q4k_full_intrin.c:698-705).
x_perm builder: new q4k_std_build_x_perm_packed_extern produces
[bp][b][i] u64 layout, each u64 = (x_low | x_high << 32). Same
total bytes as the two unpacked float arrays. One pass per matvec.
Per chunk:
- Before: 4 bp × 2 halves = 8 VL=cn*32 FMAs.
- Now: 4 bp × 1 packed = 4 VL=cn*32 packed FMAs (each does 2
fp32 multiplies per lane = 8 total per cycle on the
packed pipeline).
Net: 2x arithmetic density, real win on FMA-bound paths.
Standalone test_q4k_std_matvec ALL OK on packed variant, 12
shapes incl. K=17408; max_abs 5.7e-6 (tighter than unpacked 8.1e-6).
Task ggml-org#63.1 parent b9259d9 commit dc98e54
2 files changed
Lines changed: 203 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 | | |
39 | 43 | | |
40 | 44 | | |
| 45 | + | |
| 46 | + | |
41 | 47 | | |
42 | 48 | | |
43 | | - | |
44 | | - | |
45 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
46 | 54 | | |
47 | 55 | | |
48 | 56 | | |
| |||
125 | 133 | | |
126 | 134 | | |
127 | 135 | | |
128 | | - | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
129 | 164 | | |
130 | 165 | | |
131 | 166 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
102 | 126 | | |
103 | 127 | | |
104 | 128 | | |
| |||
600 | 624 | | |
601 | 625 | | |
602 | 626 | | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
0 commit comments