Commit 717f362
perf(native q6k): block-outer loop order (sequential weight reads)
Apply the same cache-locality reorder as q4k/q5k/q8_0 to the Q6_K kernel:
iterate block-OUTER / output-row-INNER so the block-major weight
(blockIdx*output_dim + o)*210 is read sequentially. out_base[o]
accumulates across blocks; numerically identical (NativeQ6KMatmulKernel
parity green).
NOTE: unlike Q4_K (memory-stall-bound → reorder gave 2.07×), Q6_K showed
NO board speedup (matmul 20133 → 20168 ms, within noise). Q6_K
materializes a full 256-float scratch via scalar 6-bit unpack
(skainet_q6k_dequant_block) before the dot, so it is dequant-COMPUTE-bound,
not weight-read-bound — sequential reads don't help. The reorder is kept
for consistency and because it cannot hurt; the real Q6_K lever is
vectorizing/fusing the 6-bit dequant (NEON unpack or Q8 int-dot), a
separate rewrite. Q6_K is ~13% of tensors (10 ffn_down [5632,2048], 10
attn_v, output [2048,32000]).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 86365b2 commit 717f362
1 file changed
Lines changed: 19 additions & 10 deletions
Lines changed: 19 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
118 | | - | |
119 | | - | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
120 | 129 | | |
121 | | - | |
122 | | - | |
123 | | - | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
124 | 134 | | |
| 135 | + | |
125 | 136 | | |
126 | 137 | | |
127 | | - | |
128 | | - | |
| 138 | + | |
129 | 139 | | |
130 | 140 | | |
131 | 141 | | |
132 | 142 | | |
133 | 143 | | |
134 | 144 | | |
135 | 145 | | |
136 | | - | |
| 146 | + | |
137 | 147 | | |
138 | 148 | | |
139 | 149 | | |
140 | 150 | | |
141 | 151 | | |
| 152 | + | |
142 | 153 | | |
143 | | - | |
144 | | - | |
145 | 154 | | |
146 | 155 | | |
0 commit comments