Commit 86365b2
perf(native q5k,q8_0): block-outer loop order for sequential weight reads
Apply the same cache-locality fix as q4k_matmul (d998feb) to the Q5_K
and Q8_0 kernels: iterate block-OUTER / output-row-INNER so the
block-major weight (blockIdx*output_dim + o)*bytes is read sequentially
(stride = one block) instead of striding output_dim*bytes per step — the
strided pattern makes every weight read a cold miss on the in-order A55.
out_base[o] accumulates across blocks; accumulation order is unchanged so
results are numerically identical.
Both validated on host against the Panama reference
(NativeQ5KMatmulKernelParityTest, NativeQ8_0MatmulKernelParityTest green).
Not exercised by TinyLlama Q4_K_M (Q4_K + Q6_K + F32 only), so no board
delta for that model — this keeps the K-quant kernels consistent and
benefits any model that uses Q5_K/Q8_0 weights.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 453ff40 commit 86365b2
2 files changed
Lines changed: 41 additions & 19 deletions
Lines changed: 21 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
112 | 125 | | |
113 | 126 | | |
114 | 127 | | |
| |||
121 | 134 | | |
122 | 135 | | |
123 | 136 | | |
124 | | - | |
125 | 137 | | |
126 | 138 | | |
127 | 139 | | |
| |||
195 | 207 | | |
196 | 208 | | |
197 | 209 | | |
198 | | - | |
199 | 210 | | |
200 | | - | |
| 211 | + | |
| 212 | + | |
201 | 213 | | |
202 | 214 | | |
Lines changed: 20 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| 74 | + | |
74 | 75 | | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
81 | 94 | | |
82 | 95 | | |
83 | 96 | | |
84 | | - | |
85 | | - | |
86 | 97 | | |
87 | 98 | | |
88 | 99 | | |
| |||
107 | 118 | | |
108 | 119 | | |
109 | 120 | | |
110 | | - | |
| 121 | + | |
111 | 122 | | |
112 | | - | |
113 | 123 | | |
114 | 124 | | |
0 commit comments