Skip to content

Commit 717f362

Browse files
michalharakalclaude
andcommitted
perf(native q6k): block-outer loop order (sequential weight reads)
Apply the same cache-locality reorder as q4k/q5k/q8_0 to the Q6_K kernel: iterate block-OUTER / output-row-INNER so the block-major weight (blockIdx*output_dim + o)*210 is read sequentially. out_base[o] accumulates across blocks; numerically identical (NativeQ6KMatmulKernel parity green). NOTE: unlike Q4_K (memory-stall-bound → reorder gave 2.07×), Q6_K showed NO board speedup (matmul 20133 → 20168 ms, within noise). Q6_K materializes a full 256-float scratch via scalar 6-bit unpack (skainet_q6k_dequant_block) before the dot, so it is dequant-COMPUTE-bound, not weight-read-bound — sequential reads don't help. The reorder is kept for consistency and because it cannot hurt; the real Q6_K lever is vectorizing/fusing the 6-bit dequant (NEON unpack or Q8 int-dot), a separate rewrite. Q6_K is ~13% of tensors (10 ffn_down [5632,2048], 10 attn_v, output [2048,32000]). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 86365b2 commit 717f362

1 file changed

Lines changed: 19 additions & 10 deletions

File tree

  • skainet-backends/skainet-backend-native-cpu/native/src

skainet-backends/skainet-backend-native-cpu/native/src/q6k_matmul.c

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -115,32 +115,41 @@ SKAINET_API void skainet_q6k_matmul(
115115

116116
float scratch[Q6K_BLOCK_SIZE];
117117

118-
for (int32_t o = 0; o < output_dim; ++o) {
119-
float acc = 0.0f;
118+
/*
119+
* Loop order: block OUTER, output row INNER — see q4k_matmul.c for the
120+
* rationale. The weight is block-major (blockIdx*output_dim + o)*210, so for
121+
* a fixed block consecutive `o` are 210 bytes apart: the weight bytes are
122+
* read sequentially (cache/prefetch friendly) instead of striding
123+
* output_dim*210 per step, which on the in-order A55 makes every read a cold
124+
* miss. The big Q6_K `output` projection (hidden→vocab, hit every token) is
125+
* the main beneficiary. out_base[o] accumulates across blocks; the order
126+
* over blocks is unchanged ⇒ numerically identical to the o-outer form.
127+
*/
128+
for (int32_t o = 0; o < output_dim; ++o) out_base[o] = 0.0f;
120129

121-
for (int32_t block_idx = 0; block_idx < blocks_per_input_dim; ++block_idx) {
122-
const uint8_t* block = weight + weight_byte_offset
123-
+ (size_t)(block_idx * output_dim + o) * Q6K_BYTES_PER_BLOCK;
130+
for (int32_t block_idx = 0; block_idx < blocks_per_input_dim; ++block_idx) {
131+
const float* in_block = in_base + (size_t) block_idx * Q6K_BLOCK_SIZE;
132+
const uint8_t* block = weight + weight_byte_offset
133+
+ (size_t)(block_idx * output_dim) * Q6K_BYTES_PER_BLOCK;
124134

135+
for (int32_t o = 0; o < output_dim; ++o, block += Q6K_BYTES_PER_BLOCK) {
125136
skainet_q6k_dequant_block(block, scratch);
126137

127-
const float* in_block = in_base + (size_t) block_idx * Q6K_BLOCK_SIZE;
128-
138+
float acc = 0.0f;
129139
#ifdef SKAINET_HAVE_NEON
130140
float32x4_t vacc = vdupq_n_f32(0.0f);
131141
for (int i = 0; i < Q6K_BLOCK_SIZE; i += 4) {
132142
const float32x4_t vi = vld1q_f32(in_block + i);
133143
const float32x4_t vw = vld1q_f32(scratch + i);
134144
vacc = vfmaq_f32(vacc, vi, vw);
135145
}
136-
acc += skainet_neon_hadd_f32(vacc);
146+
acc = skainet_neon_hadd_f32(vacc);
137147
#else
138148
for (int i = 0; i < Q6K_BLOCK_SIZE; ++i) {
139149
acc += in_block[i] * scratch[i];
140150
}
141151
#endif
152+
out_base[o] += acc;
142153
}
143-
144-
out_base[o] = acc;
145154
}
146155
}

0 commit comments

Comments
 (0)