Skip to content

Commit 18e4404

Browse files
committed
ggml-ve : Q4_K N>1 — HBM-aware cache upload (no more SEGV)
ROOT CAUSE of the N>1 SEGV: - With GGML_VE_Q4K_N_GT_1=1 the scheduler decides Q4_K weights are best placed on VE_HBM (because our backend now consumes them in batched form too), so w->data becomes an HBM pointer. - The host-side canonical-pack in get_or_upload_q4k_canon reads src_blocks via memcpy. That SEGVs on an HBM pointer. - Earlier sessions only ever saw CPU_Mapped Q4_K weights (N=1 path), so the cache had never been exposed to HBM-resident inputs. Fix: before calling the cache, check ggml_backend_buffer_is_host on the weight buffer. If non-host (= HBM), download the weight to a host bounce buffer with vedaMemcpyDtoH first, then pass that to the cache. One-time cost per unique weight at first lookup; subsequent calls hit the cache. Confirmed working with GGML_VE_Q4K_N_GT_1=1 on MiniCPM5-1B-Q4_K_M: - prompt eval (N=5): prints "Q4K-DEVICE-BOUNCE" once per weight, then no crashes; correct output - decode (N=1): hits the now-populated canon cache, no bounce, runs the existing 8-row tile kernel Perf right now (1B, full N>1 path enabled): | path | pp t/s | tg t/s | | ----------------------------- | -------- | ------ | | default (N>1 → CPU) | 21.9 | 14.4 | | GGML_VE_Q4K_N_GT_1=1 (this) | 7.8 | 8.5 | VE path slower than CPU because each N>1 op is LOOPED as N sequential matvecs (1.3 ms × N), where CPU's AVX2 Q4_K does a true batched SGEMM. So this commit gets us to "N>1 works on VE" but not "N>1 is fast on VE". Real win comes from the next commit: tile-batched matmul. Dequant a row tile of weights to F32 once, then run N matvecs against that cached tile (or one cblas_sgemm). Amortizes dequant across N x-columns. For now N>1 stays GATED behind GGML_VE_Q4K_N_GT_1=1 because the sequential path is slower than CPU. Will flip default when batched matmul lands.
1 parent bc64991 commit 18e4404

1 file changed

Lines changed: 33 additions & 1 deletion

File tree

ggml/src/ggml-ve/ops/mul_mat_q.cpp

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414

1515
#include "ggml.h"
1616

17+
#include <memory>
18+
1719
namespace ggml_ve {
1820
namespace ops {
1921

@@ -219,8 +221,38 @@ bool mul_mat_q(backend_context * ctx, ggml_tensor * dst) {
219221
fprintf(stderr, "[Q4K-DEBUG] name=%s M=%ld K=%ld N=%ld\n",
220222
name ? name : "?", (long)M, (long)K, (long)N);
221223
}
224+
// The host-side canonical-pack reads from src_blocks via memcpy. If
225+
// the weight buffer is device-resident (VE_HBM), w->data is an HBM
226+
// pointer and host memcpy SEGVs. Download to a host bounce buffer
227+
// first.
228+
//
229+
// This happens with GGML_VE_Q4K_N_GT_1: once our backend accepts
230+
// Q4_K MUL_MATs in batched form, the scheduler decides Q4_K weights
231+
// are best placed on VE_HBM (the buffer they're consumed on). The
232+
// cache then needs to read them BACK to host to do the canonical
233+
// pack + pre-decode. One-time cost per weight at first lookup.
234+
const void * src_for_cache = w->data;
235+
std::unique_ptr<uint8_t[]> bounce;
236+
const int64_t weight_bytes = (int64_t) M * (K / 256) * 144;
237+
if (w->buffer && !ggml_backend_buffer_is_host(w->buffer)) {
238+
bounce.reset(new uint8_t[weight_bytes]);
239+
if (vedaMemcpyDtoH(bounce.get(),
240+
(VEDAdeviceptr)(uintptr_t) w->data,
241+
weight_bytes) != VEDA_SUCCESS) {
242+
if (std::getenv("GGML_VE_Q4K_DEBUG")) {
243+
fprintf(stderr, "[Q4K-FAIL] DtoH bounce for %s failed\n",
244+
name ? name : "?");
245+
}
246+
return false;
247+
}
248+
src_for_cache = bounce.get();
249+
if (std::getenv("GGML_VE_Q4K_DEBUG")) {
250+
fprintf(stderr, "[Q4K-DEVICE-BOUNCE] %s: %ld bytes downloaded for canon pack\n",
251+
name ? name : "?", (long) weight_bytes);
252+
}
253+
}
222254
if (!ctx->cache().get_or_upload_q4k_canon(
223-
name, w->data, (uint64_t) M, (uint64_t) K, &qs_v, &hdr_v)) {
255+
name, src_for_cache, (uint64_t) M, (uint64_t) K, &qs_v, &hdr_v)) {
224256
return false;
225257
}
226258
/* N>1 (batched matmul): loop the matvec, advancing y by M and x by K

0 commit comments

Comments
 (0)