Skip to content

Commit b724705

Browse files
committed
ggml-ve : Q4_K auto-route canon vs direct by free-HBM size gate
Makes the direct-std + packed kernel the automatic default for big models while keeping canon (faster per-matvec) for models that fit. No env flags needed. q4k_route_is_direct() samples free HBM once (after weights are resident), memoizes the decision: free < total/2 -> DIRECT (big model; canon's 2x footprint OOMs) free >= total/2 -> canon (small model; canon is ~2x faster) Measured with NO flags: 1B Q4_K_M: free 42.8 GB -> canon, 16.9 pp / 11.7 tg 27B Q4_K_M: free 15.8 GB -> direct, 3.2 pp / 0.5 tg (was: canon N=1 0.27 tg, and N>1 OOM crash) Routing also gates N>1 acceptance: direct models accept N>1 (prompt-eval on VE via per-column matvec loop); canon models stay N=1 (N>1 prompt-eval falls to CPU, avoiding canon's HBM doubling). Escape hatches: GGML_VE_Q4K_FORCE_DIRECT=1 / GGML_VE_Q4K_DIRECT=1 -> always direct GGML_VE_Q4K_FORCE_CANON=1 -> always canon GGML_VE_Q4K_DIRECT_FREE_MB=<n> -> custom threshold GGML_VE_NO_Q4K=1 -> disable VE Q4_K entirely Note: decision is process-global (first model wins). For model-swap servers, use the force flags. The very first Q4_K op of the first graph may route to CPU before the sample memoizes -- negligible.
1 parent 47337a9 commit b724705

1 file changed

Lines changed: 64 additions & 12 deletions

File tree

ggml/src/ggml-ve/ops/mul_mat_q.cpp

Lines changed: 64 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,57 @@ namespace ops {
2222

2323
namespace {
2424

25+
/* ---- Q4_K routing: canon (default) vs direct-std (big models) ----
26+
*
27+
* canon path : pre-reordered nibbles + 64 B/blk decoded headers cached in
28+
* HBM. Fastest per-matvec, but DOUBLES the Q4_K HBM footprint
29+
* (raw + canon). Great for models that fit; OOMs ~27B-class.
30+
* direct path : reads the standard 144 B/blk layout in place, packed pvfmad
31+
* VL=256 kernel. ~2x SLOWER per-matvec than canon on small
32+
* models, but no HBM doubling -> the only thing that fits big
33+
* models, and 2.2x faster than canon-N=1 there.
34+
*
35+
* Decision is made ONCE (memoized) by sampling free HBM after weights are
36+
* resident: a big model leaves little free, so route it to direct; a small
37+
* model leaves lots free, so route it to canon. Cut at total/2 by default
38+
* (27B leaves ~22 GB free of 48; 1B leaves ~43 GB -- wide margin).
39+
*
40+
* Escape hatches:
41+
* GGML_VE_Q4K_FORCE_DIRECT=1 / GGML_VE_Q4K_DIRECT=1 -> always direct
42+
* GGML_VE_Q4K_FORCE_CANON=1 -> always canon
43+
* GGML_VE_Q4K_DIRECT_FREE_MB=<n> -> custom threshold
44+
*
45+
* Returns true => use direct-std path (and accept N>1). */
46+
bool q4k_route_is_direct() {
47+
static std::atomic<int> route{-1}; // -1 undecided, 0 canon, 1 direct
48+
int d = route.load(std::memory_order_relaxed);
49+
if (d >= 0) return d == 1;
50+
51+
if (std::getenv("GGML_VE_Q4K_FORCE_CANON")) { route.store(0); return false; }
52+
if (std::getenv("GGML_VE_Q4K_FORCE_DIRECT") ||
53+
std::getenv("GGML_VE_Q4K_DIRECT")) { route.store(1); return true; }
54+
55+
size_t mem_free = 0, mem_total = 0;
56+
if (vedaMemGetInfo(&mem_free, &mem_total) != VEDA_SUCCESS || mem_total == 0) {
57+
// No live context yet (e.g. called from schedule before any compute).
58+
// Don't memoize -- stay on the safe canon default and let a later
59+
// call (with a live context) make the real decision.
60+
return false;
61+
}
62+
const char * th = std::getenv("GGML_VE_Q4K_DIRECT_FREE_MB");
63+
const size_t thresh = th
64+
? (size_t) std::strtoull(th, nullptr, 10) * 1024ull * 1024ull
65+
: mem_total / 2;
66+
const int dec = (mem_free < thresh) ? 1 : 0;
67+
route.store(dec);
68+
if (std::getenv("GGML_VE_Q4K_DEBUG")) {
69+
fprintf(stderr, "[Q4K-ROUTE] free=%zu MB total=%zu MB thresh=%zu MB -> %s\n",
70+
mem_free / (1024 * 1024), mem_total / (1024 * 1024),
71+
thresh / (1024 * 1024), dec ? "DIRECT" : "canon");
72+
}
73+
return dec == 1;
74+
}
75+
2576
bool is_supported_quant_type(ggml_type t) {
2677
// Q8_0: all-HBM fused kernel.
2778
// Q4_K: canonical-nibble + qs/hdr-split HBM kernel. Microbench: 94x
@@ -134,12 +185,14 @@ bool mul_mat_q_supports(const ggml_tensor * op) {
134185
return false;
135186
};
136187

137-
/* N>1 currently CRASHES the kernel chain (likely cache reentry or
138-
* VEDA arg lifetime issue across queued launches). Reverting to
139-
* N=1 only until investigated. The dispatch code that loops N times
140-
* is left in place for when the underlying issue is fixed.
141-
* Opt-in for testing: GGML_VE_Q4K_N_GT_1=1 to try N>1 path. */
142-
if (w->type == GGML_TYPE_Q4_K && std::getenv("GGML_VE_Q4K_N_GT_1") != nullptr) {
188+
/* N>1 acceptance for Q4_K is tied to routing: the direct-std path
189+
* handles N>1 (loops matvec per column) without the canon HBM
190+
* doubling that OOMs big models. When routed to canon we keep N=1
191+
* only (prompt-eval N>1 falls to CPU -- correct, and canon's 2x HBM
192+
* blowup at N>1 is exactly what we avoid). GGML_VE_Q4K_N_GT_1=1
193+
* forces N>1 acceptance regardless (legacy testing hook). */
194+
if (w->type == GGML_TYPE_Q4_K &&
195+
(q4k_route_is_direct() || std::getenv("GGML_VE_Q4K_N_GT_1") != nullptr)) {
143196
if (N < 1) return reject("N<1");
144197
} else {
145198
if (N != 1) return reject("N!=1");
@@ -215,12 +268,11 @@ bool mul_mat_q(backend_context * ctx, ggml_tensor * dst) {
215268
const char * name = (w->name && w->name[0]) ? w->name : nullptr;
216269
const int64_t N = dst->ne[1];
217270

218-
/* GGML_VE_Q4K_DIRECT=1: direct-dispatch on the standard 144-B/blk
219-
* layout. NO canon cache, NO 192/144 storage blow-up, NO host
220-
* bounce. Slower per-call than the canon path (low VL=8 vs 256)
221-
* but it's the only way 27B Q4_K_M fits on a single 48 GB VE when
222-
* we also need raw weights on VE_HBM. */
223-
if (std::getenv("GGML_VE_Q4K_DIRECT") != nullptr) {
271+
/* Direct-dispatch on the standard 144-B/blk layout (chunked+packed
272+
* VL=256). NO canon cache, NO 192/144 storage blow-up. Routed here
273+
* automatically for big models (q4k_route_is_direct() samples free
274+
* HBM); small models fall through to canon below. */
275+
if (q4k_route_is_direct()) {
224276
const VEDAdeviceptr w_raw_hbm = ctx->resolve_in(w);
225277
if (w_raw_hbm == 0) return false;
226278

0 commit comments

Comments
 (0)