You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ggml-ve : Q4_K auto-route canon vs direct by free-HBM size gate
Makes the direct-std + packed kernel the automatic default for big
models while keeping canon (faster per-matvec) for models that fit.
No env flags needed.
q4k_route_is_direct() samples free HBM once (after weights are
resident), memoizes the decision:
free < total/2 -> DIRECT (big model; canon's 2x footprint OOMs)
free >= total/2 -> canon (small model; canon is ~2x faster)
Measured with NO flags:
1B Q4_K_M: free 42.8 GB -> canon, 16.9 pp / 11.7 tg
27B Q4_K_M: free 15.8 GB -> direct, 3.2 pp / 0.5 tg
(was: canon N=1 0.27 tg, and N>1 OOM crash)
Routing also gates N>1 acceptance: direct models accept N>1
(prompt-eval on VE via per-column matvec loop); canon models stay
N=1 (N>1 prompt-eval falls to CPU, avoiding canon's HBM doubling).
Escape hatches:
GGML_VE_Q4K_FORCE_DIRECT=1 / GGML_VE_Q4K_DIRECT=1 -> always direct
GGML_VE_Q4K_FORCE_CANON=1 -> always canon
GGML_VE_Q4K_DIRECT_FREE_MB=<n> -> custom threshold
GGML_VE_NO_Q4K=1 -> disable VE Q4_K entirely
Note: decision is process-global (first model wins). For model-swap
servers, use the force flags. The very first Q4_K op of the first
graph may route to CPU before the sample memoizes -- negligible.
0 commit comments