You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ggml-ve: run F16 weights as BF16 + compile the VEBP/Q4K N>1 prompt graph (ggml-org#72)
VEBP keeps token_embd as F16 (tied, so it's both the embedding and the lm_head).
The VE has no F16 path, so GET_ROWS + that MUL_MAT were refused and the graph
fragmented. Two changes make VEBP fully self-contained and let its prompt graph
compile:
1. F16 -> BF16 on HBM upload (same 2-byte size, strides unchanged), served by
the existing BF16 GET_ROWS / matvec / colmajor paths. get_rows/mul_mat
supports + dispatch accept F16; the graph compiler maps src_type F16->BF16
and uploads the converted copy. Conversion uses the F16C row helpers in 1M
chunks (a per-element call loop was ~10 s for the 621M-elem token_embd,
charged to the first prompt eval).
2. Removed a stale guard that refused Q4_K/VEBP MUL_MAT when ne[1] != 1. It
dated from when N>1 was refused entirely; now the codegen loops the matvec
_inner over the n_tok columns, so it handles N>1. The guard was rejecting
EVERY VEBP prompt graph (Qcur is VEBP, ne[1]=N) -> the prompt silently ran
on the interpreter. (A new first-execute verbose log made this visible.)
Also fixed N>1 codegen vectorization found while profiling (the .L showed it was
NOT a table overflow — it was unvectorized loops): the MUL `src1[e % period]`
modulo forced scalar code -> restored the nested broadcast form; added `restrict`
to the element-wise / RMS_NORM / GLU pointers (void*-cast aliasing).
Measured (GGML_VE_HBM=1, -fa on, -ctk/-ctv bf16, warm, run twice):
Llama-3.2-3B prompt 57 -> 65, decode 48 -> 56 (vectorization fixes)
Bonsai-VEBP prompt 9.8 -> 12.65 (1.29x, V.OP 12% -> 92%), decode 33.5 -> 38.7
All outputs token-for-token identical to the interpreter.
VEBP's prompt gain is smaller than Llama's 16x because its per-token ternary
matvec is compute-bound (V.OP already 92%), so the fork/join fusion saves a
smaller fraction. Further gains need a batched ternary matmul (read the weight
once across N columns) — a follow-up.
0 commit comments