Skip to content

Commit 0fb4f29

Browse files
committed
ggml-ve graph compiler: wildcard the attn-mask ne[0] in the cgraph signature
Companion to the mask-CPY skip. The cgraph signature already treats KV/mask growing dims as wildcards, but only ne[1..2] -- correct for the KV cache (growing dim ne[1], ne[0]=head_dim is stable) but NOT for the attention mask, whose growing dim is ne[0] (= padded n_kv). So the decode signature still changed at every mask-pad boundary, forcing a needless re-trace + .so reload (no NCC recompile after the source fix, but still per-boundary churn). Since the mask no longer affects the generated source at all (its CPY is skipped; VE flash-attn masks causally via seq_len), wildcard ALL the mask spatial dims. The decode graph is now traced + compiled ONCE and reused for the whole generation regardless of context length. Verified: cold 250-token VEBP generation does 4 NCC compiles + 4 re-traces total (the distinct with-/without-output-projection shapes + 2 trivial), not growing with length. Was re-tracing at every pad boundary. Not pushed.
1 parent 1685dd2 commit 0fb4f29

1 file changed

Lines changed: 10 additions & 1 deletion

File tree

ggml/src/ggml-ve/ggml-ve.cpp

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,12 +133,21 @@ static uint64_t cgraph_signature(const ggml_cgraph * g) {
133133
if (t == nullptr) { mix(0); return; }
134134
mix((uint64_t) t->type | (uint64_t)(slot + 1) << 32);
135135
const bool wild = is_kv_or_mask_tensor(t);
136+
// The attention mask's GROWING dim is ne[0] (= padded n_kv), unlike the
137+
// KV cache whose growing dim is ne[1] (ne[0] = stable head_dim). The
138+
// mask no longer affects the generated source at all (its CPY is
139+
// skipped — VE flash-attn masks causally via seq_len), so wildcard ALL
140+
// of the mask's spatial dims; otherwise the decode signature changed
141+
// at every mask-pad boundary, forcing a needless re-trace + .so reload.
142+
const bool is_mask = t->name &&
143+
std::strncmp(t->name, "attn_inp_kq_mask", 16) == 0;
136144
for (int d = 0; d < GGML_MAX_DIMS; ++d) {
137145
// Hash ne[0] always (head_dim / embed_dim — stable per model).
138146
// For KV / mask tensors skip ne[1..2] (the seq + mask dims
139147
// that grow as KV occupancy crosses power-of-2 chunk
140148
// boundaries); ne[3] back in since batch is meaningful.
141-
if (wild && (d == 1 || d == 2)) {
149+
// For the mask also skip ne[0] (its growing n_kv dim).
150+
if ((wild && (d == 1 || d == 2)) || (is_mask && d == 0)) {
142151
mix((uint64_t) 0xFFFFFFFFFFFFFFFFULL); // wildcard marker
143152
} else {
144153
mix((uint64_t) t->ne[d]);

0 commit comments

Comments
 (0)