Skip to content

Commit 737cd47

Browse files
davide221mrciffa
andauthored
perf(laguna): spec-decode verify path — bonus fold, fused domino head, draft-graph replay, AUTO width, fused QK/shexp weights (#479)
* perf(laguna): verify-path optimizations — bonus fold, fused domino, draft padding, AUTO width - chain greedy: fold the bonus token into the next verify batch as the seed (DDTree next_token contract); 1 target forward per step not 2 - fused domino head: one GPU graph (lm_head proj + unrolled GRU + in-graph argmax->get_rows), one-time f16 embedding table (392MiB), runs on the draft backend stream, async token readbacks - draft graph: pad ctx to 64-aligned topology + persistent metadata arena so ggml-cuda graph cache can replay (cache keys on tensor addresses); full-layer pad mask; positions plumbed for padding - AUTO verify width: round(EWMA)+1 capped at 3 (DFLASH_LAGUNA_VERIFY_WIDTH_MAX); old formula drifted to w4/5 on high-AL workloads (HE 172 -> 188 tok/s) - all changes env-gated; outputs byte-identical vs base (19-output harness) * perf(laguna): fused adjacent QK and shexp gate/up weights for decode-width matmuls The loader lays attn_q|attn_k and ffn_gate_shexp|ffn_up_shexp adjacently in the weight buffer and binds a fused tensor over each pair (zero extra VRAM; the split tensors remain valid views for all other paths). At decode widths (n_tokens <= 8, MMVQ) the attention builder then runs: - ONE matmul for Q|K, ONE rms_norm+mul with a fused per-head norm weight, and ONE rope over all heads (Q and K use identical rope params), splitting with views afterwards - ONE matmul + ggml_swiglu for the shared expert instead of gate+up+glu Bit-identical by construction at MMVQ widths: matmul rows, rms_norm rows, the norm mul and rope are all per-row/per-head independent (verified with a standalone concat-vs-split bitwise test, /tmp/test_concat_mm.cpp, and the 19-output e2e hash harness). MMQ (prefill, n_tokens > 8) partitions work by total row count and is NOT bit-stable under row-concat, so prefill keeps the split weights - that path is unchanged and byte-identical too. Kill switch: DFLASH_LAGUNA_FUSED_QK=0 (loader). Debug: LUCE_QK_FUSE_MODE, LUCE_QK_FUSE_LAYERS. Measured (laguna-xs2 Q4_K_M + v23 drafter, RTX 3090, greedy): HumanEval 187.5 -> 192.6 tok/s, GSM8K ~177 -> 181.6, all 19 harness outputs byte-identical to the pre-change reference. * deps: bump llama.cpp to perf/luce-verify-kernels (graph stats, q8 memo, diagnostics) * fix(laguna): address review findings on the verify-path PR - guard the full-layer pad-mask write (null on all-SWA drafts; the causal SWA mask covers those layers) - release fused-domino resources (f16 embedding table buffer/ctx, dedicated backend instance) in ~LagunaDFlashTarget - release the persistent draft metadata arena and reserve state in step_graph_destroy (park/unpark kept ~32 MiB host per graph) - fused domino: same vocab-compatibility guard as the legacy path, clean fallback + one-shot warning on mismatch Unit suite 2022/0; spec-decode output hash unchanged. --------- Co-authored-by: mrciffa <davide@cifarelli.tech>
1 parent f7fc944 commit 737cd47

15 files changed

Lines changed: 697 additions & 55 deletions

server/deps/llama.cpp

server/src/common/dflash_draft_graph.cpp

Lines changed: 103 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,18 +24,25 @@ static bool draft_has_swa_layers(const DraftWeights & dw) {
2424

2525
// Build draft graph at a given ctx_len into sg. Does NOT touch sg.alloc.
2626
// mirror_view: if true, uses a view into mirror->target_feat at slot0.
27+
// ctx_alloc: allocation/topology size of the ctx dimension (>= ctx_len).
28+
// When ctx_alloc > 0 and differs from the legacy behavior, a full-layer pad
29+
// mask input is created so the graph topology stays stable while ctx_len
30+
// grows (CUDA-graph replay for the draft forward).
2731
static bool build_draft_graph_internal(
2832
StepGraph & sg,
2933
const DraftWeights & dw,
3034
ggml_tensor * lm_head,
3135
int ctx_len,
3236
const DraftFeatureMirror * mirror,
3337
int mirror_slot0,
34-
bool mirror_view) {
38+
bool mirror_view,
39+
bool pad_masked = false) {
3540

41+
const size_t arena_sz = 32u * 1024 * 1024;
42+
if (sg.meta_arena.size() < arena_sz) sg.meta_arena.resize(arena_sz);
3643
ggml_init_params ip{};
37-
ip.mem_size = 256 * 1024 * 1024;
38-
ip.mem_buffer = nullptr;
44+
ip.mem_size = sg.meta_arena.size();
45+
ip.mem_buffer = sg.meta_arena.data();
3946
ip.no_alloc = true;
4047
sg.ctx = ggml_init(ip);
4148
if (!sg.ctx) return false;
@@ -86,6 +93,18 @@ static bool build_draft_graph_internal(
8693
ggml_set_input(sg.attn_mask);
8794
}
8895

96+
bool any_full_layer = false;
97+
for (int i = 0; i < dw.n_layer; i++)
98+
if (!dw.layers[i].is_swa) { any_full_layer = true; break; }
99+
sg.pad_mask_full = nullptr;
100+
if (pad_masked && any_full_layer) {
101+
const int total_k = ctx_len + q_len;
102+
const int kv_pad = mask_align_up(total_k, MASK_KV_PAD);
103+
sg.pad_mask_full = ggml_new_tensor_2d(sg.ctx, GGML_TYPE_F16, kv_pad, q_len);
104+
ggml_set_name(sg.pad_mask_full, "pad_mask_full");
105+
ggml_set_input(sg.pad_mask_full);
106+
}
107+
89108
sg.gf = ggml_new_graph_custom(sg.ctx, 4096, false);
90109

91110
DraftGraphInputs gi{};
@@ -96,6 +115,7 @@ static bool build_draft_graph_internal(
96115
gi.positions_k = sg.positions_k;
97116
gi.lm_head = lm_head;
98117
gi.causal_mask_swa = sg.attn_mask;
118+
gi.pad_mask_full = sg.pad_mask_full;
99119
DraftGraphOutputs go = build_draft_graph(sg.ctx, dw, gi);
100120
sg.hidden_states = go.hidden_states;
101121
sg.logits = go.logits;
@@ -123,16 +143,42 @@ bool build_draft_step(
123143
int ctx_len,
124144
const DraftFeatureMirror * mirror,
125145
int committed,
126-
int /*ctx_len_max*/) {
146+
int /*ctx_len_max*/,
147+
bool pad_ctx) {
127148
step_graph_free(sg);
128149

129150
if (!sg.alloc) {
130151
sg.alloc = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
131152
}
132153

154+
// Padded-ctx mode: build the graph at the 64-aligned ctx size and mask the
155+
// pad keys, so the topology (and gallocr layout) stays IDENTICAL across
156+
// ~64 tokens of context growth and ggml-cuda can replay the draft forward
157+
// as a CUDA graph. Requires masking, so only usable when the draft has no
158+
// SWA layers (SWA windowing would slide into the pad region).
159+
// Padding is safe as long as no layer does actual SWA WINDOWING at this
160+
// context size (windowing would slide the K view into the pad region).
161+
// Layers flagged is_swa below the window size just mean "causal noise
162+
// mask" and pad fine with the pad rows masked out.
163+
const int ctx_pad_cand = (ctx_len + 63) & ~63;
164+
const bool swa_windowing = draft_has_swa_layers(dw) && dw.swa_window > 0 &&
165+
ctx_pad_cand > dw.swa_window;
166+
const bool do_pad = pad_ctx && !swa_windowing;
167+
const int ctx_alloc = do_pad ? ctx_pad_cand : ctx_len;
168+
static bool s_pad_logged = false;
169+
if (!s_pad_logged) {
170+
s_pad_logged = true;
171+
std::fprintf(stderr, "[draft-pad] pad_ctx=%d has_swa=%d do_pad=%d ctx_len=%d ctx_alloc=%d\n",
172+
(int)pad_ctx, (int)draft_has_swa_layers(dw), (int)do_pad, ctx_len, ctx_alloc);
173+
}
174+
133175
int mirror_slot0 = 0;
134-
const bool use_view = mirror &&
176+
bool use_view = mirror &&
135177
draft_feature_mirror_can_view(*mirror, committed, ctx_len, mirror_slot0);
178+
if (use_view && do_pad &&
179+
mirror_slot0 + ctx_alloc > mirror->cap) {
180+
use_view = false; // padded view would run past the ring
181+
}
136182

137183
// If ctx_len exceeds our cached reserve, re-reserve at next 64 boundary.
138184
// This makes all subsequent alloc_graph calls within the 64-token window
@@ -142,29 +188,74 @@ bool build_draft_step(
142188
// Build a dummy graph at ctx_padded just for sizing.
143189
// Use non-view path for reserve (view tensors don't need allocation).
144190
if (!build_draft_graph_internal(sg, dw, lm_head, ctx_padded,
145-
nullptr, 0, false)) {
191+
nullptr, 0, false, do_pad)) {
146192
return false;
147193
}
148194
ggml_gallocr_reserve(sg.alloc, sg.gf);
149195
sg.alloc_reserved_ctx = ctx_padded;
150196
step_graph_free(sg);
151197
}
152198

153-
// Build real graph at ctx_len for actual computation.
154-
if (!build_draft_graph_internal(sg, dw, lm_head, ctx_len,
155-
mirror, mirror_slot0, use_view)) {
199+
// Build real graph. Padded mode: topology at ctx_alloc, real rows = ctx_len.
200+
if (!build_draft_graph_internal(sg, dw, lm_head,
201+
do_pad ? ctx_alloc : ctx_len,
202+
mirror, mirror_slot0, use_view, do_pad)) {
156203
return false;
157204
}
205+
sg.ctx_alloc = do_pad ? ctx_alloc : 0;
158206

159207
if (!ggml_gallocr_alloc_graph(sg.alloc, sg.gf)) {
160208
return false;
161209
}
162210

211+
if (do_pad) {
212+
const int q_len = dw.block_size;
213+
const int total_k = ctx_alloc + q_len;
214+
const int kv_pad = mask_align_up(total_k, MASK_KV_PAD);
215+
// Full-layer mask: real ctx keys + all noise keys visible (the DFlash
216+
// block is non-causal on full layers), pad keys and alignment columns
217+
// -inf.
218+
static constexpr uint16_t ZERO = 0x0000;
219+
static constexpr uint16_t NEG_INF = 0xFC00;
220+
std::vector<uint16_t> mask_data((size_t)kv_pad * q_len, NEG_INF);
221+
for (int q = 0; q < q_len; q++) {
222+
for (int k = 0; k < ctx_len; k++)
223+
mask_data[(size_t)q * kv_pad + k] = ZERO;
224+
for (int j = 0; j < q_len; j++)
225+
mask_data[(size_t)q * kv_pad + (ctx_alloc + j)] = ZERO;
226+
}
227+
if (sg.pad_mask_full) {
228+
ggml_backend_tensor_set(sg.pad_mask_full, mask_data.data(), 0,
229+
sizeof(uint16_t) * mask_data.size());
230+
}
231+
232+
// The pad rows of the ctx features must be FINITE (they are masked
233+
// out, but NaN/Inf would still poison flash-attn). Zero them.
234+
if (ctx_alloc > ctx_len) {
235+
if (use_view) {
236+
const size_t row_bytes = (size_t)mirror->target_feat->nb[1];
237+
std::vector<uint8_t> zeros((size_t)(ctx_alloc - ctx_len) * row_bytes, 0);
238+
ggml_backend_tensor_set(mirror->target_feat, zeros.data(),
239+
(size_t)(mirror_slot0 + ctx_len) * row_bytes,
240+
zeros.size());
241+
} else {
242+
const size_t row_bytes = (size_t)sg.target_hidden_cat->nb[1];
243+
std::vector<uint8_t> zeros((size_t)(ctx_alloc - ctx_len) * row_bytes, 0);
244+
ggml_backend_tensor_set(sg.target_hidden_cat, zeros.data(),
245+
(size_t)ctx_len * row_bytes,
246+
zeros.size());
247+
}
248+
}
249+
}
250+
163251
// Fill causal mask data for SWA layers (after allocation gives memory to the tensor).
164252
if (sg.attn_mask) {
165253
const int q_len = dw.block_size;
166-
const bool swa_active = dw.swa_window > 0 && ctx_len > dw.swa_window;
167-
const int eff_ctx = swa_active ? dw.swa_window : ctx_len;
254+
const bool swa_active = !do_pad && dw.swa_window > 0 && ctx_len > dw.swa_window;
255+
// Padded mode: keys span ctx_alloc rows; only the first ctx_len are
256+
// real (visible), the pad rows stay -inf. Noise keys sit at ctx_alloc.
257+
const int eff_ctx = do_pad ? ctx_alloc : (swa_active ? dw.swa_window : ctx_len);
258+
const int vis_ctx = do_pad ? ctx_len : eff_ctx;
168259
const int eff_total_k = eff_ctx + q_len;
169260
const int kv_pad = mask_align_up(eff_total_k, MASK_KV_PAD);
170261

@@ -175,7 +266,7 @@ bool build_draft_step(
175266
static constexpr uint16_t NEG_INF = 0xFC00;
176267
std::vector<uint16_t> mask_data((size_t)kv_pad * q_len, NEG_INF);
177268
for (int q = 0; q < q_len; q++) {
178-
for (int k = 0; k < eff_ctx; k++)
269+
for (int k = 0; k < vis_ctx; k++)
179270
mask_data[(size_t)q * kv_pad + k] = ZERO;
180271
for (int j = 0; j <= q; j++)
181272
mask_data[(size_t)q * kv_pad + (eff_ctx + j)] = ZERO;

server/src/common/dflash_draft_graph.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ bool build_draft_step(
3131
int ctx_len,
3232
const DraftFeatureMirror * mirror = nullptr,
3333
int committed = 0,
34-
int ctx_len_max = 0);
34+
int ctx_len_max = 0,
35+
bool pad_ctx = false);
3536

3637
} // namespace dflash::common

server/src/common/dflash_target.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@
1616

1717
#include "ddtree.h"
1818

19+
struct ggml_tensor;
20+
struct ggml_backend;
21+
1922
namespace dflash::common {
2023

2124
struct DFlashTarget {
@@ -110,6 +113,12 @@ struct DFlashTarget {
110113

111114
// Embed token IDs using the target's embedding table.
112115
// Output: `out` must have space for `n * hidden_size()` floats.
116+
// Optional GPU handles for the fused domino draft head. A target that
117+
// returns non-null for all three enables the single-graph draft-side path.
118+
virtual ggml_tensor * lm_head_tensor() { return nullptr; }
119+
virtual ggml_tensor * gpu_embd_table() { return nullptr; }
120+
virtual ggml_backend * fused_head_backend() { return nullptr; }
121+
113122
virtual bool embed_tokens(const int32_t * tokens, int n,
114123
float * out) const = 0;
115124

server/src/common/domino_head.cpp

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,4 +179,144 @@ bool domino_correct_greedy_chain(const DraftWeights & dw,
179179
return true;
180180
}
181181

182+
bool domino_correct_greedy_chain_fused(const DraftWeights & dw,
183+
ggml_backend_t backend,
184+
ggml_tensor * lm_head,
185+
ggml_tensor * embd_table,
186+
const float * local_hidden,
187+
int q_len,
188+
int32_t last_tok,
189+
std::vector<int32_t> & draft_tok) {
190+
if (!dw.domino.enabled || q_len <= 1 || !local_hidden ||
191+
!backend || !lm_head || !embd_table) {
192+
return false;
193+
}
194+
const int hidden = dw.n_embd;
195+
const int H = dw.domino.gru_hidden_dim;
196+
const int E = dw.domino.emb_dim;
197+
const int n_cand = q_len - 1;
198+
const int vocab = (int)lm_head->ne[1];
199+
if (hidden <= 0 || H <= 0 || E <= 0 || vocab <= 0) return false;
200+
if (dw.domino.vocab_size > 0 && vocab != dw.domino.vocab_size) {
201+
static bool s_vocab_warned = false;
202+
if (!s_vocab_warned) {
203+
s_vocab_warned = true;
204+
std::fprintf(stderr,
205+
"domino_fused: vocab mismatch lm_head=%d domino=%d; falling back\n",
206+
vocab, dw.domino.vocab_size);
207+
}
208+
return false;
209+
}
210+
211+
static const bool zero_start = std::getenv("DFLASH_DOMINO_ZERO_START") != nullptr;
212+
213+
const size_t arena_size = ggml_tensor_overhead() * (size_t)(96 + 48 * n_cand) +
214+
ggml_graph_overhead_custom(1024, false) + 4 * 1024 * 1024;
215+
static thread_local std::vector<uint8_t> g_arena_fused;
216+
if (g_arena_fused.size() < arena_size) g_arena_fused.resize(arena_size);
217+
218+
ggml_init_params ip{};
219+
ip.mem_size = g_arena_fused.size();
220+
ip.mem_buffer = g_arena_fused.data();
221+
ip.no_alloc = true;
222+
ggml_context * ctx = ggml_init(ip);
223+
if (!ctx) return false;
224+
ggml_cgraph * gf = ggml_new_graph_custom(ctx, 1024, false);
225+
226+
ggml_tensor * inp_hidden = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, hidden, n_cand);
227+
ggml_tensor * inp_seed = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 1);
228+
ggml_set_input(inp_hidden);
229+
ggml_set_input(inp_seed);
230+
231+
// Base logits for every candidate in one matmul: [vocab, n_cand].
232+
ggml_tensor * base = ggml_mul_mat(ctx, lm_head, inp_hidden);
233+
234+
ggml_tensor * state = ggml_reshape_2d(ctx, dw.domino.start, H, 1);
235+
if (zero_start) state = ggml_scale(ctx, state, 0.0f);
236+
ggml_tensor * prev_embed = ggml_get_rows(ctx, embd_table, inp_seed); // [hidden,1] f32
237+
ggml_set_name(prev_embed, "dom_embed_seed");
238+
239+
std::vector<ggml_tensor *> toks((size_t)n_cand, nullptr);
240+
for (int i = 0; i < n_cand; ++i) {
241+
ggml_tensor * gi = ggml_mul_mat(ctx, dw.domino.gru_w_ih, prev_embed);
242+
gi = ggml_add(ctx, gi, ggml_reshape_2d(ctx, dw.domino.gru_b_ih, 3 * H, 1));
243+
ggml_tensor * gh = ggml_mul_mat(ctx, dw.domino.gru_w_hh, state);
244+
gh = ggml_add(ctx, gh, ggml_reshape_2d(ctx, dw.domino.gru_b_hh, 3 * H, 1));
245+
246+
const size_t gate_bytes = (size_t)H * ggml_element_size(gi);
247+
ggml_tensor * i_r = ggml_view_2d(ctx, gi, H, 1, gi->nb[1], 0);
248+
ggml_tensor * i_z = ggml_view_2d(ctx, gi, H, 1, gi->nb[1], gate_bytes);
249+
ggml_tensor * i_n = ggml_view_2d(ctx, gi, H, 1, gi->nb[1], 2 * gate_bytes);
250+
ggml_tensor * h_r = ggml_view_2d(ctx, gh, H, 1, gh->nb[1], 0);
251+
ggml_tensor * h_z = ggml_view_2d(ctx, gh, H, 1, gh->nb[1], gate_bytes);
252+
ggml_tensor * h_n = ggml_view_2d(ctx, gh, H, 1, gh->nb[1], 2 * gate_bytes);
253+
254+
ggml_tensor * reset = ggml_sigmoid(ctx, ggml_add(ctx, i_r, h_r));
255+
ggml_tensor * update = ggml_sigmoid(ctx, ggml_add(ctx, i_z, h_z));
256+
ggml_tensor * cand = ggml_tanh(ctx, ggml_add(ctx, i_n, ggml_mul(ctx, reset, h_n)));
257+
ggml_tensor * h_new = ggml_add(ctx, cand,
258+
ggml_mul(ctx, update,
259+
ggml_sub(ctx, state, cand)));
260+
261+
ggml_tensor * hid_i = ggml_view_2d(ctx, inp_hidden, hidden, 1,
262+
inp_hidden->nb[1],
263+
(size_t)i * inp_hidden->nb[1]);
264+
ggml_tensor * zcat = ggml_concat(ctx, hid_i, h_new, 0);
265+
ggml_tensor * bias = ggml_mul_mat(ctx, dw.domino.head_w1, zcat);
266+
bias = ggml_add(ctx, bias, ggml_reshape_2d(ctx, dw.domino.head_b1, E, 1));
267+
bias = ggml_silu(ctx, bias);
268+
bias = ggml_mul_mat(ctx, dw.domino.head_w2, bias);
269+
bias = ggml_add(ctx, bias, ggml_reshape_2d(ctx, dw.domino.head_b2, vocab, 1));
270+
271+
ggml_tensor * base_i = ggml_view_2d(ctx, base, vocab, 1,
272+
base->nb[1], (size_t)i * base->nb[1]);
273+
ggml_tensor * corrected = ggml_add(ctx, base_i, bias);
274+
ggml_tensor * tok = ggml_argmax(ctx, corrected);
275+
ggml_set_output(tok);
276+
ggml_build_forward_expand(gf, tok);
277+
toks[(size_t)i] = tok;
278+
279+
if (i + 1 < n_cand) {
280+
prev_embed = ggml_get_rows(ctx, embd_table, tok);
281+
ggml_set_name(prev_embed, "dom_embed_tok");
282+
}
283+
state = h_new;
284+
}
285+
286+
static thread_local ggml_gallocr_t galloc_fused = nullptr;
287+
if (!galloc_fused) {
288+
galloc_fused = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
289+
}
290+
if (!ggml_gallocr_alloc_graph(galloc_fused, gf)) {
291+
std::fprintf(stderr, "domino_fused: gallocr_alloc_graph failed\n");
292+
ggml_free(ctx);
293+
return false;
294+
}
295+
296+
ggml_backend_tensor_set(inp_hidden, local_hidden + (size_t)hidden, 0,
297+
sizeof(float) * (size_t)hidden * (size_t)n_cand);
298+
ggml_backend_tensor_set(inp_seed, &last_tok, 0, sizeof(int32_t));
299+
300+
if (ggml_backend_graph_compute(backend, gf) != GGML_STATUS_SUCCESS) {
301+
std::fprintf(stderr, "domino_fused: graph_compute failed\n");
302+
ggml_free(ctx);
303+
return false;
304+
}
305+
306+
draft_tok.assign((size_t)q_len, 0);
307+
draft_tok[0] = last_tok;
308+
// One synchronize instead of n_cand blocking readbacks.
309+
int32_t t_out[16];
310+
const int n_get = n_cand < 16 ? n_cand : 16;
311+
for (int i = 0; i < n_get; ++i) {
312+
ggml_backend_tensor_get_async(backend, toks[(size_t)i], &t_out[i], 0, sizeof(int32_t));
313+
}
314+
ggml_backend_synchronize(backend);
315+
for (int i = 0; i < n_get; ++i) {
316+
draft_tok[(size_t)i + 1] = t_out[i];
317+
}
318+
ggml_free(ctx);
319+
return true;
320+
}
321+
182322
} // namespace dflash::common

server/src/common/domino_head.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,20 @@
88

99
namespace dflash::common {
1010

11+
// Fused variant: one GPU graph = lm_head projection of the candidate hidden
12+
// states + unrolled GRU correction chain with in-graph argmax -> get_rows
13+
// token feedback. Requires the target to expose its lm_head and a GPU (f16)
14+
// token-embedding table. Runs on a dedicated CUDA backend instance so the
15+
// ggml-cuda graph cache can replay it across steps.
16+
bool domino_correct_greedy_chain_fused(const DraftWeights & dw,
17+
ggml_backend_t backend,
18+
ggml_tensor * lm_head,
19+
ggml_tensor * embd_table,
20+
const float * local_hidden,
21+
int q_len,
22+
int32_t last_tok,
23+
std::vector<int32_t> & draft_tok);
24+
1125
bool domino_correct_greedy_chain(const DraftWeights & dw,
1226
ggml_backend_t backend,
1327
DFlashTarget & target,

0 commit comments

Comments
 (0)