Commit 5e7594c
committed
qwen35: fix do_spec_decode argmax OOB on prefix-cache partial restore
`n_last_chunk = committed % PREFILL_UBATCH` only equals the last prefill
chunk's actual size when prefill started at kv_offset=0. With prefix-cache
partial restore, `restore_and_generate` runs delta-prefill from kv_offset>0,
so the last chunk's `n_tokens` is `prompt_len - kv_offset`, not the modulo
of `committed` over PREFILL_UBATCH. The read offset was then larger than
sg_.argmax_tokens->ne[0], firing the "tensor read out of bounds" assert
on the first DFlash spec-decode request against any prompt the cache had
already seen.
Read the actual last-chunk size from sg_.argmax_tokens->ne[0], which the
graph builder sized to match the bound chunk. No-op when kv_offset==0
(`committed % UBATCH == ne[0]`).1 parent 230c303 commit 5e7594c
1 file changed
Lines changed: 10 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
832 | 832 | | |
833 | 833 | | |
834 | 834 | | |
835 | | - | |
836 | | - | |
837 | | - | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
838 | 844 | | |
839 | | - | |
| 845 | + | |
840 | 846 | | |
841 | 847 | | |
842 | 848 | | |
| |||
0 commit comments