You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix sliding-window chunked prefill in the gemma4-31B runner
The runner chunked prefill at get_max_prefill_chunk = 2*sliding_window (2048).
A chunk larger than the window overflows the 2*window ring KV cache across
chunk boundaries: after writing a 2048-token chunk the ring holds only the most
recent 2048 positions, so the first ~(chunk - window) queries of every chunk
after the first lose the tail of the previous chunk that is still inside their
1024 window. Those sliding-layer queries then attend over a truncated window,
which propagates into their hidden states and the global KV those positions
write, changing the output. The global flat-cache layers are unaffected.
Cap the prefill chunk at the sliding window: get_sliding_window from metadata
(now exported), else max_prefill/2 since the export sets max_prefill =
2*sliding_window. Decode is unaffected. Adds --max_prefill_chunk to override
the chunk size for testing.
Authored with assistance from Claude Code.
ghstack-source-id: 1c793dd
ghstack-comment-id: 4734206312
Pull-Request: #20346
0 commit comments