Commit 501ba12
Allow chunked prefill when num_prompt_tokens > max_seq_len
Summary:
Remove the early `num_prompt_tokens <= max_seq_len` check in TextLLMRunner. `TextPrefiller::prefill()` already supports chunked prefill — when the prompt is longer than `max_seq_len` it splits the input into `max_seq_len`-sized chunks and prefills them sequentially. The previous check rejected this valid case, breaking models exported with `max_seq_len < max_context_len` (e.g. a 1024 prefill chunk over a 4096 KV cache).
The total-capacity bound is preserved:
- For non-sliding-window models (`max_seq_len >= max_context_len`), the existing `pos_ + num_prompt_tokens < max_context_len` check is unchanged.
- For sliding-window models (`max_seq_len < max_context_len`), a new per-call check `num_prompt_tokens < max_context_len` ensures the prompt itself fits in KV cache; `pos_` doesn't represent consumed capacity for these models since the model handles position wrapping internally.
Differential Revision: D1017287201 parent 32702ac commit 501ba12
1 file changed
Lines changed: 19 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
| |||
158 | 158 | | |
159 | 159 | | |
160 | 160 | | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
161 | 170 | | |
162 | 171 | | |
163 | 172 | | |
| |||
0 commit comments