Optimize TTFT#4695
Open
grimoire wants to merge 17 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves TTFT for the PyTorch engine under mixed short/long prompt pressure by adding a bounded opt-TTFT prefill scheduling policy. It introduces a long-prefill “lane” with size-aware selection + aging, interleaves bounded short/normal prefills between long-context chunks, and preserves chunk carry state across interleaved prefills/decodes.
Changes:
- Add scheduler support for chunk-limited KV ownership (
kv_token_limit), plus long-prefill ordering policies (size/fifo) with aging. - Add bounded prefill interleaving logic in
InputsMakerAsync, including reserving KV blocks per long chunk and making “pending long chunk” visible to the engine runnable gate. - Add regression tests across scheduler, inputs maker, model agent, and spec agent to ensure correctness of chunk carry, gate rollbacks, and scheduling behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/pytorch/spec_decode/test_spec_agent.py | Adds tests ensuring SpecModelAgent preserves/clears chunk carry correctly across decode/prefill interleaving and DP edge cases. |
| tests/pytorch/paging/test_scheduler.py | Adds extensive scheduler regression tests for prefill gates, prefix-hit rollback behavior, long-chunk allocation limits, and long-lane policy/aging. |
| tests/pytorch/paging/test_block_manager.py | Adds a window block manager test ensuring allocation respects kv_token_limit under sliding-window accounting. |
| tests/pytorch/engine/test_model_agent.py | Adds tests verifying chunk model_metas carry behavior across interleaved prefills and final-chunk consumption. |
| tests/pytorch/engine/test_inputs_maker.py | Adds tests for opt-TTFT env reading/clamping, runnable gating for pending long chunks, and bounded interleaving policy behavior. |
| lmdeploy/pytorch/spec_decode/spec_agent.py | Keeps chunk carry state across interleaved non-chunk decode/prefill (cleared only on first chunk; consumed on final). |
| lmdeploy/pytorch/paging/seq_states/states.py | Clears kv_token_limit when freeing a sequence to avoid leaking chunk-limited ownership across lifetimes. |
| lmdeploy/pytorch/paging/scheduler.py | Implements bounded prefill admission gates with tentative prefix-hit rollback, long-prefill lane ordering (size/fifo + aging), and chunk KV limiting/reservation. |
| lmdeploy/pytorch/paging/block_trie.py | Clamps trie allocation visibility to kv_token_limit so chunk-limited KV ownership doesn’t publish beyond the chunk. |
| lmdeploy/pytorch/paging/block_manager/window_block_manager.py | Fixes sliding-window required-block accounting with chunk limits (and clamps negative required blocks to 0). |
| lmdeploy/pytorch/paging/block_manager/default_block_manager.py | Applies kv_token_limit when computing required blocks so allocation can be bounded per chunk. |
| lmdeploy/pytorch/messages.py | Adds SchedulerSequence.kv_token_limit metadata used to bound temporary KV ownership for non-final long chunks. |
| lmdeploy/pytorch/envs.py | Adds opt-TTFT env parsing (LMDEPLOY_PT_TTFT_POLICY, ...SHORT_TURNS, ...AGING_SEC) and a generic env_to_choice() helper. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Changes chunk-meta carry logic so only chunk inputs consume chunk state; interleaved normal prefills don’t clear it. |
| lmdeploy/pytorch/engine/inputs_maker.py | Adds bounded opt-TTFT prefill interleaving policy, long-chunk KV reservation, and richer module/class documentation. |
| lmdeploy/pytorch/engine/engine_loop.py | Extends runnable gating to include engine-local pending long-chunk work (not only scheduler queues). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+539
to
+559
| def _split_waiting_by_prefill_kind(waiting: SeqList): | ||
| """Split waiting requests into normal/final and non-final long | ||
| prefill.""" | ||
| normal_waiting: SeqList = [] | ||
| long_waiting: SeqList = [] | ||
| for seq in waiting: | ||
| if self._prefill_kv_token_limit(seq) is None: | ||
| normal_waiting.append(seq) | ||
| else: | ||
| long_waiting.append(seq) | ||
| return normal_waiting, long_waiting | ||
|
|
||
| def _sort_normal_prefills(waiting: SeqList): | ||
| return sorted(waiting, key=lambda seq: (self._prefill_admission_token_count(seq), seq.arrive_time)) | ||
|
|
||
| def _sort_long_prefills_for_long_turn(waiting: SeqList): | ||
| if self._long_prefill_policy != 'size': | ||
| return waiting | ||
| now = time.perf_counter() | ||
| return sorted(waiting, key=lambda seq: self._long_prefill_priority_key(seq, now)) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Requirements
Summary
Improve TTFT under mixed short/long prompt pressure by adding bounded opt-TTFT prefill scheduling for the PyTorch engine.
This keeps decode protected, allows a bounded number of short/normal prefill turns between long-context chunks, and schedules one long-work turn after the quota is reached. Waiting long prefills use a size-aware long-lane policy with aging so smaller long prompts are not blocked by extreme outliers forever, while old huge prompts can still make progress.
Changes
LMDEPLOY_PT_TTFT_POLICY=size|fifoLMDEPLOY_PT_TTFT_SHORT_TURNSLMDEPLOY_PT_TTFT_AGING_SECBenchmark Notes
Known tradeoff: extremely large prompts can still dominate global tail, and the
>65536bucket may regress because the policy intentionally favors smaller long prompts inside the long lane.Other
Both
InputsMakerAsync._make_forward_inputsandScheduler._schedule_prefillhave several code smells. A refactor for them would be added after we finish this PR and prefix caching PR.