Skip to content

Optimize TTFT#4695

Open
grimoire wants to merge 17 commits into
InternLM:mainfrom
grimoire:opt-ttft-token-aware-prefill
Open

Optimize TTFT#4695
grimoire wants to merge 17 commits into
InternLM:mainfrom
grimoire:opt-ttft-token-aware-prefill

Conversation

@grimoire

@grimoire grimoire commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Requirements

Summary

Improve TTFT under mixed short/long prompt pressure by adding bounded opt-TTFT prefill scheduling for the PyTorch engine.

This keeps decode protected, allows a bounded number of short/normal prefill turns between long-context chunks, and schedules one long-work turn after the quota is reached. Waiting long prefills use a size-aware long-lane policy with aging so smaller long prompts are not blocked by extreme outliers forever, while old huge prompts can still make progress.

Changes

  • Add bounded short/normal prefill turns between long-context chunks.
  • Add size-aware long-lane selection with aging.
  • Add private env controls:
    • LMDEPLOY_PT_TTFT_POLICY=size|fifo
    • LMDEPLOY_PT_TTFT_SHORT_TURNS
    • LMDEPLOY_PT_TTFT_AGING_SEC
  • Preserve ModelAgent and SpecAgent chunk carry across interleaved normal prefills.
  • Add scheduler/input-maker/model-agent/spec-agent regression tests.

Benchmark Notes

Known tradeoff: extremely large prompts can still dominate global tail, and the >65536 bucket may regress because the policy intentionally favors smaller long prompts inside the long lane.

Other

Both InputsMakerAsync._make_forward_inputs and Scheduler._schedule_prefill have several code smells. A refactor for them would be added after we finish this PR and prefix caching PR.

@grimoire grimoire marked this pull request as ready for review June 24, 2026 03:36
Copilot AI review requested due to automatic review settings June 24, 2026 03:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves TTFT for the PyTorch engine under mixed short/long prompt pressure by adding a bounded opt-TTFT prefill scheduling policy. It introduces a long-prefill “lane” with size-aware selection + aging, interleaves bounded short/normal prefills between long-context chunks, and preserves chunk carry state across interleaved prefills/decodes.

Changes:

  • Add scheduler support for chunk-limited KV ownership (kv_token_limit), plus long-prefill ordering policies (size/fifo) with aging.
  • Add bounded prefill interleaving logic in InputsMakerAsync, including reserving KV blocks per long chunk and making “pending long chunk” visible to the engine runnable gate.
  • Add regression tests across scheduler, inputs maker, model agent, and spec agent to ensure correctness of chunk carry, gate rollbacks, and scheduling behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/pytorch/spec_decode/test_spec_agent.py Adds tests ensuring SpecModelAgent preserves/clears chunk carry correctly across decode/prefill interleaving and DP edge cases.
tests/pytorch/paging/test_scheduler.py Adds extensive scheduler regression tests for prefill gates, prefix-hit rollback behavior, long-chunk allocation limits, and long-lane policy/aging.
tests/pytorch/paging/test_block_manager.py Adds a window block manager test ensuring allocation respects kv_token_limit under sliding-window accounting.
tests/pytorch/engine/test_model_agent.py Adds tests verifying chunk model_metas carry behavior across interleaved prefills and final-chunk consumption.
tests/pytorch/engine/test_inputs_maker.py Adds tests for opt-TTFT env reading/clamping, runnable gating for pending long chunks, and bounded interleaving policy behavior.
lmdeploy/pytorch/spec_decode/spec_agent.py Keeps chunk carry state across interleaved non-chunk decode/prefill (cleared only on first chunk; consumed on final).
lmdeploy/pytorch/paging/seq_states/states.py Clears kv_token_limit when freeing a sequence to avoid leaking chunk-limited ownership across lifetimes.
lmdeploy/pytorch/paging/scheduler.py Implements bounded prefill admission gates with tentative prefix-hit rollback, long-prefill lane ordering (size/fifo + aging), and chunk KV limiting/reservation.
lmdeploy/pytorch/paging/block_trie.py Clamps trie allocation visibility to kv_token_limit so chunk-limited KV ownership doesn’t publish beyond the chunk.
lmdeploy/pytorch/paging/block_manager/window_block_manager.py Fixes sliding-window required-block accounting with chunk limits (and clamps negative required blocks to 0).
lmdeploy/pytorch/paging/block_manager/default_block_manager.py Applies kv_token_limit when computing required blocks so allocation can be bounded per chunk.
lmdeploy/pytorch/messages.py Adds SchedulerSequence.kv_token_limit metadata used to bound temporary KV ownership for non-final long chunks.
lmdeploy/pytorch/envs.py Adds opt-TTFT env parsing (LMDEPLOY_PT_TTFT_POLICY, ...SHORT_TURNS, ...AGING_SEC) and a generic env_to_choice() helper.
lmdeploy/pytorch/engine/model_agent/agent.py Changes chunk-meta carry logic so only chunk inputs consume chunk state; interleaved normal prefills don’t clear it.
lmdeploy/pytorch/engine/inputs_maker.py Adds bounded opt-TTFT prefill interleaving policy, long-chunk KV reservation, and richer module/class documentation.
lmdeploy/pytorch/engine/engine_loop.py Extends runnable gating to include engine-local pending long-chunk work (not only scheduler queues).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +539 to +559
def _split_waiting_by_prefill_kind(waiting: SeqList):
"""Split waiting requests into normal/final and non-final long
prefill."""
normal_waiting: SeqList = []
long_waiting: SeqList = []
for seq in waiting:
if self._prefill_kv_token_limit(seq) is None:
normal_waiting.append(seq)
else:
long_waiting.append(seq)
return normal_waiting, long_waiting

def _sort_normal_prefills(waiting: SeqList):
return sorted(waiting, key=lambda seq: (self._prefill_admission_token_count(seq), seq.arrive_time))

def _sort_long_prefills_for_long_turn(waiting: SeqList):
if self._long_prefill_policy != 'size':
return waiting
now = time.perf_counter()
return sorted(waiting, key=lambda seq: self._long_prefill_priority_key(seq, now))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants