Optimize TTFT by grimoire · Pull Request #4695 · InternLM/lmdeploy

grimoire · 2026-06-22T10:53:55Z

Requirements

Interleave long-context prefill chunks with decode #4631

Summary

Improve TTFT under mixed short/long prompt pressure by adding bounded opt-TTFT prefill scheduling for the PyTorch engine.

This keeps decode protected, allows a bounded number of short/normal prefill turns between long-context chunks, and schedules one long-work turn after the quota is reached. Waiting long prefills use a size-aware long-lane policy with aging so smaller long prompts are not blocked by extreme outliers forever, while old huge prompts can still make progress.

Changes

Add bounded short/normal prefill turns between long-context chunks.
Add size-aware long-lane selection with aging.
Add private env controls:
- LMDEPLOY_PT_TTFT_POLICY=size|fifo
- LMDEPLOY_PT_TTFT_SHORT_TURNS
- LMDEPLOY_PT_TTFT_AGING_SEC
Preserve ModelAgent and SpecAgent chunk carry across interleaved normal prefills.
Add scheduler/input-maker/model-agent/spec-agent regression tests.

Benchmark Notes

Known tradeoff: extremely large prompts can still dominate global tail, and the >65536 bucket may regress because the policy intentionally favors smaller long prompts inside the long lane.

Other

Both InputsMakerAsync._make_forward_inputs and Scheduler._schedule_prefill have several code smells. A refactor for them would be added after we finish this PR and prefix caching PR.

Copilot

Pull request overview

This PR improves TTFT for the PyTorch engine under mixed short/long prompt pressure by adding a bounded opt-TTFT prefill scheduling policy. It introduces a long-prefill “lane” with size-aware selection + aging, interleaves bounded short/normal prefills between long-context chunks, and preserves chunk carry state across interleaved prefills/decodes.

Changes:

Add scheduler support for chunk-limited KV ownership (kv_token_limit), plus long-prefill ordering policies (size/fifo) with aging.
Add bounded prefill interleaving logic in InputsMakerAsync, including reserving KV blocks per long chunk and making “pending long chunk” visible to the engine runnable gate.
Add regression tests across scheduler, inputs maker, model agent, and spec agent to ensure correctness of chunk carry, gate rollbacks, and scheduling behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/pytorch/spec_decode/test_spec_agent.py	Adds tests ensuring SpecModelAgent preserves/clears chunk carry correctly across decode/prefill interleaving and DP edge cases.
tests/pytorch/paging/test_scheduler.py	Adds extensive scheduler regression tests for prefill gates, prefix-hit rollback behavior, long-chunk allocation limits, and long-lane policy/aging.
tests/pytorch/paging/test_block_manager.py	Adds a window block manager test ensuring allocation respects `kv_token_limit` under sliding-window accounting.
tests/pytorch/engine/test_model_agent.py	Adds tests verifying chunk `model_metas` carry behavior across interleaved prefills and final-chunk consumption.
tests/pytorch/engine/test_inputs_maker.py	Adds tests for opt-TTFT env reading/clamping, runnable gating for pending long chunks, and bounded interleaving policy behavior.
lmdeploy/pytorch/spec_decode/spec_agent.py	Keeps chunk carry state across interleaved non-chunk decode/prefill (cleared only on first chunk; consumed on final).
lmdeploy/pytorch/paging/seq_states/states.py	Clears `kv_token_limit` when freeing a sequence to avoid leaking chunk-limited ownership across lifetimes.
lmdeploy/pytorch/paging/scheduler.py	Implements bounded prefill admission gates with tentative prefix-hit rollback, long-prefill lane ordering (size/fifo + aging), and chunk KV limiting/reservation.
lmdeploy/pytorch/paging/block_trie.py	Clamps trie allocation visibility to `kv_token_limit` so chunk-limited KV ownership doesn’t publish beyond the chunk.
lmdeploy/pytorch/paging/block_manager/window_block_manager.py	Fixes sliding-window required-block accounting with chunk limits (and clamps negative required blocks to 0).
lmdeploy/pytorch/paging/block_manager/default_block_manager.py	Applies `kv_token_limit` when computing required blocks so allocation can be bounded per chunk.
lmdeploy/pytorch/messages.py	Adds `SchedulerSequence.kv_token_limit` metadata used to bound temporary KV ownership for non-final long chunks.
lmdeploy/pytorch/envs.py	Adds opt-TTFT env parsing (`LMDEPLOY_PT_TTFT_POLICY`, `...SHORT_TURNS`, `...AGING_SEC`) and a generic `env_to_choice()` helper.
lmdeploy/pytorch/engine/model_agent/agent.py	Changes chunk-meta carry logic so only chunk inputs consume chunk state; interleaved normal prefills don’t clear it.
lmdeploy/pytorch/engine/inputs_maker.py	Adds bounded opt-TTFT prefill interleaving policy, long-chunk KV reservation, and richer module/class documentation.
lmdeploy/pytorch/engine/engine_loop.py	Extends runnable gating to include engine-local pending long-chunk work (not only scheduler queues).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        def _split_waiting_by_prefill_kind(waiting: SeqList):
+            """Split waiting requests into normal/final and non-final long
+            prefill."""
+            normal_waiting: SeqList = []
+            long_waiting: SeqList = []
+            for seq in waiting:
+                if self._prefill_kv_token_limit(seq) is None:
+                    normal_waiting.append(seq)
+                else:
+                    long_waiting.append(seq)
+            return normal_waiting, long_waiting
+
+        def _sort_normal_prefills(waiting: SeqList):
+            return sorted(waiting, key=lambda seq: (self._prefill_admission_token_count(seq), seq.arrive_time))
+
+        def _sort_long_prefills_for_long_turn(waiting: SeqList):
+            if self._long_prefill_policy != 'size':
+                return waiting
+            now = time.perf_counter()
+            return sorted(waiting, key=lambda seq: self._long_prefill_priority_key(seq, now))
+


grimoire added 16 commits June 22, 2026 12:22

first slice chunked prefill

95d9730

update dp ep mpt

7450288

force last chunk prefill

9c28988

fix lint

0323cba

milestone

be3b82f

fix long context

d2771d1

add comment and flags

c6e455d

fix

83281ac

fix

df2811b

fix

214ad98

remove clear

2d03862

fix

f5392fe

improve readability

4cf43bc

fix

91ad7e2

fix unfinished

37dd886

better readbility

4e18b3e

grimoire marked this pull request as ready for review June 24, 2026 03:36

Copilot AI review requested due to automatic review settings June 24, 2026 03:36

Copilot started reviewing on behalf of grimoire June 24, 2026 03:36 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

fix copilot review

7acf019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize TTFT#4695

Optimize TTFT#4695
grimoire wants to merge 17 commits into
InternLM:mainfrom
grimoire:opt-ttft-token-aware-prefill

grimoire commented Jun 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Summary

Changes

Benchmark Notes

Other

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented Jun 22, 2026 •

edited

Loading