Skip to content

Interleave long-context prefill chunks with decode#4631

Open
grimoire wants to merge 4 commits into
InternLM:mainfrom
grimoire:refactor-chunked-prefill
Open

Interleave long-context prefill chunks with decode#4631
grimoire wants to merge 4 commits into
InternLM:mainfrom
grimoire:refactor-chunked-prefill

Conversation

@grimoire

@grimoire grimoire commented May 28, 2026

Copy link
Copy Markdown
Collaborator

requirement

Interleave chunk and decoding. Real prefix caching would be done in future PR.

@grimoire grimoire force-pushed the refactor-chunked-prefill branch 3 times, most recently from 0f3284c to c76bae2 Compare June 8, 2026 04:11
@grimoire grimoire force-pushed the refactor-chunked-prefill branch from c76bae2 to a5cb8a7 Compare June 12, 2026 12:46
@grimoire grimoire force-pushed the refactor-chunked-prefill branch from a5cb8a7 to 24d5eea Compare June 16, 2026 08:32
@grimoire grimoire changed the title [WIP] Interleave long-context prefill chunks with decode Interleave long-context prefill chunks with decode Jun 17, 2026
@grimoire grimoire marked this pull request as ready for review June 17, 2026 06:23
Copilot AI review requested due to automatic review settings June 17, 2026 06:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds long-context chunk prefill interleaving with decode in the PyTorch engine loop, primarily by introducing a temporary KV allocation cap for non-final chunks and by updating the engine input policy to defer chunk forwards behind decode when appropriate. It also adjusts speculative decoding’s chunk carry behavior so interleaved decode does not disrupt pending chunk state, and adds targeted tests for these behaviors.

Changes:

  • Add kv_token_limit to SchedulerSequence and apply it across block allocation and prefix-cache trie allocation to keep non-final long-context chunks from over-allocating KV.
  • Introduce scheduler support for reserving KV blocks chunk-by-chunk (reserve_long_context_chunk) and update the engine input maker to interleave decode between chunk forwards.
  • Add tests covering chunk/decode interleaving, chunk KV limiting (including sliding-window), and spec-agent chunk carry semantics.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
lmdeploy/pytorch/engine/inputs_maker.py Implements the interleaving policy between long-context chunk forwards and decode; tracks last forward kind.
lmdeploy/pytorch/messages.py Adds SchedulerSequence.kv_token_limit to represent temporary KV ownership bounds.
lmdeploy/pytorch/paging/block_manager/default_block_manager.py Clamps required block computation by kv_token_limit.
lmdeploy/pytorch/paging/block_manager/window_block_manager.py Updates sliding-window required-block math to remain valid under kv_token_limit.
lmdeploy/pytorch/paging/block_trie.py Prevents prefix-cache trie allocation beyond kv_token_limit.
lmdeploy/pytorch/paging/scheduler.py Applies chunk KV limits during prefill scheduling and adds reserve_long_context_chunk for incremental KV growth.
lmdeploy/pytorch/paging/seq_states/states.py Clears kv_token_limit when freeing sequences.
lmdeploy/pytorch/spec_decode/spec_agent.py Keeps long-chunk carry state across interleaved decode (and DP dummy placeholders).
tests/pytorch/engine/test_inputs_maker.py Adds interleaving-policy tests and fakes needed to exercise it.
tests/pytorch/paging/test_block_manager.py Adds coverage for kv_token_limit behavior under windowed allocation.
tests/pytorch/paging/test_scheduler.py Adds coverage for decode growth reclaiming, chunk-limited prefill scheduling, and reservation behavior.
tests/pytorch/spec_decode/test_spec_agent.py Adds coverage for chunk carry preservation/clearing across decode and prefill cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +250 to +251
evictable = self.hanging + self.waiting
if not self.eviction_helper.evict_for_seq(seq, evictable, prealloc_size):
Comment on lines +250 to +257
evictable = self.hanging + self.waiting
if not self.eviction_helper.evict_for_seq(seq, evictable, prealloc_size):
seq.kv_token_limit = old_kv_token_limit
return False

self.block_manager.allocate(seq, prealloc_size)
self.block_trie.allocate(seq)
return True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants