fix prefix default on#784
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a prefix-caching bug where, when an entire prompt is fully cached, prefill ends up with zero tokens to forward and cannot produce logits for the next-token sampler. The fix forces the last full block to be recomputed when all blocks would otherwise be cache hits, and also removes the fp4x2-specific gating that disabled the prefix-cache path in MLA attention.
Changes:
- In
BlockManager.can_allocateandallocate, force the final full block to be treated as a cache miss when every block is a cache hit, ensuring at least one token remains for prefill. - In
attention_mla.forward_impl_server_mode, drop thekv_b_proj.weight.dtype != fp4x2guard so the prefix-cache attention branch is taken wheneverattn_metadata.has_cachedis true. - Update
tests/test_block_manager.pyto reflect the new expectednum_cached_tokensvalues (4 instead of 8; 0 instead of 4) and rename a test accordingly.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| atom/model_engine/block_manager.py | Force last full block to recompute when prompt would be fully cached, in both can_allocate and allocate. |
| atom/model_ops/attention_mla.py | Remove fp4x2 dtype gating; use attn_metadata.has_cached directly to choose the prefix-cache attention path. |
| tests/test_block_manager.py | Adjust expected cache-hit counts and rename test_exact_block_size_fully_cached to reflect last-block recompute behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
valarLip
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist