-
Notifications
You must be signed in to change notification settings - Fork 700
Refactor prefix caching for pytorch engine #4618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
bfd25d5
finish vlm
grimoire ec8e772
add context hash
grimoire a86ae37
early hash
grimoire 6d13d65
ssm prefix caching prefill
grimoire 9ca3a3b
decoding ssm
grimoire 96f6db4
refactor sequence
grimoire 1aa8773
add comment
grimoire 3866b42
enable when prefix_cache_decode_state_interval=0
grimoire 9da79a7
optimize copy state
grimoire 64991ff
better copy cache
grimoire a6c2dde
easy engine loop func
grimoire baef92a
more fix
grimoire 90ddc95
fix end states
grimoire 79838a4
update block trie
grimoire 22167ae
refactor block trie
grimoire e93959b
add hit rate metrics
grimoire 235b43d
add longbenchv2
grimoire efd2ffc
fix
grimoire 8c8a0ba
Merge branch 'main' into refactor-prefix-caching
grimoire ec728f7
add check and raise
grimoire d6abb5f
Merge branch 'main' of github.com:InternLM/lmdeploy into refactor-pre…
grimoire 145d5d5
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 56e3b7d
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 49f42e1
Fix Qwen3 Omni fake model dtype fixture
grimoire 4bbcf03
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire e1fadea
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 240aae0
fix gdr kernel for tilelang>=0.1.9
grimoire 79e5706
add flag
grimoire 48ea0ea
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 8f550e6
Merge branch 'fix-gdr-tilelang019' into refactor-prefix-caching
grimoire 9938afc
fix bugs
grimoire ff1f74a
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 636b629
update comment
grimoire 292e9c9
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 9d54645
remove init state
grimoire 6a96306
remove unrelated
grimoire 3546b7c
fix duplicate node
grimoire db1590e
fix reserve_decode_state_checkpoint_for_seq
grimoire 126c3d1
fix 27b
grimoire a0426f2
fix
grimoire b2979b0
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire 5ca1229
update
grimoire 8c0ff32
fix long chunk
grimoire 3979c8b
fix lint
grimoire 665fcc2
add acache_tokens
grimoire b13846f
fix metric
grimoire 3185fa1
Merge remote-tracking branch 'upstream/main' into refactor-prefix-cac…
grimoire File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -367,6 +367,17 @@ class PytorchEngineConfig: | |
| max_batch_size is always captured. | ||
| thread_safe: thread safe engine instance. | ||
| enable_prefix_caching: Enable token match and sharing caches. | ||
| prefix_cache_state_budget: Extra SSM state-cache slots budgeted for | ||
| prefix-cache checkpoints. 0 adds no extra slots, but SSM | ||
| checkpoints may still borrow idle runtime state slots. | ||
| prefix_cache_decode_state_interval: Token interval for SSM decode | ||
| state checkpoints. 0 disables decode-state checkpoint saves; prefill | ||
| and chunk checkpoints may still be saved. Keep 0 unless the workload | ||
| has long SSM decoding and repeated continuations that can reuse | ||
| decode checkpoints. Smaller positive values create more hit points | ||
| but use more checkpoint memory and copy work; larger values reduce | ||
| overhead but make decode-prefix hits less likely. Positive values | ||
| must be multiples of the cache block size. | ||
| device_type: The inference device type, options ['cuda'] | ||
| eager_mode: Enable "eager" mode or not | ||
| custom_module_map: nn module map customized by users. Once | ||
|
|
@@ -428,6 +439,8 @@ class PytorchEngineConfig: | |
| cudagraph_capture_batch_sizes: list[int] | None = None | ||
| thread_safe: bool = False | ||
| enable_prefix_caching: bool = False | ||
| prefix_cache_state_budget: int = 0 | ||
| prefix_cache_decode_state_interval: int = 0 | ||
| device_type: str = 'cuda' | ||
| eager_mode: bool = False | ||
| custom_module_map: dict[str, str] = None | ||
|
|
@@ -472,6 +485,8 @@ def __post_init__(self): | |
| assert self.max_prefill_token_num >= 0, \ | ||
| 'invalid max_prefill_token_num' | ||
| assert self.num_gpu_blocks >= 0, 'invalid num_gpu_blocks' | ||
| assert self.prefix_cache_state_budget >= 0, 'invalid prefix_cache_state_budget' | ||
| assert self.prefix_cache_decode_state_interval >= 0, 'invalid prefix_cache_decode_state_interval' | ||
| try: | ||
| self.quant_policy = QuantPolicy(self.quant_policy) | ||
| except ValueError as e: | ||
|
|
@@ -485,6 +500,9 @@ def __post_init__(self): | |
| (f'block_size must be >= kernel_block_size and an integer multiple ' | ||
| f'of kernel_block_size, but got block_size {self.block_size} ' | ||
| f'and kernel_block_size {self.kernel_block_size}') | ||
| if self.prefix_cache_decode_state_interval > 0: | ||
| assert self.prefix_cache_decode_state_interval % self.block_size == 0, ( | ||
| 'prefix_cache_decode_state_interval must be a multiple of block_size') | ||
|
Comment on lines
+503
to
+505
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Better to update the help information of prefix_cache_decode_state_interval |
||
| if self.quant_policy > 0 and self.device_type not in ['cuda', 'ascend']: | ||
| assert False, \ | ||
| 'kv cache quantization only works for CUDA and ASCEND.' | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please recommend their values for qwen3.5-35b/qwen3.5-35b-fp8/qwen3.5-397b/qwen3.5-397b-fp8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix_cache_state_budget is the extra states for caching only, this value could be the same as max-batch-size if gpu memory is enough.
prefix_cache_decode_state_interval is the interval of checkpoint saving. This value should be determined by the inputs distribution. If most session lengh are larger than 1024, you can use 1024 as the prefix_cache_decode_state_interval