feat: Turbomind linear gdn prefix caching#4465
Conversation
|
TurboMind now treats Qwen3.5 hybrid attention as two cache families:
On a prefix hit, TurboMind restores both:
Key changes
New findingsThe important new finding is that hybrid prefix caching must budget for three separate memory buckets:
Before this change, TurboMind only effectively budgeted KV blocks plus live GDN state. The cached GDN checkpoint pool was lazy and not included in the initial capacity estimate. For Qwen3.5-27B AWQ on
So while a single live GDN state is constant-size, dense GDN checkpointing becomes linear in cached prefix length and can significantly reduce available context capacity if it is budgeted conservatively. linear_prefix_cache_interval_blocks is controlling every how many KV blocks do we save a GDN snapshot. Example: default interval is 64, it means we save a GDN snapshot (37.9 MiB) every 64 int8 KV blocks (65 MiB). Reasonable tradeoff between reuse and memory footprint. Real-model observationsValidated on Observed hybrid prefix hit on repeated prompt:
Observed context-capacity impact when GDN checkpoint memory is included in the budget:
This showed that interval Runtime behavior
Validation
|
There was a problem hiding this comment.
Pull request overview
Adds hybrid prefix caching support in TurboMind for Qwen3.5-style hybrid attention models by extending the existing KV prefix cache with periodic Gated DeltaNet (linear attention) state checkpoints, and wires the new option through Python config + CLI into the C++ engine.
Changes:
- Introduces
linear_prefix_cache_interval_blocksacross Python config/CLI and TurboMind engine params to control linear-attention checkpoint cadence. - Extends TurboMind prefix-cache matching/caching to additionally capture and restore GDN conv/recurrent states at interval boundaries.
- Adds Python tests to validate config defaults/validation and API server forwarding behavior.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_turbomind/test_engine_config.py | Adds tests for default/validation/override of the new interval config. |
| tests/test_lmdeploy/test_turbomind/test_api_server.py | Ensures API server forwards the new option into TurbomindEngineConfig and preserves default CUDA batch sizing behavior. |
| src/turbomind/turbomind.cc | Parses linear_prefix_cache_interval_blocks and removes the previous hard block on prefix caching with linear attention. |
| src/turbomind/models/llama/SequenceManager.h | Adds per-sequence pending checkpoint tensors/metadata and threads the interval into SequenceManager. |
| src/turbomind/models/llama/SequenceManager.cc | Budgets cache blocks considering checkpoint overhead; integrates trie verify hooks; restores linear states on prefix hits. |
| src/turbomind/models/llama/llama_params.h | Adds the new engine parameter to EngineParam. |
| src/turbomind/models/llama/GatedDeltaNetLayer.h | Adds capture staging buffers and bookkeeping for checkpoint capture during prefill. |
| src/turbomind/models/llama/GatedDeltaNetLayer.cc | Computes per-request capture counts, allocates staging opportunistically, and publishes captured checkpoint slices to sequences for caching. |
| src/turbomind/models/llama/gated_delta_net_kernels.h | Extends kernel launcher APIs to optionally write checkpoint captures. |
| src/turbomind/models/llama/gated_delta_net_kernels.cu | Implements conv/recurrent checkpoint capture paths and adds new overloads for the launchers. |
| src/turbomind/models/llama/BlockTrie.h | Extends trie nodes to optionally own a linear-state slot; returns a richer match result including linear checkpoint state. |
| src/turbomind/models/llama/BlockTrie.cc | Stores/retrieves linear checkpoint state in trie nodes and releases it when nodes are invalidated. |
| src/turbomind/engine/engine.cc | Passes the new interval into SequenceManager. |
| lmdeploy/messages.py | Adds config field, docs, and validation for linear_prefix_cache_interval_blocks. |
| lmdeploy/cli/utils.py | Adds --linear-prefix-cache-interval-blocks CLI option. |
| lmdeploy/cli/serve.py | Wires the CLI option into TurbomindEngineConfig for API server. |
| lmdeploy/cli/cli.py | Exposes the CLI option on the chat CLI path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This looks promising, prefix-caching is the last piece towards a working lmdeploy production deployment of Qwen3.5 |
…ntation; add related tests. This change enhances memory management for hybrid models by increasing the checkpoint interval, which may reduce memory usage but requires more recompute after prefix hits.
ad6573b to
8f1b581
Compare
Qwen3.5 Hybrid Prefix Caching in TurboMind
Summary
This change adds prefix caching support for Qwen3.5 hybrid-attention models in TurboMind.
The implementation keeps
quant_policyscoped to KV cache quantization. GDN prefix checkpoints remain in the model/state dtypes in this version.User-Facing Changes
linear_prefix_cache_interval_blockstoTurbomindEngineConfig--linear-prefix-cache-interval-blocksto the TurboMind CLI surface2KV blocks< 1Runtime Design
Hybrid cache structure
Prefix matching
On a prefix hit:
Cache maintenance
Main Code Areas
lmdeploy/messages.pylmdeploy/cli/utils.pylmdeploy/cli/cli.pylmdeploy/cli/serve.pysrc/turbomind/turbomind.ccsrc/turbomind/models/llama/llama_params.hsrc/turbomind/engine/engine.ccsrc/turbomind/models/llama/BlockTrie.hsrc/turbomind/models/llama/BlockTrie.ccsrc/turbomind/models/llama/SequenceManager.hsrc/turbomind/models/llama/SequenceManager.ccsrc/turbomind/models/llama/GatedDeltaNetLayer.hsrc/turbomind/models/llama/GatedDeltaNetLayer.ccsrc/turbomind/models/llama/gated_delta_net_kernels.hsrc/turbomind/models/llama/gated_delta_net_kernels.cuTest Coverage Added
Python tests
tests/test_lmdeploy/test_turbomind/test_engine_config.pytests/test_lmdeploy/test_turbomind/test_api_server.pyapi_serverforwards hybrid prefix-cache options intoTurbomindEngineConfigapi_serveruses the normal default CUDAmax_batch_sizewhen the user does not set one explicitlyTest commands run
Observed results
test_engine_config.py+test_api_server.py:5 passedtest_converter.py:5 passed_turbomindrebuilt successfullyReal-Model Validation
Model:
QuantTrio/Qwen3.5-27B-AWQCommand used:
Observed startup details:
max cached tokens: 533248.8320tokens.Observed hybrid prefix-cache hit on repeated request:
Request details:
prompt_tokens=626,completion_tokens=24prompt_tokens=626,completion_tokens=24This confirms that the second request reused both normal cached KV blocks and a compatible linear-attention checkpoint.
Notes
quant_policyremains KV-only in this PR.