[feat][breaking] Enable prefix caching by default#741
Conversation
Closes ROCm#672. Prefix caching now defaults to on for both the `Config` dataclass and the `EngineArgs` CLI surface, matching the default of every other major serving runtime. The CLI flag uses argparse's `BooleanOptionalAction`, so users can opt out with `--no-enable-prefix-caching` (or the underscore alias). The original `--enable_prefix_caching` form is preserved for backward compatibility. Note: `Qwen3NextMTP` asserts `enable_prefix_caching=False`, so users serving that model must pass `--no-enable-prefix-caching` explicitly. The plugin (sgl OOT) path is unchanged and continues to force-disable prefix caching since sgl manages its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
oh thanks |
|
Thanks for the contribution. Direction LGTM — defaulting prefix caching on and adding Must-fix1. You missed
|
|
thanks @ChuanLi1101 for the review! will have opus4.7 look at ur comments & implement it. why does Qwen3NextMTPModel", "Qwen3_5MTPModel not work with prefix caching, this is an very strange behavior, never the less, will have my claude implement ur suggestion |
- Config.__post_init__ auto-disables enable_prefix_caching with a warning when the speculative draft architecture is Qwen3NextMTPModel or Qwen3_5MTPModel. Both models still assert defensively in __init__. - Flip tests/conftest.py MockConfig default to match the new production default (enable_prefix_caching=True). - Add tests/test_arg_utils.py covering all five CLI parsing forms for the BooleanOptionalAction + double-alias combo. - docs/scheduling_kv_cache_guide.md: document the MTP auto-fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@ChuanLi1101 thanks for the thorough review — addressed in c738556: Must-fix1. 2. Auto-fallback in 3. PR title — Updated to Nice-to-have4. 5. CLI regression test — Added 6. Stale line numbers in PR description — Refreshed. 7. Default duplicated in two places — Noted as out of scope per your call. One callout from my side: my local branch was behind |
|
i would like to fix it instead of workaround for special cases.. @jiayyu |
|
@functionstackx Verified
PR description, doc note (§4.5), and PR title all updated. From a review-mechanics standpoint this is good to go. @valarLip @jiayyu — leaving the call on workaround-vs-root-cause to you. Sharing one piece of context I happened to come across while reviewing, in case it's useful for the decision:
All four MTPs share the same architectural shape (take If you'd prefer to root-cause the Qwen MTP issue first, that's totally fine — the |
|
Some failed cases has been fixed in this pr and prefix cache of DeepSeek V4 will be enabled later |
Summary
Closes #672.
Config.enable_prefix_cachingandEngineArgs.enable_prefix_cachingdefaults fromFalsetoTrue, aligning ATOM with every other major serving runtime.argparse.BooleanOptionalActionso users can opt out with--no-enable-prefix-caching. Both the dashed (--enable-prefix-caching) and the legacy underscore (--enable_prefix_caching) forms remain accepted.Config.__post_init__when the speculative draft architecture is one of the MTP variants that does not support it (Qwen3NextMTPModel,Qwen3_5MTPModel). The model-sideraise ValueErrorin each MTP__init__stays as a defensive last-line check. This keeps--method mtp --model Qwen3-Next-*/Qwen3.5-*working out of the box under the new default.docs/configuration_guide.md,docs/serving_benchmarking_guide.md, anddocs/scheduling_kv_cache_guide.mdto reflect the new default, the opt-out flag, and the MTP auto-fallback.Notes
atom/plugin/config.py:260is intentionally unchanged — it continues to forceenable_prefix_caching=Falsebecause sglang manages its own prefix caching. The vLLM plugin path atatom/plugin/config.py:133passes through the user-provided value, unchanged.raise ValueErrorfor direct model construction):Qwen3NextMTP—atom/models/qwen3_next_mtp.py:131-132Qwen3_5MTP—atom/models/qwen3_5_mtp.py:139-140DeepSeekMTPandMiMoV2FlashMTPdo not have the assertion, so they are correctly excluded from_MTP_NO_PREFIX_CACHEand run with prefix caching enabled.Falsedefault (e.g. benchmark scripts comparing on/off). The MTP auto-fallback avoids breakage of the documented happy path, but please flag in the release notes.Test plan
pytest tests/test_block_manager.py, 31 passed).tests/test_arg_utils.pyand passing — covers all five forms: bare,--enable-prefix-caching,--enable_prefix_caching,--no-enable-prefix-caching,--no-enable_prefix_caching.tests/conftest.pyMockConfig.defaultsflipped toenable_prefix_caching=Trueto match production; all existing tests still pass.--method mtp --model Qwen3-Next-*and--method mtp --model Qwen3.5-*boot without--no-enable-prefix-cachingand log the auto-disable warning.🤖 Generated with Claude Code