Skip to content

[feat][breaking] Enable prefix caching by default#741

Open
functionstackx wants to merge 4 commits into
ROCm:mainfrom
functionstackx:enable-prefix-caching-by-default
Open

[feat][breaking] Enable prefix caching by default#741
functionstackx wants to merge 4 commits into
ROCm:mainfrom
functionstackx:enable-prefix-caching-by-default

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 11, 2026

Summary

Closes #672.

  • Flip Config.enable_prefix_caching and EngineArgs.enable_prefix_caching defaults from False to True, aligning ATOM with every other major serving runtime.
  • Switch the CLI flag to argparse.BooleanOptionalAction so users can opt out with --no-enable-prefix-caching. Both the dashed (--enable-prefix-caching) and the legacy underscore (--enable_prefix_caching) forms remain accepted.
  • Auto-disable prefix caching (with a warning) in Config.__post_init__ when the speculative draft architecture is one of the MTP variants that does not support it (Qwen3NextMTPModel, Qwen3_5MTPModel). The model-side raise ValueError in each MTP __init__ stays as a defensive last-line check. This keeps --method mtp --model Qwen3-Next-* / Qwen3.5-* working out of the box under the new default.
  • Update docs/configuration_guide.md, docs/serving_benchmarking_guide.md, and docs/scheduling_kv_cache_guide.md to reflect the new default, the opt-out flag, and the MTP auto-fallback.

Notes

  • The plugin (sgl OOT) path at atom/plugin/config.py:260 is intentionally unchanged — it continues to force enable_prefix_caching=False because sglang manages its own prefix caching. The vLLM plugin path at atom/plugin/config.py:133 passes through the user-provided value, unchanged.
  • MTP architectures that reject prefix caching (kept as defensive raise ValueError for direct model construction):
    • Qwen3NextMTPatom/models/qwen3_next_mtp.py:131-132
    • Qwen3_5MTPatom/models/qwen3_5_mtp.py:139-140
  • DeepSeekMTP and MiMoV2FlashMTP do not have the assertion, so they are correctly excluded from _MTP_NO_PREFIX_CACHE and run with prefix caching enabled.
  • This is technically a behavior change for anyone who relied on the old False default (e.g. benchmark scripts comparing on/off). The MTP auto-fallback avoids breakage of the documented happy path, but please flag in the release notes.

Test plan

  • Local block-manager tests pass (pytest tests/test_block_manager.py, 31 passed).
  • Local CLI-parsing regression test added at tests/test_arg_utils.py and passing — covers all five forms: bare, --enable-prefix-caching, --enable_prefix_caching, --no-enable-prefix-caching, --no-enable_prefix_caching.
  • tests/conftest.py MockConfig.defaults flipped to enable_prefix_caching=True to match production; all existing tests still pass.
  • CI on the full suite (requires AITER / transformers / GPU runners not available locally).
  • Sanity-check a Qwen serving run with default-on prefix caching.
  • Sanity-check --method mtp --model Qwen3-Next-* and --method mtp --model Qwen3.5-* boot without --no-enable-prefix-caching and log the auto-disable warning.

🤖 Generated with Claude Code

Closes ROCm#672. Prefix caching now defaults to on for both the `Config`
dataclass and the `EngineArgs` CLI surface, matching the default of every
other major serving runtime. The CLI flag uses argparse's
`BooleanOptionalAction`, so users can opt out with
`--no-enable-prefix-caching` (or the underscore alias). The original
`--enable_prefix_caching` form is preserved for backward compatibility.

Note: `Qwen3NextMTP` asserts `enable_prefix_caching=False`, so users
serving that model must pass `--no-enable-prefix-caching` explicitly.
The plugin (sgl OOT) path is unchanged and continues to force-disable
prefix caching since sgl manages its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx changed the title [feat] Enable prefix caching by default [bug fix] Enable prefix caching by default May 11, 2026
@valarLip
Copy link
Copy Markdown
Collaborator

oh thanks

@ChuanLi1101
Copy link
Copy Markdown
Collaborator

ChuanLi1101 commented May 12, 2026

Thanks for the contribution. Direction LGTM — defaulting prefix caching on and adding BooleanOptionalAction for opt-out is in line with vLLM/SGLang/TensorRT-LLM. The CLI parsing is correct (I verified all five forms locally: bare, --enable-prefix-caching, --enable_prefix_caching, --no-enable-prefix-caching, --no-enable_prefix_caching). A few things to address before merge:


Must-fix

1. You missed Qwen3_5MTP — same assertion as Qwen3NextMTP

atom/models/qwen3_5_mtp.py:139-140 raises the exact same ValueError as qwen3_next_mtp.py:131-132:

# atom/models/qwen3_5_mtp.py
if atom_config.enable_prefix_caching:
    raise ValueError("Qwen3_5MTP currently does not support prefix caching")

A quick rg "enable_prefix_caching" atom/models/ would have caught this. :

Fix-then-sweep: after fixing a bug, immediately grep for the same pattern across the codebase and fix all occurrences in one pass.

Please update the PR description, release-note, and docs to mention both Qwen3NextMTP and Qwen3_5MTP.

2. Don't break MTP out-of-the-box — auto-fallback in Config.__post_init__

With the new default, --method mtp --model <Qwen3-Next-* | Qwen3.5-*> will crash on init after tokenizer + draft-model HF config are loaded. The CLAUDE.md quickstart command itself uses Qwen, so this regresses the documented happy path.

Either move the assertion earlier and improve the error, or (preferred) make Config auto-fallback when the user didn't explicitly opt in. Concretely, I'd add the following in Config.__post_init__ right after the existing speculative_config validation block (around atom/config.py:988):

# Auto-disable prefix caching for MTP draft architectures that don't
# support it. Keeps `enable_prefix_caching=True` as a safe default
# without forcing users to remember `--no-enable-prefix-caching` for
# every Qwen3-Next-MTP / Qwen3.5-MTP run. The model-side assertion in
# qwen3_next_mtp.py / qwen3_5_mtp.py stays as a defensive last-line check.
_MTP_NO_PREFIX_CACHE = {"Qwen3NextMTPModel", "Qwen3_5MTPModel"}
if self.speculative_config is not None and self.enable_prefix_caching:
    draft_arch = (
        getattr(self.speculative_config.draft_model_hf_config, "architectures", None)
        or [""]
    )[0]
    if draft_arch in _MTP_NO_PREFIX_CACHE:
        logger.warning(
            f"{draft_arch} does not support prefix caching; auto-disabling "
            "for this run. Pass --no-enable-prefix-caching explicitly to "
            "silence this warning."
        )
        self.enable_prefix_caching = False

Notes on this patch:

  • SpeculativeConfig.__post_init__ already populates draft_model_hf_config and rewrites the architecture via _MTP_TYPE_MAP / hf_config_override (atom/config.py:749-757, 780-826), so by the time Config.__post_init__ runs the architecture string is already canonical (Qwen3NextMTPModel / Qwen3_5MTPModel).
  • DeepSeekMTPModel and MiMoV2FlashMTPModel don't have the assertion, so they're correctly excluded from the set.
  • Leave the existing raise ValueError(...) in both MTP __init__s in place as a defensive check for anyone who constructs the model class directly.

If you'd rather keep the hard-fail, then this PR needs a clear BREAKING CHANGE banner in the title, description, and release notes listing the affected architectures.

3. PR title is [bug fix] — this is a behavior change

[feat] or [breaking] would be more accurate. Even with the auto-fallback in #2, users who relied on the old False default (e.g. benchmark scripts comparing prefix-cache on/off) will see different behavior. Worth flagging in the release log.


Nice-to-have

4. Test mock default diverges from production

tests/conftest.py:118 MockConfig.defaults still has enable_prefix_caching=False. Not a bug — every existing test overrides it explicitly — but it's a footgun for whoever writes the next test. Either flip it to True to match production, or add a one-line comment explaining the intentional divergence.

5. Add a CLI parsing regression test

You wrote "Confirmed CLI parsing" in the test plan but didn't sink it into the suite. The double-alias + BooleanOptionalAction combo is fragile (it's a less-trodden corner of argparse). A tiny tests/test_arg_utils.py covering the five forms would prevent silent regressions:

import argparse
import pytest
from atom.model_engine.arg_utils import EngineArgs


@pytest.mark.parametrize(
    "argv, expected",
    [
        ([], True),
        (["--enable-prefix-caching"], True),
        (["--enable_prefix_caching"], True),
        (["--no-enable-prefix-caching"], False),
        (["--no-enable_prefix_caching"], False),
    ],
)
def test_enable_prefix_caching_cli(argv, expected):
    parser = argparse.ArgumentParser()
    EngineArgs.add_cli_args(parser)
    ns = parser.parse_args(argv)
    assert ns.enable_prefix_caching is expected

6. PR description line numbers are stale

Qwen3NextMTP asserts enable_prefix_caching=False (atom/models/qwen3_next_mtp.py:115-117)

Lines 115-117 in the current file are inside remap_mtp_weight_name. The actual assertion is at :131-132. (Looks like generated content that wasn't re-verified against main.) Please fix the line ranges so future readers don't get sent to the wrong spot.

7. Default value duplicated in two places

EngineArgs.enable_prefix_caching: bool = True (dataclass field) and default=True in add_cli_args must be kept in sync by hand. The rest of the file follows the same pattern, so this is not something to fix in this PR — just flagging that an additional default=cls.enable_prefix_caching style pass would be a nice cleanup later.


What's already correct (no changes needed)

  • Plugin paths — vLLM (atom/plugin/config.py:133) passes through vllm_cache_config.enable_prefix_caching; SGLang (atom/plugin/config.py:260) hard-codes False. Both untouched, both correct, and your PR description correctly explains why.
  • All three docs files (configuration_guide.md, scheduling_kv_cache_guide.md, serving_benchmarking_guide.md) updated consistently.
  • EngineArgs.from_cli_args doesn't need any change — BooleanOptionalAction derives dest=enable_prefix_caching from the first long option, which already matches the dataclass field name.

Once #1#3 are addressed, happy to re-review.

@functionstackx
Copy link
Copy Markdown
Contributor Author

thanks @ChuanLi1101 for the review! will have opus4.7 look at ur comments & implement it.

why does Qwen3NextMTPModel", "Qwen3_5MTPModel not work with prefix caching, this is an very strange behavior, never the less, will have my claude implement ur suggestion

- Config.__post_init__ auto-disables enable_prefix_caching with a
  warning when the speculative draft architecture is Qwen3NextMTPModel
  or Qwen3_5MTPModel. Both models still assert defensively in __init__.
- Flip tests/conftest.py MockConfig default to match the new production
  default (enable_prefix_caching=True).
- Add tests/test_arg_utils.py covering all five CLI parsing forms for
  the BooleanOptionalAction + double-alias combo.
- docs/scheduling_kv_cache_guide.md: document the MTP auto-fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx changed the title [bug fix] Enable prefix caching by default [feat] Enable prefix caching by default May 12, 2026
@functionstackx functionstackx changed the title [feat] Enable prefix caching by default [feat][breaking] Enable prefix caching by default May 12, 2026
@functionstackx
Copy link
Copy Markdown
Contributor Author

functionstackx commented May 12, 2026

@ChuanLi1101 thanks for the thorough review — addressed in c738556:

Must-fix

1. Qwen3_5MTP — Fixed. _MTP_NO_PREFIX_CACHE = {"Qwen3NextMTPModel", "Qwen3_5MTPModel"}. PR description rewritten to list both with correct line numbers (qwen3_next_mtp.py:131-132, qwen3_5_mtp.py:139-140). Doc note in docs/scheduling_kv_cache_guide.md §4.5 names both. DeepSeekMTP and MiMoV2FlashMTP correctly excluded (verified neither has the assertion).

2. Auto-fallback in Config.__post_init__ — Implemented as suggested, inside the existing speculative_config validation block. Pulls draft_model_hf_config.architectures[0] (canonical by that point per SpeculativeConfig.__post_init__hf_config_override), warns, and sets enable_prefix_caching = False. Model-side raise ValueError kept in both MTPs as a defensive last-line check.

3. PR title — Updated to [feat][breaking] Enable prefix caching by default. Added a behavior-change note to the PR description for the release log (no CHANGELOG file in the repo, so the PR body is the release note surface).

Nice-to-have

4. MockConfig default — Flipped to True to match production. All existing tests still pass (turns out a handful didn't override it explicitly, but they were prefix-caching-agnostic and still pass under the new default).

5. CLI regression test — Added tests/test_arg_utils.py with your exact parametrize block, covering all five forms. Required extending the conftest stubs (LLMEngine, CompilationConfig, SpeculativeConfig) since arg_utils imports them at module load. All five pass.

6. Stale line numbers in PR description — Refreshed.

7. Default duplicated in two places — Noted as out of scope per your call.

One callout from my side: my local branch was behind main when I first read your comment, so I initially thought atom/models/qwen3_5_mtp.py didn't exist. A git fetch would have shown me what rg would have shown me. My miss — your sweep rule applies to me too.

@valarLip
Copy link
Copy Markdown
Collaborator

i would like to fix it instead of workaround for special cases.. @jiayyu

@functionstackx
Copy link
Copy Markdown
Contributor Author

i would like to fix it instead of workaround for special cases.. @jiayyu

hi @valarLip feel free to takeover enabling prefix caching by default ! would appreipcate your help.

@ChuanLi1101
Copy link
Copy Markdown
Collaborator

ChuanLi1101 commented May 13, 2026

@functionstackx Verified c7385562 — all five review items landed correctly:

  • _MTP_NO_PREFIX_CACHE = {"Qwen3NextMTPModel", "Qwen3_5MTPModel"} defined at module scope above SpeculativeConfig (good placement).
  • Auto-fallback nested inside the existing if self.speculative_config is not None: block — cleaner than my original sketch (avoids re-checking speculative_config twice).
  • Defensive raise ValueError kept in both MTP __init__s — correct.
  • tests/conftest.py MockConfig flipped to True, LLMEngine / CompilationConfig / SpeculativeConfig stubbed so EngineArgs import works in the test stub environment.
  • tests/test_arg_utils.py covers all five CLI forms.

PR description, doc note (§4.5), and PR title all updated. From a review-mechanics standpoint this is good to go.


@valarLip @jiayyu — leaving the call on workaround-vs-root-cause to you. Sharing one piece of context I happened to come across while reviewing, in case it's useful for the decision:

Draft model Assertion in __init__? Origin
DeepSeekMTP no
MiMoV2FlashMTP no
Qwen3NextMTP raises ValueError added in #239 (initial impl)
Qwen3_5MTP raises ValueError added in #600 (initial impl)

All four MTPs share the same architectural shape (take target_hidden_states, concat with inputs_embeds, project through fc, run a single transformer layer), but only the two Qwen ones assert. The assertions were added at initial-merge time by the original authors of those modules, not as later patches, so the underlying issue is presumably known to whoever wrote them but never written down in the codebase.

If you'd prefer to root-cause the Qwen MTP issue first, that's totally fine — the BooleanOptionalAction / --no-enable-prefix-caching plumbing in this PR is independently useful and could land on its own if you'd rather decouple the default flip from the MTP fix. Or if you'd prefer the workaround-now path, happy to file a tracking issue for the Qwen MTP root cause so it doesn't get lost. Whichever you and @jiayyu prefer.

@jiayyu
Copy link
Copy Markdown
Contributor

jiayyu commented May 13, 2026

Some failed cases has been fixed in this pr and prefix cache of DeepSeek V4 will be enabled later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Issue]: prefix caching should be enabled by default

4 participants