Skip to content

[TRTLLM-12958][feat] Enable speculative decoding for dis-agg gen-only requests#14546

Merged
bo-nv merged 4 commits into
NVIDIA:mainfrom
bo-nv:trtllm-12958
Jun 11, 2026
Merged

[TRTLLM-12958][feat] Enable speculative decoding for dis-agg gen-only requests#14546
bo-nv merged 4 commits into
NVIDIA:mainfrom
bo-nv:trtllm-12958

Conversation

@bo-nv

@bo-nv bo-nv commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes
    • Improved compatibility for disaggregated execution with varying peer layer configurations.
    • Enhanced handling of speculative decoding batch processing when draft tokens are unavailable, ensuring uniform input shapes across batch operations.
    • Increased robustness of generation in disaggregated execution scenarios.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@bo-nv bo-nv self-assigned this May 26, 2026
@bo-nv bo-nv requested review from a team as code owners May 26, 2026 04:17
@bo-nv bo-nv marked this pull request as draft May 26, 2026 04:17
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR enables uniform draft token handling across disaggregated CTX and GEN peers by relaxing peer compatibility checks, padding empty draft token lists, and pre-filling missing context draft tokens in the executor to ensure consistent speculative decoding input shapes.

Changes

Uniform draft token handling in disaggregated generation

Layer / File(s) Summary
Peer compatibility, transmission unpacking, and executor draft setup
tensorrt_llm/_torch/disaggregation/native/peer.py, tensorrt_llm/_torch/disaggregation/native/transfer.py, tensorrt_llm/_torch/pyexecutor/py_executor.py
PeerRegistrar._check_peer_compatible logs info and allows partial layer transfer on layer-count mismatch instead of failing; RxSession.unpack_aux pads empty draft_tokens with zeros to match _max_draft_len; PyExecutor._prepare_disagg_gen_transmission_complete pre-fills missing ctx_draft_tokens with dummy tokens when speculative decoding is enabled, ensuring GEN receives uniform draft-length inputs.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is largely empty with only template placeholders and uncompleted sections. Fill in the Description and Test Coverage sections with specific details about what changes are made and why. Provide at least one relevant test case safeguarding the changes.
Title check ❓ Inconclusive The title clearly references the JIRA ticket and feature type, but lacks a descriptive summary of what 'gen-only spec dec' enables or the actual changes made. Expand the title to be more descriptive of the actual changes, such as '[TRTLLM-12958][feat] Allow speculative decoding in generation-only phase' or similar, to clarify the change for developers scanning history.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/disaggregation/native/peer.py`:
- Around line 100-107: The current unconditional allowance for differing layer
counts in PeerRegistrar can register incompatible peers; change the logic so you
compute the actual transferable overlap (e.g., derive overlapping layer
names/ids via set(self_layers) ∩ set(peer_layers) or by using the existing
pool_mapping logic to compute transferable_layers) and only proceed with the
partial-transfer path when that overlap is non-empty or the difference matches
an explicitly allowed MTP-only delta; otherwise log an error and raise/fail
compatibility instead of silently allowing registration (update the code around
PeerRegistrar and the pool_mapping check and the logger call that currently
emits "layer count differs ... allowing partial layer transfer").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05dfc83a-4bcf-4d42-9a00-bb3a8e3859b2

📥 Commits

Reviewing files that changed from the base of the PR and between 526d7ee and 2b57dd9.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/disaggregation/native/peer.py
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread tensorrt_llm/_torch/disaggregation/native/peer.py
@bo-nv

bo-nv commented May 26, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

1 similar comment
@bo-nv

bo-nv commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50415 [ run ] triggered by Bot. Commit: 2cebbef Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50415 [ run ] completed with state SUCCESS. Commit: 2cebbef
/LLM/main/L0_MergeRequest_PR pipeline #39939 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51474 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51474 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #40880 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51609 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51609 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #40992 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51714 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51714 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #41090 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51783 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51783 [ run ] completed with state SUCCESS. Commit: a26e28d
/LLM/main/L0_MergeRequest_PR pipeline #41149 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51823 [ run ] triggered by Bot. Commit: a26e28d Link to invocation

@bo-nv

bo-nv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53138 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53138 [ run ] completed with state FAILURE. Commit: 12f0890
/LLM/main/L0_MergeRequest_PR pipeline #42341 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53149 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53149 [ run ] completed with state FAILURE. Commit: 12f0890
/LLM/main/L0_MergeRequest_PR pipeline #42352 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53162 [ run ] triggered by Bot. Commit: 12f0890 Link to invocation

bo-nv added 4 commits June 10, 2026 02:19
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
@bo-nv

bo-nv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53197 [ run ] triggered by Bot. Commit: 3ee277e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53162 [ run ] completed with state ABORTED. Commit: 12f0890

Link to invocation

@chuangz0 chuangz0 self-requested a review June 10, 2026 07:04
# Limit to prompt_len blocks, matching C++ cacheFormatter behavior.
# Extra blocks from num_extra_kv_tokens (speculative decoding) have
# uninitialized KV data and must not be transferred.
total_blocks = (req.prompt_len + tpb - 1) // tpb

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, along with token_range, the requirement is to only transfer blocks for prompt_len. However, in practice, prompt + num_extra_kv_tokens blocks are allocated.

If MTP is enabled for both the context phase and the generation phase, then the current modification will only transfer prompt_len
blocks, and the extra block that may will not be transferred. The questions are:

  1. Will the KV cache for num_extra_kv_tokens be written to during the context phase?
  2. Will the KV cache written during the context phase be used by the generation phase?
  3. When both context and generation have MTP enabled, do we need to transfer prompt_len + num_extra_kv_tokens KV blocks?
    cc @lfr-0531

@bo-nv bo-nv Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuangz0 these changes fix the accuracy issue with py-transceiver + eagle3. Without them,accuracy drops even if enable eagle3 for both ctx and gen.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge first

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53197 [ run ] completed with state SUCCESS. Commit: 3ee277e
/LLM/main/L0_MergeRequest_PR pipeline #42394 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bo-nv

bo-nv commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53286 [ run ] triggered by Bot. Commit: 3ee277e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53286 [ run ] completed with state SUCCESS. Commit: 3ee277e
/LLM/main/L0_MergeRequest_PR pipeline #42476 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bo-nv bo-nv enabled auto-merge (squash) June 11, 2026 02:47
@bo-nv bo-nv merged commit 228829c into NVIDIA:main Jun 11, 2026
8 checks passed
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Shixiaowei02 added a commit to Shixiaowei02/TensorRT-LLM that referenced this pull request Jun 11, 2026
ADEngine subclasses the abstract ModelEngine and does not run
PyTorchModelEngine.__init__, so it never set `enable_spec_decode`. After
NVIDIA#14546 added an unguarded `self.model_engine.enable_spec_decode` read in
`_prepare_disagg_gen_transmission_complete` (the disagg generation handoff
path that ADEngine traverses via NVIDIA#14057 AutoDeploy Basic Disagg Support),
AutoDeploy disaggregated runs crash with:
  AttributeError: 'ADEngine' object has no attribute 'enable_spec_decode'

NVIDIA#14546 and NVIDIA#14057 each passed CI independently but conflict semantically
once both are on main. Set `is_spec_decode`/`enable_spec_decode` in
ADEngine.__init__, mirroring PyTorchModelEngine
(enable_spec_decode == spec_config is not None), so ADEngine satisfies the
ModelEngine attribute contract that shared PyExecutor code relies on.

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Shixiaowei02 added a commit to Shixiaowei02/TensorRT-LLM that referenced this pull request Jun 11, 2026
ADEngine subclasses the abstract ModelEngine and does not run
PyTorchModelEngine.__init__, so it never set `enable_spec_decode`. After
NVIDIA#14546 added an unguarded `self.model_engine.enable_spec_decode` read in
`_prepare_disagg_gen_transmission_complete` (the disagg generation handoff
path that ADEngine traverses via NVIDIA#14057 AutoDeploy Basic Disagg Support),
AutoDeploy disaggregated runs crash with:
  AttributeError: 'ADEngine' object has no attribute 'enable_spec_decode'

NVIDIA#14546 and NVIDIA#14057 each passed CI independently but conflict semantically
once both are on main. Set `is_spec_decode`/`enable_spec_decode` in
ADEngine.__init__, mirroring PyTorchModelEngine
(enable_spec_decode == spec_config is not None), so ADEngine satisfies the
ModelEngine attribute contract that shared PyExecutor code relies on.

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02 Shixiaowei02 changed the title [TRTLLM-12958][feat] Enable gen-only spec dec [TRTLLM-12958][feat] Enable speculative decoding for dis-agg gen-only requests Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants