Skip to content

[TRTLLM-11558][feat] Acceptance rate based speculation off in one model path#12905

Open
zheyuf wants to merge 1 commit into
NVIDIA:mainfrom
zheyuf:0408_AR_based
Open

[TRTLLM-11558][feat] Acceptance rate based speculation off in one model path#12905
zheyuf wants to merge 1 commit into
NVIDIA:mainfrom
zheyuf:0408_AR_based

Conversation

@zheyuf

@zheyuf zheyuf commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Updates
    • Enhanced speculative decoding gating mechanism to use acceptance rate-based metrics for improved decision-making on when to disable speculation.
    • Renamed speculative decoding configuration parameters: acceptance_windowacceptance_rate_window_size and acceptance_length_thresholdacceptance_rate_threshold.
    • New acceptance rate threshold now enforced between 0.0 and 1.0 for consistency.

Description

This PR ports AR-based speculation-off to the one-model path and removes the previous implementation from the two-model path.

Highlights

  • Users can configure acceptance_rate_window_size and acceptance_rate_threshold through the LLM API.
  • If the rolling acceptance rate over the recent window drops below acceptance_rate_threshold, speculative decoding is permanently disabled.
  • This is compatible with batch-size-based dynamic speculation, so both mechanisms can be enabled at the same time.

Main changes compared with the previous two-model-path implementation

  • The gate is now based on true acceptance rate, rather than acceptance length. This is important for compatibility with dynamic draft length, where the draft length can vary over time.
  • AR history now records one batch-level acceptance-rate sample per speculation-enabled iteration. Previously, it recorded one request-level average after each request completed.
  • Parameter naming is now clearer and more consistent.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@zheyuf zheyuf marked this pull request as ready for review April 9, 2026 23:35
@zheyuf zheyuf requested review from a team as code owners April 9, 2026 23:35
@zheyuf zheyuf requested review from dongxuy04, hchings, mikeiovine and ziyixiong-nv and removed request for dongxuy04 and ziyixiong-nv April 9, 2026 23:36
@coderabbitai

coderabbitai Bot commented Apr 9, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This change refactors speculative decoding acceptance-rate gating from per-request average-decoded-tokens tracking to batch-based true acceptance-rate sampling. Configuration parameters and gate API methods are renamed accordingly.

Changes

Cohort / File(s) Summary
Speculative Decoding Core
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/speculative/speculation_gate.py
Refactored acceptance tracking from per-request record_avg_decoded(...) to batch-based record_acceptance_rate(...). Gate now records true acceptance-rate samples (acceptance_rate = total_accepted_tokens / total_draft_tokens) instead of derived metrics from average decoded tokens. Internal state tracking fields renamed (acceptance_historyacceptance_rate_history, etc.). Added proactive clearing of draft tokens when speculation is permanently disabled.
Configuration Schema
tensorrt_llm/llmapi/llm_args.py
Renamed acceptance window/threshold fields in DecodingBaseConfig: acceptance_windowacceptance_rate_window_size, acceptance_length_thresholdacceptance_rate_threshold. New threshold is constrained to [0.0, 1.0] and now represents true acceptance rate rather than acceptance length.
Test Suite
tests/unittest/_torch/speculative/test_spec_gate.py
Updated all speculation gate tests to use new record_acceptance_rate API instead of record_avg_decoded. Adjusted mock parameters, tracked fields, configuration references, and assertions to align with acceptance-rate semantics. Modified test configuration values for thresholds and cache settings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: implementing acceptance rate-based speculation disabling in the one-model execution path.
Description check ✅ Passed The PR description explains the key changes and highlights but lacks detailed Test Coverage section and some checklist items are incomplete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@zheyuf

zheyuf commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1167-1171: Remove unused new_tokens_lens retrieval.

The new_tokens_lens variable is retrieved from sample_state.host and converted to a list, but it is never used in the method body. This appears to be leftover code from a previous implementation.

♻️ Proposed fix to remove dead code
         if (getattr(self.dist, 'has_pp', False)
                 and not self.dist.is_last_pp_rank):
             return False, None
-        new_tokens_lens = getattr(sample_state.host, 'new_tokens_lens', None)
-        if new_tokens_lens is None:
-            return False, None
-        new_tokens_lens_list = (new_tokens_lens.tolist() if hasattr(
-            new_tokens_lens, 'tolist') else list(new_tokens_lens))
         total_draft_tokens = 0
         total_accepted_tokens = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1167 - 1171,
Remove the dead retrieval and conversion of new_tokens_lens: delete the lines
that get new_tokens_lens from sample_state.host, the conditional that checks if
new_tokens_lens is None and returns (False, None), and the conversion to
new_tokens_lens_list (the references to new_tokens_lens, new_tokens_lens_list
and their tolist/list conversion). Ensure no other code in the enclosing
function (e.g., methods using sample_state.host) depends on new_tokens_lens
before removing these statements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1167-1171: Remove the dead retrieval and conversion of
new_tokens_lens: delete the lines that get new_tokens_lens from
sample_state.host, the conditional that checks if new_tokens_lens is None and
returns (False, None), and the conversion to new_tokens_lens_list (the
references to new_tokens_lens, new_tokens_lens_list and their tolist/list
conversion). Ensure no other code in the enclosing function (e.g., methods using
sample_state.host) depends on new_tokens_lens before removing these statements.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fcb3fcdc-4b58-4922-b217-d415972c41fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2a0be45 and c12fe3b.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/speculation_gate.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/speculative/test_spec_gate.py

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42591 [ run ] triggered by Bot. Commit: c12fe3b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42591 [ run ] completed with state SUCCESS. Commit: c12fe3b
/LLM/main/L0_MergeRequest_PR pipeline #33317 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zheyuf

zheyuf commented Apr 11, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42761 [ run ] triggered by Bot. Commit: cdda7c8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42761 [ run ] completed with state SUCCESS. Commit: cdda7c8
/LLM/main/L0_MergeRequest_PR pipeline #33438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zheyuf

zheyuf commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42966 [ run ] triggered by Bot. Commit: 79c0653 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #42966 [ run ] completed with state SUCCESS. Commit: 79c0653
/LLM/main/L0_MergeRequest_PR pipeline #33622 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zheyuf

zheyuf commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #43290 [ run ] triggered by Bot. Commit: ddd812e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #43290 [ run ] completed with state FAILURE. Commit: ddd812e
/LLM/main/L0_MergeRequest_PR pipeline #33837 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zheyuf zheyuf enabled auto-merge (squash) April 15, 2026 04:28
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53662 [ run ] triggered by Bot. Commit: 672132c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53662 [ run ] completed with state SUCCESS. Commit: 672132c
/LLM/main/L0_MergeRequest_PR pipeline #42803 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zheyuf

zheyuf commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53835 [ run ] triggered by Bot. Commit: 8098694 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53835 [ run ] completed with state SUCCESS. Commit: 8098694
/LLM/main/L0_MergeRequest_PR pipeline #42948 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zheyuf

zheyuf commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53948 [ run ] triggered by Bot. Commit: eaa50b7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53948 [ run ] completed with state FAILURE. Commit: eaa50b7
/LLM/main/L0_MergeRequest_PR pipeline #43038 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zheyuf

zheyuf commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54002 [ run ] triggered by Bot. Commit: 734d277 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54002 [ run ] completed with state SUCCESS. Commit: 734d277
/LLM/main/L0_MergeRequest_PR pipeline #43087 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zheyuf

zheyuf commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55059 [ run ] triggered by Bot. Commit: a695708 Link to invocation

@zheyuf

zheyuf commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55070 [ run ] triggered by Bot. Commit: 9cd2fe6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55059 [ run ] completed with state ABORTED. Commit: a695708

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55070 [ run ] completed with state FAILURE. Commit: 9cd2fe6
/LLM/main/L0_MergeRequest_PR pipeline #44057 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zheyuf

zheyuf commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55192 [ run ] triggered by Bot. Commit: d30d2e3 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55192 [ run ] completed with state SUCCESS. Commit: d30d2e3
/LLM/main/L0_MergeRequest_PR pipeline #44155 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Gate speculative decoding on a rolling per-request acceptance rate. When
the moving-average AR over acceptance_rate_window_size falls below
acceptance_rate_threshold, speculation is disabled for that request;
when it rises back above the threshold, speculation re-engages.

- Wire SpeculationGate through the three hot-path call sites (PP, non-
  overlap, overlap scheduler) and hoist the early-return so the
  no-op-when-disabled guarantee is visible in the diff.
- Rename _record_batch_acceptance_rate -> _update_batch_acceptance_rate
  and document the overlap-scheduler interaction.
- Add the acceptance_rate_window_size / acceptance_rate_threshold knobs
  to TorchLlmArgs / DecodingBaseConfig.
- Add unit tests under tests/unittest/_torch/speculative/test_spec_gate.py.

Squashed from:
  0e4906a AR based speculation off
  e099d11 fix CI
  ef245d7 Rename _record_batch_acceptance_rate and add overlap scheduler comment.
  955bd69 [None][perf] Make AR gate no-op explicit at call sites

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
@zheyuf

zheyuf commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55277 [ run ] triggered by Bot. Commit: fc0be10 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55277 [ run ] completed with state FAILURE. Commit: fc0be10
/LLM/main/L0_MergeRequest_PR pipeline #44230 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants