Skip to content

[TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params#14745

Open
zhaoyangwang-nvidia wants to merge 21 commits into
NVIDIA:mainfrom
zhaoyangwang-nvidia:TRTLLM-12669-remove-allow-advanced-sampling
Open

[TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params#14745
zhaoyangwang-nvidia wants to merge 21 commits into
NVIDIA:mainfrom
zhaoyangwang-nvidia:TRTLLM-12669-remove-allow-advanced-sampling

Conversation

@zhaoyangwang-nvidia

@zhaoyangwang-nvidia zhaoyangwang-nvidia commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Replace static config flag with auto-detected per-step uses_advanced_sampling based on actual sampling params. Include this in CUDA graph key so we lazily capture two graph variants (argmax fast-path vs advanced sampling kernel) and dispatch by replaying the right one.

Summary by CodeRabbit

  • New Features

    • Speculative decoding now automatically detects all-greedy sampling for optimized performance.
    • Draft token generation enhanced to support per-request sampling parameters in advanced configurations.
  • Bug Fixes

    • Removed unnecessary fallback warnings in speculative decoding setup.
  • Documentation

    • Updated speculative decoding configuration examples and test settings with improved sampling support.

Description

Refactors Eagle3 one-model speculative decoding sampling. Four logical changes, sequenced as separate commits:

1. Replace allow_advanced_sampling with auto-detected is_all_greedy_sample

Removes the allow_advanced_sampling config flag from DecodingBaseConfig. Replaced with a per-step is_all_greedy_sample derived from the actual temperature / top_k / top_p of requests in the batch. The flag is included in the CUDA graph cache key, so two graph variants are lazily captured (argmax fast-path vs. advanced-sampling kernel) and dispatched at replay time based on batch composition.

2. Eagle3 drafter honors target's sampling params

Previously the Eagle3 draft model always ran greedy regardless of the target's sampling configuration. This change propagates the target's temperature / top_k / top_p into the draft loop so that draft samples come from the same distribution as the target's reference distribution. This is a correctness prerequisite for non-greedy rejection sampling (Leviathan formula u * p_draft(x) < p_target(x) only holds when both probabilities come from the same conditioning).

3. Slot-indexed draft_probs to support mixed batches

Previously _can_use_rejection_sampling bailed out when the batch contained context requests, falling back to exact-match for the whole batch. Root cause: draft_probs was indexed by batch position, but batch position is unstable across iterations (chunked-prefill, finishing gens, new ctx joins all shift positions). Fix:

  • Reshape draft_probs from flat [total_draft_tokens, vocab] to slot-indexed [max_num_requests, max_draft_len, vocab], keyed by stable py_seq_slot.
  • Scatter on write (_compute_and_store_draft_probs), gather on read (_accept_draft_tokens) using a precomputed batch_slot_ids tensor.
  • Drop the num_contexts == 0 constraint in _can_use_rejection_sampling — ctx subset goes through _sample_tokens_for_batch, gen subset goes through the rejection kernel.
  • Reset draft_probs_valid = False whenever the draft loop writes no probs, so stale data is never read.

Mixed-batch rejection captures ~18% sys-tps on llama70b bs=32 vs. the exact-match fallback.


Test Coverage

Unit tests (B200)

tests/unittest/_torch/speculative/test_eagle3.py
  test_eagle3_cuda_graph_padding[True]  PASSED  130.54s
  test_eagle3_cuda_graph_padding[False] PASSED  103.53s

End-to-end correctness — Qwen3-8B (H100 SXM5 80G)

Greedy path verification (no temp / top_p / top_k → both paths take greedy branch):

  • baseline (83ec591, allow_advanced_sampling=False) vs new (d237690, auto-detected is_all_greedy_sample=True)
  • token-level identity: 80/80 prompts identical
  • total_output_tokens: 113,432 vs 113,432, Δ=0
  • mean_acceptance_rate: 0.4918 vs 0.4918, Δ=+0.00%
  • mean_acceptance_length: 2.475 vs 2.475, Δ=0
  • mean throughput: 224.19 vs 224.02 tok/s, Δ=-0.07%
  • wall clock: 505.99s vs 506.38s, Δ=+0.08%

Performance — rejection sampling ON vs OFF (non-greedy)

CUDA graph enabled. mtbench dataset. Sampling params: temperature=0.7, top_k=50, top_p=0.9.

Llama-3.3-70B-Instruct + EAGLE-3 (mean over 3 rounds)

bs TPS off TPS on ΔTPS AR off AR on ΔAR
32 3408.2 3111.3 −8.70% 0.3475 0.3581 +3.04%
16 2203.7 2069.3 −6.09% 0.3499 0.3568 +1.96%
8 1278.9 1232.4 −3.58% 0.3444 0.3586 +4.15%
4 704.1 694.9 −1.29% 0.3451 0.3585 +3.88%
2 390.2 384.3 −1.51% 0.3448 0.3547 +2.89%
1 204.2 203.4 −0.43% 0.3490 0.3561 +2.05%
Per-round detail (3 rounds, llama-70b)
bs round TPS off TPS on ΔTPS AR off AR on ΔAR
32 R1 3413.8 3113.9 −8.78% 0.3479 0.3579 +2.88%
32 R2 3441.7 3111.2 −9.60% 0.3512 0.3584 +2.04%
32 R3 3369.0 3108.9 −7.72% 0.3435 0.3579 +4.19%
16 R1 2175.9 2078.4 −4.48% 0.3524 0.3559 +0.99%
16 R2 2219.6 2101.3 −5.33% 0.3492 0.3591 +2.82%
16 R3 2215.7 2028.2 −8.46% 0.3481 0.3553 +2.08%
8 R1 1281.1 1230.6 −3.94% 0.3428 0.3586 +4.62%
8 R2 1255.5 1266.1 +0.85% 0.3438 0.3613 +5.10%
8 R3 1300.2 1200.6 −7.66% 0.3466 0.3560 +2.73%
4 R1 709.1 708.0 −0.15% 0.3475 0.3618 +4.14%
4 R2 694.5 701.5 +1.01% 0.3422 0.3607 +5.40%
4 R3 708.6 675.1 −4.72% 0.3457 0.3530 +2.11%
2 R1 390.1 380.3 −2.51% 0.3420 0.3487 +1.95%
2 R2 390.0 379.8 −2.61% 0.3489 0.3539 +1.42%
2 R3 390.4 392.7 +0.59% 0.3434 0.3616 +5.31%
1 R1 205.5 205.7 +0.07% 0.3429 0.3553 +3.62%
1 R2 202.5 203.4 +0.48% 0.3530 0.3555 +0.70%
1 R3 204.7 201.0 −1.84% 0.3511 0.3575 +1.83%

Qwen3-235B-A22B + EAGLE-3

bs TPS off TPS on ΔTPS AR off AR on ΔAR
16 1280.8 1337.3 +4.42% 0.3089 0.3510 +13.62%
8 777.3 805.9 +3.68% 0.3098 0.3502 +13.03%
4 469.0 492.4 +5.00% 0.3068 0.3539 +15.36%
2 278.0 288.8 +3.89% 0.3078 0.3519 +14.32%
1 154.1 159.7 +3.64% 0.3097 0.3488 +12.62%

Observation

Acceptance rate improves across all tested configurations under non-greedy sampling (+2–4% on llama-70b, +12–15% on qwen-235b). The AR uplift translates to a TPS win on qwen-235b at every batch size, but not on llama-70b, where rejection sampling currently costs 0–9% TPS. Lower batch sizes also show enough run-to-run noise that some signs flip across rounds.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@zhaoyangwang-nvidia zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch from 903b453 to d237690 Compare May 29, 2026 10:05
@zhaoyangwang-nvidia zhaoyangwang-nvidia marked this pull request as ready for review May 29, 2026 10:10
@zhaoyangwang-nvidia zhaoyangwang-nvidia requested review from a team as code owners May 29, 2026 10:10
@zhaoyangwang-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@zhaoyangwang-nvidia

Copy link
Copy Markdown
Collaborator Author

Hi @mikeiovine please help to review this PR, thanks~

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51043 [ run ] triggered by Bot. Commit: d237690 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51043 [ run ] completed with state SUCCESS. Commit: d237690
/LLM/main/L0_MergeRequest_PR pipeline #40490 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@zhaoyangwang-nvidia

Copy link
Copy Markdown
Collaborator Author

Hi @NVIDIA/trt-llm-doc-owners @NVIDIA/trt-llm-llmapi-devs @NVIDIA/trt-llm-qa-function @NVIDIA/trt-llm-torch-models-devs @NVIDIA/trt-llm-torch-runtime-devs @NVIDIA/trt-llm-torch-spec-decoding please help to review this PR, thanks a lot.

@zhaoyangwang-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51297 [ run ] triggered by Bot. Commit: d237690 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51297 [ run ] completed with state SUCCESS. Commit: d237690
/LLM/main/L0_MergeRequest_PR pipeline #40712 completed with status: 'SUCCESS'

CI Report

Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR replaces the allow_advanced_sampling configuration flag with a runtime-computed is_all_greedy_sample flag. Speculative decoding now automatically selects greedy or advanced sampling paths based on whether all batch requests are greedy, enabling CUDA graph caching and rejection-sampling optimizations without manual user configuration.

Changes

Speculative Decoding Sampling Configuration Refactoring

Layer / File(s) Summary
Configuration schema refactoring
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/speculative/interface.py, tensorrt_llm/_torch/speculative/utils.py
Removed allow_advanced_sampling from DecodingBaseConfig and replaced it with is_all_greedy_sample in SpecMetadata. Updated all five speculative metadata constructors (MTPSpecMetadata, Eagle3OneModelSpecMetadata, PARDSpecMetadata, DFlashSpecMetadata, DraftTargetOneModelSpecMetadata) to remove the allow_advanced_sampling argument from instantiation calls.
CUDA graph caching key expansion
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Extended KeyType with a fifth boolean dimension for the new is_all_greedy_sample flag. Updated get_graph_key to accept spec_resource_manager and spec_metadata, derives is_all_greedy_sample from metadata, and includes it in the graph key for both Eagle3 and non-Eagle3 paths. Updated the maybe_get_cuda_graph call site to pass these new parameters.
Speculative sampling dispatch and metadata computation
tensorrt_llm/_torch/speculative/interface.py
Refactored populate_sampling_params_for_one_model to compute is_all_greedy_sample dynamically during three phases: initialization of per-request sampling storage, accumulation of normalized sampling parameters, and early-return buffer preparation when fully greedy. Added new _draft_sampler_advanced method for draft token generation with per-request temperatures/top-k/top-p and optional FlashInfer routing. Updated rejection-sampling eligibility and _sample_tokens_for_batch branching to use the derived flag instead of the removed config knob.
Eagle3 draft decoder parametrization and rejection sampling gating
tensorrt_llm/_torch/speculative/eagle3.py, tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py
Updated Eagle3OneModelWorker.draft_decoder signature to accept optional spec_metadata and batch_size, enabling per-request sampling when both are provided or greedy fallback otherwise. Updated draft loop call site to pass these parameters. Tightened rejection-sampling conditions in both Eagle3 and Eagle3DynamicTree workers to skip rejection-sampling computations when the batch is all-greedy, even if rejection sampling is otherwise enabled.
Executor warning removal and example/test updates
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, examples/llm-api/quickstart_advanced.py, examples/models/core/nemotron/README_nemotron_super_v3.md, tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/defs/examples/serve/test_configs/Nemotron3_Super_120B_NVFP4.yml, tests/integration/defs/perf/pytorch_model_config.py, tests/unittest/_torch/speculative/test_eagle3.py
Removed deprecated "falling back to greedy decoding" warning and the --allow_advanced_sampling CLI argument from examples. Removed allow_advanced_sampling=True from all speculative decoding config instantiations in integration and unit tests. Updated YAML configuration examples and test setup to reflect the new speculative metadata structure.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main changes: removing allow_advanced_sampling, auto-detecting greedy fast-paths, enabling mixed-batch rejection sampling, and making draft honor target sampling parameters.
Description check ✅ Passed PR description comprehensively explains the refactoring: removal of allow_advanced_sampling, auto-detection of is_all_greedy_sample, CUDA graph dual variants, and draft sampling parameter propagation. Includes test coverage, performance metrics, and completes all PR checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhaoyangwang-nvidia zhaoyangwang-nvidia changed the title [TRTLLM-12669][refactor] Remove allow_advanced_sampling and capture dual CUDA graphs [TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params Jun 3, 2026

@mikeiovine mikeiovine left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any other spec algos always using greedy for draft sampling? That will need to be fixed too in a follow up

Comment thread tensorrt_llm/llmapi/llm_args.py
Comment thread tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated
Comment thread tensorrt_llm/_torch/speculative/eagle3.py Outdated
Comment thread tensorrt_llm/llmapi/llm_args.py
@zhaoyangwang-nvidia zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch 2 times, most recently from 775bae6 to d6fd852 Compare June 4, 2026 02:54
'MTPForCausalLM' does not store its constructor's 'model' argument as
self.model, so getattr(draft_model.model, "d2t", None) raised
AttributeError when draft_decoder was called in MTP Eagle mode. Use
nested getattr to safely return None when draft_model has no 'model'
attribute (MTP Eagle never uses a compressed vocabulary so d2t is
always None for that mode anyway).

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…ble top_k_max

Two bugs exposed by the new forced non-greedy CUDA graph capture pass:

1. SpecMetadata.populate_sampling_params_for_one_model: buffer size is
   tokens_per_request * max_num_requests, but warmup batches can have
   more total tokens when batch_size > max_num_requests. Fix by using
   max(static_required, actual_flat_size) for buffer allocation.

2. Eagle3 dynamic tree rejection: verify_dynamic_tree_rejection_from_logits_out
   computed top_k_max via boolean tensor indexing + .item(), both
   CUDA-graph-incompatible. Fix by:
   - Pre-computing top_k_max CPU-side in populate_sampling_params_for_one_model
   - Passing top_k_max=0 during stream capture (forces full-sort path,
     always correct) and the pre-computed value during eager execution
   - Adding top_k_max optional param to verify_dynamic_tree_rejection_from_logits_out

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…ure and fix PARD num_tokens shape mismatch

- eagle3_dynamic_tree.py: _can_use_rejection_sampling now returns False
  when spec_metadata.is_cuda_graph is True. The rejection ops
  (compute_draft_probs_for_dynamic_tree_rejection_op) use a full-sort
  fallback with dynamic allocation that is incompatible with CUDA stream
  capture, causing cudaErrorStreamCaptureUnsupported.

- interface.py: _sample_tokens_for_batch now derives num_tokens from
  logits.shape[0] instead of computing it from runtime_draft_len.
  For PARD under CUDA graph capture runtime_draft_len can be the PARD-max
  while the graph was built for a shorter draft_len, causing a shape
  mismatch in the torch.compiled sampling_batch_spec_dec_one_model.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…tp_in_adp

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…g in ADP+LM-head-TP

In the MTP-Eagle ADP + LM-head-TP path, draft logits are zero-padded to
max_num_requests so every TP rank produces an identically-shaped tensor for
the LM-head-TP all-gather. The refactored draft sampler applies per-request
temperature/top_k/top_p tensors sized to token_count (== batch_size), so the
padded logits ([max_num_requests, vocab]) failed to broadcast against the
[batch_size, 1] temperature in apply_temperature, crashing torch.compile fake
tensor tracing during executor worker init.

Drop the padded rows before sampling (logits = logits[:token_count]) instead
of trimming the sampled tokens afterwards. This keeps logits, next_draft_tokens
and the draft_probs buffer token_count-sized and lets the per-request sampling
params broadcast correctly.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…y selection

The one-engine CUDA graph key includes is_all_greedy_sample to dispatch
between the argmax fast-path and the advanced-sampling graph variant. The flag
was only (re)computed inside populate_sampling_params_for_one_model, which runs
in _prepare_inputs AFTER maybe_get_cuda_graph has already built the key. The key
therefore used the previous iteration's stale flag, and warmup left it False
(from the advanced-sampling capture pass). On the first real decode iteration a
greedy batch would then replay the advanced-sampling graph while populate skips
filling the sampling/draft_probs buffers, reading uninitialized slot-indexed
data. For MTP with num_nextn>=2 this hung the executor (Hang detected on rank 0).

Fix:
- Extract the greediness detection into _scan_one_model_sampling (single source
  of truth) and add update_is_all_greedy_sample, called before the graph key is
  built so the key matches the buffers populate fills. populate now reuses the
  same scan.
- Defensively reset spec_metadata.is_all_greedy_sample to True after CUDA graph
  warmup so the stale capture-only False does not seed the first iteration.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
@zhaoyangwang-nvidia zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch from 9f39541 to 764edb7 Compare June 13, 2026 03:16
@xxi-nv

xxi-nv commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54001 [ run ] triggered by Bot. Commit: 764edb7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54001 [ run ] completed with state SUCCESS. Commit: 764edb7
/LLM/main/L0_MergeRequest_PR pipeline #43086 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…avoid multi-GPU hang

draft_decoder routed the all-greedy fast path to _draft_sampler_greedy, a
plain torch.argmax. For MTP-Eagle with a tensor-parallel draft LM head
(tp_size>1 without attention DP, or LM-head-TP in ADP) the draft logits are
sharded along the vocab dim, so a per-rank argmax selects a different token on
each rank. The ranks then desync on the speculative-decoding control flow and
the next collective deadlocks, observed as a generation hang on rank 0
(e.g. DeepSeek-V3-Lite tp4 + mtp_nextn>=2 + cuda_graph + torch_compile).

Restore the TP-aware path: for the all-greedy case, MTP-Eagle now uses
draft_sampler(), which all-gathers the sharded draft logits before argmax (and
falls back to a plain argmax when no TP gather is needed). Eagle3 (non-MTP)
keeps its d2t-aware argmax. This matches the pre-refactor behavior.

Root-caused and verified by local reproduction (DeepSeek-V3-Lite, tp4,
mtp_nextn=2, cuda_graph, torch_compile): baseline passes, the refactor hangs,
and this fix restores passing.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
…non-greedy) sampling under TP

The non-greedy draft sampling path (_draft_sampler_advanced) has the same
multi-GPU hazard as the greedy path that was just fixed. With a plain
tensor-parallel draft LM head (tp_size>1 without attention DP) each rank only
holds a vocab shard of the draft logits, so per-rank random sampling draws a
different token on each rank, desyncs the speculative-decoding control flow and
deadlocks the next collective (generation hang).

Greedy could be repaired with draft_sampler()'s lightweight max+index
all-gather, but random sampling needs the full distribution, so all-gather the
sharded draft logits into the full vocab before advanced sampling. Every rank
then samples from the same distribution with the shared seed. The
LM-head-TP-in-ADP path is gathered upstream and is intentionally excluded.

Verified by local reproduction (DeepSeek-V3-Lite, tp4, mtp_nextn=2, cuda_graph,
torch_compile, non-greedy temperature/top_k/top_p): hangs without this gather,
passes with it.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
@xxi-nv

xxi-nv commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54084 [ run ] triggered by Bot. Commit: 85468e1 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54084 [ run ] completed with state SUCCESS. Commit: 85468e1
/LLM/main/L0_MergeRequest_PR pipeline #43168 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv

xxi-nv commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54110 [ run ] triggered by Bot. Commit: 85468e1 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54110 [ run ] completed with state SUCCESS. Commit: 85468e1
/LLM/main/L0_MergeRequest_PR pipeline #43194 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…ng through draft_sampler

The previous fix routed every greedy MTP-Eagle draft step through
draft_sampler(), but that call does not forward mapping_lm_head_tp. For the
LM-head-TP-in-ADP configuration draft_sampler() then takes its ADP branch with
a None mapping and crashes during warmup with
"'NoneType' object has no attribute 'tp_group'" (Executor worker returned
error), e.g. DeepSeek-R1 nvfp4 latency_adp_lmtp_tp4.

Only plain tensor parallelism (tp_size>1 without attention DP) shards the draft
logits over the vocab dim and needs draft_sampler()'s all-gather argmax. The
LM-head-TP-in-ADP case already yields full-vocab logits per rank (gathered
upstream) and the no-TP / Eagle3 cases need nothing, so all of those take the
plain d2t-aware argmax (_draft_sampler_greedy), restoring the pre-regression
behavior for ADP while keeping the plain-TP hang fix.

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>
@xxi-nv

xxi-nv commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54140 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54140 [ run ] completed with state SUCCESS. Commit: 16e577c
/LLM/main/L0_MergeRequest_PR pipeline #43223 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv

xxi-nv commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54165 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54165 [ run ] completed with state SUCCESS. Commit: 16e577c
/LLM/main/L0_MergeRequest_PR pipeline #43248 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv

xxi-nv commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54182 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

@xxi-nv

xxi-nv commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54250 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54182 [ run ] completed with state ABORTED. Commit: 16e577c

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants