[TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params by zhaoyangwang-nvidia · Pull Request #14745 · NVIDIA/TensorRT-LLM

zhaoyangwang-nvidia · 2026-05-29T08:59:59Z

Replace static config flag with auto-detected per-step uses_advanced_sampling based on actual sampling params. Include this in CUDA graph key so we lazily capture two graph variants (argmax fast-path vs advanced sampling kernel) and dispatch by replaying the right one.

Summary by CodeRabbit

New Features
- Speculative decoding now automatically detects all-greedy sampling for optimized performance.
- Draft token generation enhanced to support per-request sampling parameters in advanced configurations.
Bug Fixes
- Removed unnecessary fallback warnings in speculative decoding setup.
Documentation
- Updated speculative decoding configuration examples and test settings with improved sampling support.

Description

Refactors Eagle3 one-model speculative decoding sampling. Four logical changes, sequenced as separate commits:

1. Replace `allow_advanced_sampling` with auto-detected `is_all_greedy_sample`

Removes the allow_advanced_sampling config flag from DecodingBaseConfig. Replaced with a per-step is_all_greedy_sample derived from the actual temperature / top_k / top_p of requests in the batch. The flag is included in the CUDA graph cache key, so two graph variants are lazily captured (argmax fast-path vs. advanced-sampling kernel) and dispatched at replay time based on batch composition.

2. Eagle3 drafter honors target's sampling params

Previously the Eagle3 draft model always ran greedy regardless of the target's sampling configuration. This change propagates the target's temperature / top_k / top_p into the draft loop so that draft samples come from the same distribution as the target's reference distribution. This is a correctness prerequisite for non-greedy rejection sampling (Leviathan formula u * p_draft(x) < p_target(x) only holds when both probabilities come from the same conditioning).

3. Slot-indexed `draft_probs` to support mixed batches

Previously _can_use_rejection_sampling bailed out when the batch contained context requests, falling back to exact-match for the whole batch. Root cause: draft_probs was indexed by batch position, but batch position is unstable across iterations (chunked-prefill, finishing gens, new ctx joins all shift positions). Fix:

Reshape draft_probs from flat [total_draft_tokens, vocab] to slot-indexed [max_num_requests, max_draft_len, vocab], keyed by stable py_seq_slot.
Scatter on write (_compute_and_store_draft_probs), gather on read (_accept_draft_tokens) using a precomputed batch_slot_ids tensor.
Drop the num_contexts == 0 constraint in _can_use_rejection_sampling — ctx subset goes through _sample_tokens_for_batch, gen subset goes through the rejection kernel.
Reset draft_probs_valid = False whenever the draft loop writes no probs, so stale data is never read.

Mixed-batch rejection captures ~18% sys-tps on llama70b bs=32 vs. the exact-match fallback.

Test Coverage

Unit tests (B200)

tests/unittest/_torch/speculative/test_eagle3.py
  test_eagle3_cuda_graph_padding[True]  PASSED  130.54s
  test_eagle3_cuda_graph_padding[False] PASSED  103.53s

End-to-end correctness — Qwen3-8B (H100 SXM5 80G)

Greedy path verification (no temp / top_p / top_k → both paths take greedy branch):

baseline (83ec591, allow_advanced_sampling=False) vs new (d237690, auto-detected is_all_greedy_sample=True)
token-level identity: 80/80 prompts identical
total_output_tokens: 113,432 vs 113,432, Δ=0
mean_acceptance_rate: 0.4918 vs 0.4918, Δ=+0.00%
mean_acceptance_length: 2.475 vs 2.475, Δ=0
mean throughput: 224.19 vs 224.02 tok/s, Δ=-0.07%
wall clock: 505.99s vs 506.38s, Δ=+0.08%

Performance — rejection sampling ON vs OFF (non-greedy)

CUDA graph enabled. mtbench dataset. Sampling params: temperature=0.7, top_k=50, top_p=0.9.

Llama-3.3-70B-Instruct + EAGLE-3 (mean over 3 rounds)

bs	TPS off	TPS on	ΔTPS	AR off	AR on	ΔAR
32	3408.2	3111.3	−8.70%	0.3475	0.3581	+3.04%
16	2203.7	2069.3	−6.09%	0.3499	0.3568	+1.96%
8	1278.9	1232.4	−3.58%	0.3444	0.3586	+4.15%
4	704.1	694.9	−1.29%	0.3451	0.3585	+3.88%
2	390.2	384.3	−1.51%	0.3448	0.3547	+2.89%
1	204.2	203.4	−0.43%	0.3490	0.3561	+2.05%

Per-round detail (3 rounds, llama-70b)

bs	round	TPS off	TPS on	ΔTPS	AR off	AR on	ΔAR
32	R1	3413.8	3113.9	−8.78%	0.3479	0.3579	+2.88%
32	R2	3441.7	3111.2	−9.60%	0.3512	0.3584	+2.04%
32	R3	3369.0	3108.9	−7.72%	0.3435	0.3579	+4.19%
16	R1	2175.9	2078.4	−4.48%	0.3524	0.3559	+0.99%
16	R2	2219.6	2101.3	−5.33%	0.3492	0.3591	+2.82%
16	R3	2215.7	2028.2	−8.46%	0.3481	0.3553	+2.08%
8	R1	1281.1	1230.6	−3.94%	0.3428	0.3586	+4.62%
8	R2	1255.5	1266.1	+0.85%	0.3438	0.3613	+5.10%
8	R3	1300.2	1200.6	−7.66%	0.3466	0.3560	+2.73%
4	R1	709.1	708.0	−0.15%	0.3475	0.3618	+4.14%
4	R2	694.5	701.5	+1.01%	0.3422	0.3607	+5.40%
4	R3	708.6	675.1	−4.72%	0.3457	0.3530	+2.11%
2	R1	390.1	380.3	−2.51%	0.3420	0.3487	+1.95%
2	R2	390.0	379.8	−2.61%	0.3489	0.3539	+1.42%
2	R3	390.4	392.7	+0.59%	0.3434	0.3616	+5.31%
1	R1	205.5	205.7	+0.07%	0.3429	0.3553	+3.62%
1	R2	202.5	203.4	+0.48%	0.3530	0.3555	+0.70%
1	R3	204.7	201.0	−1.84%	0.3511	0.3575	+1.83%

Qwen3-235B-A22B + EAGLE-3

bs	TPS off	TPS on	ΔTPS	AR off	AR on	ΔAR
16	1280.8	1337.3	+4.42%	0.3089	0.3510	+13.62%
8	777.3	805.9	+3.68%	0.3098	0.3502	+13.03%
4	469.0	492.4	+5.00%	0.3068	0.3539	+15.36%
2	278.0	288.8	+3.89%	0.3078	0.3519	+14.32%
1	154.1	159.7	+3.64%	0.3097	0.3488	+12.62%

Observation

Acceptance rate improves across all tested configurations under non-greedy sampling (+2–4% on llama-70b, +12–15% on qwen-235b). The AR uplift translates to a TPS win on qwen-235b at every batch size, but not on llama-70b, where rejection sampling currently costs 0–9% TPS. Lower batch sizes also show enough run-to-run noise that some signs flip across rounds.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

zhaoyangwang-nvidia · 2026-05-29T12:39:04Z

/bot run

zhaoyangwang-nvidia · 2026-05-29T12:39:27Z

Hi @mikeiovine please help to review this PR, thanks~

tensorrt-cicd · 2026-05-29T12:44:41Z

PR_Github #51043 [ run ] triggered by Bot. Commit: d237690 Link to invocation

tensorrt-cicd · 2026-05-29T17:20:11Z

PR_Github #51043 [ run ] completed with state SUCCESS. Commit: d237690
/LLM/main/L0_MergeRequest_PR pipeline #40490 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-06-01T02:50:21Z

Hi @NVIDIA/trt-llm-doc-owners @NVIDIA/trt-llm-llmapi-devs @NVIDIA/trt-llm-qa-function @NVIDIA/trt-llm-torch-models-devs @NVIDIA/trt-llm-torch-runtime-devs @NVIDIA/trt-llm-torch-spec-decoding please help to review this PR, thanks a lot.

zhaoyangwang-nvidia · 2026-06-01T02:57:55Z

/bot run

tensorrt-cicd · 2026-06-01T03:04:37Z

PR_Github #51297 [ run ] triggered by Bot. Commit: d237690 Link to invocation

tensorrt-cicd · 2026-06-01T04:02:32Z

PR_Github #51297 [ run ] completed with state SUCCESS. Commit: d237690
/LLM/main/L0_MergeRequest_PR pipeline #40712 completed with status: 'SUCCESS'

CI Report

Link to invocation

coderabbitai · 2026-06-01T09:50:12Z

📝 Walkthrough

Walkthrough

This PR replaces the allow_advanced_sampling configuration flag with a runtime-computed is_all_greedy_sample flag. Speculative decoding now automatically selects greedy or advanced sampling paths based on whether all batch requests are greedy, enabling CUDA graph caching and rejection-sampling optimizations without manual user configuration.

Changes

Speculative Decoding Sampling Configuration Refactoring

Layer / File(s)	Summary
Configuration schema refactoring `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/speculative/interface.py`, `tensorrt_llm/_torch/speculative/utils.py`	Removed `allow_advanced_sampling` from `DecodingBaseConfig` and replaced it with `is_all_greedy_sample` in `SpecMetadata`. Updated all five speculative metadata constructors (`MTPSpecMetadata`, `Eagle3OneModelSpecMetadata`, `PARDSpecMetadata`, `DFlashSpecMetadata`, `DraftTargetOneModelSpecMetadata`) to remove the `allow_advanced_sampling` argument from instantiation calls.
CUDA graph caching key expansion `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`	Extended `KeyType` with a fifth boolean dimension for the new `is_all_greedy_sample` flag. Updated `get_graph_key` to accept `spec_resource_manager` and `spec_metadata`, derives `is_all_greedy_sample` from metadata, and includes it in the graph key for both Eagle3 and non-Eagle3 paths. Updated the `maybe_get_cuda_graph` call site to pass these new parameters.
Speculative sampling dispatch and metadata computation `tensorrt_llm/_torch/speculative/interface.py`	Refactored `populate_sampling_params_for_one_model` to compute `is_all_greedy_sample` dynamically during three phases: initialization of per-request sampling storage, accumulation of normalized sampling parameters, and early-return buffer preparation when fully greedy. Added new `_draft_sampler_advanced` method for draft token generation with per-request temperatures/top-k/top-p and optional FlashInfer routing. Updated rejection-sampling eligibility and `_sample_tokens_for_batch` branching to use the derived flag instead of the removed config knob.
Eagle3 draft decoder parametrization and rejection sampling gating `tensorrt_llm/_torch/speculative/eagle3.py`, `tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py`	Updated `Eagle3OneModelWorker.draft_decoder` signature to accept optional `spec_metadata` and `batch_size`, enabling per-request sampling when both are provided or greedy fallback otherwise. Updated draft loop call site to pass these parameters. Tightened rejection-sampling conditions in both Eagle3 and Eagle3DynamicTree workers to skip rejection-sampling computations when the batch is all-greedy, even if rejection sampling is otherwise enabled.
Executor warning removal and example/test updates `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`, `examples/llm-api/quickstart_advanced.py`, `examples/models/core/nemotron/README_nemotron_super_v3.md`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/defs/examples/serve/test_configs/Nemotron3_Super_120B_NVFP4.yml`, `tests/integration/defs/perf/pytorch_model_config.py`, `tests/unittest/_torch/speculative/test_eagle3.py`	Removed deprecated "falling back to greedy decoding" warning and the `--allow_advanced_sampling` CLI argument from examples. Removed `allow_advanced_sampling=True` from all speculative decoding config instantiations in integration and unit tests. Updated YAML configuration examples and test setup to reflect the new speculative metadata structure.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the main changes: removing allow_advanced_sampling, auto-detecting greedy fast-paths, enabling mixed-batch rejection sampling, and making draft honor target sampling parameters.
Description check	✅ Passed	PR description comprehensively explains the refactoring: removal of allow_advanced_sampling, auto-detection of is_all_greedy_sample, CUDA graph dual variants, and draft sampling parameter propagation. Includes test coverage, performance metrics, and completes all PR checklist items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mikeiovine

Are any other spec algos always using greedy for draft sampling? That will need to be fixed too in a follow up

'MTPForCausalLM' does not store its constructor's 'model' argument as self.model, so getattr(draft_model.model, "d2t", None) raised AttributeError when draft_decoder was called in MTP Eagle mode. Use nested getattr to safely return None when draft_model has no 'model' attribute (MTP Eagle never uses a compressed vocabulary so d2t is always None for that mode anyway). Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…ble top_k_max Two bugs exposed by the new forced non-greedy CUDA graph capture pass: 1. SpecMetadata.populate_sampling_params_for_one_model: buffer size is tokens_per_request * max_num_requests, but warmup batches can have more total tokens when batch_size > max_num_requests. Fix by using max(static_required, actual_flat_size) for buffer allocation. 2. Eagle3 dynamic tree rejection: verify_dynamic_tree_rejection_from_logits_out computed top_k_max via boolean tensor indexing + .item(), both CUDA-graph-incompatible. Fix by: - Pre-computing top_k_max CPU-side in populate_sampling_params_for_one_model - Passing top_k_max=0 during stream capture (forces full-sort path, always correct) and the pre-computed value during eager execution - Adding top_k_max optional param to verify_dynamic_tree_rejection_from_logits_out Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…ure and fix PARD num_tokens shape mismatch - eagle3_dynamic_tree.py: _can_use_rejection_sampling now returns False when spec_metadata.is_cuda_graph is True. The rejection ops (compute_draft_probs_for_dynamic_tree_rejection_op) use a full-sort fallback with dynamic allocation that is incompatible with CUDA stream capture, causing cudaErrorStreamCaptureUnsupported. - interface.py: _sample_tokens_for_batch now derives num_tokens from logits.shape[0] instead of computing it from runtime_draft_len. For PARD under CUDA graph capture runtime_draft_len can be the PARD-max while the graph was built for a shorter draft_len, causing a shape mismatch in the torch.compiled sampling_batch_spec_dec_one_model. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…tp_in_adp Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…g in ADP+LM-head-TP In the MTP-Eagle ADP + LM-head-TP path, draft logits are zero-padded to max_num_requests so every TP rank produces an identically-shaped tensor for the LM-head-TP all-gather. The refactored draft sampler applies per-request temperature/top_k/top_p tensors sized to token_count (== batch_size), so the padded logits ([max_num_requests, vocab]) failed to broadcast against the [batch_size, 1] temperature in apply_temperature, crashing torch.compile fake tensor tracing during executor worker init. Drop the padded rows before sampling (logits = logits[:token_count]) instead of trimming the sampled tokens afterwards. This keeps logits, next_draft_tokens and the draft_probs buffer token_count-sized and lets the per-request sampling params broadcast correctly. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…y selection The one-engine CUDA graph key includes is_all_greedy_sample to dispatch between the argmax fast-path and the advanced-sampling graph variant. The flag was only (re)computed inside populate_sampling_params_for_one_model, which runs in _prepare_inputs AFTER maybe_get_cuda_graph has already built the key. The key therefore used the previous iteration's stale flag, and warmup left it False (from the advanced-sampling capture pass). On the first real decode iteration a greedy batch would then replay the advanced-sampling graph while populate skips filling the sampling/draft_probs buffers, reading uninitialized slot-indexed data. For MTP with num_nextn>=2 this hung the executor (Hang detected on rank 0). Fix: - Extract the greediness detection into _scan_one_model_sampling (single source of truth) and add update_is_all_greedy_sample, called before the graph key is built so the key matches the buffers populate fills. populate now reuses the same scan. - Defensively reset spec_metadata.is_all_greedy_sample to True after CUDA graph warmup so the stale capture-only False does not seed the first iteration. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

xxi-nv · 2026-06-13T03:16:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-13T03:21:42Z

PR_Github #54001 [ run ] triggered by Bot. Commit: 764edb7 Link to invocation

tensorrt-cicd · 2026-06-13T10:16:13Z

PR_Github #54001 [ run ] completed with state SUCCESS. Commit: 764edb7
/LLM/main/L0_MergeRequest_PR pipeline #43086 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…avoid multi-GPU hang draft_decoder routed the all-greedy fast path to _draft_sampler_greedy, a plain torch.argmax. For MTP-Eagle with a tensor-parallel draft LM head (tp_size>1 without attention DP, or LM-head-TP in ADP) the draft logits are sharded along the vocab dim, so a per-rank argmax selects a different token on each rank. The ranks then desync on the speculative-decoding control flow and the next collective deadlocks, observed as a generation hang on rank 0 (e.g. DeepSeek-V3-Lite tp4 + mtp_nextn>=2 + cuda_graph + torch_compile). Restore the TP-aware path: for the all-greedy case, MTP-Eagle now uses draft_sampler(), which all-gathers the sharded draft logits before argmax (and falls back to a plain argmax when no TP gather is needed). Eagle3 (non-MTP) keeps its d2t-aware argmax. This matches the pre-refactor behavior. Root-caused and verified by local reproduction (DeepSeek-V3-Lite, tp4, mtp_nextn=2, cuda_graph, torch_compile): baseline passes, the refactor hangs, and this fix restores passing. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…non-greedy) sampling under TP The non-greedy draft sampling path (_draft_sampler_advanced) has the same multi-GPU hazard as the greedy path that was just fixed. With a plain tensor-parallel draft LM head (tp_size>1 without attention DP) each rank only holds a vocab shard of the draft logits, so per-rank random sampling draws a different token on each rank, desyncs the speculative-decoding control flow and deadlocks the next collective (generation hang). Greedy could be repaired with draft_sampler()'s lightweight max+index all-gather, but random sampling needs the full distribution, so all-gather the sharded draft logits into the full vocab before advanced sampling. Every rank then samples from the same distribution with the shared seed. The LM-head-TP-in-ADP path is gathered upstream and is intentionally excluded. Verified by local reproduction (DeepSeek-V3-Lite, tp4, mtp_nextn=2, cuda_graph, torch_compile, non-greedy temperature/top_k/top_p): hangs without this gather, passes with it. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

xxi-nv · 2026-06-14T04:20:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-14T04:27:43Z

PR_Github #54084 [ run ] triggered by Bot. Commit: 85468e1 Link to invocation

tensorrt-cicd · 2026-06-14T08:53:45Z

PR_Github #54084 [ run ] completed with state SUCCESS. Commit: 85468e1
/LLM/main/L0_MergeRequest_PR pipeline #43168 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-14T10:12:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-14T10:18:40Z

PR_Github #54110 [ run ] triggered by Bot. Commit: 85468e1 Link to invocation

tensorrt-cicd · 2026-06-14T16:45:12Z

PR_Github #54110 [ run ] completed with state SUCCESS. Commit: 85468e1
/LLM/main/L0_MergeRequest_PR pipeline #43194 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…ng through draft_sampler The previous fix routed every greedy MTP-Eagle draft step through draft_sampler(), but that call does not forward mapping_lm_head_tp. For the LM-head-TP-in-ADP configuration draft_sampler() then takes its ADP branch with a None mapping and crashes during warmup with "'NoneType' object has no attribute 'tp_group'" (Executor worker returned error), e.g. DeepSeek-R1 nvfp4 latency_adp_lmtp_tp4. Only plain tensor parallelism (tp_size>1 without attention DP) shards the draft logits over the vocab dim and needs draft_sampler()'s all-gather argmax. The LM-head-TP-in-ADP case already yields full-vocab logits per rank (gathered upstream) and the no-TP / Eagle3 cases need nothing, so all of those take the plain d2t-aware argmax (_draft_sampler_greedy), restoring the pre-regression behavior for ADP while keeping the plain-TP hang fix. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

xxi-nv · 2026-06-14T16:52:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-14T16:57:47Z

PR_Github #54140 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

tensorrt-cicd · 2026-06-14T21:40:29Z

PR_Github #54140 [ run ] completed with state SUCCESS. Commit: 16e577c
/LLM/main/L0_MergeRequest_PR pipeline #43223 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-15T00:29:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-15T00:35:23Z

PR_Github #54165 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

tensorrt-cicd · 2026-06-15T01:25:58Z

PR_Github #54165 [ run ] completed with state SUCCESS. Commit: 16e577c
/LLM/main/L0_MergeRequest_PR pipeline #43248 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-15T02:03:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-15T02:09:20Z

PR_Github #54182 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

xxi-nv · 2026-06-15T07:15:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-15T07:21:06Z

PR_Github #54250 [ run ] triggered by Bot. Commit: 16e577c Link to invocation

tensorrt-cicd · 2026-06-15T07:25:45Z

PR_Github #54182 [ run ] completed with state ABORTED. Commit: 16e577c

Link to invocation

github-actions Bot assigned zhaoyangwang-nvidia May 29, 2026

zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch from 903b453 to d237690 Compare May 29, 2026 10:05

zhaoyangwang-nvidia marked this pull request as ready for review May 29, 2026 10:10

zhaoyangwang-nvidia requested review from a team as code owners May 29, 2026 10:10

zhaoyangwang-nvidia requested review from nv-guomingz, sunnyqgg, syuoni, venkywonka and zhenhuaw-me May 29, 2026 10:10

ruodil approved these changes Jun 1, 2026

View reviewed changes

jieli-matrix approved these changes Jun 3, 2026

View reviewed changes

zhaoyangwang-nvidia changed the title ~~[TRTLLM-12669][refactor] Remove allow_advanced_sampling and capture dual CUDA graphs~~ [TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params Jun 3, 2026

mikeiovine reviewed Jun 3, 2026

View reviewed changes

zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch 2 times, most recently from 775bae6 to d6fd852 Compare June 4, 2026 02:54

zhaoyangwang-nvidia added 6 commits June 12, 2026 20:15

[TRTLLM-11508][fix] trim draft token to token_count when use_lm_head_…

8c16eed

…tp_in_adp Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch from 9f39541 to 764edb7 Compare June 13, 2026 03:16

zhaoyangwang-nvidia added 2 commits June 13, 2026 10:01

Conversation

zhaoyangwang-nvidia commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

1. Replace allow_advanced_sampling with auto-detected is_all_greedy_sample

2. Eagle3 drafter honors target's sampling params

3. Slot-indexed draft_probs to support mixed batches

Test Coverage

Unit tests (B200)

End-to-end correctness — Qwen3-8B (H100 SXM5 80G)

Performance — rejection sampling ON vs OFF (non-greedy)

Llama-3.3-70B-Instruct + EAGLE-3 (mean over 3 rounds)

Qwen3-235B-A22B + EAGLE-3

Observation

PR Checklist

GitHub Bot Help

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

zhaoyangwang-nvidia commented Jun 1, 2026

Uh oh!

zhaoyangwang-nvidia commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xxi-nv commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

xxi-nv commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

xxi-nv commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

xxi-nv commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

xxi-nv commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

xxi-nv commented Jun 15, 2026

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026 •

edited

Loading

1. Replace `allow_advanced_sampling` with auto-detected `is_all_greedy_sample`

3. Slot-indexed `draft_probs` to support mixed batches

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading