[TRTLLM-12669][feat] Enable rejection sampling by default for Eagle3 one-model

zhaoyangwang-nvidia · zhaoyangwang-nvidia · commit d599fb530c5f · 2026-06-02T03:06:42.000-07:00
Flip the default of `use_rejection_sampling` from `False` to `True` on
DecodingBaseConfig. With the refactor of the all-greedy fast path in
place, this is safe: the runtime guard in `_can_use_rejection_sampling`
still requires a non-greedy batch, so all-greedy batches keep taking the
argmax fast path unchanged. Only batches that already opted into
non-greedy sampling now see the rejection sampling acceptance behavior.

Benchmark results on Qwen3-235B-A22B + Eagle3 (tp=8) show consistent
+6.4% to +9.4% throughput and +3.4 to +4.3 pp acceptance rate across
batch sizes 1-16 vs the exact-match baseline. Other Eagle3 deployments
see smaller but uniformly positive acceptance-rate gains.

Two prior `raise ValueError` paths are converted to silent fallbacks so
the new default does not break existing users:
- Non-Eagle3 spec configs (PARD, DFlash, MTP, ...) silently disable the
  flag in TorchLlmArgs post-validation, since rejection sampling is only
  wired up for Eagle3 one-model paths.
- SA-enhanced Eagle3 configs disable the flag in the per-config
  validator, since SA may override proposed draft tokens.

Users who want the prior exact-match behavior can still pass
`use_rejection_sampling=False` explicitly.

Signed-off-by: ZhaoyangWang &lt;zhaoyangw@nvidia.com&gt;
diff --git a/tensorrt_llm/llmapi/llm_args.py b/tensorrt_llm/llmapi/llm_args.py
@@ -897,11 +897,13 @@ class DecodingBaseConfig(StrictBaseModel):
         "PyTorch backend only.")
 
     use_rejection_sampling: bool = Field(
-        default=False,
+        default=True,
         status="prototype",
         description=
-        "If true, enables rejection sampling for one-model speculative decoding paths. "
-        "This is intended for non-greedy sampling configurations on the PyTorch backend. "
+        "If true (default), enables rejection sampling for one-model speculative "
+        "decoding paths when the batch contains any non-greedy request. All-greedy "
+        "batches always take the argmax fast path regardless of this flag. Set to "
+        "false to fall back to exact-match verification on non-greedy batches. "
         "The non-dynamic-tree one-model path requires FlashInfer.")
 
     # If set, drafting is allowed to use chain drafter.
@@ -958,13 +960,14 @@ def validate_draft_len_schedule_and_sort(cls, v, info):
 
     @model_validator(mode='after')
     def validate_rejection_sampling_config(self):
-        """Reject SA-enhanced configurations that invalidate rejection sampling."""
+        """Disable rejection sampling when SA-enhanced configurations are
+        active, since SA may override the proposed draft tokens. This is a
+        silent fallback so the new default (True) does not break sa_config
+        users.
+        """
         if self.use_rejection_sampling and getattr(self, 'sa_config',
                                                    None) is not None:
-            raise ValueError(
-                "use_rejection_sampling is incompatible with sa_config "
-                "because SA enhancement may override the proposed draft tokens."
-            )
+            self.use_rejection_sampling = False
         return self
 
     @model_validator(mode='after')
@@ -4140,12 +4143,12 @@ def validate_speculative_config(self):
                     exclude={"decoding_type"})
                 self.speculative_config = Eagle3DecodingConfig(**eagle_data)
 
-            if self.speculative_config.use_rejection_sampling:
-                if not isinstance(self.speculative_config,
-                                  Eagle3DecodingConfig):
-                    raise ValueError(
-                        "use_rejection_sampling is only supported for "
-                        "PyTorch Eagle3 one-model speculative decoding paths.")
+            if self.speculative_config.use_rejection_sampling and not isinstance(
+                    self.speculative_config, Eagle3DecodingConfig):
+                # Rejection sampling is only wired up for Eagle3 one-model paths.
+                # Silently fall back for other spec types so the new default
+                # (True) does not break them.
+                self.speculative_config.use_rejection_sampling = False
 
             if isinstance(self.speculative_config, PARDDecodingConfig):
                 assert self.speculative_config.max_draft_len > 0, "PARD max_draft_len must be > 0"