NVIDIA
diff --git a/‎docs/source/features/speculative-decoding.md‎
Lines changed: 71 additions & 0 deletions b/‎docs/source/features/speculative-decoding.md‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎tensorrt_llm/_torch/pyexecutor/model_engine.py‎
Lines changed: 20 additions & 14 deletions b/‎tensorrt_llm/_torch/pyexecutor/model_engine.py‎
Lines changed: 20 additions & 14 deletions
diff --git a/‎tensorrt_llm/_torch/speculative/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎tensorrt_llm/_torch/speculative/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎tensorrt_llm/_torch/speculative/eagle3.py‎
Lines changed: 52 additions & 8 deletions b/‎tensorrt_llm/_torch/speculative/eagle3.py‎
Lines changed: 52 additions & 8 deletions
@@ -48,6 +48,8 @@ speculative_config = Eagle3DecodingConfig(
 llm = LLM(model, speculative_config=speculative_config)
 ```
 
+EAGLE 3 can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
+
 ### NGram
 
 The NGram method is an implementation of [this Prompt Lookup Decoding algorithm](https://github.com/apoorvumang/prompt-lookup-decoding).
@@ -88,6 +90,29 @@ speculative_config = MTPDecodingConfig(
 llm = LLM("/path/to/deepseek_model", speculative_config=speculative_config)
 ```
 
+MTP can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
+
+### PARD
+
+PARD (PARallel Draft) is a target-independent speculative decoding method that predicts all draft tokens in a single forward pass using mask tokens. Unlike MTP or EAGLE 3 which generate drafts one token at a time, PARD produces K draft tokens in parallel.
+
+Reference: [PARD: Parallel Drafting for Speculative Decoding](https://arxiv.org/pdf/2504.18583)
+
+* `max_draft_len`: Maximum draft candidate length.
+* `speculative_model`: Path or HuggingFace model ID for the PARD draft model.
+* `mask_token_id`: Token ID used as the mask token for parallel prediction. If not set, it is read from the draft model config.
+
+```python
+from tensorrt_llm.llmapi import PARDDecodingConfig
+
+speculative_config = PARDDecodingConfig(
+    max_draft_len=4, speculative_model="/path/to/pard_model")
+
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)
+```
+
+PARD can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
+
 ### User-provided drafting
 A completely user-defined drafting method can be supplied with a `UserProvidedDecodingConfig` that includes
 * `max_draft_len`: Maximum draft candidate length.
@@ -103,6 +128,40 @@ speculative_config = UserProvidedDecodingConfig(
 llm = LLM("/path/to/target_model", speculative_config=speculative_config)
 ```
 
+## Suffix Automaton (SA) Enhancement
+
+The Suffix Automaton (SA) is a model-free, GPU-based pattern-matching draft enhancer. It finds suffix matches in previously generated tokens and proposes draft tokens when the match is long enough. SA is very accurate when it matches (exact pattern repetition), while neural methods are better for novel content — combining them gives the best of both worlds.
+
+SA can be combined with the following speculative decoding techniques:
+
+* **MTP** (`MTPDecodingConfig`)
+* **EAGLE 3** (`Eagle3DecodingConfig`)
+* **PARD** (`PARDDecodingConfig`)
+
+To enable SA combination, set `use_sa_spec=True` on the speculative config. The `sa_spec_threshold` parameter controls the minimum suffix match length required to override the neural draft (default: 4).
+
+```python
+from tensorrt_llm.llmapi import Eagle3DecodingConfig
+
+speculative_config = Eagle3DecodingConfig(
+    max_draft_len=4,
+    speculative_model="/path/to/eagle3_model",
+    use_sa_spec=True,
+    sa_spec_threshold=4)
+
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)
+```
+
+SA can also be used as a standalone speculative decoding technique via `SADecodingConfig`:
+
+```python
+from tensorrt_llm.llmapi import SADecodingConfig
+
+speculative_config = SADecodingConfig(max_draft_len=4)
+
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)
+```
+
 ## Usage with `trtllm-bench` and `trtllm-serve`
 
 ```{eval-rst}
@@ -117,6 +176,8 @@ Speculative decoding options must be specified via `--config config.yaml` for bo
 * `Eagle3`
 * `NGram`
 * `DraftTarget`
+* `PARD`
+* `SA`
 
 > Note: The PyTorch backend supports only `Eagle3`. `decoding_type: Eagle` is accepted as a backward-compatible alias for `Eagle3`, but EAGLE (v1/v2) draft checkpoints are incompatible.
 
@@ -138,6 +199,16 @@ speculative_config:
   speculative_model: /path/to/draft/model
 ```
 
+```yaml
+# SA combination: enable Suffix Automaton enhancement with any supported technique
+speculative_config:
+  decoding_type: Eagle3
+  max_draft_len: 4
+  speculative_model: /path/to/draft/model
+  use_sa_spec: true
+  sa_spec_threshold: 4
+```
+
 ```{note}
 The field name `speculative_model_dir` can also be used as an alias for `speculative_config.speculative_model`. For example:
 
 
@@ -3,7 +3,6 @@
 import functools
 import gc
 import inspect
-import itertools
 import math
 import os
 import weakref
@@ -3489,21 +3488,28 @@ def _prepare_inputs(
                 raise NotImplementedError(
                     f"Unsupported cp_type {getattr(cp_type, 'name', cp_type)}.")
 
-        # Initialize SA state for new requests (MTP+SA path)
+        # Initialize SA state for new requests (MTP+SA, EAGLE3+SA, PARD+SA, etc.)
         use_sa_spec = (self.spec_config is not None
                        and getattr(self.spec_config, 'use_sa_spec', False))
-        if (use_sa_spec and spec_metadata is not None
-                and hasattr(spec_metadata, 'sa_manager')
-                and spec_metadata.sa_manager is not None
-                and self.mapping.is_last_pp_rank()):
-            sa_manager = spec_metadata.sa_manager
-            for request in itertools.chain(
-                    scheduled_requests.context_requests,
-                    scheduled_requests.generation_requests):
-                if request.py_request_id not in sa_manager._initialized_requests:
-                    sa_manager.add_request(request.py_request_id,
-                                           request.get_tokens(0))
-                    sa_manager._initialized_requests.add(request.py_request_id)
+        if use_sa_spec and resource_manager is not None and self.mapping.is_last_pp_rank(
+        ):
+            from tensorrt_llm._torch.speculative.suffix_automaton import \
+                SuffixAutomatonManager
+            spec_rm = resource_manager.get_resource_manager(
+                ResourceManagerType.SPEC_RESOURCE_MANAGER)
+            sa_manager = None
+            if spec_rm is not None:
+                if isinstance(spec_rm, SuffixAutomatonManager):
+                    sa_manager = spec_rm
+                else:
+                    sa_manager = getattr(spec_rm, 'sa_manager', None)
+            if sa_manager is not None:
+                for request in scheduled_requests.all_requests():
+                    if request.py_request_id not in sa_manager._initialized_requests:
+                        sa_manager.add_request(request.py_request_id,
+                                               request.get_tokens(0))
+                        sa_manager._initialized_requests.add(
+                            request.py_request_id)
 
         return self._prepare_tp_inputs(
             scheduled_requests, kv_cache_manager, attn_metadata, spec_metadata,
 
@@ -7,6 +7,7 @@
 from .mtp import MTPEagleWorker, MTPSampler, MTPSpecMetadata, MTPWorker
 from .ngram import NGramDrafter, NGramPoolManager
 from .pard import PARDSpecMetadata, PARDWorker
+from .sa_enhancer import SADraftEnhancer
 from .sa_worker import SASampler, SASpecMetadata, SAWorker
 from .save_hidden_state import (SaveHiddenStatesResourceManager,
                                 SaveHiddenStatesSpecMetadata)
@@ -31,6 +32,7 @@
     "NGramPoolManager",
     "PARDSpecMetadata",
     "PARDWorker",
+    "SADraftEnhancer",
     "SASampler",
     "SASpecMetadata",
     "SAWorker",
 
@@ -14,6 +14,7 @@
 from ..pyexecutor.scheduler import ScheduledRequests
 from .interface import SpecMetadata, SpecWorkerBase
 from .mtp import MTPSampler
+from .sa_enhancer import SADraftEnhancer
 from .spec_tree_manager import SpecTreeManager
 
 if TYPE_CHECKING:
@@ -27,14 +28,21 @@ class Eagle3ResourceManager(BaseResourceManager):
     and one for the draft model. Use this class to manage the hidden states.
     """
 
-    def __init__(self, config: "EagleDecodingConfig", dtype: torch.dtype,
-                 hidden_size: int, max_num_requests: int, max_seq_len: int,
-                 max_num_tokens: int):
+    def __init__(self,
+                 config: "EagleDecodingConfig",
+                 dtype: torch.dtype,
+                 hidden_size: int,
+                 max_num_requests: int,
+                 max_seq_len: int,
+                 max_num_tokens: int,
+                 sa_manager=None):
         self.dtype = dtype
         self.max_draft_len = config.max_draft_len
         self.hidden_size = hidden_size
         self.max_num_requests = max_num_requests
         self.max_seq_len = max_seq_len
+        # Optional SA manager for EAGLE3+SA mode
+        self.sa_manager = sa_manager
         # There could be dummy request for padding batch when using CUDA graph.
         # Reserve one more slot for the dummy request.
         slot_size = self.max_seq_len + 1
@@ -94,13 +102,18 @@ def free_resources(self, request: LlmRequest):
         self.seq_lens[slot_id] = 0
         self.start_indices[slot_id] = 0
         self.slot_manager.remove_slot(request.request_id)
+        if self.sa_manager is not None:
+            self.sa_manager.remove_request(request.request_id)
 
     def add_dummy_requests(self, request_ids: List[int]):
         for rid in request_ids:
             self.slot_manager.add_slot(rid)
+        if self.sa_manager is not None:
+            self.sa_manager.add_dummy_requests(request_ids)
 
     def shutdown(self):
-        pass
+        if self.sa_manager is not None:
+            self.sa_manager.shutdown()
 
     def get_max_resource_count(self) -> int:
         return self.max_num_requests
@@ -298,6 +311,8 @@ class Eagle3OneModelSpecMetadata(SpecMetadata):
     dtype: torch.dtype = torch.bfloat16
     # The index of the batch inputs
     batch_indices_cuda: Optional[torch.Tensor] = None
+    # Optional resource manager (used to access SA manager for EAGLE3+SA)
+    spec_resource_manager: Optional[Eagle3ResourceManager] = None
 
     def __post_init__(self):
         if self.layers_to_capture is None:
@@ -345,6 +360,12 @@ def prepare(self):
                                                  non_blocking=True)
         self.num_tokens -= (self.num_generations) * self.max_draft_len
 
+        sa_manager = getattr(self.spec_resource_manager, 'sa_manager', None)
+        if sa_manager is not None:
+            gen_request_ids = self.request_ids[num_seqs - self.num_generations:]
+            if gen_request_ids:
+                sa_manager.prepare(gen_request_ids, self.max_draft_len)
+
     def maybe_capture_hidden_states(
             self,
             layer_id: int,
@@ -375,6 +396,9 @@ def __init__(self,
         super().__init__(use_separate_draft_kv_cache)
         self.spec_config = spec_config
         self.mapping = mapping
+        self.sa_enhancer: Optional[SADraftEnhancer] = None
+        if getattr(spec_config, 'use_sa_spec', False):
+            self.sa_enhancer = SADraftEnhancer(spec_config.sa_spec_threshold)
 
     @property
     def max_draft_len(self) -> int:
@@ -424,6 +448,19 @@ def forward(self,
         accepted_tokens, num_accepted_tokens = self.sample_and_accept_draft_tokens(
             logits, attn_metadata, spec_metadata)
 
+        sa_manager = getattr(spec_metadata.spec_resource_manager, 'sa_manager',
+                             None)
+        if self.sa_enhancer is not None and sa_manager is not None:
+            self.sa_enhancer.extend_and_prepare(
+                sa_manager=sa_manager,
+                request_ids=spec_metadata.request_ids,
+                accepted_tokens=accepted_tokens,
+                num_accepted_tokens=num_accepted_tokens,
+                num_gens=num_gens,
+                num_contexts=num_contexts,
+                max_draft_len=self.max_draft_len,
+            )
+
         # Save the old attn_metadata and spec_metadata
         self._prepare_attn_metadata_for_spec_dec(attn_metadata)
 
@@ -528,6 +565,14 @@ def forward(self,
                 }
         next_draft_tokens = torch.stack(next_draft_tokens, dim=1)
 
+        # Override with SA draft tokens after all draft layers have run,
+        # so that draft layers never see SA tokens in their inputs.
+        if self.sa_enhancer is not None:
+            gen_draft_tokens = next_draft_tokens[num_contexts:]
+            gen_draft_tokens = self.sa_enhancer.maybe_override_all_draft_tokens(
+                gen_draft_tokens)
+            next_draft_tokens[num_contexts:] = gen_draft_tokens
+
         # restore attn_metadata to support cuda graph
         self._restore_attn_metadata_from_spec_dec(attn_metadata)
         # restore all_rank_num_tokens for attention DP
@@ -588,11 +633,10 @@ def draft_decoder(
                 Draft token ids. Flattened.
         '''
 
-        # Note: using greedy for draft tokens is a bit easier to implement and
-        # faster. It doesn't affect the final output and seems to have a negligible
-        # impact on AR.
         d2t = getattr(draft_model.model, "d2t", None)
-        return self._draft_sampler_greedy(logits, d2t)
+        draft_tokens = self._draft_sampler_greedy(logits, d2t)
+
+        return draft_tokens
 
     def prepare_1st_drafter_inputs(
         self,