You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EAGLE 3 can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
52
+
51
53
### NGram
52
54
53
55
The NGram method is an implementation of [this Prompt Lookup Decoding algorithm](https://github.com/apoorvumang/prompt-lookup-decoding).
MTP can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
94
+
95
+
### PARD
96
+
97
+
PARD (PARallel Draft) is a target-independent speculative decoding method that predicts all draft tokens in a single forward pass using mask tokens. Unlike MTP or EAGLE 3 which generate drafts one token at a time, PARD produces K draft tokens in parallel.
98
+
99
+
Reference: [PARD: Parallel Drafting for Speculative Decoding](https://arxiv.org/pdf/2504.18583)
100
+
101
+
*`max_draft_len`: Maximum draft candidate length.
102
+
*`speculative_model`: Path or HuggingFace model ID for the PARD draft model.
103
+
*`mask_token_id`: Token ID used as the mask token for parallel prediction. If not set, it is read from the draft model config.
104
+
105
+
```python
106
+
from tensorrt_llm.llmapi import PARDDecodingConfig
PARD can be combined with the [Suffix Automaton enhancement](#suffix-automaton-sa-enhancement) for improved acceptance rates on repetitive content. See the SA section below for details.
115
+
91
116
### User-provided drafting
92
117
A completely user-defined drafting method can be supplied with a `UserProvidedDecodingConfig` that includes
The Suffix Automaton (SA) is a model-free, GPU-based pattern-matching draft enhancer. It finds suffix matches in previously generated tokens and proposes draft tokens when the match is long enough. SA is very accurate when it matches (exact pattern repetition), while neural methods are better for novel content — combining them gives the best of both worlds.
134
+
135
+
SA can be combined with the following speculative decoding techniques:
136
+
137
+
***MTP** (`MTPDecodingConfig`)
138
+
***EAGLE 3** (`Eagle3DecodingConfig`)
139
+
***PARD** (`PARDDecodingConfig`)
140
+
141
+
To enable SA combination, set `use_sa_spec=True` on the speculative config. The `sa_spec_threshold` parameter controls the minimum suffix match length required to override the neural draft (default: 4).
142
+
143
+
```python
144
+
from tensorrt_llm.llmapi import Eagle3DecodingConfig
@@ -117,6 +176,8 @@ Speculative decoding options must be specified via `--config config.yaml` for bo
117
176
*`Eagle3`
118
177
*`NGram`
119
178
*`DraftTarget`
179
+
*`PARD`
180
+
*`SA`
120
181
121
182
> Note: The PyTorch backend supports only `Eagle3`. `decoding_type: Eagle` is accepted as a backward-compatible alias for `Eagle3`, but EAGLE (v1/v2) draft checkpoints are incompatible.
122
183
@@ -138,6 +199,16 @@ speculative_config:
138
199
speculative_model: /path/to/draft/model
139
200
```
140
201
202
+
```yaml
203
+
# SA combination: enable Suffix Automaton enhancement with any supported technique
204
+
speculative_config:
205
+
decoding_type: Eagle3
206
+
max_draft_len: 4
207
+
speculative_model: /path/to/draft/model
208
+
use_sa_spec: true
209
+
sa_spec_threshold: 4
210
+
```
211
+
141
212
```{note}
142
213
The field name `speculative_model_dir` can also be used as an alias for `speculative_config.speculative_model`. For example:
0 commit comments