Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7cde870
feat: add guided decoding support for speculative decoding (MTP)
windreamer Apr 27, 2026
200cbf4
refactor(spec_decode): skip guided-processor fork for Eagle3
windreamer Apr 28, 2026
2da7d83
docs(spec_decoding): use MyST colon-fence admonitions instead of GitH…
windreamer Apr 28, 2026
631299b
feat(spec_decode): add Eagle3 guided decoding with d2t bitmask transl…
windreamer Apr 28, 2026
399fde5
fix: fix issuse raised by copilot
windreamer Apr 28, 2026
b89fe80
fix: batch sync GPU tensors to CPU before loop to avoid per-iteration…
windreamer May 6, 2026
10a9afe
fix: advance forked GrammarMatchers with draft tokens instead of argmax
windreamer May 6, 2026
1f651f9
fix: use asyncio.to_thread for spec_agent guided decoding CPU ops
windreamer May 15, 2026
a446cf6
fix: add missing accept_token in spec_agent prefill path
windreamer May 15, 2026
a68193d
fix: session_ctx expand/slice, cache bitmask constants, fix tests, up…
windreamer May 15, 2026
3a4fa93
refactor: make proposer get_outputs async, wrap xgrammar calls in to_…
windreamer May 18, 2026
3b57e64
test: skip MTP guided decoding integration tests when FA3 unavailable
windreamer May 18, 2026
5fe0e73
fix: wrap async get_outputs() calls with asyncio.run() in test
windreamer May 18, 2026
1fbfa47
fix: remove usage of flash_attn_v3_available
windreamer May 22, 2026
5321b55
docs: add missing __main__ module test in examples in docs
windreamer Jun 2, 2026
14e7192
fix: update test_spec_agent for guided decoding support
windreamer Jun 2, 2026
db75fb6
fix: address Copilot review feedback — CPU tensor, redundant assignme…
windreamer Jun 2, 2026
b7c4f36
fix: correct misleading docstrings and rename _apply_guided_bitmask →…
windreamer Jun 5, 2026
18c355d
refactor: consolidate guided spec decoding logic into GuidedSpecHelper
windreamer Jun 5, 2026
9f63ee5
docs: fix example codes
windreamer Jun 9, 2026
f455380
Move .cpu() calls into asyncio.to_thread worker closures
windreamer Jun 9, 2026
4ed439c
Move cpu_draft .cpu() into asyncio.to_thread in apply_serial_bitmask
windreamer Jun 9, 2026
586196c
fix: increase max_batch_size to 2 for MTP guided decoding tests
windreamer Jun 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 93 additions & 2 deletions docs/en/advance/spec_decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

Speculative decoding is an optimization technique that introcude a lightweight draft model to propose multiple next tokens and then, the main model verify and choose the longest matched tokens in a forward pass. Compared with standard auto-regressive decoding, this methold lets the system generate multiple tokens at once.

> \[!NOTE\]
> This is an experimental feature in lmdeploy.
:::{note}
This is an experimental feature in lmdeploy.
:::

## Examples

Expand Down Expand Up @@ -104,3 +105,93 @@ deepseek-ai/DeepSeek-V3 \
--max-batch-size 128 \
--enable-metrics
```

## Guided Decoding with Speculative Decoding

Speculative decoding (MTP) can be combined with [structured output](./structed_output.md) so that the draft tokens proposed by the spec model also respect the grammar constraints (e.g. JSON schema, regex). This significantly improves the acceptance rate compared to running spec decoding without grammar masks.

:::{note}
This feature is supported for spec methods that inherit from `DeepseekMTP`, including `deepseek_mtp`, `qwen3_5_mtp`, and `eagle3`. Only the PyTorch backend is supported.
:::

### How it works

The grammar mask is applied at two stages:

1. **Draft model** — forked grammar matchers are used to mask each draft position serially. Each position's mask depends on the token accepted at the previous position, ensuring the draft model proposes grammatically valid tokens.
2. **Target model verification** — position-serial grammar masking is applied to the target model's logits. After rejection sampling, only the accepted tokens are fed back to the original (un-forked) grammar matchers, keeping them in sync for the next step.

When the draft model uses a different vocabulary from the target model (e.g. Eagle 3 with a compressed draft vocabulary), the target-vocab bitmask produced by xgrammar is translated to a draft-vocab bitmask via an efficient scatter-add kernel before being applied to the draft logits.

### pipeline

```python
from lmdeploy import PytorchEngineConfig, pipeline
from lmdeploy.messages import GenerationConfig, SpeculativeConfig

if __name__ == '__main__':

model_path = 'deepseek-ai/DeepSeek-V3'
spec_cfg = SpeculativeConfig(method='deepseek_mtp', num_speculative_tokens=3)
pipe = pipeline(
model_path,
backend_config=PytorchEngineConfig(tp=16, max_batch_size=128),
speculative_config=spec_cfg,
)

schema = {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
}
gen_config = GenerationConfig(
response_format=dict(type='json_schema', json_schema=dict(name='person', schema=schema)),
max_new_tokens=256,
)

response = pipe(['Introduce yourself as JSON.'], gen_config=gen_config)
print(response)
```

### api_server

```shell
lmdeploy serve api_server \
deepseek-ai/DeepSeek-V3 \
--backend pytorch \
--server-port 24545 \
--tp 16 \
--speculative-algorithm deepseek_mtp \
--speculative-num-draft-tokens 3 \
--max-batch-size 128
```

The client can then use `response_format` as described in the [structured output](./structed_output.md) documentation:

```python
from openai import OpenAI

if __name__ == '__main__':

schema = {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
}
response_format = dict(type='json_schema', json_schema=dict(name='person', schema=schema))

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:24545/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{'role': 'user', 'content': 'Introduce yourself as JSON.'}],
response_format=response_format,
)
print(response)
```
95 changes: 93 additions & 2 deletions docs/zh_cn/advance/spec_decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

投机解码是一种优化技术,它通过引入轻量级草稿模型来预测多个后续token,再由主模型在前向推理过程中验证并选择匹配度最高的长token序列。与标准的自回归解码相比,这种方法可使系统一次性生成多个token。

> \[!NOTE\]
> 请注意,这是lmdeploy中的实验性功能。
:::{note}
请注意,这是lmdeploy中的实验性功能。
:::

## 示例

Expand Down Expand Up @@ -103,3 +104,93 @@ deepseek-ai/DeepSeek-V3 \
--max-batch-size 128 \
--enable-metrics
```

## 投机解码与结构化输出

投机解码(MTP)可以与[结构化输出](./structed_output.md)结合使用,使草稿模型提出的 token 也遵循语法约束(如 JSON Schema、正则表达式),从而显著提高接受率。

:::{note}
该功能支持继承自 `DeepseekMTP` 的投机方法,包括 `deepseek_mtp`、`qwen3_5_mtp` 和 `eagle3`。仅支持 PyTorch 后端。
:::

### 工作原理

语法掩码在两个阶段分别施加:

1. **草稿模型** — 使用 fork 出的语法匹配器,逐位置串行施加掩码。每个位置的掩码依赖于前一位置接受的 token,确保草稿模型提出符合语法的 token。
2. **主模型验证** — 对主模型的 logits 进行逐位置串行的语法掩码处理。拒绝采样后,仅将接受的 token 反馈给原始(未 fork 的)语法匹配器,使其为下一步保持正确的状态。

当草稿模型使用与主模型不同的词表时(例如 Eagle 3 使用压缩的草稿词表),xgrammar 生成的目标词表位掩码会通过高效的 scatter-add 内核转换为草稿词表位掩码,然后再应用于草稿 logits。

### pipeline

```python
from lmdeploy import PytorchEngineConfig, pipeline
from lmdeploy.messages import GenerationConfig, SpeculativeConfig

if __name__ == '__main__':

model_path = 'deepseek-ai/DeepSeek-V3'
spec_cfg = SpeculativeConfig(method='deepseek_mtp', num_speculative_tokens=3)
pipe = pipeline(
model_path,
backend_config=PytorchEngineConfig(tp=16, max_batch_size=128),
speculative_config=spec_cfg,
)

schema = {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
}
gen_config = GenerationConfig(
response_format=dict(type='json_schema', json_schema=dict(name='person', schema=schema)),
max_new_tokens=256,
)

response = pipe(['请用 JSON 格式做自我介绍。'], gen_config=gen_config)
print(response)
```

### api_server

```shell
lmdeploy serve api_server \
deepseek-ai/DeepSeek-V3 \
--backend pytorch \
--server-port 24545 \
--tp 16 \
--speculative-algorithm deepseek_mtp \
--speculative-num-draft-tokens 3 \
--max-batch-size 128
```

客户端可以按照[结构化输出](./structed_output.md)文档中的方式使用 `response_format`:

```python
from openai import OpenAI

if __name__ == '__main__':

schema = {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
}
response_format = dict(type='json_schema', json_schema=dict(name='person', schema=schema))

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:24545/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{'role': 'user', 'content': '请用 JSON 格式做自我介绍。'}],
response_format=response_format,
)
print(response)
```
5 changes: 3 additions & 2 deletions lmdeploy/pytorch/engine/guided_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@


class GuidedDecodingManager:
processors = {}

def __init__(self, tokenizer: PreTrainedTokenizerBase, vocab_size: int | None):
if vocab_size is None:
Expand All @@ -20,6 +19,7 @@ def __init__(self, tokenizer: PreTrainedTokenizerBase, vocab_size: int | None):
tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=vocab_size)
self.compiler = xgr.GrammarCompiler(tokenizer_info)
self.vocab_size = vocab_size
self.processors: dict[int, dict[int, xgr.GrammarMatcher]] = {}

def get_processors(self, session_ctx: list[dict[str, Any]],
response_formats: tuple[dict]) -> dict[int, xgr.GrammarMatcher]:
Expand All @@ -32,7 +32,8 @@ def get_processors(self, session_ctx: list[dict[str, Any]],
if isinstance(schema, dict):
for key in ['json_schema', 'schema']:
if key in schema:
schema = json.dumps(schema[key], ensure_ascii=False)
val = schema[key]
schema = val if isinstance(val, str) else json.dumps(val, ensure_ascii=False)

if not isinstance(schema, str):
raise ValueError(f'Cannot parse schema {schema}. The schema must be '
Expand Down
5 changes: 5 additions & 0 deletions lmdeploy/pytorch/engine/model_agent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,11 @@ def __init__(
self.agent_strategy,
misc_config=misc_config,
device=device)
if self.spec_agent.is_enabled():
from lmdeploy.pytorch.spec_decode.guided_spec_helper import GuidedSpecHelper
helper = GuidedSpecHelper(self.guided_decoding_manager)
self.spec_agent.guided_helper = helper
self.spec_agent.proposer.guided_helper = helper
# sleep wakeup state
self.state: SleepWakeupState = SleepWakeupState()

Expand Down
Loading
Loading