InternLM · windreamer · Apr 27, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/docs/en/advance/spec_decoding.md b/docs/en/advance/spec_decoding.md
@@ -2,8 +2,9 @@
 
 Speculative decoding is an optimization technique that introcude a lightweight draft model to propose multiple next tokens and then, the main model verify and choose the longest matched tokens in a forward pass. Compared with standard auto-regressive decoding, this methold lets the system generate multiple tokens at once.
 
-> \[!NOTE\]
-> This is an experimental feature in lmdeploy.
+:::{note}
+This is an experimental feature in lmdeploy.
+:::
 
 ## Examples
 
@@ -104,3 +105,93 @@ deepseek-ai/DeepSeek-V3 \
 --max-batch-size 128 \
 --enable-metrics
 ```
+
+## Guided Decoding with Speculative Decoding
+
+Speculative decoding (MTP) can be combined with [structured output](./structed_output.md) so that the draft tokens proposed by the spec model also respect the grammar constraints (e.g. JSON schema, regex). This significantly improves the acceptance rate compared to running spec decoding without grammar masks.
+
+:::{note}
+This feature is supported for spec methods that inherit from `DeepseekMTP`, including `deepseek_mtp`, `qwen3_5_mtp`, and `eagle3`. Only the PyTorch backend is supported.
+:::
+
+### How it works
+
+The grammar mask is applied at two stages:
+
+1. **Draft model** — forked grammar matchers are used to mask each draft position serially. Each position's mask depends on the token accepted at the previous position, ensuring the draft model proposes grammatically valid tokens.
+2. **Target model verification** — position-serial grammar masking is applied to the target model's logits. After rejection sampling, only the accepted tokens are fed back to the original (un-forked) grammar matchers, keeping them in sync for the next step.
+
+When the draft model uses a different vocabulary from the target model (e.g. Eagle 3 with a compressed draft vocabulary), the target-vocab bitmask produced by xgrammar is translated to a draft-vocab bitmask via an efficient scatter-add kernel before being applied to the draft logits.
+
+### pipeline
+
+```python
+from lmdeploy import PytorchEngineConfig, pipeline
+from lmdeploy.messages import GenerationConfig, SpeculativeConfig
+
+if __name__ == '__main__':
+
+  model_path = 'deepseek-ai/DeepSeek-V3'
+  spec_cfg = SpeculativeConfig(method='deepseek_mtp', num_speculative_tokens=3)
+  pipe = pipeline(
+      model_path,
+      backend_config=PytorchEngineConfig(tp=16, max_batch_size=128),
+      speculative_config=spec_cfg,
+  )
+
+  schema = {
+      'type': 'object',
+      'properties': {
+          'name': {'type': 'string'},
+          'age': {'type': 'integer'},
+      },
+      'required': ['name', 'age'],
+  }
+  gen_config = GenerationConfig(
+      response_format=dict(type='json_schema', json_schema=dict(name='person', schema=schema)),
+      max_new_tokens=256,
+  )
+
+  response = pipe(['Introduce yourself as JSON.'], gen_config=gen_config)
+  print(response)
+```
+
+### api_server
+
+```shell
+lmdeploy serve api_server \
+deepseek-ai/DeepSeek-V3 \
+--backend pytorch \
+--server-port 24545 \
+--tp 16 \
+--speculative-algorithm deepseek_mtp \
+--speculative-num-draft-tokens 3 \
+--max-batch-size 128
+```
+
+The client can then use `response_format` as described in the [structured output](./structed_output.md) documentation:
+
+```python
+from openai import OpenAI
+
+if __name__ == '__main__':
+
+  schema = {
+      'type': 'object',
+      'properties': {
+          'name': {'type': 'string'},
+          'age': {'type': 'integer'},
+      },
+      'required': ['name', 'age'],
+  }
+  response_format = dict(type='json_schema', json_schema=dict(name='person', schema=schema))
+
+  client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:24545/v1')
+  model_name = client.models.list().data[0].id
+  response = client.chat.completions.create(
+      model=model_name,
+      messages=[{'role': 'user', 'content': 'Introduce yourself as JSON.'}],
+      response_format=response_format,
+  )
+  print(response)
+```
diff --git a/docs/zh_cn/advance/spec_decoding.md b/docs/zh_cn/advance/spec_decoding.md
@@ -2,8 +2,9 @@
 
 投机解码是一种优化技术，它通过引入轻量级草稿模型来预测多个后续token，再由主模型在前向推理过程中验证并选择匹配度最高的长token序列。与标准的自回归解码相比，这种方法可使系统一次性生成多个token。
 
-> \[!NOTE\]
-> 请注意，这是lmdeploy中的实验性功能。
+:::{note}
+请注意，这是lmdeploy中的实验性功能。
+:::
 
 ## 示例
 
@@ -103,3 +104,93 @@ deepseek-ai/DeepSeek-V3 \
 --max-batch-size 128 \
 --enable-metrics
 ```
+
+## 投机解码与结构化输出
+
+投机解码（MTP）可以与[结构化输出](./structed_output.md)结合使用，使草稿模型提出的 token 也遵循语法约束（如 JSON Schema、正则表达式），从而显著提高接受率。
+
+:::{note}
+该功能支持继承自 `DeepseekMTP` 的投机方法，包括 `deepseek_mtp`、`qwen3_5_mtp` 和 `eagle3`。仅支持 PyTorch 后端。
+:::
+
+### 工作原理
+
+语法掩码在两个阶段分别施加：
+
+1. **草稿模型** — 使用 fork 出的语法匹配器，逐位置串行施加掩码。每个位置的掩码依赖于前一位置接受的 token，确保草稿模型提出符合语法的 token。
+2. **主模型验证** — 对主模型的 logits 进行逐位置串行的语法掩码处理。拒绝采样后，仅将接受的 token 反馈给原始（未 fork 的）语法匹配器，使其为下一步保持正确的状态。
+
+当草稿模型使用与主模型不同的词表时（例如 Eagle 3 使用压缩的草稿词表），xgrammar 生成的目标词表位掩码会通过高效的 scatter-add 内核转换为草稿词表位掩码，然后再应用于草稿 logits。
+
+### pipeline
+
+```python
+from lmdeploy import PytorchEngineConfig, pipeline
+from lmdeploy.messages import GenerationConfig, SpeculativeConfig
+
+if __name__ == '__main__':
+
+  model_path = 'deepseek-ai/DeepSeek-V3'
+  spec_cfg = SpeculativeConfig(method='deepseek_mtp', num_speculative_tokens=3)
+  pipe = pipeline(
+      model_path,
+      backend_config=PytorchEngineConfig(tp=16, max_batch_size=128),
+      speculative_config=spec_cfg,
+  )
+
+  schema = {
+      'type': 'object',
+      'properties': {
+          'name': {'type': 'string'},
+          'age': {'type': 'integer'},
+      },
+      'required': ['name', 'age'],
+  }
+  gen_config = GenerationConfig(
+      response_format=dict(type='json_schema', json_schema=dict(name='person', schema=schema)),
+      max_new_tokens=256,
+  )
+
+  response = pipe(['请用 JSON 格式做自我介绍。'], gen_config=gen_config)
+  print(response)
+```
+
+### api_server
+
+```shell
+lmdeploy serve api_server \
+deepseek-ai/DeepSeek-V3 \
+--backend pytorch \
+--server-port 24545 \
+--tp 16 \
+--speculative-algorithm deepseek_mtp \
+--speculative-num-draft-tokens 3 \
+--max-batch-size 128
+```
+
+客户端可以按照[结构化输出](./structed_output.md)文档中的方式使用 `response_format`：
+
+```python
+from openai import OpenAI
+
+if __name__ == '__main__':
+
+  schema = {
+      'type': 'object',
+      'properties': {
+          'name': {'type': 'string'},
+          'age': {'type': 'integer'},
+      },
+      'required': ['name', 'age'],
+  }
+  response_format = dict(type='json_schema', json_schema=dict(name='person', schema=schema))
+
+  client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:24545/v1')
+  model_name = client.models.list().data[0].id
+  response = client.chat.completions.create(
+      model=model_name,
+      messages=[{'role': 'user', 'content': '请用 JSON 格式做自我介绍。'}],
+      response_format=response_format,
+  )
+  print(response)
+```
diff --git a/lmdeploy/pytorch/engine/guided_process.py b/lmdeploy/pytorch/engine/guided_process.py
@@ -11,7 +11,6 @@
 
 
 class GuidedDecodingManager:
-    processors = {}
 
     def __init__(self, tokenizer: PreTrainedTokenizerBase, vocab_size: int | None):
         if vocab_size is None:
@@ -20,6 +19,7 @@ def __init__(self, tokenizer: PreTrainedTokenizerBase, vocab_size: int | None):
         tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=vocab_size)
         self.compiler = xgr.GrammarCompiler(tokenizer_info)
         self.vocab_size = vocab_size
+        self.processors: dict[int, dict[int, xgr.GrammarMatcher]] = {}
 
     def get_processors(self, session_ctx: list[dict[str, Any]],
                        response_formats: tuple[dict]) -> dict[int, xgr.GrammarMatcher]:
@@ -32,7 +32,8 @@ def get_processors(self, session_ctx: list[dict[str, Any]],
                     if isinstance(schema, dict):
                         for key in ['json_schema', 'schema']:
                             if key in schema:
-                                schema = json.dumps(schema[key], ensure_ascii=False)
+                                val = schema[key]
+                                schema = val if isinstance(val, str) else json.dumps(val, ensure_ascii=False)
 
                     if not isinstance(schema, str):
                         raise ValueError(f'Cannot parse schema {schema}. The schema must be '

diff --git a/lmdeploy/pytorch/engine/model_agent/agent.py b/lmdeploy/pytorch/engine/model_agent/agent.py
@@ -307,6 +307,11 @@ def __init__(
                                            self.agent_strategy,
                                            misc_config=misc_config,
                                            device=device)
+        if self.spec_agent.is_enabled():
+            from lmdeploy.pytorch.spec_decode.guided_spec_helper import GuidedSpecHelper
+            helper = GuidedSpecHelper(self.guided_decoding_manager)
+            self.spec_agent.guided_helper = helper
+            self.spec_agent.proposer.guided_helper = helper
         # sleep wakeup state
         self.state: SleepWakeupState = SleepWakeupState()