Skip to content

Commit 81f06aa

Browse files
committed
Merge remote-tracking branch 'upstream/develop' into abort_requests_fix
2 parents 250eab6 + e327673 commit 81f06aa

File tree

221 files changed

+19830
-3687
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

221 files changed

+19830
-3687
lines changed
Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
name: fastdeploy-llm-integration
3+
description: >
4+
Guides you through adding inference deployment support for a new open-source LLM to the FastDeploy repository.
5+
Given a model path (local or HuggingFace/ModelScope hub), this skill walks through analyzing the model architecture,
6+
choosing the right base class, generating the model implementation file, updating registries, writing docs, and
7+
producing a deployment test script.
8+
9+
Use this skill whenever the user wants to: add a new model to FastDeploy, integrate an open-source LLM for inference,
10+
support a new model architecture in PaddlePaddle's inference framework, port a HuggingFace model to FastDeploy,
11+
or asks "如何在FastDeploy中支持XX模型" / "帮我给FastDeploy新增XX模型支持".
12+
13+
IMPORTANT: Always use this skill when the user mentions FastDeploy and a model name/path together, even if they
14+
just ask "how do I add X to FastDeploy" — this skill has all the patterns and templates needed.
15+
---
16+
17+
# FastDeploy LLM Integration Skill
18+
19+
你的任务是:给定一个开源大模型路径,完整实现该模型在 FastDeploy 仓库中的推理部署支持,包括模型实现代码、文档和测试脚本。
20+
21+
---
22+
23+
## 工作流程总览
24+
25+
```
26+
步骤 1: 分析模型架构
27+
步骤 2: 选择继承策略(复用 vs 新建)
28+
步骤 3: 生成模型实现文件
29+
步骤 4: 更新注册和配置
30+
步骤 5: 补充文档
31+
步骤 6: 生成部署测试脚本
32+
```
33+
34+
---
35+
36+
## 步骤 1:分析模型架构
37+
38+
首先读取模型的 `config.json`
39+
40+
```bash
41+
cat /path/to/model/config.json
42+
# 或从 HuggingFace 获取:
43+
curl https://huggingface.co/<org>/<model>/raw/main/config.json
44+
```
45+
46+
**关键字段提取清单:**
47+
48+
| 字段 | 用途 |
49+
|------|------|
50+
| `architectures` | 注册用的 architecture name,如 `["Qwen2ForCausalLM"]` |
51+
| `model_type` | attention/MLP 路径选择的分支条件 |
52+
| `hidden_size` | 模型宽度 |
53+
| `num_hidden_layers` | 层数 |
54+
| `num_attention_heads` | 注意力头数 |
55+
| `num_key_value_heads` | GQA 头数(若 < num_attention_heads 则为 GQA) |
56+
| `intermediate_size` | FFN 中间层大小 |
57+
| `num_experts` / `num_routed_experts` | MoE 专家数(有则为 MoE 模型) |
58+
| `rope_theta` / `rope_scaling` | 位置编码配置 |
59+
| `attention_bias` | Attention 是否有 bias |
60+
| `qk_norm` | 是否有 QK normalization(GLM4.5+ 特性) |
61+
62+
---
63+
64+
## 步骤 2:选择继承策略
65+
66+
根据分析结果,按以下决策树选择最优策略:
67+
68+
```
69+
config.json 分析
70+
71+
├── 与 DeepSeekV3 架构高度相似(MLA/DSA attention + MoE)?
72+
│ └── YES → 继承 DeepseekV3ForCausalLM 或 DeepseekV32ForCausalLM
73+
│ 参考:glm_moe_dsa.py(PR #6863)
74+
75+
├── 与 GLM4 MoE 相似(标准 MHA + MoE + QK Norm)?
76+
│ └── YES → 继承 Glm4MoeForCausalLM 或从头实现,参考 glm4_moe.py
77+
78+
├── 与 Qwen2/3 架构相似(GQA + RoPE + SwiGLU)?
79+
│ └── YES → 继承 Qwen2ForCausalLM / Qwen3ForCausalLM
80+
│ 参考:qwen3.py
81+
82+
└── 全新架构 → 从 ModelForCasualLM 基类开始
83+
参考:qwen2.py(最完整的参考实现)
84+
```
85+
86+
**继承的好处**
87+
- 减少代码量 80%+
88+
- 自动继承 tensor parallelism、weight sharding
89+
- 只需重载差异部分(如不同的 attention 实现、不同的 MoE 路由)
90+
91+
---
92+
93+
## 步骤 3:生成模型实现文件
94+
95+
文件路径:`fastdeploy/model_executor/models/<model_name>.py`
96+
97+
参考 `references/model_templates.md` 中的完整代码模板。根据步骤 2 的继承策略选择对应模板:
98+
99+
### 模板 A:继承现有模型(推荐,适合 90% 的情况)
100+
101+
```python
102+
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
103+
#
104+
# Licensed under the Apache License, Version 2.0 (the "License")
105+
# ... (standard Apache 2.0 header)
106+
107+
"""<ModelName> model implementation for FastDeploy."""
108+
109+
from __future__ import annotations
110+
111+
from fastdeploy.model_executor.models.model_base import ModelCategory, ModelRegistry
112+
113+
# 从最相似的已有模型继承
114+
from fastdeploy.model_executor.models.<base_model> import <BaseForCausalLM>
115+
116+
117+
@ModelRegistry.register_model_class(
118+
architecture="<NewModelArchitecture>ForCausalLM", # 必须与 config.json 的 architectures[0] 完全一致
119+
module_name="<model_name>",
120+
category=ModelCategory.TEXT_GENERATION,
121+
)
122+
class <NewModel>ForCausalLM(<BaseForCausalLM>):
123+
"""<NewModel> causal language model.
124+
125+
Reuses <BaseModel> infrastructure with <description of differences>.
126+
"""
127+
128+
@classmethod
129+
def name(cls) -> str:
130+
return "<NewModelArchitecture>ForCausalLM"
131+
132+
# 只重载与 base model 有差异的部分
133+
# 例如:不同的 attention 类型、不同的 MLP 结构、额外的 normalization
134+
```
135+
136+
### 模板 B:全新模型实现
137+
138+
完整代码结构见 `references/model_templates.md`,包含:
139+
- 标准 MLP 层(SwiGLU / GeGLU / ReLU 变体)
140+
- Attention 层(MHA / GQA / MLA)
141+
- DecoderLayer
142+
- 主 Model 类(带 `@support_graph_optimization`
143+
- ForCausalLM 注册类
144+
- PretrainedModel 类(tensor parallel 配置)
145+
146+
---
147+
148+
## 步骤 4:更新注册和配置
149+
150+
### 4a. 验证自动注册
151+
152+
FastDeploy 使用 `__init__.py` 自动扫描 `models/` 目录,**无需手动注册**。只要文件放在 `fastdeploy/model_executor/models/` 下,装饰器就会被自动加载。
153+
154+
验证命令:
155+
```python
156+
from fastdeploy.model_executor.models import ModelRegistry
157+
print(ModelRegistry.get_supported_archs())
158+
# 应该能看到你的新 architecture name
159+
```
160+
161+
### 4b. 更新 model_type 条件分支(如有需要)
162+
163+
如果你的新模型与某个已有模型共享同一个 Python 文件(如 glm_moe_dsa 复用 deepseek_v3.py),需要更新对应文件中的 model_type 判断:
164+
165+
```python
166+
# 在 deepseek_v3.py 或其他共享文件中
167+
if model_type in ["deepseek_v3", "deepseek_v32", "<your_new_model_type>"]:
168+
self.attn = DeepseekV3MLAAttention(...)
169+
```
170+
171+
### 4c. 更新 supported_models.md
172+
173+
`docs/supported_models.md` 表格中添加新行:
174+
175+
```markdown
176+
| <ModelName> | <ModelSize> | BF16 ||| - |
177+
```
178+
179+
---
180+
181+
## 步骤 5:补充文档
182+
183+
`docs/` 目录下创建或更新模型文档。参考 `references/doc_template.md` 生成标准文档,包含:
184+
185+
1. 模型简介(架构特点)
186+
2. 部署命令(最小可运行示例)
187+
3. 性能指标(如已有 benchmark)
188+
4. 注意事项(量化兼容性、TP 限制等)
189+
190+
---
191+
192+
## 步骤 6:生成部署测试脚本
193+
194+
生成两种测试脚本:
195+
196+
### 快速验证脚本(本地调试用)
197+
198+
```python
199+
# test_<model_name>_inference.py
200+
"""Quick sanity check for <ModelName> integration in FastDeploy."""
201+
import subprocess, sys
202+
203+
MODEL_PATH = "<model_path>" # 用户提供的路径
204+
205+
def test_model_loads():
206+
"""Test that the model architecture is correctly registered."""
207+
from fastdeploy.model_executor.models import ModelRegistry
208+
archs = ModelRegistry.get_supported_archs()
209+
assert "<NewModelArchitecture>ForCausalLM" in archs, \
210+
f"Model not registered! Available: {archs}"
211+
print("✅ Model registration: PASS")
212+
213+
def test_basic_inference():
214+
"""Run a simple single-GPU inference test."""
215+
result = subprocess.run([
216+
"python", "-m", "fastdeploy.entrypoints.openai.api_server",
217+
"--model", MODEL_PATH,
218+
"--max-model-len", "1024",
219+
"--tensor-parallel-size", "1",
220+
# Add --dry-run or short test here if supported
221+
], capture_output=True, text=True, timeout=120)
222+
print(result.stdout[-2000:]) # Last 2000 chars
223+
print("✅ Server startup: PASS" if result.returncode == 0 else "❌ FAIL")
224+
225+
if __name__ == "__main__":
226+
test_model_loads()
227+
test_basic_inference()
228+
```
229+
230+
### 完整部署命令(生产用)
231+
232+
```bash
233+
# Single GPU
234+
python -m fastdeploy.entrypoints.openai.api_server \
235+
--model <model_path> \
236+
--tensor-parallel-size 1 \
237+
--max-model-len 32768
238+
239+
# Multi-GPU (8-way TP)
240+
python -m fastdeploy.entrypoints.openai.api_server \
241+
--model <model_path> \
242+
--tensor-parallel-size 8 \
243+
--max-model-len 131072
244+
245+
# MoE with Expert Parallelism
246+
python -m fastdeploy.entrypoints.openai.api_server \
247+
--model <model_path> \
248+
--tensor-parallel-size 8 \
249+
--pipeline-parallel-size 1 \
250+
--max-model-len 131072
251+
252+
# curl 测试
253+
curl http://localhost:8080/v1/completions \
254+
-H "Content-Type: application/json" \
255+
-d '{"model": "<model_name>", "prompt": "Hello, I am", "max_tokens": 50}'
256+
```
257+
258+
---
259+
260+
## 参考资料
261+
262+
- **代码模板**:读取 `references/model_templates.md` 获取完整代码样板
263+
- **架构决策树**`references/architecture_guide.md` — 更详细的架构选型指南
264+
- **PR 参考**
265+
- [PR #6863](https://github.com/PaddlePaddle/FastDeploy/pull/6863) — GLM-MoE-DSA(继承 DeepSeekV3,最简继承示例)
266+
- [PR #7139](https://github.com/PaddlePaddle/FastDeploy/pull/7139) — GLM4.7 Flash(ForwardMeta 参数化模式)
267+
- [PR #6689](https://github.com/PaddlePaddle/FastDeploy/pull/6689) — DeepSeek-v3.2(自定义 CUDA kernel 集成)
268+
269+
---
270+
271+
## 输出物清单
272+
273+
完成后,向用户提供以下文件:
274+
275+
1. `fastdeploy/model_executor/models/<model_name>.py` — 模型实现
276+
2. `docs/<model_name>_deployment.md` — 部署文档
277+
3. `test_<model_name>_inference.py` — 测试脚本
278+
4. (如需要)修改说明:`deepseek_v3.py` 或其他共享文件中新增的 model_type 分支
279+
280+
---
281+
282+
## 常见陷阱
283+
284+
**陷阱 1:architecture name 不匹配**
285+
`@ModelRegistry.register_model_class(architecture=...)` 中的字符串必须与模型 `config.json``architectures[0]` 完全一致,大小写敏感。
286+
287+
**陷阱 2:忘记 model_type 条件**
288+
如果你的模型继承了 DeepSeekV3 但 attention 类型不同,需要在父类中添加 `model_type` 判断,否则会走错 attention 路径。
289+
290+
**陷阱 3:Tensor Parallelism 配置**
291+
`num_key_value_heads` 必须能被 `tensor_parallel_size` 整除,否则需要使用 head padding(参考 PR #7139 中的 padding 逻辑)。
292+
293+
**陷阱 4:MoE 专家权重格式**
294+
MoE 模型的专家权重需要用 `FusedMoE.make_expert_params_mapping()` 做参数映射,不能直接用标准的 `stacked_params_mapping`
295+
296+
**陷阱 5:PretrainedModel 未注册**
297+
如果你没有创建 `PretrainedModel` 子类,tensor parallelism mapping 会缺失,多卡推理可能出现权重切分错误。

0 commit comments

Comments
 (0)