|
| 1 | +--- |
| 2 | +name: fastdeploy-llm-integration |
| 3 | +description: > |
| 4 | + Guides you through adding inference deployment support for a new open-source LLM to the FastDeploy repository. |
| 5 | + Given a model path (local or HuggingFace/ModelScope hub), this skill walks through analyzing the model architecture, |
| 6 | + choosing the right base class, generating the model implementation file, updating registries, writing docs, and |
| 7 | + producing a deployment test script. |
| 8 | +
|
| 9 | + Use this skill whenever the user wants to: add a new model to FastDeploy, integrate an open-source LLM for inference, |
| 10 | + support a new model architecture in PaddlePaddle's inference framework, port a HuggingFace model to FastDeploy, |
| 11 | + or asks "如何在FastDeploy中支持XX模型" / "帮我给FastDeploy新增XX模型支持". |
| 12 | +
|
| 13 | +IMPORTANT: Always use this skill when the user mentions FastDeploy and a model name/path together, even if they |
| 14 | +just ask "how do I add X to FastDeploy" — this skill has all the patterns and templates needed. |
| 15 | +--- |
| 16 | + |
| 17 | +# FastDeploy LLM Integration Skill |
| 18 | + |
| 19 | +你的任务是:给定一个开源大模型路径,完整实现该模型在 FastDeploy 仓库中的推理部署支持,包括模型实现代码、文档和测试脚本。 |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## 工作流程总览 |
| 24 | + |
| 25 | +``` |
| 26 | +步骤 1: 分析模型架构 |
| 27 | +步骤 2: 选择继承策略(复用 vs 新建) |
| 28 | +步骤 3: 生成模型实现文件 |
| 29 | +步骤 4: 更新注册和配置 |
| 30 | +步骤 5: 补充文档 |
| 31 | +步骤 6: 生成部署测试脚本 |
| 32 | +``` |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## 步骤 1:分析模型架构 |
| 37 | + |
| 38 | +首先读取模型的 `config.json`: |
| 39 | + |
| 40 | +```bash |
| 41 | +cat /path/to/model/config.json |
| 42 | +# 或从 HuggingFace 获取: |
| 43 | +curl https://huggingface.co/<org>/<model>/raw/main/config.json |
| 44 | +``` |
| 45 | + |
| 46 | +**关键字段提取清单:** |
| 47 | + |
| 48 | +| 字段 | 用途 | |
| 49 | +|------|------| |
| 50 | +| `architectures` | 注册用的 architecture name,如 `["Qwen2ForCausalLM"]` | |
| 51 | +| `model_type` | attention/MLP 路径选择的分支条件 | |
| 52 | +| `hidden_size` | 模型宽度 | |
| 53 | +| `num_hidden_layers` | 层数 | |
| 54 | +| `num_attention_heads` | 注意力头数 | |
| 55 | +| `num_key_value_heads` | GQA 头数(若 < num_attention_heads 则为 GQA) | |
| 56 | +| `intermediate_size` | FFN 中间层大小 | |
| 57 | +| `num_experts` / `num_routed_experts` | MoE 专家数(有则为 MoE 模型) | |
| 58 | +| `rope_theta` / `rope_scaling` | 位置编码配置 | |
| 59 | +| `attention_bias` | Attention 是否有 bias | |
| 60 | +| `qk_norm` | 是否有 QK normalization(GLM4.5+ 特性) | |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## 步骤 2:选择继承策略 |
| 65 | + |
| 66 | +根据分析结果,按以下决策树选择最优策略: |
| 67 | + |
| 68 | +``` |
| 69 | +config.json 分析 |
| 70 | + │ |
| 71 | + ├── 与 DeepSeekV3 架构高度相似(MLA/DSA attention + MoE)? |
| 72 | + │ └── YES → 继承 DeepseekV3ForCausalLM 或 DeepseekV32ForCausalLM |
| 73 | + │ 参考:glm_moe_dsa.py(PR #6863) |
| 74 | + │ |
| 75 | + ├── 与 GLM4 MoE 相似(标准 MHA + MoE + QK Norm)? |
| 76 | + │ └── YES → 继承 Glm4MoeForCausalLM 或从头实现,参考 glm4_moe.py |
| 77 | + │ |
| 78 | + ├── 与 Qwen2/3 架构相似(GQA + RoPE + SwiGLU)? |
| 79 | + │ └── YES → 继承 Qwen2ForCausalLM / Qwen3ForCausalLM |
| 80 | + │ 参考:qwen3.py |
| 81 | + │ |
| 82 | + └── 全新架构 → 从 ModelForCasualLM 基类开始 |
| 83 | + 参考:qwen2.py(最完整的参考实现) |
| 84 | +``` |
| 85 | + |
| 86 | +**继承的好处**: |
| 87 | +- 减少代码量 80%+ |
| 88 | +- 自动继承 tensor parallelism、weight sharding |
| 89 | +- 只需重载差异部分(如不同的 attention 实现、不同的 MoE 路由) |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## 步骤 3:生成模型实现文件 |
| 94 | + |
| 95 | +文件路径:`fastdeploy/model_executor/models/<model_name>.py` |
| 96 | + |
| 97 | +参考 `references/model_templates.md` 中的完整代码模板。根据步骤 2 的继承策略选择对应模板: |
| 98 | + |
| 99 | +### 模板 A:继承现有模型(推荐,适合 90% 的情况) |
| 100 | + |
| 101 | +```python |
| 102 | +# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. |
| 103 | +# |
| 104 | +# Licensed under the Apache License, Version 2.0 (the "License") |
| 105 | +# ... (standard Apache 2.0 header) |
| 106 | + |
| 107 | +"""<ModelName> model implementation for FastDeploy.""" |
| 108 | + |
| 109 | +from __future__ import annotations |
| 110 | + |
| 111 | +from fastdeploy.model_executor.models.model_base import ModelCategory, ModelRegistry |
| 112 | + |
| 113 | +# 从最相似的已有模型继承 |
| 114 | +from fastdeploy.model_executor.models.<base_model> import <BaseForCausalLM> |
| 115 | + |
| 116 | + |
| 117 | +@ModelRegistry.register_model_class( |
| 118 | + architecture="<NewModelArchitecture>ForCausalLM", # 必须与 config.json 的 architectures[0] 完全一致 |
| 119 | + module_name="<model_name>", |
| 120 | + category=ModelCategory.TEXT_GENERATION, |
| 121 | +) |
| 122 | +class <NewModel>ForCausalLM(<BaseForCausalLM>): |
| 123 | + """<NewModel> causal language model. |
| 124 | +
|
| 125 | + Reuses <BaseModel> infrastructure with <description of differences>. |
| 126 | + """ |
| 127 | + |
| 128 | + @classmethod |
| 129 | + def name(cls) -> str: |
| 130 | + return "<NewModelArchitecture>ForCausalLM" |
| 131 | + |
| 132 | + # 只重载与 base model 有差异的部分 |
| 133 | + # 例如:不同的 attention 类型、不同的 MLP 结构、额外的 normalization |
| 134 | +``` |
| 135 | + |
| 136 | +### 模板 B:全新模型实现 |
| 137 | + |
| 138 | +完整代码结构见 `references/model_templates.md`,包含: |
| 139 | +- 标准 MLP 层(SwiGLU / GeGLU / ReLU 变体) |
| 140 | +- Attention 层(MHA / GQA / MLA) |
| 141 | +- DecoderLayer |
| 142 | +- 主 Model 类(带 `@support_graph_optimization`) |
| 143 | +- ForCausalLM 注册类 |
| 144 | +- PretrainedModel 类(tensor parallel 配置) |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +## 步骤 4:更新注册和配置 |
| 149 | + |
| 150 | +### 4a. 验证自动注册 |
| 151 | + |
| 152 | +FastDeploy 使用 `__init__.py` 自动扫描 `models/` 目录,**无需手动注册**。只要文件放在 `fastdeploy/model_executor/models/` 下,装饰器就会被自动加载。 |
| 153 | + |
| 154 | +验证命令: |
| 155 | +```python |
| 156 | +from fastdeploy.model_executor.models import ModelRegistry |
| 157 | +print(ModelRegistry.get_supported_archs()) |
| 158 | +# 应该能看到你的新 architecture name |
| 159 | +``` |
| 160 | + |
| 161 | +### 4b. 更新 model_type 条件分支(如有需要) |
| 162 | + |
| 163 | +如果你的新模型与某个已有模型共享同一个 Python 文件(如 glm_moe_dsa 复用 deepseek_v3.py),需要更新对应文件中的 model_type 判断: |
| 164 | + |
| 165 | +```python |
| 166 | +# 在 deepseek_v3.py 或其他共享文件中 |
| 167 | +if model_type in ["deepseek_v3", "deepseek_v32", "<your_new_model_type>"]: |
| 168 | + self.attn = DeepseekV3MLAAttention(...) |
| 169 | +``` |
| 170 | + |
| 171 | +### 4c. 更新 supported_models.md |
| 172 | + |
| 173 | +在 `docs/supported_models.md` 表格中添加新行: |
| 174 | + |
| 175 | +```markdown |
| 176 | +| <ModelName> | <ModelSize> | BF16 | ✅ | ✅ | - | |
| 177 | +``` |
| 178 | + |
| 179 | +--- |
| 180 | + |
| 181 | +## 步骤 5:补充文档 |
| 182 | + |
| 183 | +在 `docs/` 目录下创建或更新模型文档。参考 `references/doc_template.md` 生成标准文档,包含: |
| 184 | + |
| 185 | +1. 模型简介(架构特点) |
| 186 | +2. 部署命令(最小可运行示例) |
| 187 | +3. 性能指标(如已有 benchmark) |
| 188 | +4. 注意事项(量化兼容性、TP 限制等) |
| 189 | + |
| 190 | +--- |
| 191 | + |
| 192 | +## 步骤 6:生成部署测试脚本 |
| 193 | + |
| 194 | +生成两种测试脚本: |
| 195 | + |
| 196 | +### 快速验证脚本(本地调试用) |
| 197 | + |
| 198 | +```python |
| 199 | +# test_<model_name>_inference.py |
| 200 | +"""Quick sanity check for <ModelName> integration in FastDeploy.""" |
| 201 | +import subprocess, sys |
| 202 | + |
| 203 | +MODEL_PATH = "<model_path>" # 用户提供的路径 |
| 204 | + |
| 205 | +def test_model_loads(): |
| 206 | + """Test that the model architecture is correctly registered.""" |
| 207 | + from fastdeploy.model_executor.models import ModelRegistry |
| 208 | + archs = ModelRegistry.get_supported_archs() |
| 209 | + assert "<NewModelArchitecture>ForCausalLM" in archs, \ |
| 210 | + f"Model not registered! Available: {archs}" |
| 211 | + print("✅ Model registration: PASS") |
| 212 | + |
| 213 | +def test_basic_inference(): |
| 214 | + """Run a simple single-GPU inference test.""" |
| 215 | + result = subprocess.run([ |
| 216 | + "python", "-m", "fastdeploy.entrypoints.openai.api_server", |
| 217 | + "--model", MODEL_PATH, |
| 218 | + "--max-model-len", "1024", |
| 219 | + "--tensor-parallel-size", "1", |
| 220 | + # Add --dry-run or short test here if supported |
| 221 | + ], capture_output=True, text=True, timeout=120) |
| 222 | + print(result.stdout[-2000:]) # Last 2000 chars |
| 223 | + print("✅ Server startup: PASS" if result.returncode == 0 else "❌ FAIL") |
| 224 | + |
| 225 | +if __name__ == "__main__": |
| 226 | + test_model_loads() |
| 227 | + test_basic_inference() |
| 228 | +``` |
| 229 | + |
| 230 | +### 完整部署命令(生产用) |
| 231 | + |
| 232 | +```bash |
| 233 | +# Single GPU |
| 234 | +python -m fastdeploy.entrypoints.openai.api_server \ |
| 235 | + --model <model_path> \ |
| 236 | + --tensor-parallel-size 1 \ |
| 237 | + --max-model-len 32768 |
| 238 | + |
| 239 | +# Multi-GPU (8-way TP) |
| 240 | +python -m fastdeploy.entrypoints.openai.api_server \ |
| 241 | + --model <model_path> \ |
| 242 | + --tensor-parallel-size 8 \ |
| 243 | + --max-model-len 131072 |
| 244 | + |
| 245 | +# MoE with Expert Parallelism |
| 246 | +python -m fastdeploy.entrypoints.openai.api_server \ |
| 247 | + --model <model_path> \ |
| 248 | + --tensor-parallel-size 8 \ |
| 249 | + --pipeline-parallel-size 1 \ |
| 250 | + --max-model-len 131072 |
| 251 | + |
| 252 | +# curl 测试 |
| 253 | +curl http://localhost:8080/v1/completions \ |
| 254 | + -H "Content-Type: application/json" \ |
| 255 | + -d '{"model": "<model_name>", "prompt": "Hello, I am", "max_tokens": 50}' |
| 256 | +``` |
| 257 | + |
| 258 | +--- |
| 259 | + |
| 260 | +## 参考资料 |
| 261 | + |
| 262 | +- **代码模板**:读取 `references/model_templates.md` 获取完整代码样板 |
| 263 | +- **架构决策树**:`references/architecture_guide.md` — 更详细的架构选型指南 |
| 264 | +- **PR 参考**: |
| 265 | + - [PR #6863](https://github.com/PaddlePaddle/FastDeploy/pull/6863) — GLM-MoE-DSA(继承 DeepSeekV3,最简继承示例) |
| 266 | + - [PR #7139](https://github.com/PaddlePaddle/FastDeploy/pull/7139) — GLM4.7 Flash(ForwardMeta 参数化模式) |
| 267 | + - [PR #6689](https://github.com/PaddlePaddle/FastDeploy/pull/6689) — DeepSeek-v3.2(自定义 CUDA kernel 集成) |
| 268 | + |
| 269 | +--- |
| 270 | + |
| 271 | +## 输出物清单 |
| 272 | + |
| 273 | +完成后,向用户提供以下文件: |
| 274 | + |
| 275 | +1. `fastdeploy/model_executor/models/<model_name>.py` — 模型实现 |
| 276 | +2. `docs/<model_name>_deployment.md` — 部署文档 |
| 277 | +3. `test_<model_name>_inference.py` — 测试脚本 |
| 278 | +4. (如需要)修改说明:`deepseek_v3.py` 或其他共享文件中新增的 model_type 分支 |
| 279 | + |
| 280 | +--- |
| 281 | + |
| 282 | +## 常见陷阱 |
| 283 | + |
| 284 | +**陷阱 1:architecture name 不匹配** |
| 285 | +`@ModelRegistry.register_model_class(architecture=...)` 中的字符串必须与模型 `config.json` 中 `architectures[0]` 完全一致,大小写敏感。 |
| 286 | + |
| 287 | +**陷阱 2:忘记 model_type 条件** |
| 288 | +如果你的模型继承了 DeepSeekV3 但 attention 类型不同,需要在父类中添加 `model_type` 判断,否则会走错 attention 路径。 |
| 289 | + |
| 290 | +**陷阱 3:Tensor Parallelism 配置** |
| 291 | +`num_key_value_heads` 必须能被 `tensor_parallel_size` 整除,否则需要使用 head padding(参考 PR #7139 中的 padding 逻辑)。 |
| 292 | + |
| 293 | +**陷阱 4:MoE 专家权重格式** |
| 294 | +MoE 模型的专家权重需要用 `FusedMoE.make_expert_params_mapping()` 做参数映射,不能直接用标准的 `stacked_params_mapping`。 |
| 295 | + |
| 296 | +**陷阱 5:PretrainedModel 未注册** |
| 297 | +如果你没有创建 `PretrainedModel` 子类,tensor parallelism mapping 会缺失,多卡推理可能出现权重切分错误。 |
0 commit comments