[Models] add fleet model fallback 2 by xiaoguoguo626807 · Pull Request #7964 · PaddlePaddle/FastDeploy

xiaoguoguo626807 · 2026-06-02T02:46:06Z

Motivation

新增 PaddleFleet 作为模型推理后端（--model-impl paddlefleet），通过将 PaddleFleet TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核，实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。 #7732 的清晰版

Modifications

config.py: 新增 paddlefleet 到 ModelImpl 类型定义
engine/args_utils.py: 支持 --model-impl paddlefleet CLI 参数，并补充校验逻辑
model_executor/models/paddleformers/base_fleet.py: 新增 PaddleFleetModelBase 基类、FastDeployAttention 层及 patch_paddlefleet_core_attention 替换函数
model_executor/models/paddleformers/__init__.py: 注册 PaddleFleetForCausalLM 模型类
test_fallback_fleet_model.py` 需要独立的 PaddleFormers 和 PaddleFleet 依赖,使用 pytest conftest.py 钩子机制，在测试运行时动态安装依赖，避免污染全局环境

Usage or Command

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet

Accuracy Tests

N/A（本 PR 新增 PaddleFleet 推理后端，尚未提供与参考实现的 logits 对齐数据）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-06-02T02:46:11Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-06-02T03:01:38Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-02 20:14:56

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 80e85f8 | Merge base: 4474188 (branch: develop)

1 Required任务 : 8/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	37	2	1	1	0

任务	错误类型	置信度	日志
`Approval`	需要 Approval	高	Job

2 失败详情

🔴 Approval — 需要 Approval（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

建议操作：请通过人工审批。

codecov-commenter · 2026-06-02T03:25:48Z

Codecov Report

❌ Patch coverage is 96.91358% with 10 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3fe8f7c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
.../model_executor/models/paddleformers/base_fleet.py	96.68%	1 Missing and 9 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7964   +/-   ##
==========================================
  Coverage           ?   67.56%           
==========================================
  Files              ?      468           
  Lines              ?    65903           
  Branches           ?    10169           
==========================================
  Hits               ?    44525           
  Misses             ?    18540           
  Partials           ?     2838

Flag	Coverage Δ
GPU	`77.67% <96.91%> (?)`
XPU	`7.02% <0.30%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into fleet

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-03 13:56:29

📋 Review 摘要

PR 概述：新增 paddlefleet 作为模型推理后端，通过将 PaddleFleet TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核，实现对 KV Cache 和高性能 Attention 的复用
变更范围：model_executor/models/paddleformers/、model_executor/graph_optimization/、engine/args_utils.py、config.py、tests/model_executor/fallback/
影响面 Tag：[Models] [FDConfig] [Graph Optimization] [CI] [Engine]

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/models/paddleformers/base_fleet.py`	`assert` 用于运行时校验，Python `-O` 下将静默失效

历史 Findings 修复情况

Finding	问题	状态
F1	`args_utils.py` help 字符串拼接缺少空格分隔	⚠️ 仍存在
F2	`graph_opt_backend` 不支持 `*args`，CUDAGraph 开启时可能 TypeError	🔄 部分修复
F3	`multi_latent_attention` 直接访问可能 AttributeError	✅ 已修复
F4	conftest.py 日志版本号与实际安装版本不一致	⚠️ 仍存在
F5	`graph_opt_backend(args, kwargs)` 传递 `args` 时 `GraphOptBackend.__call__` 仅接受 `**kwargs`，CUDAGraph 开启时仍会 TypeError	⚠️ 仍存在

📝 PR 规范检查

符合规范。

总体评价

F3（MLA AttributeError）已修复，整体实现逻辑清晰。历史 F1/F4 仍未修复，F5 核心问题（GraphOptBackend.__call__ 不接受 *args）仍存在，CUDAGraph 路径下启用 paddlefleet 会报 TypeError；建议跟进或在 PR 中明确说明 paddlefleet 暂不支持 CUDAGraph 模式。新增代码中 assert forward_meta is not None 需替换为显式 raise。

gongshaotian

LGTM for GraphOptBackend

xiaoguoguo626807 · 2026-06-03T06:21:40Z

/re-run all-failed

fd fallback fleet model clean commit

ffcb10c

xiaoguoguo626807 had a problem deploying to Metax_ci June 2, 2026 02:46 — with GitHub Actions Failure