[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777) by ShaneGZhu · Pull Request #7832 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-15T08:42:59Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

启动服务时加上参数--enable-moe-scores-elementwise-fuse

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…dle#7777) [Cherry-Pick]

paddle-bot · 2026-05-15T08:51:49Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-15T09:43:06Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 19:56:40

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0725fb4
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

❌ 4 个 Required 任务失败，2 个 Required 任务已取消，PR 暂无法合并，请优先处理失败任务。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
43(8)	35	27	6	0	0	0

注：另有 2 个任务被取消（cancelled），不计入上表统计。

2 任务状态汇总

2.1 Required 任务 — 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	13m34s	PR问题：mxfp4.py:38 TypeError，worker进程崩溃	修复mxfp4.py:38 enable_torch_proxy兼容性调用	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	26m2s	PR问题：mxfp4.py TypeError，API server启动失败	修复mxfp4.py:38，确保4卡测试服务正常启动	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	47m6s	PR问题：noaux_tc导入失败+mxfp4.py TypeError	修复noauxtc编译及mxfp4.py:38兼容调用	Job	🔄×1
❌	`Extracted partial CE model tasks / run_ce_cases`	8m3s	PR问题：mxfp4.py TypeError，CE服务启动失败	修复mxfp4.py:38 paddle.compat.enable_torch_proxy	Job	-
⚪	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	已取消	-	-	-
⚪	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	已取消	-	-	-
✅	其余 2 个必选任务通过（logprob、stable_tests）	-	-	-	-	-

2.2 可选任务 — 25/27 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	19m2s	Job	-
❌	`Trigger Jenkins for PR`	53s	Job	-
✅	其余 25 个可选任务通过	-	-	-

3 失败详情（仅 Required）

Run Base Tests / base_tests — 用例失败（置信度: 高）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 用例失败（服务启动崩溃）
置信度: 高
根因摘要: PR问题：mxfp4.py:38 enable_torch_proxy TypeError，worker进程崩溃
分析器: 通用分析(fallback)

根因详情:
fastdeploy/model_executor/layers/quantization/mxfp4.py 第 38 行在模块级别调用 paddle.compat.enable_torch_proxy(scope={"flashinfer"})，该调用返回 TypeError: 'NoneType' object is not callable，导致 worker 进程无法初始化量化配置，服务启动失败，退出码 8。

关键日志:

File ".../quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR engine.py: Failed to launch worker processes
##[error]Process completed with exit code 8.

修复建议:

检查 fastdeploy/model_executor/layers/quantization/mxfp4.py 第 38 行：paddle.compat.enable_torch_proxy 在当前 Paddle 版本中为 None，需增加 callable() 守卫或使用正确的 API 调用方式
确认此 cherry-pick（[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc #7777）引入的 mxfp4.py 变更兼容目标环境的 Paddle 版本

修复建议摘要: 修复mxfp4.py:38 enable_torch_proxy调用，增加callable守卫

链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 用例失败（置信度: 高）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 用例失败（API server 未启动）
置信度: 高
根因摘要: PR问题：mxfp4.py TypeError导致API server未启动，3个测试全败
分析器: 通用分析(fallback)

失败用例:

测试	错误	根因
`test_ernie_21b_tp1_dp4.py::*`	RuntimeError: API server did not start on port 8088	mxfp4.py:38 TypeError 导致服务崩溃
`test_ernie_21b_tp1_dp4_mtp.py::*`	RuntimeError: API server did not start on port 8088	同上
`test_determinism_long.py::*`	服务启动失败	同上

根因详情:
与 base_tests 根因相同。mxfp4.py:38 的 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 调用失败（TypeError: 'NoneType' object is not callable），worker log 中明确记录此错误。API server 在 8088 端口未能启动，导致 3 个依赖服务启动的测试文件全部报 RuntimeError。

关键日志:

File ".../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
RuntimeError: API server did not start on port 8088
3 test file(s) failed: test_determinism_long.py / test_ernie_21b_tp1_dp4.py / test_ernie_21b_tp1_dp4_mtp.py

修复建议:

同 base_tests：修复 mxfp4.py 第 38 行，确保 paddle.compat.enable_torch_proxy 可被调用
修复后验证 TP1DP4 服务可正常在端口 8088 启动

修复建议摘要: 修复mxfp4.py:38，确保4卡API server正常启动

链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — 用例失败（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败（已重跑 1 次仍失败）
错误类型: 用例失败（PD 分离服务启动失败）
置信度: 高
根因摘要: PR问题：import noaux_tc失败+mxfp4.py TypeError，PD分离服务崩溃
分析器: 通用分析(fallback)

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	同上
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	同上
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	同上

根因详情:
本 PR（cherry-pick #7777，noauxtc 内核融合）直接导致两个问题：① moe.py:45 报 import noaux_tc Failed!，noauxtc 模块在 XPU 环境中编译/导入失败；② 同样触发 mxfp4.py:38 TypeError，worker 进程崩溃，4 个 PD 分离测试全部以 "服务启动失败" 退出。重跑 1 次后仍失败，非偶发问题。

关键日志:

WARNING moe.py[line:45] import noaux_tc Failed!
File ".../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
4 failed in 2653.10s (0:44:13)
============================8卡cases测试失败,请检查日志!============================

修复建议:

排查 XPU 环境下 noauxtc 编译失败原因，确保 cherry-pick 的 noauxtc kernel 支持 XPU 硬件
修复 mxfp4.py:38 的 paddle.compat.enable_torch_proxy 调用（与 GPU 环境同问题）

修复建议摘要: 修复noauxtc XPU编译及mxfp4.py:38兼容调用

链接: 查看日志

Extracted partial CE model tasks / run_ce_cases — 用例失败（置信度: 高）

Extracted partial CE model tasks / run_ce_cases

状态: ❌ 失败
错误类型: 用例失败（CE 服务启动失败）
置信度: 高
根因摘要: PR问题：mxfp4.py:38 TypeError，CE服务(test_EB_Lite_serving)崩溃
分析器: 通用分析(fallback)

根因详情:
test_EB_Lite_serving.py 启动 wint4 量化配置的服务时，触发 parse_quant_config → get_quantization_config("wint4") → from .mxfp4 import MXFP4Config，在 mxfp4.py:38 处因 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 为 None 而崩溃，服务 exit_code=1，CE 模型测试整体失败。

关键日志:

File ".../quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1
##[error]Process completed with exit code 1.

修复建议:

在 mxfp4.py:38 前增加判断：if callable(getattr(paddle.compat, 'enable_torch_proxy', None)): 再调用，避免 NoneType 错误
或将该调用移至函数内部，避免模块级副作用

修复建议摘要: 在mxfp4.py:38前增加callable守卫避免TypeError

链接: 查看日志

PaddlePaddle-bot · 2026-05-15T10:03:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 17:56:21

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d43e7d0
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 4 个 Required 任务失败，需优先处理；2 个 Required 任务仍在运行中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
30(0)	30	22	6	2	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	48m2s	环境问题：mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None	mxfp4.py:38 添加 callable 检查	Job	-
❌	`Run Base Tests / base_tests`	14m2s	环境问题：mxfp4.py:38 TypeError，Worker 进程崩溃退出码 8	mxfp4.py:38 添加 callable 检查或升级 Paddle	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	25m32s	环境问题：3个e2e测试失败，推断与 mxfp4.py 相同	同 base_tests 修复 mxfp4 环境问题	Job	-
❌	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	8m4s	环境问题：test_EB_Lite_serving 服务无法启动，mxfp4.py:38 TypeError	mxfp4.py:38 添加 callable 检查	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
✅	其余 2 个必选任务通过（`run_tests_logprob`、`stable_tests`）	-	-	-	-	-

2.2 可选任务 — 20/22 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m28s	Job	-
❌	`Trigger Jenkins for PR` (CI_METAX)	1m4s	Job	-
✅	其余 20 个可选任务通过	-	-	-

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — 模块导入崩溃 + PR引入XPU op警告
置信度: 高
根因摘要: XPU环境 mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None 致 Worker 崩溃；PR新增 grouped_topk CUDA op 在 XPU 未编译（有 try-except 保护，非直接崩溃原因）
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	Failed: PD分离服务启动失败	mxfp4.py:38 TypeError

根因详情:
Worker 进程在 initialize_fd_config → parse_quant_config → get_quantization_config 调用链中触发 mxfp4.py 的懒加载。mxfp4.py:38 在模块级调用 paddle.compat.enable_torch_proxy(scope={"flashinfer"})，该函数在 XPU 版 Paddle 中为 None（不可调用），导致 TypeError。所有 4 个 PD 分离测试因此无法启动 Worker，均以「PD分离服务启动失败」结束。mxfp4.py 不在本 PR 的改动范围内，且 PR 新增的 grouped_topk CUDA op 在 XPU 环境触发了 WARNING: import noaux_tc Failed!（已有 try-except 保护，不是崩溃直接原因）。

关键日志:

WARNING  moe.py: import noaux_tc Failed!
File "/workspace/FastDeploy/fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
4 failed in 2653.31s (0:44:13)

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py:38 添加平台兼容检查：if callable(getattr(paddle.compat, 'enable_torch_proxy', None)):
确认 grouped_topk 是否需要在 XPU 编译（若 CUDA-only，现有 try-except 保护已足够，无需额外处理）
建议验证 merge base (88a7479) 是否也有此失败，以判断是否为预存问题

修复建议摘要: mxfp4.py:38 添加 callable 检查；XPU grouped_topk 警告可忽略

关联变更: PR 在 fastdeploy/model_executor/layers/moe/moe.py:36 新增 grouped_topk import，XPU 环境触发警告（已保护）
链接: 查看日志

Run Base Tests / base_tests — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — 模块导入崩溃
置信度: 高
根因摘要: mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None 致 Worker 崩溃，退出码 8
分析器: ci_analyze_unittest_fastdeploy

根因详情:
测试使用已安装的 FastDeploy 包（/usr/local/lib/python3.10/dist-packages/fastdeploy/...），启动 ernie-4_5-21b-a3b-bf16-paddle 模型（wint4 量化）时，Worker 进程的 initialize_fd_config → parse_quant_config → get_quantization_config 调用链触发 mxfp4.py 模块导入，第 38 行的 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 在当前 CI Paddle 版本中为 None，抛出 TypeError。mxfp4.py 不在本 PR 改动文件中，判断为已安装包与 Paddle 版本间的兼容性环境问题。

关键日志:

File ".../fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to launch worker processes
+ exit 8

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py:38：添加 if callable(getattr(paddle.compat, 'enable_torch_proxy', None)): 防护
检查 CI runner 的 Paddle 版本，确认 paddle.compat.enable_torch_proxy 是否被正确导出
验证 merge base 是否也存在此失败

修复建议摘要: mxfp4.py:38 添加 callable 检查或升级 CI Paddle 版本

关联变更: 本 PR 未改动 mxfp4.py，失败与本 PR 代码变更无直接关联
链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 环境问题（置信度: 中）

状态: ❌ 失败
错误类型: 环境问题（推断）— 3个 e2e 测试文件失败
置信度: 中（无详细 traceback）
根因摘要: 3个 ERNIE 21B TP1-DP4 e2e 测试失败，推断为 mxfp4.py:38 环境兼容性问题
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_determinism_long.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）
`test_ernie_21b_tp1_dp4.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）
`test_ernie_21b_tp1_dp4_mtp.py`	exit code 1	推断：mxfp4.py:38 TypeError（无详细日志）

根因详情:
step_log 中仅有 1 个测试通过（test_vocab_parallel_embedding_deterministic，不涉及 wint4 量化）。另 3 个测试均为 ERNIE 21B TP1-DP4 的 e2e 服务测试，在相同环境中启动量化模型服务时极可能触发与 base_tests 相同的 mxfp4.py:38 TypeError。详细日志未在 step_log 中展示，置信度为中。

关键日志:

3 test file(s) failed:
/workspace/FastDeploy/tests/e2e/4cards_cases/test_determinism_long.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4_mtp.py
##[error]Process completed with exit code 1.

修复建议:

修复 mxfp4.py:38 callable 检查（同 base_tests）
查看 4cards 测试详细日志（GitHub Artifacts）以确认根因

修复建议摘要: 修复 mxfp4.py:38；可下载 Artifacts 查看详细日志

链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 环境问题（置信度: 高）

状态: ❌ 失败
错误类型: 环境问题 — Worker 启动崩溃
置信度: 高
根因摘要: test_EB_Lite_serving.py 服务无法启动，mxfp4.py:38 TypeError 致 Worker 崩溃
分析器: ci_analyze_unittest_fastdeploy

根因详情:
CE 模型服务测试 test_EB_Lite_serving.py 启动 FastDeploy LLM 服务时，Worker 进程因 mxfp4.py:38 TypeError 无法完成初始化，服务报 ERROR: Failed to initialize FastDeploy LLM engine，测试框架收到非零退出码后报错。根因与 base_tests 完全一致，mxfp4.py 不在本 PR 改动范围内。

关键日志:

File ".../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
[ERROR] api_server.py: Failed to initialize FastDeploy LLM engine, service exit now!
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议:

mxfp4.py:38 添加 callable(getattr(paddle.compat, 'enable_torch_proxy', None)) 检查
验证 merge base 是否同样失败

修复建议摘要: mxfp4.py:38 添加 callable 检查，修复环境兼容性

关联变更: 本 PR 未改动 mxfp4.py
链接: 查看日志

PaddlePaddle-bot · 2026-05-16T09:41:53Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 17:39:24

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d43e7d0
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

有 1 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
12(0)	12	9	2	0	0	0

2 任务状态汇总

2.1 Required任务 : 0/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	48m2s	PR问题：mxfp4.py L38调用XPU不可用的enable_torch_proxy	mxfp4.py L38添加enable_torch_proxy可调用性检查	Job	-
⛔	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	已取消（依赖任务失败导致）	-	-	-

2.2 可选任务 — 9/10 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m28s	Job	-
✅	其余 9 个可选任务通过	-	-	-

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases — 测试失败（置信度: 中）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 测试失败
置信度: 中
根因摘要: PR引入的mxfp4.py在XPU环境调用了不可用的paddle.compat.enable_torch_proxy
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	worker进程初始化崩溃
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	worker进程初始化崩溃
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	worker进程初始化崩溃
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	worker进程初始化崩溃

根因详情:
本次PR引入了noaux_tc算子融合（cast+sigmoid+bias+noauxtc），在XPU环境中moe.pyL45处导入noaux_tc失败。更关键的是mxfp4.py第38行的模块级初始化代码调用了paddle.compat.enable_torch_proxy(scope={"flashinfer"})，但该API在XPU环境的Paddle版本中为None（不可调用），导致TypeError: 'NoneType' object is not callable，worker进程崩溃，4个PD分离服务测试用例全部无法启动。

关键日志:

WARNING  moe.py[line:45] import noaux_tc Failed!
File "worker_process.py", line 1176, in initialize_fd_config
    quant_config = parse_quant_config(...)
File "mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py L38: 添加可调用性检查，将 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 改为 _fn = getattr(paddle.compat, 'enable_torch_proxy', None); if callable(_fn): _fn(scope={"flashinfer"})
同时检查moe.py L45的noaux_tc导入，确保在XPU环境下优雅降级（当前已有WARNING，需确认是否影响功能）

修复建议摘要: mxfp4.py L38添加enable_torch_proxy可调用性检查

关联变更: fastdeploy/model_executor/layers/quantization/mxfp4.py (L38), fastdeploy/model_executor/models/moe.py (L45, noaux_tc import)

链接: 查看日志

PaddlePaddle-bot · 2026-05-16T11:08:56Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-16 19:06:06

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d43e7d0
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

❌ 4 个 Required 任务失败，需优先处理后方可合并。 另有 2 个 Required 任务已取消（含主测试任务 run_tests_with_coverage）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
30(0)	30	23	5	0	0	0

2 任务状态汇总

2.1 Required 任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	14m2s	PR问题：mxfp4.py:38 enable_torch_proxy 为 None	mxfp4.py:38 添加 callable 保护	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	25m32s	PR问题：mxfp4.py:38 TypeError 服务启动失败	同上，修复 mxfp4.py:38 兼容性	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	48m2s	PR问题：XPU 环境 mxfp4.py:38 TypeError	检查 XPU Paddle compat API 可用性	Job	-
❌	`run_ce_cases`	8m4s	PR问题：mxfp4.py:38 TypeError CE 服务超时	修复 mxfp4.py 模块级不兼容调用	Job	-
⏸️	`run_tests_with_coverage`（已取消）	-	已取消	-	-	-
⏸️	`xpu_4cards_case_test / run_xpu_4cards_cases`（已取消）	-	已取消	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/22 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	12m28s	Job	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

以下 4 个 Required 失败任务均为同一根因：mxfp4.py 第 38 行在模块加载时调用 paddle.compat.enable_torch_proxy(scope={"flashinfer"})，而该函数在当前 CI 环境（GPU H20 和 XPU）中值为 None（不可调用），导致所有使用 wint4 量化的模型服务无法启动。

Run Base Tests / base_tests — 测试失败（置信度: 高）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 测试失败（服务启动失败，exit code 8）
置信度: 高
根因摘要: mxfp4.py:38 调用 enable_torch_proxy 为 None，wint4 量化服务崩溃
分析器: ci_analyze_unittest_fastdeploy

根因详情:
fastdeploy/model_executor/layers/quantization/mxfp4.py 第 38 行在模块导入时执行 paddle.compat.enable_torch_proxy(scope={"flashinfer"})。由于 get_quantization_config() 在解析 wint4 时会触发 from .mxfp4 import MXFP4Config，任何使用 wint4 量化配置的服务启动都会立即崩溃。

关键日志:

File ".../quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to initialize FastDeploy LLM engine
exit code 8

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py L38：添加 callable 保护：

if callable(getattr(paddle.compat, 'enable_torch_proxy', None)):
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})

确认本 PR cherry-pick（[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc #7777）是否引入了该 mxfp4.py 变更，若是则检查 Paddle 版本依赖

修复建议摘要: mxfp4.py:38 添加 callable 保护，避免不兼容 Paddle 版本崩溃

链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 测试失败（置信度: 高）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 测试失败（API 服务启动失败）
置信度: 高
根因摘要: 同根因，4卡所有 worker 进程因 mxfp4.py 导入失败而退出
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_ernie_21b_tp1_dp4.py`	API server did not start on port 8088	mxfp4.py 导入失败
`test_ernie_21b_tp1_dp4_mtp.py`	API server did not start on port 8088	mxfp4.py 导入失败
`test_determinism_long.py`	API server did not start on port 8088	mxfp4.py 导入失败

关键日志:

workerlog.0/1/2/3: TypeError: 'NoneType' object is not callable
RuntimeError: API server did not start on port 8088
3 test file(s) failed

修复建议摘要: 修复 mxfp4.py:38，同 base_tests

链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — 测试失败（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 测试失败（PD 分离服务启动失败）
置信度: 高
根因摘要: XPU workspace 源码 mxfp4.py:38 同样 TypeError，4 个 PD 分离用例全败
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py:38 TypeError
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py:38 TypeError

关键日志:

File "/workspace/FastDeploy/fastdeploy/.../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
====8卡cases测试失败,请检查日志!====

修复建议摘要: 同 base_tests，XPU 源码路径确认为仓库代码问题

链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 测试失败（置信度: 高）

Extracted partial CE model tasks to run in CI. / run_ce_cases

状态: ❌ 失败
错误类型: 测试失败（CE 服务启动超时）
置信度: 高
根因摘要: mxfp4.py:38 TypeError 导致 CE 服务立即崩溃，等待 360s 超时
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_EB_Lite_serving.py`	起服务或执行异常 exit_code=1	mxfp4.py:38 TypeError

关键日志:

TypeError: 'NoneType' object is not callable
{"status": "服务启动中，耗时：[360s]", ...}
服务启动超时，耗时：[360s]
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议摘要: 修复 mxfp4.py:38，CE 服务即可正常启动

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 12:56:55

📋 Review 摘要

PR 概述：将 MoE 路由热路径中的 cast→sigmoid→add bias→noaux_tc 四步操作融合为单个 CUDA kernel（grouped_topk），通过启动参数 --enable-moe-scores-elementwise-fuse 控制开启。
变更范围：custom_ops/gpu_ops/、fastdeploy/model_executor/layers/moe/、fastdeploy/engine/、fastdeploy/scheduler/
影响面 Tag：[OP] [Optimization] [FDConfig]

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/layers/moe/fused_moe_marlin_backend.py`	MoE 后端多实现未同步：marlin/wint2 backend 未加 use_fused_cast 支持
❓ 疑问	`fastdeploy/worker/worker_process.py:861`	CLI 参数命名用下划线，与 args_utils.py 的连字符形式不一致
📝 PR 规范	—	标题含双 Tag（[Op][Optimization]），[Op] 非官方，描述 Motivation/Modifications/Accuracy Tests 均为空

📝 PR 规范检查

标题问题：当前标题 [Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777) 存在三处问题：① 同时包含 [Op] 和 [Optimization] 两个 Tag（规范要求且仅能包含一个）；② [Op] 非官方 Tag（官方为 [OP]）；③ [Optimization] 与标题描述之间缺少空格。

标题建议（可直接复制）：

[Cherry-Pick][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)

PR 描述建议（可直接复制，复刻 checklist §D2 完整结构）：

## Motivation
MoE 路由热路径（如 DeepSeekV3）中，cast to float32、sigmoid、add bias、noaux_tc 被拆为多个独立算子，产生多次显存读写开销。本 PR 将这四步操作融合为单个 CUDA kernel（`grouped_topk`），通过减少显存带宽消耗来提升 MoE 路由计算效率。该功能通过 `--enable-moe-scores-elementwise-fuse` 开关控制，默认关闭保持原有行为。

## Modifications
1. 新增 `custom_ops/gpu_ops/grouped_topk_kernels.cu`：实现 fused cast+sigmoid+add_bias+grouped_topk kernel，支持 float/float16/bfloat16 输入，通过 `PD_BUILD_STATIC_OP(grouped_topk)` 注册；
2. `custom_ops/gpu_ops/cpp_extensions.cc`：声明 `grouped_topk` 函数签名并注册 Python binding；
3. `custom_ops/setup_ops.py`：将新 kernel 加入两处源文件编译列表；
4. `fastdeploy/engine/args_utils.py`：新增 `--enable-moe-scores-elementwise-fuse` CLI 参数；
5. `fastdeploy/scheduler/config.py`：`SchedulerConfig` 新增 `enable_moe_scores_elementwise_fuse` 字段（默认 False）；
6. `fastdeploy/engine/engine.py`：将新标志通过 worker_store_true_flag 传递给 worker；
7. `fastdeploy/model_executor/layers/moe/moe.py`：`get_moe_scores` 新增 `use_fused_cast` 参数，为 True 时调用 `grouped_topk` fused kernel，跳过独立的 sigmoid + add 操作；
8. `fused_moe_{blackwell,cutlass,deepgemm,triton}_backend.py`：各后端按 `use_fused` 标志决定是否在进入 `get_moe_scores` 前提前执行 `.cast("float32")`；
9. `fastdeploy/model_executor/layers/moe/ep.py`：EP 路径传入 `use_fused_cast=False`（EPLB 场景暂不支持 fusion，已加 TODO 注释）；
10. `tests/operators/test_grouped_topk_op.py`：新增算子单测，覆盖数值正确性验证。

## Usage or Command
启动服务时加上参数 `--enable-moe-scores-elementwise-fuse`

## Accuracy Tests
N/A。kernel 内部使用与原始 sigmoid 路径数值等价的公式（`1/(1+exp(-x))`），bit-level 一致性由 `tests/operators/test_grouped_topk_op.py` 验证。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. 已新增 `tests/operators/test_grouped_topk_op.py`
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰，fused kernel 的并行设计（一个 warp 处理一个 group）合理，各 MoE 后端的联动改造较为完整。主要关注点是 fused_moe_marlin_backend.py 和 fused_moe_wint2_backend.py 是否也使用 noaux_tc 路径——如使用则需同步 use_fused_cast 支持，避免开启优化后这两个后端走旧路径出现行为不一致。

PaddlePaddle-bot · 2026-05-18T05:01:02Z

        help="chunk size of moe input",
    )
+    parser.add_argument(
+        "--enable_moe_scores_elementwise_fuse",


❓ 疑问 CLI 参数命名使用下划线，与 args_utils.py 中的连字符形式不一致

args_utils.py 中注册的是 --enable-moe-scores-elementwise-fuse（连字符），而此处使用 --enable_moe_scores_elementwise_fuse（下划线）。在用户文档和日志中会出现两种形式，建议统一为连字符形式：

"--enable-moe-scores-elementwise-fuse",

PaddlePaddle-bot · 2026-05-18T05:33:15Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 13:30:12

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0725fb4
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

⚠️ CI尚未完成，有 3 个 Required 任务失败，3 个 Required 任务仍在运行中，需等待并处理失败项。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
31(0)	31	23	5	3	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Four Cards Tests / run_4_cards_tests`	26m2s	PR问题：Ernie 21B TP1 DP4/MTP/确定性测试失败，疑MoE配置未同步	检查4卡测试config是否缺少enable_moe_scores_elementwise_fuse字段	Job	-
❌	`Run Base Tests / base_tests`	13m34s	环境问题：mxfp4.py导入时enable_torch_proxy为None	环境问题，请 rerun	Job	-
❌	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	8m3s	日志获取失败，无法分析具体根因	分析不可用，请查看 CI 详情	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
✅	其余 2 个必选任务通过（run_tests_logprob、stable_tests）	-	-	-	-	-

2.2 可选任务 — 21/23 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	19m2s	Job	-
❌	`Trigger Jenkins for PR`	53s	Job	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Four Cards Tests / run_4_cards_tests — 测试失败（置信度: 中）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 测试失败
置信度: 中
根因摘要: PR新增enable_moe_scores_elementwise_fuse属性，4卡集成测试config未同步更新

失败用例:

测试	错误	根因
`test_ernie_21b_tp1_dp4.py`	未知（详细日志不可用）	MoE配置属性缺失或grouped_topk兼容性问题
`test_ernie_21b_tp1_dp4_mtp.py`	未知（详细日志不可用）	MoE配置属性缺失或grouped_topk兼容性问题
`test_determinism_long.py`	未知（详细日志不可用）	MoE数值确定性受grouped_topk影响

根因详情:
PR在 fastdeploy/scheduler/config.py 和 worker_process.py 新增了 enable_moe_scores_elementwise_fuse 字段，并在 moe.py 引入 grouped_topk 算子和 use_fused_cast 路径。PR已对 tests/layers/test_deepgemm_fused_moe.py 等文件补充该字段（SimpleNamespace(..., enable_moe_scores_elementwise_fuse=False)），但 tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py 等4卡集成测试可能仍缺少该配置字段，导致 AttributeError 或行为异常。

关键日志:

3 test file(s) failed:
/workspace/FastDeploy/tests/e2e/4cards_cases/test_determinism_long.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4_mtp.py
##[error]Process completed with exit code 1.

修复建议:

检查 tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py 等文件中创建 scheduler_config 的代码，补充 enable_moe_scores_elementwise_fuse=False 字段（参考 tests/layers/test_deepgemm_fused_moe.py 的修改方式）
若4卡测试使用完整 SchedulerConfig 而非 SimpleNamespace，确认 scheduler/config.py 的变更已正确合入

修复建议摘要: 在4卡测试config中补充enable_moe_scores_elementwise_fuse=False字段

关联变更: fastdeploy/scheduler/config.py, fastdeploy/worker/worker_process.py, fastdeploy/model_executor/layers/moe/moe.py
链接: 查看日志

Run Base Tests / base_tests — 基础设施/环境（置信度: 中）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 基础设施
置信度: 中
根因摘要: 环境问题：mxfp4.py导入时paddle.compat.enable_torch_proxy为None不可调用

根因详情:
Worker进程初始化量化配置时，quantization/__init__.py:220 尝试 from .mxfp4 import MXFP4Config，触发 mxfp4.py:38 执行 paddle.compat.enable_torch_proxy(scope={"flashinfer"})，但该属性在此环境中为 None，导致 TypeError: 'NoneType' object is not callable。本PR 未修改 mxfp4.py 或 quantization/__init__.py，判定为paddle版本/环境兼容性问题。

关键日志:

File "fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR 2026-05-18 13:03:24 engine.py[line:163] Failed to launch worker processes
+ exit 8
##[error]Process completed with exit code 8.

修复建议:

此为环境问题，与本PR无关，建议直接 rerun 该任务
若持续失败，排查 paddle.compat 版本兼容性（CI环境中 paddle.compat.enable_torch_proxy 是否为 None）

修复建议摘要: 环境问题，请 rerun

关联变更: 无（本PR未修改量化相关文件）
链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 未知（置信度: 低）

Extracted partial CE model tasks to run in CI. / run_ce_cases

状态: ❌ 失败
错误类型: 未知
置信度: 低
根因摘要: CI日志获取失败，无法进行深度分析

根因详情:
该任务的完整日志获取失败（error_snippet 为"日志获取失败，无法提取错误信息"），无法确定具体失败原因。建议直接查看 GitHub Actions 日志界面获取详情。

修复建议:

直接访问 CI日志查看失败详情
若为间歇性失败，可尝试 rerun

修复建议摘要: 分析不可用，请查看 CI 详情

链接: 查看日志

PaddlePaddle-bot · 2026-05-18T14:07:30Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 22:02:54

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0725fb4
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

⚠️ Required 任务失败：4 个 Required 任务失败需优先处理；另有 2 个 Required 任务被取消（含主测试任务 run_tests_with_coverage）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
32(0)	32	24	6	0	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	13m34s	PR问题：`mxfp4.py` L38 `enable_torch_proxy` 为None，worker崩溃，exit code 8	`mxfp4.py` L38 加 callable() 保护或升级 paddle	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	26m2s	PR问题：ernie-21b 服务无法启动，3个e2e测试失败	同上，修复 `mxfp4.py` L38 兼容性	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	47m6s	PR问题：noauxtc未编译for XPU + `mxfp4.py` L38 TypeError	检查noauxtc XPU兼容性，修复mxfp4.py L38	Job	-
❌	`Extracted partial CE model tasks / run_ce_cases`	8m3s	PR问题：`mxfp4.py` L38 TypeError，test_EB_Lite_serving起服失败	`mxfp4.py` L38 加 callable() 保护	Job	-
⚪	`run_tests_with_coverage`（主测试任务）	-	已取消（未执行）	-	-	-
⚪	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	已取消（未执行）	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/24 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	19m2s	Job	-
❌	`Trigger Jenkins for PR`	53s	Job	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Base Tests / base_tests — 服务启动失败（置信度: 中）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 服务启动失败
置信度: 中
根因摘要: mxfp4.py L38 调用 paddle.compat.enable_torch_proxy 为 None，worker 崩溃，exit code 8
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
ernie45t_21b_sot_wint4 服务启动	TypeError: 'NoneType' object is not callable	`mxfp4.py` L38 `enable_torch_proxy` 不可调用

根因详情:
Worker 进程在 initialize_fd_config 中加载量化配置时崩溃：parse_quant_config → get_quantization_config("mxfp4") → from .mxfp4 import MXFP4Config → L38 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 抛出 TypeError。该 API 在当前 CI paddle 环境中返回 None（不可调用），导致所有使用 mxfp4/wint4 量化的模型服务无法启动。测试使用已安装 wheel（/usr/local/lib/python3.10/dist-packages），mxfp4.py 变更很可能由此 PR cherry-pick 引入。

关键日志:

File ".../fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to launch worker processes, check log/workerlog.* for more details.
+ exit 8

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py L38：调用前添加 callable 检查：if callable(getattr(paddle.compat, 'enable_torch_proxy', None)): paddle.compat.enable_torch_proxy(scope={"flashinfer"})
检查 CI 环境 paddle 版本是否支持 enable_torch_proxy API，升级 paddle 或修改为兼容调用方式

修复建议摘要: mxfp4.py L38 添加 callable 保护或升级 paddle 版本

关联变更: cherry-pick 引入的 mxfp4.py 量化配置变更
链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 服务启动失败（置信度: 高）

run_ce_cases

状态: ❌ 失败
错误类型: 服务启动失败
置信度: 高
根因摘要: 同 base_tests，mxfp4.py L38 TypeError 导致 test_EB_Lite_serving 起服失败
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_EB_Lite_serving.py`	exit_code=1（服务启动异常）	`mxfp4.py` L38 `enable_torch_proxy` 为 None

根因详情:
与 base_tests 完全相同的错误栈。worker_process.py 在加载 mxfp4.py 时触发 TypeError: 'NoneType' object is not callable，导致 fastdeploy 引擎初始化失败。日志明确显示 Failed to initialize FastDeploy LLM engine, service exit now!

关键日志:

File ".../fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR api_server.py:140: Failed to initialize FastDeploy LLM engine, service exit now!
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议:

同 base_tests，修复 mxfp4.py L38 callable 兼容性保护

修复建议摘要: 修复 mxfp4.py L38 callable 检查，与 base_tests 同根因

关联变更: cherry-pick 引入的 mxfp4.py 变更
链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 测试失败（置信度: 中）

run_4_cards_tests

状态: ❌ 失败
错误类型: 测试失败
置信度: 中
根因摘要: 3 个 ernie-21b e2e 测试失败，推测同为 mxfp4.py 服务启动失败
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_determinism_long.py`	exit code 1	ernie-21b 服务可能无法启动
`test_ernie_21b_tp1_dp4.py`	exit code 1	ernie-21b 服务可能无法启动
`test_ernie_21b_tp1_dp4_mtp.py`	exit code 1	ernie-21b 服务可能无法启动

根因详情:
test_vocab_parallel_embedding_deterministic 通过（纯计算测试，不依赖 inference server）。3 个失败测试均为 ernie-21b 端到端 serving 测试，需要启动 fastdeploy 推理服务，极可能因相同的 mxfp4.py L38 TypeError 导致服务无法启动。完整 traceback 与 base_tests/run_ce_cases 模式一致。

关键日志:

3 test file(s) failed:
/workspace/FastDeploy/tests/e2e/4cards_cases/test_determinism_long.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4.py
/workspace/FastDeploy/tests/e2e/4cards_cases/test_ernie_21b_tp1_dp4_mtp.py
##[error]Process completed with exit code 1.

修复建议:

修复 mxfp4.py L38 callable 兼容性问题（同 base_tests）
若修复后仍失败，请检查 tests/e2e/4cards_cases 下 3 个失败测试文件的详细日志

修复建议摘要: 优先修复 mxfp4.py L38，若仍失败再查具体e2e日志

关联变更: cherry-pick 引入的量化相关变更
链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — 服务启动失败（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 服务启动失败
置信度: 高
根因摘要: noauxtc 未编译for XPU + mxfp4.py L38 TypeError，4个PD分离测试全败
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py L38 TypeError
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py L38 TypeError
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py L38 TypeError
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py L38 TypeError

根因详情:
本 Job 使用 PR 源码（/workspace/FastDeploy/）。存在两个直接错误：(1) WARNING: import noaux_tc Failed!（moe.py L45），此 PR cherry-pick 新增的 noauxtc kernel fusion 未在 XPU 环境编译；(2) mxfp4.py L38 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) TypeError（XPU paddle 版本中该 API 为 None）。两者叠加导致 worker 进程崩溃，4 个 PD 分离测试全部报 PD分离服务启动失败。

关键日志:

WARNING  moe.py[line:45] import noaux_tc Failed!
File ".../fastdeploy/model_executor/layers/quantization/mxfp4.py", line 38
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation - Failed: PD分离服务启动失败
======================== 4 failed in 2653.10s (0:44:13) ========================

修复建议:

fastdeploy/model_executor/layers/quantization/mxfp4.py L38：添加 callable 保护，兼容 XPU paddle 版本
fastdeploy/model_executor/layers/moe.py L45：noauxtc kernel 需添加 XPU 平台条件保护，在未编译平台跳过 import
确认 noauxtc CUDA kernel 是否仅适用于 NVIDIA GPU，XPU 路径应做 platform check

修复建议摘要: mxfp4.py L38 加 callable 保护 + noauxtc 添加 XPU 兼容条件

关联变更: cherry-pick #7777 引入的 noauxtc kernel fusion 及 mxfp4.py 变更
链接: 查看日志

PaddlePaddle-bot · 2026-05-19T00:38:21Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 08:33:18

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0725fb4
Merge base: 88a7479 (branch: release/online/20260415)
查看完整 Diff
CI 详情

1 任务总览

4 个 Required 任务失败，2 个 Required 任务被取消（未执行），当前 Required 通过率 2/8，建议先 rerun 失败任务以确认是否为环境问题后再决定合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
30(0)	30	23	5	0	0	0

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run Base Tests / base_tests`	13m34s	环境问题：mxfp4.py 模块导入 TypeError	环境问题，请 rerun	Job	-
❌	`Run Four Cards Tests / run_4_cards_tests`	26m2s	环境问题：3服务测试失败，疑 mxfp4.py TypeError	环境问题，请 rerun	Job	-
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	47m6s	PR+环境：noauxtc未为XPU编译+mxfp4 XPU不兼容	修复 mxfp4.py XPU 兼容性	Job	-
❌	`Extracted partial CE model tasks / run_ce_cases`	8m3s	环境问题：CE服务 mxfp4.py 导入 TypeError	环境问题，请 rerun	Job	-
🚫	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	已取消（未执行）	-	-	-
🚫	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	已取消（未执行）	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/22 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	19m2s	Job	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run Base Tests / base_tests — 环境问题（置信度: 高）

Run Base Tests / base_tests

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: mxfp4.py:38 paddle.compat.enable_torch_proxy 为 None，服务 exit 8
分析器: ci_analyze_unittest_fastdeploy

根因详情:
worker_process.py 初始化量化配置时，get_quantization_config 触发 from .mxfp4 import MXFP4Config，mxfp4.py:38 执行 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 但该函数为 None，抛出 TypeError，导致服务启动失败（exit code 8）。mxfp4.py 不在本 PR 变更文件中，为 CI 环境 Paddle 版本兼容性问题。

关键日志:

File ".../quantization/mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
ERROR: Failed to launch worker processes
+ exit 8

修复建议:

请 rerun；若持续失败，检查 CI 环境 Paddle 版本是否支持 paddle.compat.enable_torch_proxy
考虑在 mxfp4.py:38 前添加 if callable(getattr(paddle.compat, 'enable_torch_proxy', None)) 检查

修复建议摘要: 环境问题，请 rerun；可修复 mxfp4.py Paddle 兼容性检查

关联变更: mxfp4.py 未在本 PR 变更中，非 PR 引入
链接: 查看日志

Run Four Cards Tests / run_4_cards_tests — 环境问题（置信度: 中）

Run Four Cards Tests / run_4_cards_tests

状态: ❌ 失败
错误类型: 基础设施
置信度: 中
根因摘要: 3 个服务启动测试文件失败，推测同 mxfp4.py 环境问题
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_determinism_long.py`	服务启动失败（exit 1）	疑似 mxfp4.py TypeError
`test_ernie_21b_tp1_dp4.py`	服务启动失败（exit 1）	疑似 mxfp4.py TypeError
`test_ernie_21b_tp1_dp4_mtp.py`	服务启动失败（exit 1）	疑似 mxfp4.py TypeError

根因详情:
step_log 未包含 3 个失败测试的详细错误栈，但根据同 Workflow 的 base_tests、run_ce_cases 均在相同 Runner 上遭遇 mxfp4.py TypeError，推测这些测试也是相同根因。test_vocab_parallel_embedding_deterministic（不涉及服务启动）已通过，佐证服务启动环节存在系统性问题。

关键日志:

============================== 1 passed in 9.57s ===============================
3 test file(s) failed:
  .../test_determinism_long.py
  .../test_ernie_21b_tp1_dp4.py
  .../test_ernie_21b_tp1_dp4_mtp.py
##[error]Process completed with exit code 1.

修复建议:

请 rerun；若持续失败，查看容器内完整日志确认是否为 mxfp4.py TypeError

修复建议摘要: 环境问题疑似，请 rerun 并确认完整错误栈

关联变更: fastdeploy/model_executor/layers/moe/（PR 改动 MoE 层，可能影响 MTP 测试）
链接: 查看日志

Extracted partial CE model tasks to run in CI. / run_ce_cases — 环境问题（置信度: 高）

Extracted partial CE model tasks to run in CI. / run_ce_cases

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: mxfp4.py:38 TypeError，CE 测试 test_EB_Lite_serving.py 服务启动失败
分析器: ci_analyze_unittest_fastdeploy

根因详情:
test_EB_Lite_serving.py 启动 FastDeploy 服务（wint4 量化），worker_process.py 初始化量化配置时触发 mxfp4 模块导入，mxfp4.py:38 执行 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 但函数为 None，服务 exit_code=1。与 base_tests 完全相同的根因，mxfp4.py 不在本 PR 变更文件中。

关键日志:

File ".../site-packages/fastdeploy/.../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
[ERROR] test_EB_Lite_serving.py 起服务或执行异常，exit_code=1

修复建议:

请 rerun；若持续失败，修复 mxfp4.py:38 添加 callable 检查
检查 CI 环境 Paddle 版本兼容性

修复建议摘要: 环境问题，请 rerun；需修复 mxfp4.py Paddle 兼容性

关联变更: mxfp4.py 未在本 PR 变更中，非 PR 引入
链接: 查看日志

xpu_8cards_case_test / run_xpu_8cards_cases — PR问题+环境问题（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: XPU上mxfp4.py TypeError + PR引入noauxtc未为XPU编译
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_pd_21b_ep4tp1.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py XPU 兼容性
`test_pd_21b_ep4tp4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py XPU 兼容性
`test_pd_21b_ep4tp4_cudagraph.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py XPU 兼容性
`test_pd_p_tp4ep4_d_tp1ep4.py::test_pd_separation`	PD分离服务启动失败	mxfp4.py XPU 兼容性

根因详情:
存在两个问题：(1) PR 相关：本 PR 在 moe.py:45 新增 import noaux_tc（custom_ops/gpu_ops/ 仅 GPU 编译），XPU 上该导入失败并打印 WARNING，非直接崩溃原因；(2) 主要崩溃：mxfp4.py:38 在 XPU 环境调用 paddle.compat.enable_torch_proxy(scope={"flashinfer"}) 但该函数为 None（XPU 无 flashinfer），触发 TypeError，导致 4 个 PD 分离服务测试全部无法启动。

关键日志:

WARNING  moe.py[line:45] import noaux_tc Failed!
...
File "/workspace/FastDeploy/fastdeploy/.../mxfp4.py", line 38, in <module>
    paddle.compat.enable_torch_proxy(scope={"flashinfer"})
TypeError: 'NoneType' object is not callable
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
4 failed in 2653.10s (0:44:13)

修复建议:

修复 mxfp4.py:38：添加平台检查，非 CUDA 平台跳过 flashinfer proxy 设置（if paddle.is_compiled_with_cuda(): ...）
确认 noauxtc 自定义算子是否需要 XPU 编译支持，或确保 XPU 降级路径功能完整（moe.py 已有 WARNING 降级）

修复建议摘要: 修复mxfp4.py XPU兼容性；确认noauxtc XPU降级完整性

关联变更:

fastdeploy/model_executor/layers/moe/moe.py（PR 新增 noauxtc 导入）
custom_ops/gpu_ops/（PR 新增 noauxtc 算子，GPU only）

链接: 查看日志

ShaneGZhu added 2 commits May 15, 2026 10:09

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePad…

1ff1715

…dle#7777) [Cherry-Pick]

Kernel fusion for blackwell and deepgemm backend in non-EPLB scenarios

d43e7d0

ShaneGZhu had a problem deploying to Metax_ci May 15, 2026 08:43 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix hard code in ep.py

0725fb4

ShaneGZhu had a problem deploying to Metax_ci May 18, 2026 04:48 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 18, 2026

View reviewed changes

freeliuzc approved these changes May 19, 2026

View reviewed changes

freeliuzc merged commit 18665da into PaddlePaddle:release/online/20260415 May 19, 2026
31 of 41 checks passed

Conversation

ShaneGZhu commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 15, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required 任务 — 2/8 通过

2.2 可选任务 — 25/27 通过

3 失败详情（仅 Required）

Run Base Tests / base_tests

Run Four Cards Tests / run_4_cards_tests

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks / run_ce_cases

Uh oh!

PaddlePaddle-bot commented May 15, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 20/22 通过

3 失败详情（仅 required）

Uh oh!

PaddlePaddle-bot commented May 16, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/2 通过

2.2 可选任务 — 9/10 通过

3 失败详情（仅 required）

xpu_8cards_case_test / run_xpu_8cards_cases

Uh oh!

PaddlePaddle-bot commented May 16, 2026

1 任务总览

2 任务状态汇总

2.1 Required 任务 : 2/8 通过

2.2 可选任务 — 21/22 通过

3 失败详情（仅 required）

Run Base Tests / base_tests

Run Four Cards Tests / run_4_cards_tests

xpu_8cards_case_test / run_xpu_8cards_cases

Extracted partial CE model tasks to run in CI. / run_ce_cases

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 18, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 21/23 通过

3 失败详情（仅 required）

Run Four Cards Tests / run_4_cards_tests

Run Base Tests / base_tests

Extracted partial CE model tasks to run in CI. / run_ce_cases

Uh oh!

PaddlePaddle-bot commented May 18, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 22/24 通过

3 失败详情（仅 required）

Run Base Tests / base_tests

run_ce_cases

run_4_cards_tests

ShaneGZhu commented May 15, 2026 •

edited

Loading

PaddlePaddle-bot commented May 15, 2026 •

edited

Loading