[test]not merge by Liujie0926 · Pull Request #4487 · PaddlePaddle/PaddleFormers

Liujie0926 · 2026-05-19T06:22:01Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

paddle-bot · 2026-05-19T06:22:11Z

Thanks for your contribution!

Paddle-CI-Bot · 2026-05-19T12:26:04Z

🤖 PaddleFormers Bot Log Analysis

Run #26096898223 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议
Unittest GPU CI	单测存在 Bug	以下单测存在 Bug：`tests/peft/test_lora.py::TestLoraModel::test_fuse_moe_lora`、`tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest`（4个 case），均因找不到 `flex-ckpt.auto_generated.metadata` 文件而失败，需在 CI 机器上为 `tiny-random-qwen3vlmoev2` 模型补充 flex checkpoint 格式的权重文件。
CI_ILUVATAR	环境问题（container failed）	`test_ernie_21b_sft_training` 失败，根因是 Iluvatar 环境的 Paddle 未编译 CUDA，调用 `paddle.device.cuda.synchronize()` 时抛出 `RuntimeError: Paddle is not compiled with CUDA`，需改用 `paddle.device.synchronize()` 替换所有 `paddle.device.cuda.synchronize()` 调用，或排查 Iluvatar 镜像 Paddle 编译配置。

失败的测试 case:

# Unittest GPU CI（job/76679557139）
tests/peft/test_lora.py::TestLoraModel::test_fuse_moe_lora
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_batch
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_batch_wo_image
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_with_video

# CI_ILUVATAR（job/76679545766）
scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

Unittest GPU CI（5 个 case）：所有失败 case 均指向同一根因 —— from_pretrained 使用 load_checkpoint_format="flex_checkpoint" 时，底层调用 paddle.load(.../flex-ckpt.auto_generated.metadata)，但 CI 机器 /home/models/PaddleFormers/tiny-random-qwen3vlmoev2/ 目录下缺少该 metadata 文件。该模型目录仅含 model.safetensors，未预置 flex checkpoint 格式产物，是本次 PR（fix_ops，commit 14f2daf）新引入的 load_checkpoint_format="flex_checkpoint" 使用方式与已有 CI 数据不匹配所致。
CI_ILUVATAR（1 个 case）：Iluvatar 机器使用定制 Paddle（Paddle-iluvatar），该版本未编译 CUDA 支持，但训练流程 pipeline 并行路径中 timer_helper.py 调用了已废弃的 paddle.device.cuda.synchronize()，在 Iluvatar 环境直接崩溃。

修复建议:

Unittest GPU CI：在 CI 机器上为 tiny-random-qwen3vlmoev2 模型目录补充 flex checkpoint 文件（运行一次 dist.save_state_dict 生成 flex-ckpt.auto_generated.metadata 及相关分片），或修改测试 setUp 中 load_checkpoint_format="flex_checkpoint" 改为默认 safetensors 加载，与 CI 已有权重格式保持一致。
CI_ILUVATAR：将 paddle.device.cuda.synchronize() 替换为 paddle.device.synchronize()（Paddle 2.5+ 推荐 API），确保 Iluvatar 等非 CUDA 后端环境下训练流程正常执行。

_{🔄 每次 Re-run 后自动更新}

Liujie0926 and others added 4 commits May 18, 2026 15:44

fix

28439e7

fix

8202abf

Merge branch 'develop' into fix_ops

6f60f27

test

14f2daf

Liujie0926 closed this May 19, 2026

Liujie0926 deleted the fix_ops branch May 19, 2026 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test]not merge#4487

[test]not merge#4487
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:fix_ops

Liujie0926 commented May 19, 2026

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

Paddle-CI-Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Liujie0926 commented May 19, 2026

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

Paddle-CI-Bot commented May 19, 2026

🤖 PaddleFormers Bot Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants