Skip to content

[test]not merge#4487

Closed
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:fix_ops
Closed

[test]not merge#4487
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:fix_ops

Conversation

@Liujie0926
Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 19, 2026

Thanks for your contribution!

@Liujie0926 Liujie0926 closed this May 19, 2026
@Liujie0926 Liujie0926 deleted the fix_ops branch May 19, 2026 08:43
@Paddle-CI-Bot
Copy link
Copy Markdown

🤖 PaddleFormers Bot Log Analysis

Run #26096898223 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议
Unittest GPU CI 单测存在 Bug 以下单测存在 Bug:tests/peft/test_lora.py::TestLoraModel::test_fuse_moe_loratests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest(4个 case),均因找不到 flex-ckpt.auto_generated.metadata 文件而失败,需在 CI 机器上为 tiny-random-qwen3vlmoev2 模型补充 flex checkpoint 格式的权重文件。
CI_ILUVATAR 环境问题(container failed) test_ernie_21b_sft_training 失败,根因是 Iluvatar 环境的 Paddle 未编译 CUDA,调用 paddle.device.cuda.synchronize() 时抛出 RuntimeError: Paddle is not compiled with CUDA,需改用 paddle.device.synchronize() 替换所有 paddle.device.cuda.synchronize() 调用,或排查 Iluvatar 镜像 Paddle 编译配置。

失败的测试 case:

# Unittest GPU CI(job/76679557139)
tests/peft/test_lora.py::TestLoraModel::test_fuse_moe_lora
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_batch
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_batch_wo_image
tests/transformers/qwen3_vl_moe/test_modeling.py::Qwen3VLMoeIntegrationTest::test_model_tiny_logits_with_video

# CI_ILUVATAR(job/76679545766)
scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

  • Unittest GPU CI(5 个 case):所有失败 case 均指向同一根因 —— from_pretrained 使用 load_checkpoint_format="flex_checkpoint" 时,底层调用 paddle.load(.../flex-ckpt.auto_generated.metadata),但 CI 机器 /home/models/PaddleFormers/tiny-random-qwen3vlmoev2/ 目录下缺少该 metadata 文件。该模型目录仅含 model.safetensors,未预置 flex checkpoint 格式产物,是本次 PR(fix_ops,commit 14f2daf)新引入的 load_checkpoint_format="flex_checkpoint" 使用方式与已有 CI 数据不匹配所致。

  • CI_ILUVATAR(1 个 case):Iluvatar 机器使用定制 Paddle(Paddle-iluvatar),该版本未编译 CUDA 支持,但训练流程 pipeline 并行路径中 timer_helper.py 调用了已废弃的 paddle.device.cuda.synchronize(),在 Iluvatar 环境直接崩溃。


修复建议:

  1. Unittest GPU CI:在 CI 机器上为 tiny-random-qwen3vlmoev2 模型目录补充 flex checkpoint 文件(运行一次 dist.save_state_dict 生成 flex-ckpt.auto_generated.metadata 及相关分片),或修改测试 setUp 中 load_checkpoint_format="flex_checkpoint" 改为默认 safetensors 加载,与 CI 已有权重格式保持一致。

  2. CI_ILUVATAR:将 paddle.device.cuda.synchronize() 替换为 paddle.device.synchronize()(Paddle 2.5+ 推荐 API),确保 Iluvatar 等非 CUDA 后端环境下训练流程正常执行。


🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants