Skip to content

add dsa index aoa and log#4490

Merged
xingmingyyj merged 4 commits into
PaddlePaddle:developfrom
xingmingyyj:fix_glm5
May 20, 2026
Merged

add dsa index aoa and log#4490
xingmingyyj merged 4 commits into
PaddlePaddle:developfrom
xingmingyyj:fix_glm5

Conversation

@xingmingyyj
Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Others

PR changes

Others

Description

add dsa index aoa and log

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 19, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

Codecov Report

❌ Patch coverage is 26.08696% with 17 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@79332b4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddleformers/trainer/trainer.py 23.07% 10 Missing ⚠️
paddleformers/transformers/aoa_config_base.py 22.22% 7 Missing ⚠️

❌ Your patch status has failed because the patch coverage (26.08%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4490   +/-   ##
==========================================
  Coverage           ?   46.44%           
==========================================
  Files              ?      475           
  Lines              ?    90639           
  Branches           ?        0           
==========================================
  Hits               ?    42098           
  Misses             ?    48541           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 19, 2026

PaddleFormers Log Analysis

Run #26150815783 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议
CI_ILUVATAR 硬件兼容性问题 — paddle.device.cuda.synchronize 在天数(Iluvatar)环境不可用 training_pipeline_step 中触发 Pipeline Parallel 的 timer 时调用了 paddle.device.cuda.synchronize(),天数为非 CUDA 设备,需将该调用替换为 paddle.device.synchronize() 或做设备类型判断跳过

失败的测试 case:

scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

本次 PR(分支 fix_glm5)在 paddleformers/trainer/trainer.py 中新增了 18 行代码,新增的 training_pipeline_steptrainer.py:3891)调用了 model.forward_backward_pipeline(),触发了 Fleet 内部 Pipeline Parallel 的 P2P 通信计时器(_timers("recv_forward").start())。

该计时器内部在 timer_helper.py:62 调用了 已废弃的 paddle.device.cuda.synchronize()。天数(Iluvatar)机器使用自定义设备后端(libpaddle-iluvatar-gpu.so),Paddle 编译时未包含 CUDA 支持,导致直接抛出:

RuntimeError: (Unavailable) Paddle is not compiled with CUDA.
Cannot visit device synchronize.

修复建议:

trainer.pytraining_pipeline_step 中,调用 model.forward_backward_pipeline() 之前,将 Fleet timer 的设备同步方式改为设备无关接口,或在进入 Pipeline Parallel 前通过 paddle.device.get_device() 判断当前设备类型,对非 CUDA 设备禁用 timer(设置 --use_timer False 或等价配置):

# 替换方案:使用设备无关接口
paddle.device.synchronize()  # 代替 paddle.device.cuda.synchronize()

# 或在 ERNIE-21B-SFT.yaml 配置中关闭 Fleet timer(如支持)
# use_fleet_executor_timer: false

🔄 每次 Re-run 后自动更新

@xingmingyyj
Copy link
Copy Markdown
Collaborator Author

/re-run all-failed

@xingmingyyj xingmingyyj merged commit 95c3c8a into PaddlePaddle:develop May 20, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants