Skip to content

[CI]add allure for ci#4510

Open
Liujie0926 wants to merge 1 commit into
PaddlePaddle:developfrom
Liujie0926:fix
Open

[CI]add allure for ci#4510
Liujie0926 wants to merge 1 commit into
PaddlePaddle:developfrom
Liujie0926:fix

Conversation

@Liujie0926
Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 22, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@b94eb73). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4510   +/-   ##
==========================================
  Coverage           ?   46.43%           
==========================================
  Files              ?      475           
  Lines              ?    90657           
  Branches           ?        0           
==========================================
  Hits               ?    42098           
  Misses             ?    48559           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Paddle-CI-Bot
Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26280768867 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Unittest GPU CI (upload-allure) 报告上传失败 peaceiris/actions-gh-pages@v3 缺少 deploy key / token,需在 workflow 中配置 github_token 或正确的 deploy key secret 报错代码
Model Unittest GPU CI (upload-allure) 报告上传失败 同上,peaceiris/actions-gh-pages@v3 缺少 deploy key / token 报错代码
CI_ILUVATAR LossNan ERNIE-21B SFT 训练第 2 步出现 Loss=nan,触发了 _check_loss_valid 报错(PaddleRecall error(102): LossNan),需排查数据类型提升或学习率配置 报错代码

失败的测试 case:

scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

本 PR(#4510,标题:[CI]add allure for ci)修改了 5 个 workflow 文件以接入 Allure 报告系统,但:

  1. upload-allure(两条):新增的 peaceiris/actions-gh-pages@v3 部署步骤缺少有效的 github_token 或 deploy key,导致无法写入 gh-pages 分支;Download report 步骤因 BOS 上尚无该 PR commit 对应的报告包而返回 exit code 8(wget 下载失败),进一步引发后续 Deploy to GitHub Pages 步骤失败。
  2. CI_ILUVATAR:与本 PR 改动无直接关联,属于独立的训练数值问题。ERNIE-21B SFT 训练到 global_step=1 后第二个 step 出现数据类型自动提升(Got different data type, run type promotion automatically),随后 Loss 变为 nan,被 _check_loss_valid 拦截抛出。

修复建议:

upload-allure 两条流水线

  1. 在仓库 Settings → Secrets 中确认存在部署 gh-pages 所需的 token/key(如 ACTIONS_DEPLOY_KEY 或直接使用 secrets.GITHUB_TOKEN 并赋予 pages: write 权限)。
  2. 在 workflow 的 peaceiris/actions-gh-pages@v3 步骤中补充 github_token: ${{ secrets.GITHUB_TOKEN }}(或对应 key 字段),例如:
    - uses: peaceiris/actions-gh-pages@v3
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: ./report
        destination_dir: unit_CI/${{ env.PR_ID }}
  3. Download report 步骤中的 BOS 下载失败(exit 8)说明上游 CI job 未能成功上传报告包,需确认 unittest-gpu-ci / model-unittest-gpu-ci job 中上传 BOS 的步骤是否正常执行并在 upload-allure job 之前完成。

CI_ILUVATAR

  1. 检查 scripts/iluvatar_ci/config/ERNIE-21B-SFT.yaml 中的 learning_ratefp16/bf16 配置,训练 step 1 的 global_norm=30.47 已偏高,建议降低初始学习率或启用梯度裁剪(max_grad_norm)。
  2. 关注 W dygraph_functions.cc:125086] Got different data type, run type promotion automatically 警告,排查 MoE Gate 相关 op 的输入数据类型是否在 PR 改动或依赖升级中发生了变化(该 CI 环境中 tokenizers 被升级至 0.22.2,与 paddleformers 0.3.1 的依赖声明 <0.22 冲突,建议修复 requirements.txt 版本约束后 rerun 验证)。

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants