Skip to content

cleanup gists#4509

Merged
zjjlivein merged 2 commits into
developfrom
add_cleanup_gits
May 22, 2026
Merged

cleanup gists#4509
zjjlivein merged 2 commits into
developfrom
add_cleanup_gits

Conversation

@zjjlivein
Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 22, 2026

Thanks for your contribution!

@zjjlivein zjjlivein merged commit da1ffab into develop May 22, 2026
13 of 14 checks passed
@Paddle-CI-Bot
Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26277714633 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
CI_ILUVATAR (iluvatar_test) LossNan 检查本次 PR 引入的代码是否导致 ERNIE-21B SFT 训练中出现数据类型混用(Got different data type, run type promotion automatically),触发 nan;与 develop 对比 loss 曲线后考虑 rerun 或回滚 报错代码

失败的测试 case:

scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

本次 PR(#4509 cleanup gists)仅改动了 .github/workflows/cleanup_ci_gists.yml不涉及任何训练代码、模型代码或配置

失败链路如下:

  1. trainer.py:2248 _inner_training_loop_check_loss_valid(tr_loss) 检测到 loss = nan,抛出 PaddleRecall error(102): LossNan
  2. 在 nan 发生之前,日志出现多次 W dygraph_functions.cc:125086] Got different data type, run type promotion automatically,说明某个 multiply 操作存在隐式类型提升(如 fp16/bf16 与 fp32 混用),在天数 iluvatar 硬件上触发数值溢出/下溢,最终 loss 变为 nan
  3. 该现象与本次 PR 修改内容无关,属于 iluvatar CI 机器上已有的数值稳定性偶发问题

修复建议:

  1. 直接 rerun:PR 本身不修改任何训练代码,属于 CI 环境偶发 nan,优先 rerun 验证是否复现
  2. 若 rerun 仍失败,检查 scripts/iluvatar_ci/config/ERNIE-21B-SFT.yaml 中的 fp16/bf16 混精度配置,确认 loss_scale 初始值是否合理(当前 global_norm: 30.47 偏大,可适当降低初始学习率或增大 loss_scale
  3. 排查 multiply_fwd_func.cc 触发处对应的算子,考虑在该操作前显式 cast 到同一数据类型,避免隐式类型提升在天数卡上引发溢出

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants