Skip to content

Revert "Remove timer training argument"#4505

Merged
huangjiyi merged 3 commits into
developfrom
revert-4458-codex/default-enable-pipeline-timer
May 21, 2026
Merged

Revert "Remove timer training argument"#4505
huangjiyi merged 3 commits into
developfrom
revert-4458-codex/default-enable-pipeline-timer

Conversation

@huangjiyi
Copy link
Copy Markdown
Member

Reverts #4458

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 21, 2026

PaddleFormers Log Analysis

Run #26228391839 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议
CI_ILUVATAR 单测存在Bug(LossNan) test_ernie_21b_sft_training 训练第2步出现 Loss=nan,需排查 PR #4505trainer.py 的修改是否引入了数值不稳定问题

失败的测试case:

scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

PR #4505(Revert "Remove timer training argument")修改了以下文件:

  • paddleformers/trainer/trainer.py
  • paddleformers/trainer/training_args.py
  • paddleformers/cli/train/ernie_pretrain/workflow.py

训练在 global_step=1 完成后,第2步前向计算触发了:

W0521 20:23:02 dygraph_functions.cc:125086] Got different data type, run type promotion automatically

随后 _check_loss_valid 检测到 Loss=nan,抛出:

ValueError: PaddleRecall error(102): LossNan. Loss contains inf or nan values, its value is nan

根因指向 trainer.py 的改动引入了数据类型不一致(type promotion warning),导致在天数(Iluvatar)硬件上第2步计算溢出为 nan。该问题在 step=1 时 loss=11.34 正常,step=2 时出现 nan,符合 revert 操作引入回归的特征。

修复建议:

  1. 检查 PR Revert "Remove timer training argument" #4505trainer.py 的改动,重点排查是否有影响混合精度(fp16/bf16)计算的逻辑被 revert 回来,导致 Iluvatar 硬件上出现数据类型不匹配。
  2. trainer.py 中搜索 _check_loss_valid 上下文,确认 loss 计算路径中是否存在 float32/float16 混用的情况,必要时添加显式 cast。
  3. 若确认是 revert 引入的回归,建议在 ERNIE-21B-SFT.yaml 配置中临时验证关闭 fp16/bf16 是否能复现正常 loss,以定位是否为精度问题。
  4. 也可先 rerun 一次确认是否稳定复现,若稳定复现则需修复 trainer.py 中的数值稳定性问题后重新提交。

🔄 每次 Re-run 后自动更新

@huangjiyi huangjiyi merged commit dccf241 into develop May 21, 2026
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants