[NOT MERGE]debug CI/CE by Liujie0926 · Pull Request #4497 · PaddlePaddle/PaddleFormers

Liujie0926 · 2026-05-20T02:54:46Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

paddle-bot · 2026-05-20T02:54:53Z

Thanks for your contribution!

Paddle-CI-Bot · 2026-05-20T04:07:44Z

PaddleFormers Log Analysis

Run #26148613051 · Attempt 1

日志分析报告

流水线名称	Job	问题标签	修复建议
CI_ILUVATAR	76902154399	环境问题（container failed）	容器初始化后执行脚本失败，联系 self-hosted runner 管理员检查 `iluvatar-gpu-2-nczzk-runner-tgszg` 节点容器配置
Unittest GPU CI	76902317011	单测存在 Bug（ImportError）	`paddleformers/transformers/tokenizer_utils.py:48` 从 `transformers.tokenization_utils_fast` 导入 `PreTrainedTokenizerFast` 失败，检查当前安装的 `transformers` 版本 API 兼容性
Fleet Model Test（H20 多卡）	76902269958	高并发偶现问题	Qwen pre-train 触发 Blocking queue 异常，属偶现问题，建议 rerun；sft/lora 失败为 checkpoint 不存在的级联错误
Fleet Model Test（H20 单卡）	76902270009	高并发偶现问题	GLM4.5 单卡训练触发 Blocking queue 异常，属偶现问题，建议 rerun

失败的测试 case：

# CI_ILUVATAR
iluvatar_test → step: Print current runner name
  错误: Executing the custom container implementation failed (exit code 1)

# Unittest GPU CI  
tests/ai_edited_test/cli/test_ai_deepseek_v3_workflow.py::TestPreTrainingArguments::test_autotuner_benchmark_post_init
  错误: ImportError: cannot import name 'PreTrainedTokenizerFast' 
        from 'transformers.tokenization_utils_fast'
  调用链: tokenizer_utils.py:48 ← feature_extraction_utils.py ← image_processing_utils.py ← trainer.py

# Fleet Model Test (H20 multi-card)
- Qwen pre-train: SystemError: Blocking queue is killed (data reader exception)
- Qwen sft: FileNotFoundError: /workspace/checkpoints/qwen-pt (cascade)
- Qwen lora: HFValidationError: /workspace/checkpoints/qwen-sft (cascade)

# Fleet Model Test (H20 single-card)
- GLM4.5 single-card: SystemError: Blocking queue is killed (data reader exception)

根本原因分析：

CI_ILUVATAR：iluvatar-gpu-2 节点容器运行异常，在 "Print current runner name" 执行阶段提前失败（failed to run script step: [object Object]），后续步骤全部跳过，属于 runner 基础设施问题，与 PR 代码变更无关。
Unittest GPU CI：安装了 torch-2.12.0 后，其自带的 transformers 包中 tokenization_utils_fast 模块的 PreTrainedTokenizerFast 接口与 paddleformers 所期望的导入路径不兼容，导致无法收集测试 case（exit code 4）。此问题在 base commit 上也能复现（日志显示切换到 develop 分支后同样失败）。
Fleet Model Test（Blocking queue）：两个 job 均出现 SystemError: Blocking queue is killed because the data reader raises an exception，该报错为高并发下 DataLoader 的偶发性问题，与 PR 代码变更无直接关联，multi-card 的 sft/lora 失败是 pre-train 未产生 checkpoint 导致的级联错误。

修复建议：

CI_ILUVATAR：联系 CI 维护人员检查 iluvatar-gpu-2 节点 runner 状态，该节点容器执行环境异常，需排查并 rerun。
Unittest GPU CI：排查 paddleformers/transformers/tokenizer_utils.py:48 对 transformers.tokenization_utils_fast.PreTrainedTokenizerFast 的导入是否需要做版本兼容适配（当前 torch-2.12.0 内置 transformers 导致 API 冲突）；在 base develop 分支上也存在同样错误，建议先 merge develop 后确认。
Fleet Model Test：建议直接 rerun，两处 Blocking queue 均为偶发性问题；若 rerun 仍失败，需排查 DataLoader 在多进程并发读取时的数据管道稳定性。

_{🔄 每次 Re-run 后自动更新}

add debug

864aa37

Liujie0926 added 3 commits May 20, 2026 14:13

fix

b415b3d

fix

14caf55

test

d1eff3c

Liujie0926 closed this May 20, 2026

Liujie0926 deleted the debug branch May 20, 2026 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOT MERGE]debug CI/CE#4497

[NOT MERGE]debug CI/CE#4497
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:debug

Liujie0926 commented May 20, 2026

Uh oh!

paddle-bot Bot commented May 20, 2026

Uh oh!

Paddle-CI-Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Liujie0926 commented May 20, 2026

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented May 20, 2026

Uh oh!

Paddle-CI-Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Paddle-CI-Bot commented May 20, 2026 •

edited

Loading