Skip to content

[NOT MERGE]debug CI/CE#4497

Closed
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:debug
Closed

[NOT MERGE]debug CI/CE#4497
Liujie0926 wants to merge 4 commits into
PaddlePaddle:developfrom
Liujie0926:debug

Conversation

@Liujie0926
Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 20, 2026

Thanks for your contribution!

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 20, 2026

PaddleFormers Log Analysis

Run #26148613051 · Attempt 1

日志分析报告

流水线名称 Job 问题标签 修复建议
CI_ILUVATAR 76902154399 环境问题(container failed) 容器初始化后执行脚本失败,联系 self-hosted runner 管理员检查 iluvatar-gpu-2-nczzk-runner-tgszg 节点容器配置
Unittest GPU CI 76902317011 单测存在 Bug(ImportError) paddleformers/transformers/tokenizer_utils.py:48transformers.tokenization_utils_fast 导入 PreTrainedTokenizerFast 失败,检查当前安装的 transformers 版本 API 兼容性
Fleet Model Test(H20 多卡) 76902269958 高并发偶现问题 Qwen pre-train 触发 Blocking queue 异常,属偶现问题,建议 rerun;sft/lora 失败为 checkpoint 不存在的级联错误
Fleet Model Test(H20 单卡) 76902270009 高并发偶现问题 GLM4.5 单卡训练触发 Blocking queue 异常,属偶现问题,建议 rerun

失败的测试 case

# CI_ILUVATAR
iluvatar_test → step: Print current runner name
  错误: Executing the custom container implementation failed (exit code 1)

# Unittest GPU CI  
tests/ai_edited_test/cli/test_ai_deepseek_v3_workflow.py::TestPreTrainingArguments::test_autotuner_benchmark_post_init
  错误: ImportError: cannot import name 'PreTrainedTokenizerFast' 
        from 'transformers.tokenization_utils_fast'
  调用链: tokenizer_utils.py:48 ← feature_extraction_utils.py ← image_processing_utils.py ← trainer.py

# Fleet Model Test (H20 multi-card)
- Qwen pre-train: SystemError: Blocking queue is killed (data reader exception)
- Qwen sft: FileNotFoundError: /workspace/checkpoints/qwen-pt (cascade)
- Qwen lora: HFValidationError: /workspace/checkpoints/qwen-sft (cascade)

# Fleet Model Test (H20 single-card)
- GLM4.5 single-card: SystemError: Blocking queue is killed (data reader exception)

根本原因分析

  • CI_ILUVATARiluvatar-gpu-2 节点容器运行异常,在 "Print current runner name" 执行阶段提前失败(failed to run script step: [object Object]),后续步骤全部跳过,属于 runner 基础设施问题,与 PR 代码变更无关。
  • Unittest GPU CI:安装了 torch-2.12.0 后,其自带的 transformers 包中 tokenization_utils_fast 模块的 PreTrainedTokenizerFast 接口与 paddleformers 所期望的导入路径不兼容,导致无法收集测试 case(exit code 4)。此问题在 base commit 上也能复现(日志显示切换到 develop 分支后同样失败)。
  • Fleet Model Test(Blocking queue):两个 job 均出现 SystemError: Blocking queue is killed because the data reader raises an exception,该报错为高并发下 DataLoader 的偶发性问题,与 PR 代码变更无直接关联,multi-card 的 sft/lora 失败是 pre-train 未产生 checkpoint 导致的级联错误。

修复建议

  1. CI_ILUVATAR:联系 CI 维护人员检查 iluvatar-gpu-2 节点 runner 状态,该节点容器执行环境异常,需排查并 rerun。
  2. Unittest GPU CI:排查 paddleformers/transformers/tokenizer_utils.py:48transformers.tokenization_utils_fast.PreTrainedTokenizerFast 的导入是否需要做版本兼容适配(当前 torch-2.12.0 内置 transformers 导致 API 冲突);在 base develop 分支上也存在同样错误,建议先 merge develop 后确认。
  3. Fleet Model Test:建议直接 rerun,两处 Blocking queue 均为偶发性问题;若 rerun 仍失败,需排查 DataLoader 在多进程并发读取时的数据管道稳定性。

🔄 每次 Re-run 后自动更新

@Liujie0926 Liujie0926 closed this May 20, 2026
@Liujie0926 Liujie0926 deleted the debug branch May 20, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants