Skip to content

feat: add datasets_v2 module and SFT-V2 training pipeline#4435

Open
weiyixuanxx wants to merge 8 commits into
PaddlePaddle:developfrom
weiyixuanxx:dev_dataset_v2
Open

feat: add datasets_v2 module and SFT-V2 training pipeline#4435
weiyixuanxx wants to merge 8 commits into
PaddlePaddle:developfrom
weiyixuanxx:dev_dataset_v2

Conversation

@weiyixuanxx
Copy link
Copy Markdown
Contributor

Introduce a new data loading and encoding pipeline (datasets_v2) with:

  • Schema-based dataset registry with preprocessor auto-detection
  • Independent template system (chatml, llama3, deepseek3, etc.)
  • Lazy encoding dataset with packing and flashmask support
  • SFT-V2 workflow (workflow2.py) integrated via stage routing

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

Introduce a new data loading and encoding pipeline (datasets_v2) with:
- Schema-based dataset registry with preprocessor auto-detection
- Independent template system (chatml, llama3, deepseek3, etc.)
- Lazy encoding dataset with packing and flashmask support
- SFT-V2 workflow (workflow2.py) integrated via stage routing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

only test

…: SFT_v2. Not all features are supported yet, and the overall pipeline still needs to be reviewed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

weiyixuanxx and others added 2 commits May 18, 2026 11:11
Previously blocked by .gitignore global `dataset/` rule.
This fixes the CI ModuleNotFoundError for paddleformers.datasets_v2.dataset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

❌ Patch coverage is 22.06219% with 2381 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cdac2ff). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddleformers/datasets_v2/mm_plugin.py 16.79% 654 Missing ⚠️
paddleformers/datasets_v2/datapipe/template.py 21.64% 257 Missing ⚠️
paddleformers/datasets_v2/datapipe/encode.py 16.04% 225 Missing ⚠️
paddleformers/datasets_v2/datapipe/collate.py 5.95% 221 Missing ⚠️
paddleformers/cli/train/sft/workflow2.py 9.29% 205 Missing ⚠️
paddleformers/cli/train/sft/workflow_vl_v2.py 12.68% 179 Missing ⚠️
paddleformers/datasets_v2/loaders.py 12.94% 121 Missing ⚠️
paddleformers/datasets_v2/datapipe/tool_utils.py 46.89% 94 Missing ⚠️
...addleformers/datasets_v2/preprocessors/messages.py 17.33% 62 Missing ⚠️
paddleformers/datasets_v2/preprocessors/base.py 28.75% 57 Missing ⚠️
... and 12 more

❌ Your patch status has failed because the patch coverage (22.06%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4435   +/-   ##
==========================================
  Coverage           ?   45.66%           
==========================================
  Files              ?      500           
  Lines              ?    93656           
  Branches           ?        0           
==========================================
  Hits               ?    42770           
  Misses             ?    50886           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weiyixuanxx and others added 2 commits May 21, 2026 16:15
…enhancements

Migrate all missing template features from old datasets/ pipeline into
datasets_v2/ as independent code (no cross-imports):
- ReasoningTemplate (encode_multiturn_reasoning, thought tag management)
- Tool calling (tool_utils.py with 9 model-specific formatters)
- Function/Observation role support in encode_multiturn
- fix_special_tokens and parse_template utilities
- Grounding plugin (grounding_plugin.py)
- mm_plugin for VL-SFT support
- 25 registered templates (qwen3/3.5/vl, glm4/moe/v, ernie/vl, etc.)

Also includes: streaming dataset, packing improvements, collate
enhancements, workflow2 updates, and comprehensive tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The file was imported in __init__.py but never committed, causing
ModuleNotFoundError in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

3 similar comments
@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

…g features

- Extract 8 shared helpers (_dispatch_encode, _flatten_turns, _apply_dynamic_eos,
  _apply_efficient_eos, _apply_label_shift, _apply_truncation, _apply_auto_bos,
  _validate_and_build) to eliminate duplication between encode_sft and encode_vl_sft
- Align template.py with old pipeline: reasoning dispatch, GLM5 close-tag-only thought
- Switch CI config yamls to stage: SFT-V2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PaddlePaddle PaddlePaddle deleted a comment from github-actions Bot May 21, 2026
@weiyixuanxx
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

…aset_type default

- Add ErnieKitPreprocessor to convert src/tgt format to messages
- Expand schema _ALLOWED_MESSAGE_KEYS to include tool_calls/tool_call_id/name/tools
- Add dataset_format parameter to load_dataset() with priority-based dispatch
- Pass train_dataset_type/eval_dataset_type as format hints in workflow2
- Change DataArguments.dataset_type default from "iterable" to "map" (fixes packing conflict)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Paddle-CI-Bot
Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26246200452 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议
Fleet Model Test (A100, Integration test - A100) 单测存在 Bug qwen3vl_sft_moe_a100.yaml 训练时 hidden_states + pos_embeds 形状不匹配 [2268,256] vs [2320,256],需修复 modeling_fleet.py 中 position embedding 的 shape 对齐逻辑
Fleet Model Test (H20, multi-card, FSDP) 单测存在 Bug qwen3vl_sft_fsdp.yaml 在 save_pretrained 阶段 all_gather_object 时发生 _pickle.UnpicklingError: pickle data was truncated,需排查 FSDP 多卡保存时 tensor 序列化/反序列化逻辑
Fleet Model Test (H20, single card) 单测存在 Loss Diff(需精度 Approve) qwen3vl_sft_single_card Loss Diff 检测触发精度变更保护(exit 6),需 @XieYunshen / @From00 / @risemeup1 / @tianlef / @lugimzzz / @zjjlivein / @swgu98 进行 approve
CI_XPU 单测存在 Loss Diff test_ernie_28b_thinking_sft_training Loss 验证失败(LOSS PRECISION COMPARISON FAILED),本次 PR 对 datasets_v2 encode 逻辑的修改可能影响了 ERNIE 28B 的 token 分布,建议对比 diff 前后数据编码输出
CI_ILUVATAR 环境问题(container failed) k8s 自定义容器实现执行失败(Executing the custom container implementation failed),与本次 PR 修改无关,建议 QA 联系 runner 管理员检查 iluvatar 机器 k8s 环境
Unittest GPU CI 单测存在 Bug tests/ai_edited_test/cli/test_ai_data_args.py::TestDataArguments::test_default_values 断言失败:DataArguments().dataset_type 期望 iterable,实际为 map,本次 PR 修改了 dataset_type 默认值导致

失败的测试 case:

1. Fleet Model Test (A100, MoE multi-card):
   - tests/config/ci/qwen3vl_sft_moe_a100.yaml
     -> ValueError: Broadcast dimension mismatch [2268, 256] vs [2320, 256]
        at paddleformers/transformers/qwen3_vl/modeling_fleet.py:821

2. Fleet Model Test (H20, FSDP multi-card):
   - tests/config/ci/qwen3vl_sft_fsdp.yaml
     -> _pickle.UnpicklingError: pickle data was truncated
        at paddle/distributed/communication/serialization_utils.py:34
        (all_gather_object in model_utils.py:3960)

3. Fleet Model Test (H20, single card):
   - qwen3vl_sft_single_card(精度变更保护)
     -> Log Loss: 11.93298912 / GT Loss: 11.93281651
        Max absolute diff: 0.00017261,exit code 6(需 approve)

4. CI_XPU:
   - scripts/xpu_ci/test_ernie_28b_thinking_sft.py::test_ernie_28b_thinking_sft_training
     -> LOSS PRECISION COMPARISON FAILED

5. CI_ILUVATAR:
   - iluvatar_test(k8s container 环境失败,非代码问题)
     -> Error: failed to run script step / custom container implementation failed

6. Unittest GPU CI:
   - tests/ai_edited_test/cli/test_ai_data_args.py::TestDataArguments::test_default_values
     -> AssertionError: 'map' != 'iterable'
        DataArguments().dataset_type 默认值被改为 'map'

根本原因分析:

本次 PR(dev_dataset_v2,PR #4435)主要修改了 datasets_v2 相关逻辑,涉及编码(encode.py)耦合度降低及数据集格式支持扩展。根因如下:

  1. dataset_type 默认值变更:PR commit 5e06bb8a 描述为 fix(datasets_v2): support erniekit format, tool_calls schema, and dataset_type default,明确修改了 DataArguments.dataset_type 的默认值(由 iterable 改为 map),与 test_default_values 中的期望值不符,是本次 PR 直接引入的 Bug。

  2. Qwen3VL MoE A100 shape mismatchhidden_states shape [2268, 256]pos_embeds shape [2320, 256] 不一致,推测 PR 对 token 数计算或 image/video patch 数处理逻辑有修改,导致 position embedding 长度与实际 token 序列长度不对齐(2268 ≠ 2320,差 52 个 token),需检查 qwen3_vl/modeling_fleet.py 中 visual token 拼接部分是否受 encode 流程变化影响。

  3. H20 FSDP pickle 截断all_gather_object 在多卡 FSDP save 时因某个 rank 的 shard 文件数序列化数据被截断失败,可能与 FSDP 分片逻辑中文件名列表在新数据格式下产生异常长度有关。

  4. XPU ERNIE 28B Loss Diff:encode 逻辑变更导致 token 序列结构发生变化,影响 ERNIE 28B 训练 Loss 精度,属于本次 PR 数据处理改动的连锁影响。

  5. H20 single card 精度 approve:属于已知精度变更保护机制,需相关人员 approve 后方可合入。

  6. ILUVATAR 环境失败:k8s runner 自定义容器问题,与本次 PR 无关。


修复建议:

  1. 修复 dataset_type 默认值:将 DataArguments.dataset_type 默认值改回 "iterable",或更新 test_ai_data_args.pytest_default_values 的期望值为 "map"(取决于设计意图,确认后与 PR 描述对齐)。

  2. 修复 Qwen3VL position embedding shape mismatch:检查 paddleformers/transformers/qwen3_vl/modeling_fleet.py:821 附近 pos_embeds 的生成逻辑,确认 visual token 数量计算(patch grid size)是否在新 encode 流程下因 padding/truncation 策略变化导致偏差,对齐 hidden_statespos_embeds 的序列长度。

  3. 修复 FSDP pickle 截断:排查 model_utils.py:3960all_gather_object 调用,检查各 rank 的 files_num 是否在新格式下出现非预期的大对象导致序列化 buffer 溢出,考虑增加 buffer size 或改用 broadcast_object_list

  4. XPU Loss Diff:对比本次 PR encode 前后的 token 序列输出,确认 ERNIE 28B SFT 场景下 token 分布变化是预期的还是 side-effect,若为 side-effect 需在 encode 逻辑中对该模型单独处理。

  5. 精度 approve:H20 single card 精度变更需 @XieYunshen / @From00 / @risemeup1 / @tianlef 其中之一进行 approve。

  6. ILUVATAR 环境:联系 runner 管理员检查 iluvatar-gpu-2-nczzk-runner-wvtpl 机器的 k8s 自定义容器配置,与本次 PR 修改无关,可单独 rerun。


🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants