feat: add datasets_v2 module and SFT-V2 training pipeline by weiyixuanxx · Pull Request #4435 · PaddlePaddle/PaddleFormers

weiyixuanxx · 2026-05-12T11:48:05Z

Introduce a new data loading and encoding pipeline (datasets_v2) with:

Schema-based dataset registry with preprocessor auto-detection
Independent template system (chatml, llama3, deepseek3, etc.)
Lazy encoding dataset with packing and flashmask support
SFT-V2 workflow (workflow2.py) integrated via stage routing

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

Introduce a new data loading and encoding pipeline (datasets_v2) with: - Schema-based dataset registry with preprocessor auto-detection - Independent template system (chatml, llama3, deepseek3, etc.) - Lazy encoding dataset with packing and flashmask support - SFT-V2 workflow (workflow2.py) integrated via stage routing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

paddle-bot · 2026-05-12T11:48:14Z

Thanks for your contribution!

weiyixuanxx · 2026-05-12T11:48:39Z

only test

…: SFT_v2. Not all features are supported yet, and the overall pipeline still needs to be reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-15T03:46:57Z

/re-run all-failed

Previously blocked by .gitignore global `dataset/` rule. This fixes the CI ModuleNotFoundError for paddleformers.datasets_v2.dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-05-18T03:54:02Z

Codecov Report

❌ Patch coverage is 22.12131% with 2401 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e97d8b5). Learn more about missing BASE report.

⚠️ Current head 1477924 differs from pull request most recent head e76fd39

Please upload reports for the commit e76fd39 to get more accurate results.

Files with missing lines	Patch %	Lines
paddleformers/datasets_v2/mm_plugin.py	16.79%	654 Missing ⚠️
paddleformers/datasets_v2/datapipe/template.py	21.64%	257 Missing ⚠️
paddleformers/datasets_v2/datapipe/encode.py	16.04%	225 Missing ⚠️
paddleformers/datasets_v2/datapipe/collate.py	5.95%	221 Missing ⚠️
paddleformers/cli/train/sft/workflow2.py	9.09%	210 Missing ⚠️
paddleformers/cli/train/sft/workflow_vl_v2.py	12.68%	179 Missing ⚠️
paddleformers/datasets_v2/loaders.py	13.69%	126 Missing ⚠️
paddleformers/datasets_v2/datapipe/tool_utils.py	46.89%	94 Missing ⚠️
...addleformers/datasets_v2/preprocessors/messages.py	17.33%	62 Missing ⚠️
paddleformers/datasets_v2/preprocessors/base.py	28.75%	57 Missing ⚠️
... and 13 more

❌ Your patch status has failed because the patch coverage (22.12%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #4435   +/-   ##
==========================================
  Coverage           ?   45.64%           
==========================================
  Files              ?      501           
  Lines              ?    93720           
  Branches           ?        0           
==========================================
  Hits               ?    42774           
  Misses             ?    50946           
  Partials           ?        0

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…enhancements Migrate all missing template features from old datasets/ pipeline into datasets_v2/ as independent code (no cross-imports): - ReasoningTemplate (encode_multiturn_reasoning, thought tag management) - Tool calling (tool_utils.py with 9 model-specific formatters) - Function/Observation role support in encode_multiturn - fix_special_tokens and parse_template utilities - Grounding plugin (grounding_plugin.py) - mm_plugin for VL-SFT support - 25 registered templates (qwen3/3.5/vl, glm4/moe/v, ernie/vl, etc.) Also includes: streaming dataset, packing improvements, collate enhancements, workflow2 updates, and comprehensive tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The file was imported in __init__.py but never committed, causing ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-21T08:50:33Z

/re-run all-failed

weiyixuanxx · 2026-05-21T09:08:54Z

/re-run all-failed

weiyixuanxx · 2026-05-21T09:50:39Z

/re-run all-failed

weiyixuanxx · 2026-05-21T10:01:35Z

/re-run all-failed

…g features - Extract 8 shared helpers (_dispatch_encode, _flatten_turns, _apply_dynamic_eos, _apply_efficient_eos, _apply_label_shift, _apply_truncation, _apply_auto_bos, _validate_and_build) to eliminate duplication between encode_sft and encode_vl_sft - Align template.py with old pipeline: reasoning dispatch, GLM5 close-tag-only thought - Switch CI config yamls to stage: SFT-V2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-21T11:36:40Z

/re-run all-failed

…aset_type default - Add ErnieKitPreprocessor to convert src/tgt format to messages - Expand schema _ALLOWED_MESSAGE_KEYS to include tool_calls/tool_call_id/name/tools - Add dataset_format parameter to load_dataset() with priority-based dispatch - Pass train_dataset_type/eval_dataset_type as format hints in workflow2 - Change DataArguments.dataset_type default from "iterable" to "map" (fixes packing conflict) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Paddle-CI-Bot · 2026-05-21T18:50:29Z

PaddleFormers Log Analysis

Run #26294991443 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
CI_ILUVATAR (`iluvatar_test`)	其他 — LossNan	训练第2步出现 `loss=nan`，与 `Got different data type, run type promotion automatically` 警告同步出现，PR 中 dataset 读取或数据预处理逻辑改动引入了非法数值（inf/nan），需排查 `dev_dataset_v2` 分支的数据 collator / tokenizer 变更是否改变了 attention_mask 或 input_ids 的 dtype	报错代码
Unittest GPU CI (`unittest-gpu-ci`)	单测 Bug × 2	① `tuner.py` 中未导出 `run_dpo`，测试 `patch.object` 失败；② `USE_CASUAL_MASK` 环境变量实际值为 `False` 而测试期望 `True`，需检查 DeepSeek V3 workflow 中的环境变量设置逻辑	报错代码

失败的测试 case:

# CI_ILUVATAR
scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

# Unittest GPU CI
tests/ai_edited_test/cli/test_ai_deepseek_v3_workflow.py::TestEnvironmentVariable::test_use_casual_mask_env
tests/ai_edited_test/cli/test_ai_tuner.py::TestTrainingFunction::test_dpo_stage_calls_run_dpo

根本原因分析:

dev_dataset_v2 分支引入了 dataset 相关改动，导致三处独立问题：

LossNan（iluvatar_test）：训练 global_step=1 正常（loss=11.34），global_step=2 立刻 nan，紧随其后有 Got different data type, run type promotion automatically 警告，说明数据集 v2 的 collate 逻辑改变了某 tensor 的 dtype（如 FP16 溢出或 mask 变成了全 0），导致前向传播在天数硬件上产生 nan。
run_dpo 属性缺失（unittest-gpu-ci）：测试期望 paddleformers.cli.train.tuner 模块中存在 run_dpo 函数（通过 patch.object 来 mock），但实际 tuner.py 未 import 或未定义该函数，属于测试与实现脱节，可能是 dataset_v2 重构中对 tuner.py 改动时遗漏了 run_dpo 的引入。
USE_CASUAL_MASK 环境变量不匹配（unittest-gpu-ci）：测试断言该变量为 "True"，但实际为 "False"，说明 DeepSeek V3 workflow 初始化路径变更后默认值被修改。

修复建议:

LossNan 排查：在 dev_dataset_v2 分支的 collator 或 dataset map 函数中，检查 attention_mask、labels、position_ids 的 dtype 是否仍为期望类型（int64/float32）；对比 develop 与本分支 ERNIE-21B-SFT.yaml 的 dtype 配置项；在 iluvatar 硬件上临时添加 --fp32_opt_level O0 复现确认是否为混精度溢出。
run_dpo 缺失：在 paddleformers/cli/train/tuner.py 顶部补充：
```
from paddleformers.cli.train.dpo.workflow import run_dpo
```
确保模块命名空间中存在该符号，使 patch.object(tuner_mod, "run_dpo") 可正常工作。
USE_CASUAL_MASK 环境变量：检查 paddleformers/cli/train/sft/workflow.py 或 DeepSeek V3 相关 workflow 初始化代码，确认 USE_CASUAL_MASK 的默认值设置路径是否随本次 dataset 重构被改为 False；若属业务逻辑变更，同步更新测试断言；若属误改，回退对应赋值行。

_{🔄 每次 Re-run 后自动更新}

…pipeline The previous commit changed DataArguments.dataset_type default from "iterable" to "map", which broke all Fleet CI tests (VL-SFT, FSDP, XPU) because the old pipeline uses dataset_type to select between MapSFTDataset and IteratorSFTDataset — completely different implementations. Fix: keep default as "iterable" (old pipeline untouched), and handle packing compatibility inside workflow2.py (V2 auto-switches to map when packing is enabled). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-22T04:01:49Z

/re-run all-failed

weiyixuanxx · 2026-05-22T05:55:24Z

/re-run all-failed

Implement DPO training in the V2 data pipeline without modifying old code: - encode_dpo(): forked sequence encoding with shared prefix detection - collate_dpo(): batch collation with block-causal attention mask - workflow2.py: DPO-V2 workflow reusing existing DPOTrainer - ErnieKit preprocessor: extended to handle DPO format (response+sort) - Route stage "DPO-V2" to run_dpo_v2() in tuner.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move SFT-V2, DPO, and DPO-V2 imports inside their respective elif branches so that old pipeline stages (SFT, PT) never trigger the datasets_v2 import chain. This eliminates potential side-effects during distributed initialization (PP+TP). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-22T09:28:54Z

/re-run all failed

weiyixuanxx · 2026-05-22T09:29:11Z

/re-run all-failed

- tuner.py: import run_dpo at module level (old DPO pipeline, no datasets_v2 dependency) so patch.object in tests works correctly - dpo/__init__.py: use module __getattr__ to lazy-load run_dpo_v2, preventing workflow2.py (which imports datasets_v2) from being loaded until DPO-V2 stage is actually invoked Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pipeline sft/__init__.py eagerly imported workflow2.py and workflow_vl_v2.py, which pull in paddleformers.datasets_v2. This caused the datasets_v2 module to be loaded even for old pipeline stages (SFT, PT), potentially triggering side-effects in distributed environments (PP+TP on Iluvatar). Use module __getattr__ to defer loading until SFT-V2/VL-SFT-V2 is actually invoked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Qwen3VLTextProvider does not define separate_mtp_headloss attribute. Use getattr with default False to avoid AttributeError for providers that lack this field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-25T09:48:41Z

/re-run all failed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx · 2026-05-25T12:33:09Z

/re-run all failed

weiyixuanxx · 2026-05-26T02:51:08Z

/re-run all failed

weiyixuanxx · 2026-05-26T02:51:40Z

/re-run all failed

…tains eos_token parse_template auto-detects whether the assistant format slot already includes the eos token. If so, set efficient_eos=False to prevent _apply_efficient_eos from appending a redundant eos_token_id. This fixes a 1-token encoding mismatch vs the old pipeline that caused training loss divergence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eFormers into dev_dataset_v2

- StreamingDataset: add lazy parameter for true streaming (lazy=True) vs V1-compat materialization mode (lazy=False), controlled by lazy_data_processing in training_args - workflow2.py: simplify dataset routing logic, wire lazy param - workflow_vl_v2.py / dpo/workflow2.py: align with updated API - template.py / encode.py: improve multi-modal and DPO encoding - Remove deprecated data_args fields (random_shuffle etc.) - docs: remove ms-swift references from design document Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

First version of changes: the YAML config file requires setting stage…

6bded35

…: SFT_v2. Not all features are supported yet, and the overall pipeline still needs to be reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx and others added 2 commits May 18, 2026 11:11

fix .gitignore

21ff444

Add datasets_v2/dataset module (LazyEncodeDataset)

9a19676

Previously blocked by .gitignore global `dataset/` rule. This fixes the CI ModuleNotFoundError for paddleformers.datasets_v2.dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx and others added 2 commits May 21, 2026 16:15

fix: add missing grounding_plugin.py to datasets_v2

3916027

The file was imported in __init__.py but never committed, causing ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PaddlePaddle deleted a comment from github-actions Bot May 21, 2026

weiyixuanxx and others added 2 commits May 22, 2026 14:56

weiyixuanxx and others added 3 commits May 25, 2026 11:14

fix: use getattr for separate_mtp_headloss in gpt_provider

96c360f

Qwen3VLTextProvider does not define separate_mtp_headloss attribute. Use getattr with default False to avoid AttributeError for providers that lack this field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiyixuanxx and others added 2 commits May 25, 2026 17:58

style: minor comment wording in template.py

1e5e63e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/develop' into dev_dataset_v2

dcc4ccb

weiyixuanxx and others added 6 commits June 1, 2026 21:53

Merge remote-tracking branch 'upstream/develop' into dev_dataset_v2

2dca7de

Merge remote-tracking branch 'upstream/develop' into dev_dataset_v2

14000cf

Merge remote-tracking branch 'upstream/develop' into dev_dataset_v2

87ba0b3

Merge branch 'dev_dataset_v2' of https://github.com/weiyixuanxx/Paddl…

cf11125

…eFormers into dev_dataset_v2

Uh oh!

Conversation

weiyixuanxx commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

weiyixuanxx commented May 12, 2026

Uh oh!

weiyixuanxx commented May 15, 2026

Uh oh!

codecov-commenter commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

weiyixuanxx commented May 21, 2026

Uh oh!

weiyixuanxx commented May 21, 2026

Uh oh!

weiyixuanxx commented May 21, 2026

Uh oh!

weiyixuanxx commented May 21, 2026

Uh oh!

weiyixuanxx commented May 21, 2026

Uh oh!

Paddle-CI-Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

weiyixuanxx commented May 22, 2026

Uh oh!

weiyixuanxx commented May 22, 2026

Uh oh!

weiyixuanxx commented May 22, 2026

Uh oh!

weiyixuanxx commented May 22, 2026

Uh oh!

weiyixuanxx commented May 25, 2026

Uh oh!

weiyixuanxx commented May 25, 2026

Uh oh!

weiyixuanxx commented May 26, 2026

Uh oh!

weiyixuanxx commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weiyixuanxx commented May 12, 2026 •

edited

Loading

codecov-commenter commented May 18, 2026 •

edited

Loading

Paddle-CI-Bot commented May 21, 2026 •

edited

Loading