feat: add datasets_v2 module and SFT-V2 training pipeline#4435
feat: add datasets_v2 module and SFT-V2 training pipeline#4435weiyixuanxx wants to merge 8 commits into
Conversation
Introduce a new data loading and encoding pipeline (datasets_v2) with: - Schema-based dataset registry with preprocessor auto-detection - Independent template system (chatml, llama3, deepseek3, etc.) - Lazy encoding dataset with packing and flashmask support - SFT-V2 workflow (workflow2.py) integrated via stage routing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for your contribution! |
|
only test |
…: SFT_v2. Not all features are supported yet, and the overall pipeline still needs to be reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
Previously blocked by .gitignore global `dataset/` rule. This fixes the CI ModuleNotFoundError for paddleformers.datasets_v2.dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (22.06%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #4435 +/- ##
==========================================
Coverage ? 45.66%
==========================================
Files ? 500
Lines ? 93656
Branches ? 0
==========================================
Hits ? 42770
Misses ? 50886
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…enhancements Migrate all missing template features from old datasets/ pipeline into datasets_v2/ as independent code (no cross-imports): - ReasoningTemplate (encode_multiturn_reasoning, thought tag management) - Tool calling (tool_utils.py with 9 model-specific formatters) - Function/Observation role support in encode_multiturn - fix_special_tokens and parse_template utilities - Grounding plugin (grounding_plugin.py) - mm_plugin for VL-SFT support - 25 registered templates (qwen3/3.5/vl, glm4/moe/v, ernie/vl, etc.) Also includes: streaming dataset, packing improvements, collate enhancements, workflow2 updates, and comprehensive tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The file was imported in __init__.py but never committed, causing ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
3 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
…g features - Extract 8 shared helpers (_dispatch_encode, _flatten_turns, _apply_dynamic_eos, _apply_efficient_eos, _apply_label_shift, _apply_truncation, _apply_auto_bos, _validate_and_build) to eliminate duplication between encode_sft and encode_vl_sft - Align template.py with old pipeline: reasoning dispatch, GLM5 close-tag-only thought - Switch CI config yamls to stage: SFT-V2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
…aset_type default - Add ErnieKitPreprocessor to convert src/tgt format to messages - Expand schema _ALLOWED_MESSAGE_KEYS to include tool_calls/tool_call_id/name/tools - Add dataset_format parameter to load_dataset() with priority-based dispatch - Pass train_dataset_type/eval_dataset_type as format hints in workflow2 - Change DataArguments.dataset_type default from "iterable" to "map" (fixes packing conflict) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddleFormers Log Analysis
日志分析报告
失败的测试 case: 根本原因分析: 本次 PR(
修复建议:
🔄 每次 Re-run 后自动更新 |
Introduce a new data loading and encoding pipeline (datasets_v2) with:
Before submitting
testsfolder. If there are codecov issues, please add tests cases first.PR types
PR changes
Description