Skip to content

支持interlm25,支持模型对齐#4131

Open
learncat163 wants to merge 9 commits into
PaddlePaddle:developfrom
learncat163:pr-merge-terlm25
Open

支持interlm25,支持模型对齐#4131
learncat163 wants to merge 9 commits into
PaddlePaddle:developfrom
learncat163:pr-merge-terlm25

Conversation

@learncat163
Copy link
Copy Markdown

PR types

New features

PR changes

Add Models

Description

InternLM2.5 模型迁移到 PaddleFormers:

1. 模型功能和对齐测试 (tests/transformers/intern_lm2_5/test_modeling.py)

  • 新增 InternLM25CompatibilityTest 等功能测试类
  • 验证 torch 和 paddle 模型推理结果对齐(loss 容差 1e-2,token id 前10个推理一致)

2. 转换模型的地址

- 原版的1.8b模型地址,https://aistudio.baidu.com/modelsdetail/45123 
- paddle版本的1.8b版本地址,https://aistudio.baidu.com/modelsdetail/45124

ms-swift对比

1. ms-swift配置

MODEL_PATH="/mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-raw"
TRAIN_DATA="/mnt/learncat/code/swift/train/train-msg.jsonl"
OUTPUT_DIR="/tmp/internlm2_5-sft-full-ms-swift"

# 创建输出目录
mkdir -p "$OUTPUT_DIR"

echo "========================================"
echo "启动 ms-swift 全参微调测试"
echo "模型: internlm2_5-1_8b-chat"
echo "========================================"

# 激活swift环境并运行
source /mnt/learncat/code/swift/.venv/bin/activate && \
swift sft \
    --model "$MODEL_PATH" \
    --model_type internlm2 \
    --tuner_type full \
    --template default \
    --dataset "$TRAIN_DATA" \
    --max_length 512 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-5 \
    --max_steps 500 \
    --warmup_steps 5 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --bf16 true \
    --seed 23 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_strategy no \
    --gradient_checkpointing false \
    2>&1 | tee "$OUTPUT_DIR/training.log"

2. paddleformers-cli配置

### data
train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./tests/fixtures/dummy/sft/train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: ./tests/fixtures/dummy/sft/eval.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 512
packing: false
dataloader_shuffle: false
mix_strategy: concat
template_backend: custom
template: internlm2_5
### model
# placeholder path; override via update_training_args in test or CLI --model_name_or_path
#model_name_or_path: Qwen/Qwen3-0.6B-Base
model_name_or_path: /mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-paddle
_attn_implementation: flashmask


### finetuning
# base
stage: SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
num_train_epochs: 1
max_steps: 500
eval_steps: 1000
evaluation_strategy: steps
save_steps: 100000
save_strategy: steps
logging_steps: 1
gradient_accumulation_steps: 1
logging_dir: ./vdl_log
output_dir: ./checkpoints/qwen3-sft-full
disable_tqdm: true
eval_accumulation_steps: 16


# train
warmup_steps: 5
learning_rate: 1.0e-5

# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O1
unified_checkpoint: false
# 注释之后跳过保存阶段
# save_checkpoint_format: flex_checkpoint
load_checkpoint_format: sharding_io
continue_training: false

3. loss输出diff脚本

#!/usr/bin/env python3
import argparse
import os
import re
from pathlib import Path
from typing import Dict, Tuple


def parse_paddle_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"loss:\s*([0-9\.eE+-]+)")
    step_pattern = re.compile(r"global_step:\s*(\d+)")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "loss:" in line and "global_step:" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def parse_swift_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"'loss':\s*'([0-9\.eE+-]+)'")
    step_pattern = re.compile(r"'global_step/max_steps':\s*'(\d+)/\d+'")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "'loss'" in line and "'global_step/max_steps'" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def find_log_files(directory: str) -> Tuple[str, str]:
    dir_path = Path(directory)
    paddle_files = list(dir_path.glob("paddle*.log"))
    swift_files = list(dir_path.glob("swift*.log")) + list(dir_path.glob("swfit*.log"))

    if not paddle_files:
        raise FileNotFoundError(f"paddle log not found: {directory}")
    if not swift_files:
        raise FileNotFoundError(f"swift log not found: {directory}")

    return str(paddle_files[0]), str(swift_files[0])


def generate_markdown_table(
    paddle_dict: Dict[int, float],
    swift_dict: Dict[int, float],
    max_rows: int = 100
) -> str:
    common_steps = sorted(set(paddle_dict.keys()) & set(swift_dict.keys()))
    if not common_steps:
        return "no common steps"

    display_steps = common_steps[:max_rows]

    lines = [
        "# Loss",
        "",
        f"common: {len(common_steps)}, show: {len(display_steps)}",
        "",
        "| step | ms-swift | paddle |",
        "|------|----------|--------|"
    ]

    for step in display_steps:
        swift_loss = swift_dict[step]
        paddle_loss = paddle_dict[step]
        lines.append(f"| {step} | {swift_loss:.6f} | {paddle_loss:.6f} |")

    if len(common_steps) > max_rows:
        lines.append(f"| ... | ... | ... |")
        lines.append(f"_total {len(common_steps)}_")

    return "\n".join(lines)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dir", type=str, default="log/lm25")
    parser.add_argument("--max_rows", type=int, default=100)
    parser.add_argument("--output", type=str, default=None)
    args = parser.parse_args()

    paddle_log, swift_log = find_log_files(args.dir)
    print(f"Paddle: {paddle_log}")
    print(f"Swift:  {swift_log}")

    paddle_dict = parse_paddle_log(paddle_log)
    swift_dict = parse_swift_log(swift_log)

    print(f"Paddle: {len(paddle_dict)}")
    print(f"Swift:  {len(swift_dict)}")

    markdown = generate_markdown_table(paddle_dict, swift_dict, args.max_rows)

    output_path = args.output
    if output_path is None:
        output_path = os.path.join(args.dir, "diff-loss.md")

    os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else ".", exist_ok=True)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(markdown)

    print(f"\nOutput: {output_path}")
    print("\n" + markdown)


if __name__ == "__main__":
    main()

4. loss对比结果

common: 102, show: 100

step ms-swift paddle
1 5.953000 6.290356
2 5.730000 9.247890
3 5.861000 6.163038
4 5.663000 4.679254
5 4.272000 6.301551
6 4.699000 3.590961
7 4.390000 8.666958
8 3.389000 7.602499
9 3.542000 8.734646
10 2.256000 8.286204
11 3.168000 7.052961
12 2.073000 3.105869
13 1.677000 4.067309
14 2.472000 5.547118
15 1.549000 2.685346
16 0.719400 5.332623
17 1.491000 2.459121
18 1.392000 0.934952
19 1.072000 2.983633
20 0.747800 4.115916
21 0.683300 0.988917
22 0.365400 0.432274
23 0.648900 1.131235
24 0.041450 1.985609
25 0.266300 2.602725
26 0.285700 0.345454
27 0.550200 0.031843
28 0.436700 3.117400
29 0.143000 1.994643
30 0.177000 0.015647
31 0.045770 0.202331
32 0.095660 0.335955
33 0.160200 0.090093
34 0.095230 0.302440
35 0.044270 0.477116
36 0.004348 1.127114
37 0.021820 0.718798
38 0.055190 0.009806
39 0.061650 0.006342
40 0.002891 0.007223
41 0.023180 0.937800
42 0.010910 0.000712
43 0.012420 0.000518
44 0.005803 0.000433
45 0.002461 0.000340
46 0.002702 0.000880
47 0.057940 0.021273
48 0.007530 0.042879
49 0.000661 0.000632
50 0.005365 0.000252
51 0.062190 0.242925
52 0.002174 0.170648
53 0.002029 0.000216
54 0.008213 0.132959
55 0.014590 0.036908
56 0.001328 0.000189
57 0.000596 0.021828
58 0.002428 0.000161
59 0.022250 0.002319
60 0.000363 0.000220
61 0.004164 0.000065
62 0.000978 0.000229
63 0.000213 0.027424
64 0.000435 0.210934
65 0.036160 0.049647
66 0.001075 0.014757
67 0.023450 0.000127
68 0.000285 0.009107
69 0.000073 0.001742
70 0.002151 0.000031
71 0.000249 0.012602
72 0.000550 0.319398
73 0.000310 0.000080
74 0.000639 0.000063
75 0.000050 0.000099
76 0.000256 0.000521
77 0.000498 0.000518
78 0.000088 0.000098
79 0.000365 0.001257
80 0.000079 0.000100
81 0.000065 0.000131
82 0.000126 0.004159
83 0.000139 0.202090
84 0.000419 0.000043
85 0.000164 0.000994
86 0.000067 0.000262
87 0.000326 0.000062
88 0.000142 0.000059
89 0.000171 0.000058
90 0.000052 0.000471
91 0.000152 0.004269
92 0.000123 0.000048
93 0.000067 0.000173
94 0.000144 0.000022
95 0.000086 0.000486
96 0.000068 0.000055
97 0.000117 0.000212
98 0.000111 0.000060
99 0.000063 0.000058
100 0.000075 0.000019
... ... ...
total 102

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Mar 23, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 51.09127% with 493 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@95c3c8a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...addleformers/transformers/intern_lm2_5/modeling.py 47.90% 385 Missing ⚠️
paddleformers/transformers/intern/configuration.py 18.00% 41 Missing ⚠️
...ddleformers/transformers/intern_lm2_5/tokenizer.py 69.52% 32 Missing ⚠️
paddleformers/transformers/intern/modeling.py 58.62% 24 Missing ⚠️
...formers/transformers/intern_lm2_5/configuration.py 79.54% 9 Missing ⚠️
paddleformers/cli/utils/llm_utils.py 0.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (51.09%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4131   +/-   ##
==========================================
  Coverage           ?   47.11%           
==========================================
  Files              ?      482           
  Lines              ?    91611           
  Branches           ?        0           
==========================================
  Hits               ?    43165           
  Misses             ?    48446           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@learncat163
Copy link
Copy Markdown
Author

/re-run all-failed

module.weight[module._padding_idx].zero_()

@classmethod
def _gen_aoa_config(cls, config: InternLM25Config):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不建议禁用,参考以下实现:
def _gen_aoa_config(cls, config: InternLM25Config):
model_prefix = cls.base_model_prefix + "." if cls != cls.base_model_class else ""
aoa_statements = [
f"model.tok_embeddings.weight -> {model_prefix}tok_embeddings.weight",
f"model.norm.weight -> {model_prefix}norm.weight",
f"model.layers.$LAYER_ID.attention_norm.weight -> {model_prefix}layers.$LAYER_ID.attention_norm.weight",
f"model.layers.$LAYER_ID.ffn_norm.weight -> {model_prefix}layers.$LAYER_ID.ffn_norm.weight",
]
aoa_statements.extend([
f"model.layers.$LAYER_ID.attention.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.attention.{w}.weight"
for w in ["wqkv", "wo"]
])
aoa_statements.extend([
f"model.layers.$LAYER_ID.feed_forward.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.feed_forward.{w}.weight"
for w in ["w1", "w2", "w3"]
])
if cls != cls.base_model_class:
if getattr(config, "tie_word_embeddings", False):
aoa_statements.append("model.tok_embeddings.weight -> output.weight")
else:
aoa_statements.append("output.weight^T -> output.weight")
return {"aoa_statements": aoa_statements}

)

if attention_mask is not None and attention_mask.ndim == 4:
if attention_mask.max() != 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练在此报错,建议删除这个判断,直接使用 4D mask

("Gemma3", "gemma3_text"),
("Glm4vMoe", "glm4v_moe"),
("GlmOcr", "glm_ocr"),
("InternLM2", "intern_lm2_5"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否应该是InternLM25

logging_steps: 1
gradient_accumulation_steps: 4
logging_dir: ./vdl_log
output_dir: ./checkpoints/qwen3-sft-full
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不应该写qwen3

Comment thread tests/integration_test/interlm_sft.sh Outdated
# limitations under the License.

# TODO ,前期不在 .github/workflows/fleet-model-test.yml 中生效,避免直接卡死流程
# TODO,提交PR的时候,会提交loss对比材料
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下是否要保留todo的注释

@a31413510
Copy link
Copy Markdown
Collaborator

1、copyright 年份错误
2、中文注释需要全部改为英文

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 21, 2026

PaddleFormers Log Analysis

Run #26243605116 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议
Fleet Model Test 单测存在 Bug(exit code 250 / libuv Assertion) GLM4.5 pre-train (Grouped GEMM) 退出码 250,属已知问题,建议 rerun;若 rerun 仍失败,需排查 libuv uv__finish_close Assertion 与 glm45_pt_grouped_gemm.yaml 配置的兼容性
Fleet Model Test 单测存在 Bug(HFValidationError) Qwen vl sftoutput_dir=/workspace/checkpoints/qwen3vl-sft 被当作 HuggingFace repo id 导致崩溃(exit -6 / SIGABRT),需修复调用 hf_hub_try_to_load_from_cache 时传入本地路径的处理逻辑;Qwen vl lora 依赖 sft checkpoint,sft 失败后级联失败
CI_ILUVATAR 环境问题(container failed) Run CI unittest 步骤报 failed to run script step: [object Object],天数机器容器启动失败,与本 PR 修改无关,建议联系 CI 维护人员检查天数机器 Docker 环境

失败的测试 case:

Fleet Model Test (H20, multi-card):
  - GLM4.5 pre-train (Grouped GEMM)   → exit_code=250, libuv uv__finish_close Assertion failed
  - Qwen vl sft                        → exit code -6 (SIGABRT), HFValidationError: 本地路径被传入 hf_hub 校验
  - Qwen vl lora                       → exit code 241, 级联失败(依赖 qwen3vl-sft checkpoint 未生成)

CI_ILUVATAR:
  - iluvatar_test / Run CI unittest    → container failed: failed to run script step: [object Object]

根本原因分析:

本 PR(#4131)主体修改为新增 InternLM2.5 模型支持(新增 paddleformers/transformers/intern_lm2_5/ 及配套测试),同时在 tests/requirements.txt 追加了一项依赖。CI 失败与本 PR 代码本身无直接关联,具体原因如下:

  1. GLM4.5 Grouped GEMM(exit 250)libuv 在 TCP store 建立后触发 Assertion 'handle->flags & UV_HANDLE_CLOSING' failed,是已知 PaddlePaddle 底层偶现 bug,非本 PR 引入。

  2. Qwen vl sft(exit -6 / SIGABRT + HFValidationError)qwen3vl_sft.shoutput_dir=/workspace/checkpoints/qwen3vl-sft(本地绝对路径)传入 huggingface_hubtry_to_load_from_cache,该函数对 repo id 格式做校验,不接受绝对路径,导致进程以 SIGABRT 终止。qwen3vl_lora 使用上一步的输出目录作为 model_name_or_path,因 sft 未能完成而级联失败。

  3. CI_ILUVATAR container failed:天数机器 self-hosted runner 的自定义容器实现报 [object Object],属基础设施问题,与代码无关。


修复建议:

  1. GLM4.5 Grouped GEMM:直接 rerun,exit 250 为已知偶现问题,通常 rerun 可通过。

  2. Qwen vl sft HFValidationError
    定位 paddleformers/cli/utils/llm_utils.py(本 PR 有修改)或训练结束后调用 push_to_hub / try_to_load_from_cache 的代码,在传入 model_name_or_path 前增加本地路径判断,若为绝对路径则跳过 huggingface_hub 校验分支,示例:

    import os
    if not os.path.isabs(output_dir):
        hf_hub_try_to_load_from_cache(repo_id=output_dir, ...)

    或检查 qwen3vl_sft.yamlpush_to_hub 配置是否被意外置为 True,改为 False

  3. CI_ILUVATAR:通知 CI 维护人员(@nepeplwu / @lugimzzz)检查天数机器的 self-hosted runner 容器配置,该失败与本 PR 无关,不阻碍合入。


🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants