支持interlm25,支持模型对齐 by learncat163 · Pull Request #4131 · PaddlePaddle/PaddleFormers

learncat163 · 2026-03-23T12:20:35Z

同时支持 internlm2.0 和 2.5版本

老的 2.0版本的PR 合并在 #4018

为了避免合并冲突和后续的麻烦，在一起实现了2.0和2.5的版本

InternLM2.5 模型迁移到 PaddleFormers:

1. 模型功能和对齐测试 (`tests/transformers/intern_lm2_5/test_modeling.py`)

新增 InternLM25CompatibilityTest 等功能测试类
验证 torch 和 paddle 模型推理结果对齐（loss 容差 1e-2，token id 前10个推理一致）

2. 转换模型的地址

- 原版的1.8b模型地址，https://aistudio.baidu.com/modelsdetail/45123 
- paddle版本的1.8b版本地址，https://aistudio.baidu.com/modelsdetail/45124

ms-swift对比

1. ms-swift配置

MODEL_PATH="/mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-raw"
TRAIN_DATA="/mnt/learncat/code/swift/train/train-msg.jsonl"
OUTPUT_DIR="/tmp/internlm2_5-sft-full-ms-swift"

# 创建输出目录
mkdir -p "$OUTPUT_DIR"

echo "========================================"
echo "启动 ms-swift 全参微调测试"
echo "模型: internlm2_5-1_8b-chat"
echo "========================================"

# 激活swift环境并运行
source /mnt/learncat/code/swift/.venv/bin/activate && \
swift sft \
    --model "$MODEL_PATH" \
    --model_type internlm2 \
    --tuner_type full \
    --template default \
    --dataset "$TRAIN_DATA" \
    --max_length 512 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-5 \
    --max_steps 500 \
    --warmup_steps 5 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --bf16 true \
    --seed 23 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_strategy no \
    --gradient_checkpointing false \
    2>&1 | tee "$OUTPUT_DIR/training.log"

2. paddleformers-cli配置

### data
train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./tests/fixtures/dummy/sft/train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: ./tests/fixtures/dummy/sft/eval.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 512
packing: false
dataloader_shuffle: false
mix_strategy: concat
template_backend: custom
template: internlm2_5
### model
# placeholder path; override via update_training_args in test or CLI --model_name_or_path
#model_name_or_path: Qwen/Qwen3-0.6B-Base
model_name_or_path: /mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-paddle
_attn_implementation: flashmask


### finetuning
# base
stage: SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
num_train_epochs: 1
max_steps: 500
eval_steps: 1000
evaluation_strategy: steps
save_steps: 100000
save_strategy: steps
logging_steps: 1
gradient_accumulation_steps: 1
logging_dir: ./vdl_log
output_dir: ./checkpoints/qwen3-sft-full
disable_tqdm: true
eval_accumulation_steps: 16


# train
warmup_steps: 5
learning_rate: 1.0e-5

# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O1
unified_checkpoint: false
# 注释之后跳过保存阶段
# save_checkpoint_format: flex_checkpoint
load_checkpoint_format: sharding_io
continue_training: false

3. loss输出diff脚本

#!/usr/bin/env python3
import argparse
import os
import re
from pathlib import Path
from typing import Dict, Tuple


def parse_paddle_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"loss:\s*([0-9\.eE+-]+)")
    step_pattern = re.compile(r"global_step:\s*(\d+)")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "loss:" in line and "global_step:" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def parse_swift_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"'loss':\s*'([0-9\.eE+-]+)'")
    step_pattern = re.compile(r"'global_step/max_steps':\s*'(\d+)/\d+'")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "'loss'" in line and "'global_step/max_steps'" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def find_log_files(directory: str) -> Tuple[str, str]:
    dir_path = Path(directory)
    paddle_files = list(dir_path.glob("paddle*.log"))
    swift_files = list(dir_path.glob("swift*.log")) + list(dir_path.glob("swfit*.log"))

    if not paddle_files:
        raise FileNotFoundError(f"paddle log not found: {directory}")
    if not swift_files:
        raise FileNotFoundError(f"swift log not found: {directory}")

    return str(paddle_files[0]), str(swift_files[0])


def generate_markdown_table(
    paddle_dict: Dict[int, float],
    swift_dict: Dict[int, float],
    max_rows: int = 100
) -> str:
    common_steps = sorted(set(paddle_dict.keys()) & set(swift_dict.keys()))
    if not common_steps:
        return "no common steps"

    display_steps = common_steps[:max_rows]

    lines = [
        "# Loss",
        "",
        f"common: {len(common_steps)}, show: {len(display_steps)}",
        "",
        "| step | ms-swift | paddle |",
        "|------|----------|--------|"
    ]

    for step in display_steps:
        swift_loss = swift_dict[step]
        paddle_loss = paddle_dict[step]
        lines.append(f"| {step} | {swift_loss:.6f} | {paddle_loss:.6f} |")

    if len(common_steps) > max_rows:
        lines.append(f"| ... | ... | ... |")
        lines.append(f"_total {len(common_steps)}_")

    return "\n".join(lines)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dir", type=str, default="log/lm25")
    parser.add_argument("--max_rows", type=int, default=100)
    parser.add_argument("--output", type=str, default=None)
    args = parser.parse_args()

    paddle_log, swift_log = find_log_files(args.dir)
    print(f"Paddle: {paddle_log}")
    print(f"Swift:  {swift_log}")

    paddle_dict = parse_paddle_log(paddle_log)
    swift_dict = parse_swift_log(swift_log)

    print(f"Paddle: {len(paddle_dict)}")
    print(f"Swift:  {len(swift_dict)}")

    markdown = generate_markdown_table(paddle_dict, swift_dict, args.max_rows)

    output_path = args.output
    if output_path is None:
        output_path = os.path.join(args.dir, "diff-loss.md")

    os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else ".", exist_ok=True)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(markdown)

    print(f"\nOutput: {output_path}")
    print("\n" + markdown)


if __name__ == "__main__":
    main()

4. loss对比结果

common: 102, show: 100

step	ms-swift	paddle
1	5.953000	6.290356
2	5.730000	9.247890
3	5.861000	6.163038
4	5.663000	4.679254
5	4.272000	6.301551
6	4.699000	3.590961
7	4.390000	8.666958
8	3.389000	7.602499
9	3.542000	8.734646
10	2.256000	8.286204
11	3.168000	7.052961
12	2.073000	3.105869
13	1.677000	4.067309
14	2.472000	5.547118
15	1.549000	2.685346
16	0.719400	5.332623
17	1.491000	2.459121
18	1.392000	0.934952
19	1.072000	2.983633
20	0.747800	4.115916
21	0.683300	0.988917
22	0.365400	0.432274
23	0.648900	1.131235
24	0.041450	1.985609
25	0.266300	2.602725
26	0.285700	0.345454
27	0.550200	0.031843
28	0.436700	3.117400
29	0.143000	1.994643
30	0.177000	0.015647
31	0.045770	0.202331
32	0.095660	0.335955
33	0.160200	0.090093
34	0.095230	0.302440
35	0.044270	0.477116
36	0.004348	1.127114
37	0.021820	0.718798
38	0.055190	0.009806
39	0.061650	0.006342
40	0.002891	0.007223
41	0.023180	0.937800
42	0.010910	0.000712
43	0.012420	0.000518
44	0.005803	0.000433
45	0.002461	0.000340
46	0.002702	0.000880
47	0.057940	0.021273
48	0.007530	0.042879
49	0.000661	0.000632
50	0.005365	0.000252
51	0.062190	0.242925
52	0.002174	0.170648
53	0.002029	0.000216
54	0.008213	0.132959
55	0.014590	0.036908
56	0.001328	0.000189
57	0.000596	0.021828
58	0.002428	0.000161
59	0.022250	0.002319
60	0.000363	0.000220
61	0.004164	0.000065
62	0.000978	0.000229
63	0.000213	0.027424
64	0.000435	0.210934
65	0.036160	0.049647
66	0.001075	0.014757
67	0.023450	0.000127
68	0.000285	0.009107
69	0.000073	0.001742
70	0.002151	0.000031
71	0.000249	0.012602
72	0.000550	0.319398
73	0.000310	0.000080
74	0.000639	0.000063
75	0.000050	0.000099
76	0.000256	0.000521
77	0.000498	0.000518
78	0.000088	0.000098
79	0.000365	0.001257
80	0.000079	0.000100
81	0.000065	0.000131
82	0.000126	0.004159
83	0.000139	0.202090
84	0.000419	0.000043
85	0.000164	0.000994
86	0.000067	0.000262
87	0.000326	0.000062
88	0.000142	0.000059
89	0.000171	0.000058
90	0.000052	0.000471
91	0.000152	0.004269
92	0.000123	0.000048
93	0.000067	0.000173
94	0.000144	0.000022
95	0.000086	0.000486
96	0.000068	0.000055
97	0.000117	0.000212
98	0.000111	0.000060
99	0.000063	0.000058
100	0.000075	0.000019
...	...	...
total 102

paddle-bot · 2026-03-23T12:20:47Z

Thanks for your contribution!

codecov-commenter · 2026-03-24T02:08:03Z

Codecov Report

❌ Patch coverage is 51.09127% with 493 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@95c3c8a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...addleformers/transformers/intern_lm2_5/modeling.py	47.90%	385 Missing ⚠️
paddleformers/transformers/intern/configuration.py	18.00%	41 Missing ⚠️
...ddleformers/transformers/intern_lm2_5/tokenizer.py	69.52%	32 Missing ⚠️
paddleformers/transformers/intern/modeling.py	58.62%	24 Missing ⚠️
...formers/transformers/intern_lm2_5/configuration.py	79.54%	9 Missing ⚠️
paddleformers/cli/utils/llm_utils.py	0.00%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (51.09%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #4131   +/-   ##
==========================================
  Coverage           ?   47.11%           
==========================================
  Files              ?      482           
  Lines              ?    91611           
  Branches           ?        0           
==========================================
  Hits               ?    43165           
  Misses             ?    48446           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

learncat163 · 2026-03-24T02:25:33Z

/re-run all-failed

a31413510 · 2026-05-19T08:09:59Z

+                module.weight[module._padding_idx].zero_()
+
+    @classmethod
+    def _gen_aoa_config(cls, config: InternLM25Config):


不建议禁用，参考以下实现：
def _gen_aoa_config(cls, config: InternLM25Config):
model_prefix = cls.base_model_prefix + "." if cls != cls.base_model_class else ""
aoa_statements = [
f"model.tok_embeddings.weight -> {model_prefix}tok_embeddings.weight",
f"model.norm.weight -> {model_prefix}norm.weight",
f"model.layers.$LAYER_ID.attention_norm.weight -> {model_prefix}layers.$LAYER_ID.attention_norm.weight",
f"model.layers.$LAYER_ID.ffn_norm.weight -> {model_prefix}layers.$LAYER_ID.ffn_norm.weight",
]
aoa_statements.extend([
f"model.layers.$LAYER_ID.attention.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.attention.{w}.weight"
for w in ["wqkv", "wo"]
])
aoa_statements.extend([
f"model.layers.$LAYER_ID.feed_forward.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.feed_forward.{w}.weight"
for w in ["w1", "w2", "w3"]
])
if cls != cls.base_model_class:
if getattr(config, "tie_word_embeddings", False):
aoa_statements.append("model.tok_embeddings.weight -> output.weight")
else:
aoa_statements.append("output.weight^T -> output.weight")
return {"aoa_statements": aoa_statements}

a31413510 · 2026-05-19T08:15:31Z

+            )
+
+        if attention_mask is not None and attention_mask.ndim == 4:
+            if attention_mask.max() != 0:


训练在此报错，建议删除这个判断，直接使用 4D mask

a31413510 · 2026-05-19T10:18:26Z

        ("Gemma3", "gemma3_text"),
        ("Glm4vMoe", "glm4v_moe"),
        ("GlmOcr", "glm_ocr"),
+        ("InternLM2", "intern_lm2_5"),


是否应该是InternLM25

a31413510 · 2026-05-19T10:19:52Z

+logging_steps: 1
+gradient_accumulation_steps: 4
+logging_dir: ./vdl_log
+output_dir: ./checkpoints/qwen3-sft-full


不应该写qwen3

a31413510 · 2026-05-19T10:23:36Z

+# limitations under the License.
+
+# TODO ，前期不在 .github/workflows/fleet-model-test.yml 中生效，避免直接卡死流程
+#  TODO，提交PR的时候，会提交loss对比材料


确认下是否要保留todo的注释

a31413510 · 2026-05-19T10:24:31Z

1、copyright 年份错误
2、中文注释需要全部改为英文

Paddle-CI-Bot · 2026-05-21T10:04:01Z

PaddleFormers Log Analysis

Run #28777026633 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Unittest GPU CI	dtype 类型不匹配 Bug	在 `_get_unpad_data` 中将 `cu_seqlens` 从 int64 cast 为 int32，或改用 `dtype="int32"` 调用 `cumsum`	报错代码

失败的测试case:

tests/transformers/intern_lm2/test_modeling.py::InternLM2ModelTest::test_flash_attention_2_with_padding_mask

根本原因分析:
PR 新增的 paddleformers/transformers/intern_lm2/modeling.py 中，_get_unpad_data 以 dtype="int32" 对 attention_mask 求和后再做 paddle.cumsum，但 cumsum 默认会将结果提升为 int64，导致 cu_seqlens_q/cu_seqlens_k 实际 dtype 为 int64。_C_ops.flash_attn_unpadded 要求该参数严格为 int32，因此抛出 InvalidArgument: dtype():9(int64) != CppTypeToDataType<T>::Type():7(int32)。

修复建议:

在 paddleformers/transformers/intern_lm2/modeling.py 的 _get_unpad_data 函数中，对 cu_seqlens 显式 cast 回 int32：

def _get_unpad_data(attention_mask):
    seqlens_in_batch = attention_mask.sum(axis=-1, dtype="int32")
    indices = paddle.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    max_seqlen_in_batch = seqlens_in_batch.max().item()
-   cu_seqlens = F.pad(paddle.cumsum(seqlens_in_batch, axis=0), (1, 0))
+   cu_seqlens = F.pad(paddle.cumsum(seqlens_in_batch, axis=0), (1, 0)).cast("int32")
    return (indices, cu_seqlens, max_seqlen_in_batch)

同样检查 _flash_attention_forward 中 line 581 处的 paddle.arange(..., dtype=paddle.int32) 生成的 cu_seqlens_q，确保所有路径传入 flash_attn_varlen_func 的 cu_seqlens_q/k 均为 int32。如果 intern_lm2_5/modeling.py 存在相同写法，需一并修复。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

risemeup1111

已完成本轮审查。当前仍有需要先修复的阻塞问题，细节已放在行内评论中；主要集中在 intern 兼容代理的权重加载/参数注册路径，以及 InternLM2.5 flash attention 带 mask 路径的运行时失败。另有 tokenizer 返回字段长度不一致的问题建议一并修复。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-06-28T03:17:30Z

+        ImplModel = getattr(_impl_module, _cls_name)
+
+        # Store the actual implementation
+        self._impl = ImplModel(config)


优先级：P1
这里把真实模型放在代理对象里，但 _impl 经过当前 __setattr__ 的下划线分支会走 object.__setattr__，不会注册成 Paddle 的子层；后面直接写 __dict__ 也同样不会注册 model/output/layers 等参数。PretrainedModel.from_pretrained 会先用 model.state_dict() 计算 expected keys，AutoModel 加载公开的 InternLM2.5 配置（model_type: internlm2, architectures: InternLM2ForCausalLM）会走这个代理，结果权重加载、保存和优化器参数枚举都看不到真实参数。

请不要用未注册子层的透明代理；如果保留 wrapper，需要显式注册并委托权重相关接口，或者让 AutoModel 直接返回具体实现类。例如：

impl = ImplModel(config) self.add_sublayer("impl", impl) self._impl = impl # 如果 wrapper 的公开 state_dict key 仍需保持为具体模型的 key，继续显式委托： def state_dict(self, *args, **kwargs): return self._impl.state_dict(*args, **kwargs) def set_state_dict(self, state_dict, *args, **kwargs): return self._impl.set_state_dict(state_dict, *args, **kwargs)

risemeup1111 · 2026-06-28T03:18:01Z

+            from ..intern_lm2_5 import modeling as _impl_module
+        else:
+            logger.info("Detected InternLM2 2.0, loading 2.0 implementation")
+            from ..intern_lm2 import modeling as _impl_module


优先级：P1
这个 2.0 分支会导入 ..intern_lm2，但当前仓库只有 intern/ 和 intern_lm2_5/，没有 paddleformers.transformers.intern_lm2 包。只要加载没有 AutoModelForSequenceClassification 标记的普通 model_type: internlm2 配置，就会走到这里并直接 ModuleNotFoundError；这也和注释里“同时支持 2.0 和 2.5”的兼容目标冲突。

请补齐真实的 InternLM2 2.0 实现包，或不要声明/映射 2.0 代理能力。如果当前 PR 只支持 2.5，建议把 Auto 映射限制到能确定为 2.5 的路径，避免普通 internlm2 配置落到不存在的模块。修复形态示例：

if config.is_version_2_5: from ..intern_lm2_5 import modeling as _impl_module else: raise NotImplementedError("InternLM2 2.0 is not supported by this PaddleFormers implementation yet.")

risemeup1111 · 2026-06-28T03:18:40Z

+    has_flash_attn = False
+
+try:
+    from ..intern.bert_padding_delte import index_first_axis, pad_input, unpad_input


优先级：P1
这里导入的是不存在的 ..intern.bert_padding_delte，会一直落到下面的 fallback；但 fallback 的 pad_input(hidden_states, attention_mask) 只有 2 个参数，而 _flash_attention_forward 在第 441 行按 pad_input(attn_output_unpad, indices_q, batch_size, query_length) 调了 4 个参数。只要 attn_implementation="flash_attention_2" 且传入 padding mask，就会在恢复 batch 输出时触发 TypeError，训练/推理的带 mask flash attention 路径不可用。

请引入正确的 unpad/pad 工具，或把 fallback 实现成与调用点一致的签名和语义，并补一个带 padding mask 的 flash attention 测试。示例修复形态：

def pad_input(hidden_states, indices, batch_size, seqlen): output = paddle.zeros([batch_size * seqlen, *hidden_states.shape[1:]], dtype=hidden_states.dtype) output = paddle.scatter(output, indices, hidden_states) return output.reshape([batch_size, seqlen, *hidden_states.shape[1:]]) def unpad_input(hidden_states, attention_mask): indices, cu_seqlens, max_seqlen = _get_unpad_data(attention_mask) hidden_states = index_first_axis(hidden_states.reshape([-1, *hidden_states.shape[2:]]), indices) return hidden_states, indices, cu_seqlens, max_seqlen

risemeup1111 · 2026-06-28T03:19:13Z

+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]


优先级：P2
build_inputs_with_special_tokens 默认只加 BOS（add_eos_token=False），但这里无条件把 EOS 也标成 special token；create_token_type_ids_from_sequences 同样多算了一个 EOS 长度。这样 return_special_tokens_mask=True 或返回 token_type_ids 时长度会比实际 input_ids 多 1，批处理/对齐会出错。

请让 mask 和 token_type_ids 复用实际构造出来的序列长度，例如：

Suggested change

if token_ids_1 is None:

return [1] + ([0] * len(token_ids_0)) + [1]

return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]

output = self.build_inputs_with_special_tokens(token_ids_0, token_ids_1)

if token_ids_1 is None:

prefix_len = 1 if self.add_bos_token else 0

suffix_len = 1 if self.add_eos_token else 0

return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([1] * suffix_len)

prefix_len = 1 if self.add_bos_token else 0

suffix_len = 1 if self.add_eos_token else 0

return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + ([1] * suffix_len)

risemeup1111 · 2026-06-28T03:20:01Z

+        eos = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]


优先级：P2
这里也无条件把 EOS 算进长度，但默认 add_eos_token=False 时实际 input_ids 没有 EOS；例如 3 个 token 会返回 4 个 token_type_ids。请按 build_inputs_with_special_tokens 的真实输出长度生成，避免批处理返回字段长度不一致。

Suggested change

eos = [self.eos_token_id]

if token_ids_1 is None:

return len(token_ids_0 + eos) * [0]

return len(token_ids_0 + eos + token_ids_1 + eos) * [0]

return [0] * len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1))

risemeup1111

已复查新提交。上轮关于 2.0 分支导入、flash attention padding helper、tokenizer 返回长度的问题看起来已经处理；但 intern 兼容代理仍有两个会影响默认加载/生成路径的漏委托问题，细节已放在新的行内评论中，建议合入前修复。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-06-28T07:13:25Z

+        return self._impl.parameters(include_sublayers=include_sublayers)
+
+    def named_parameters(self, prefix="", include_sublayers=True):
+        return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)


优先级：P1
这里补了 state_dict/named_parameters 的委托，但 from_pretrained 默认 load_checkpoint_format="flex_checkpoint" 会直接调用 model.sharded_state_dict()；代理类没有覆盖这个方法时会走 Paddle Layer.sharded_state_dict()，它按注册子层递归生成 _impl.model... / _impl.output... 这类 key。与此同时 _gen_aoa_config() 仍返回真实实现的 model... / output... 目标 key，dist.load_state_dict 找不到这些目标，AutoModel 加载 InternLM2.5 的 flex checkpoint 路径仍会失败。

请把 sharded state dict 也委托给真实实现，并避免 _impl. 前缀进入公开权重 key：

Suggested change

return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)

def named_parameters(self, prefix="", include_sublayers=True):

return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)

def sharded_state_dict(self, *args, **kwargs):

return self._impl.sharded_state_dict(*args, **kwargs)

risemeup1111 · 2026-06-28T07:14:11Z

+    _auto_class = "AutoModelForCausalLM"
+    _tied_weights_keys = ["output.weight"]
+
+    def __init__(self, config: InternLM2Config):


优先级：P1
InternLM2ForCausalLM 代理没有覆盖 prepare_inputs_for_generation，而这个方法已经在 GenerationMixin 上定义，__getattr__ 不会把它转发给 _impl。结果 PretrainedModel.can_generate() 会把代理判断为不可生成，from_pretrained 不会加载 generation_config.json，save_pretrained 也不会保存 generation config；generate() 调用时还会走通用输入准备逻辑，而不是 InternLM2.5 实现里针对 cache_position / DynamicCache 的裁剪逻辑。

请在 CausalLM 代理上显式委托生成入口，恢复和真实实现一致的生成行为：

Suggested change

def __init__(self, config: InternLM2Config):

def __init__(self, config: InternLM2Config):

super().__init__(config)

def prepare_inputs_for_generation(self, *args, **kwargs):

return self._impl.prepare_inputs_for_generation(*args, **kwargs)

risemeup1111

已复查新提交。上一轮两个阻塞的代理委托问题已经按预期修复，当前未发现新的阻塞问题；另有一个 AutoModel 代理路径下 embedding/head API 未完全委托的中等优先级建议，细节已放在行内评论中。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-06-28T16:20:48Z

+                return object.__getattribute__(self, name)
+            except AttributeError:
+                return getattr(self._impl, name)
+        return getattr(self._impl, name)


优先级：P2
__getattr__ 只能代理“当前类找不到”的属性，但 get_input_embeddings / set_input_embeddings / get_output_embeddings / set_output_embeddings 已经在 PretrainedModel 上定义，不会走到这里。现在 InternLM2Model.get_input_embeddings() 会按基类逻辑查 model/embed_tokens，而真实实现用的是 tok_embeddings，会抛 NotImplementedError；InternLM2ForQuestionAnswering 也类似。InternLM2ForCausalLM.get_output_embeddings() 则会因为基类只查 lm_head 而返回 None，与真实 InternLM25ForCausalLM 的 output head 不一致。下游常见的 resize_token_embeddings、NEFTune hook、权重 tying 或测试里的 embedding API 会在 AutoModel 代理路径上表现不一致。

请把这些基类已有的公开 embedding/head API 也显式委托，并补一个 InternLM2* 代理类的轻量测试，例如：

# in InternLM2PretrainedModel def get_input_embeddings(self): return self._impl.get_input_embeddings() def set_input_embeddings(self, value): return self._impl.set_input_embeddings(value) # in InternLM2ForCausalLM def get_output_embeddings(self): return self._impl.get_output_embeddings() def set_output_embeddings(self, new_embeddings): return self._impl.set_output_embeddings(new_embeddings)

risemeup1111

已复查新提交。上一轮关于 AutoModel 代理路径下 embedding/head API 的建议已经通过显式委托和代理测试覆盖，当前未发现需要继续阻塞合入的问题；CI 也已通过。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

liuhao2638

已复查合入 develop 后的最新提交。InternLM2.5 相关改动与上一轮通过时保持一致，之前的代理委托问题仍然已修复；当前 CI 通过，未发现需要阻塞合入的新问题。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111

已复查最新提交。此前代理委托相关问题已经被当前 factory 路由实现解决/取代；本轮在新增的 InternLM2 2.0 实现里发现 flash attention 路径的阻塞问题，另有 tokenizer 返回字段长度不一致的问题，细节已放在行内评论中。CI 当前通过，但这些具体路径仍需要修复和补测。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-07-06T03:44:26Z

+try:
+    from ..intern.bert_padding_delete import index_first_axis, pad_input, unpad_input
+except ImportError:
+    index_first_axis = pad_input = unpad_input = None


优先级：P1
这里导入的是 ..intern.bert_padding_delete，但当前 PR 没有这个模块；ImportError 后会把 index_first_axis / pad_input / unpad_input 都置成 None。随后 InternLM2FlashAttention2 在带 padding mask 的路径会调用这些 helper（例如恢复输出时调用 pad_input(...)），因此只要 2.0 模型启用 flash_attention_2 且 batch 内有 padding，就会直接 TypeError: 'NoneType' object is not callable，训练/推理的 flash mask 路径不可用。

请像 2.5 实现一样在本文件提供可用 helper，或者引入真实存在的公共 helper，并补一个 2.0 flash attention + padding mask 的测试。可直接按 2.5 当前实现补齐：

Suggested change

try:

from ..intern.bert_padding_delete import index_first_axis, pad_input, unpad_input

except ImportError:

index_first_axis = pad_input = unpad_input = None

def index_first_axis(tensor, index):

return tensor[index]

def pad_input(hidden_states, indices, batch, seqlen):

output = paddle.zeros([batch * seqlen, *hidden_states.shape[1:]], dtype=hidden_states.dtype)

output = paddle.scatter(output, indices, hidden_states)

return output.reshape([batch, seqlen, *hidden_states.shape[1:]])

def unpad_input(hidden_states, attention_mask):

indices, cu_seqlens, max_seqlen = _get_unpad_data(attention_mask)

hidden_states = index_first_axis(hidden_states.reshape([-1, *hidden_states.shape[2:]]), indices)

return hidden_states, indices, cu_seqlens, max_seqlen

risemeup1111 · 2026-07-06T03:45:12Z

+        # [1, 1847, 8, 4, 128]
+
+        query_states = qkv_states[..., : self.num_key_value_groups, :]
+        query_states = query_states.reshape([bsz, q_len, self.num_heads * self.num_key_value_groups, self.head_dim])


优先级：P1
这里把 query head 数写成了 self.num_heads * self.num_key_value_groups，会在真实 GQA 配置下把 head 数再乘一次 group 数。InternLM2/2.5 的公开配置常见 num_attention_heads=16,num_key_value_heads=8 或 32,8，此时 qkv_states[..., : self.num_key_value_groups, :] 实际只包含 num_heads 个 query head；当前 reshape 会要求 2x/4x 的元素数，flash attention 路径在无 padding mask 时也会直接 reshape 失败。上面的 eager attention 和 2.5 flash attention 都用 -1 展平实际 query heads。

请保持与 eager/2.5 实现一致，不要重复乘 group 数：

Suggested change

query_states = query_states.reshape([bsz, q_len, self.num_heads * self.num_key_value_groups, self.head_dim])

query_states = query_states.reshape([bsz, q_len, -1, self.head_dim])

risemeup1111 · 2026-07-06T03:45:47Z

+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]


优先级：P2
新加的 2.0 tokenizer 仍然无条件把 EOS 算进 special mask 和 token_type_ids，但 build_inputs_with_special_tokens() 默认 add_eos_token=False，实际只会加 BOS。默认配置下 3 个 token 的 input_ids 长度是 4，这里会返回长度 5；返回 special_tokens_mask 或 token_type_ids 时会和 input_ids 对不上。2.5 tokenizer 已按真实构造长度修复，2.0 这里也需要保持一致。

建议按 add_bos_token/add_eos_token 生成 mask，并让 token_type_ids 直接复用实际输出长度：

Suggested change

if token_ids_1 is None:

return [1] + ([0] * len(token_ids_0)) + [1]

return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]

prefix_len = 1 if self.add_bos_token else 0

suffix_len = 1 if self.add_eos_token else 0

if token_ids_1 is None:

return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([1] * suffix_len)

return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + ([1] * suffix_len)

risemeup1111 · 2026-07-06T03:46:35Z

+        eos = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]


优先级：P2
这里同样无条件把 EOS 算进长度，但默认 add_eos_token=False 时实际 input_ids 不包含 EOS；例如 3 个 token 实际输出 [bos]+3 长度 4，这里会返回 4 或 5（pair 场景也多算分隔 EOS）。请直接按 build_inputs_with_special_tokens() 的真实输出长度生成，避免批处理字段长度不一致。

Suggested change

eos = [self.eos_token_id]

if token_ids_1 is None:

return len(token_ids_0 + eos) * [0]

return len(token_ids_0 + eos + token_ids_1 + eos) * [0]

return [0] * len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1))

a31413510 · 2026-07-06T06:17:05Z

需要修改ci报错的代码风格问题

liuhao2638

已复查最新提交。上一轮关于 InternLM2 2.0 flash attention helper、GQA reshape 以及 tokenizer 返回字段长度的问题均已按预期修复，并补充了相应覆盖；当前未发现需要继续阻塞合入的问题，CI 已通过。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

a31413510 · 2026-07-06T07:29:47Z

CI 测试失败：test_flash_attention_2_with_padding_mask

tests/transformers/intern_lm2/test_modeling.py::InternLM2ModelTest::test_flash_attention_2_with_padding_mask 报错：

ValueError: The type of data we are trying to retrieve (int32) does not match
the type of data (int64) currently contained in the container.

可能原因：paddleformers/transformers/intern_lm2/modeling.py:51，_get_unpad_data 中 seqlens_in_batch 指定 dtype="int32"，但 paddle.cumsum 在此版本 Paddle 会将 int32
输出提升为 int64，导致 cu_seqlens 实际为 int64，而 flash_attn_unpadded API 要求 int32，类型不匹配报错。

需在 _get_unpad_data 中确保 cu_seqlens 转为 int32：
cu_seqlens = F.pad(paddle.cumsum(seqlens_in_batch, axis=0), (1, 0)).cast("int32")

a31413510 · 2026-07-09T06:17:54Z

需要解决ci问题

learncat163 added 3 commits March 23, 2026 19:57

支持interlm25

9ec7337

格式优化

53c9c85

internlm2_5的template提交

fe3c519

paddle-bot Bot added the contributor label Mar 23, 2026

learncat163 added 2 commits March 24, 2026 09:15

internlm2_5的workflow报错修正

9e2a84a

因为缺少__init__导致package路径冲突

f53f497

a31413510 reviewed May 19, 2026

View reviewed changes

learncat163 added 2 commits May 21, 2026 14:44

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

9f18bcf

根据PR指导，修复InternLM2.5的代码

1987e44

learncat163 added 8 commits May 21, 2026 19:17

根据PR指导，修复InternLM2.5的代码

252a16a

固定自动对齐的随机数发生器，减少不一致概率

8778117

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

6947ff8

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

c2265c5

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

3c06fea

根据PR意见修改

562430c

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

a9b7ae7

fix copyright and comment

a246d2d

risemeup1111 suggested changes Jun 28, 2026

View reviewed changes

fix internlm2 common model

65bd35b

risemeup1111 suggested changes Jun 28, 2026

View reviewed changes

fix internlm2 common model

3dd3419

risemeup1111 reviewed Jun 28, 2026

View reviewed changes

fix: delegate embedding/head APIs to _impl

5c56e4e

risemeup1111 approved these changes Jun 29, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

2a3fea5

liuhao2638 approved these changes Jul 3, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into pr-merge-terlm25

dff2c7d

learncat163 mentioned this pull request Jul 5, 2026

Pure add internlm2 #4018

Draft

learncat163 added 3 commits July 5, 2026 23:58

fix:pr mix internlm2 and internlm2.5

66bad74

fix:pr merge internlm2 to internlm2.5

d9ab9cb

fix:pr mini proxy model class

1958c26

risemeup1111 suggested changes Jul 6, 2026

View reviewed changes

learncat163 added 2 commits July 6, 2026 14:19

fix:pr fix flash helper/GQA/token error

0151889

fix:pr fix code style

93758ed

liuhao2638 approved these changes Jul 6, 2026

View reviewed changes

-        if token_ids_1 is None:
-            return [1] + ([0] * len(token_ids_0)) + [1]
-        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+        output = self.build_inputs_with_special_tokens(token_ids_0, token_ids_1)
+        if token_ids_1 is None:
+            prefix_len = 1 if self.add_bos_token else 0
+            suffix_len = 1 if self.add_eos_token else 0
+            return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([1] * suffix_len)
+        prefix_len = 1 if self.add_bos_token else 0
+        suffix_len = 1 if self.add_eos_token else 0
+        return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + ([1] * suffix_len)

-    def __init__(self, config: InternLM2Config):
+    def __init__(self, config: InternLM2Config):
+        super().__init__(config)
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        return self._impl.prepare_inputs_for_generation(*args, **kwargs)

-try:
-    from ..intern.bert_padding_delete import index_first_axis, pad_input, unpad_input
-except ImportError:
-    index_first_axis = pad_input = unpad_input = None
+def index_first_axis(tensor, index):
+    return tensor[index]
+def pad_input(hidden_states, indices, batch, seqlen):
+    output = paddle.zeros([batch * seqlen, *hidden_states.shape[1:]], dtype=hidden_states.dtype)
+    output = paddle.scatter(output, indices, hidden_states)
+    return output.reshape([batch, seqlen, *hidden_states.shape[1:]])
+def unpad_input(hidden_states, attention_mask):
+    indices, cu_seqlens, max_seqlen = _get_unpad_data(attention_mask)
+    hidden_states = index_first_axis(hidden_states.reshape([-1, *hidden_states.shape[2:]]), indices)
+    return hidden_states, indices, cu_seqlens, max_seqlen

	query_states = query_states.reshape([bsz, q_len, self.num_heads * self.num_key_value_groups, self.head_dim])
	query_states = query_states.reshape([bsz, q_len, -1, self.head_dim])

Uh oh!

Conversation

learncat163 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

同时支持 internlm2.0 和 2.5版本

1. 模型功能和对齐测试 (tests/transformers/intern_lm2_5/test_modeling.py)

2. 转换模型的地址

ms-swift对比

1. ms-swift配置

2. paddleformers-cli配置

3. loss输出diff脚本

4. loss对比结果

Uh oh!

paddle-bot Bot commented Mar 23, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

learncat163 commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a31413510 commented May 19, 2026

Uh oh!

Paddle-CI-Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

liuhao2638 left a comment

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learncat163 commented Mar 23, 2026 •

edited

Loading

1. 模型功能和对齐测试 (`tests/transformers/intern_lm2_5/test_modeling.py`)

codecov-commenter commented Mar 24, 2026 •

edited

Loading

Paddle-CI-Bot commented May 21, 2026 •

edited

Loading