Skip to content

支持interlm25,支持模型对齐#4131

Open
learncat163 wants to merge 20 commits into
PaddlePaddle:developfrom
learncat163:pr-merge-terlm25
Open

支持interlm25,支持模型对齐#4131
learncat163 wants to merge 20 commits into
PaddlePaddle:developfrom
learncat163:pr-merge-terlm25

Conversation

@learncat163

Copy link
Copy Markdown

PR types

New features

PR changes

Add Models

Description

InternLM2.5 模型迁移到 PaddleFormers:

1. 模型功能和对齐测试 (tests/transformers/intern_lm2_5/test_modeling.py)

  • 新增 InternLM25CompatibilityTest 等功能测试类
  • 验证 torch 和 paddle 模型推理结果对齐(loss 容差 1e-2,token id 前10个推理一致)

2. 转换模型的地址

- 原版的1.8b模型地址,https://aistudio.baidu.com/modelsdetail/45123 
- paddle版本的1.8b版本地址,https://aistudio.baidu.com/modelsdetail/45124

ms-swift对比

1. ms-swift配置

MODEL_PATH="/mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-raw"
TRAIN_DATA="/mnt/learncat/code/swift/train/train-msg.jsonl"
OUTPUT_DIR="/tmp/internlm2_5-sft-full-ms-swift"

# 创建输出目录
mkdir -p "$OUTPUT_DIR"

echo "========================================"
echo "启动 ms-swift 全参微调测试"
echo "模型: internlm2_5-1_8b-chat"
echo "========================================"

# 激活swift环境并运行
source /mnt/learncat/code/swift/.venv/bin/activate && \
swift sft \
    --model "$MODEL_PATH" \
    --model_type internlm2 \
    --tuner_type full \
    --template default \
    --dataset "$TRAIN_DATA" \
    --max_length 512 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-5 \
    --max_steps 500 \
    --warmup_steps 5 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --bf16 true \
    --seed 23 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_strategy no \
    --gradient_checkpointing false \
    2>&1 | tee "$OUTPUT_DIR/training.log"

2. paddleformers-cli配置

### data
train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./tests/fixtures/dummy/sft/train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: ./tests/fixtures/dummy/sft/eval.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 512
packing: false
dataloader_shuffle: false
mix_strategy: concat
template_backend: custom
template: internlm2_5
### model
# placeholder path; override via update_training_args in test or CLI --model_name_or_path
#model_name_or_path: Qwen/Qwen3-0.6B-Base
model_name_or_path: /mnt/learncat/llm/internlm/internlm2_5-1_8b-chat-paddle
_attn_implementation: flashmask


### finetuning
# base
stage: SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
num_train_epochs: 1
max_steps: 500
eval_steps: 1000
evaluation_strategy: steps
save_steps: 100000
save_strategy: steps
logging_steps: 1
gradient_accumulation_steps: 1
logging_dir: ./vdl_log
output_dir: ./checkpoints/qwen3-sft-full
disable_tqdm: true
eval_accumulation_steps: 16


# train
warmup_steps: 5
learning_rate: 1.0e-5

# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O1
unified_checkpoint: false
# 注释之后跳过保存阶段
# save_checkpoint_format: flex_checkpoint
load_checkpoint_format: sharding_io
continue_training: false

3. loss输出diff脚本

#!/usr/bin/env python3
import argparse
import os
import re
from pathlib import Path
from typing import Dict, Tuple


def parse_paddle_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"loss:\s*([0-9\.eE+-]+)")
    step_pattern = re.compile(r"global_step:\s*(\d+)")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "loss:" in line and "global_step:" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def parse_swift_log(file_path: str) -> Dict[int, float]:
    loss_dict = {}
    loss_pattern = re.compile(r"'loss':\s*'([0-9\.eE+-]+)'")
    step_pattern = re.compile(r"'global_step/max_steps':\s*'(\d+)/\d+'")

    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            if "'loss'" in line and "'global_step/max_steps'" in line:
                loss_match = loss_pattern.search(line)
                step_match = step_pattern.search(line)
                if loss_match and step_match:
                    loss_val = float(loss_match.group(1))
                    step_val = int(step_match.group(1))
                    loss_dict[step_val] = loss_val
    return loss_dict


def find_log_files(directory: str) -> Tuple[str, str]:
    dir_path = Path(directory)
    paddle_files = list(dir_path.glob("paddle*.log"))
    swift_files = list(dir_path.glob("swift*.log")) + list(dir_path.glob("swfit*.log"))

    if not paddle_files:
        raise FileNotFoundError(f"paddle log not found: {directory}")
    if not swift_files:
        raise FileNotFoundError(f"swift log not found: {directory}")

    return str(paddle_files[0]), str(swift_files[0])


def generate_markdown_table(
    paddle_dict: Dict[int, float],
    swift_dict: Dict[int, float],
    max_rows: int = 100
) -> str:
    common_steps = sorted(set(paddle_dict.keys()) & set(swift_dict.keys()))
    if not common_steps:
        return "no common steps"

    display_steps = common_steps[:max_rows]

    lines = [
        "# Loss",
        "",
        f"common: {len(common_steps)}, show: {len(display_steps)}",
        "",
        "| step | ms-swift | paddle |",
        "|------|----------|--------|"
    ]

    for step in display_steps:
        swift_loss = swift_dict[step]
        paddle_loss = paddle_dict[step]
        lines.append(f"| {step} | {swift_loss:.6f} | {paddle_loss:.6f} |")

    if len(common_steps) > max_rows:
        lines.append(f"| ... | ... | ... |")
        lines.append(f"_total {len(common_steps)}_")

    return "\n".join(lines)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dir", type=str, default="log/lm25")
    parser.add_argument("--max_rows", type=int, default=100)
    parser.add_argument("--output", type=str, default=None)
    args = parser.parse_args()

    paddle_log, swift_log = find_log_files(args.dir)
    print(f"Paddle: {paddle_log}")
    print(f"Swift:  {swift_log}")

    paddle_dict = parse_paddle_log(paddle_log)
    swift_dict = parse_swift_log(swift_log)

    print(f"Paddle: {len(paddle_dict)}")
    print(f"Swift:  {len(swift_dict)}")

    markdown = generate_markdown_table(paddle_dict, swift_dict, args.max_rows)

    output_path = args.output
    if output_path is None:
        output_path = os.path.join(args.dir, "diff-loss.md")

    os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else ".", exist_ok=True)

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(markdown)

    print(f"\nOutput: {output_path}")
    print("\n" + markdown)


if __name__ == "__main__":
    main()

4. loss对比结果

common: 102, show: 100

step ms-swift paddle
1 5.953000 6.290356
2 5.730000 9.247890
3 5.861000 6.163038
4 5.663000 4.679254
5 4.272000 6.301551
6 4.699000 3.590961
7 4.390000 8.666958
8 3.389000 7.602499
9 3.542000 8.734646
10 2.256000 8.286204
11 3.168000 7.052961
12 2.073000 3.105869
13 1.677000 4.067309
14 2.472000 5.547118
15 1.549000 2.685346
16 0.719400 5.332623
17 1.491000 2.459121
18 1.392000 0.934952
19 1.072000 2.983633
20 0.747800 4.115916
21 0.683300 0.988917
22 0.365400 0.432274
23 0.648900 1.131235
24 0.041450 1.985609
25 0.266300 2.602725
26 0.285700 0.345454
27 0.550200 0.031843
28 0.436700 3.117400
29 0.143000 1.994643
30 0.177000 0.015647
31 0.045770 0.202331
32 0.095660 0.335955
33 0.160200 0.090093
34 0.095230 0.302440
35 0.044270 0.477116
36 0.004348 1.127114
37 0.021820 0.718798
38 0.055190 0.009806
39 0.061650 0.006342
40 0.002891 0.007223
41 0.023180 0.937800
42 0.010910 0.000712
43 0.012420 0.000518
44 0.005803 0.000433
45 0.002461 0.000340
46 0.002702 0.000880
47 0.057940 0.021273
48 0.007530 0.042879
49 0.000661 0.000632
50 0.005365 0.000252
51 0.062190 0.242925
52 0.002174 0.170648
53 0.002029 0.000216
54 0.008213 0.132959
55 0.014590 0.036908
56 0.001328 0.000189
57 0.000596 0.021828
58 0.002428 0.000161
59 0.022250 0.002319
60 0.000363 0.000220
61 0.004164 0.000065
62 0.000978 0.000229
63 0.000213 0.027424
64 0.000435 0.210934
65 0.036160 0.049647
66 0.001075 0.014757
67 0.023450 0.000127
68 0.000285 0.009107
69 0.000073 0.001742
70 0.002151 0.000031
71 0.000249 0.012602
72 0.000550 0.319398
73 0.000310 0.000080
74 0.000639 0.000063
75 0.000050 0.000099
76 0.000256 0.000521
77 0.000498 0.000518
78 0.000088 0.000098
79 0.000365 0.001257
80 0.000079 0.000100
81 0.000065 0.000131
82 0.000126 0.004159
83 0.000139 0.202090
84 0.000419 0.000043
85 0.000164 0.000994
86 0.000067 0.000262
87 0.000326 0.000062
88 0.000142 0.000059
89 0.000171 0.000058
90 0.000052 0.000471
91 0.000152 0.004269
92 0.000123 0.000048
93 0.000067 0.000173
94 0.000144 0.000022
95 0.000086 0.000486
96 0.000068 0.000055
97 0.000117 0.000212
98 0.000111 0.000060
99 0.000063 0.000058
100 0.000075 0.000019
... ... ...
total 102

@paddle-bot

paddle-bot Bot commented Mar 23, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@codecov-commenter

codecov-commenter commented Mar 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 51.09127% with 493 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@95c3c8a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...addleformers/transformers/intern_lm2_5/modeling.py 47.90% 385 Missing ⚠️
paddleformers/transformers/intern/configuration.py 18.00% 41 Missing ⚠️
...ddleformers/transformers/intern_lm2_5/tokenizer.py 69.52% 32 Missing ⚠️
paddleformers/transformers/intern/modeling.py 58.62% 24 Missing ⚠️
...formers/transformers/intern_lm2_5/configuration.py 79.54% 9 Missing ⚠️
paddleformers/cli/utils/llm_utils.py 0.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (51.09%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4131   +/-   ##
==========================================
  Coverage           ?   47.11%           
==========================================
  Files              ?      482           
  Lines              ?    91611           
  Branches           ?        0           
==========================================
  Hits               ?    43165           
  Misses             ?    48446           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@learncat163

Copy link
Copy Markdown
Author

/re-run all-failed

module.weight[module._padding_idx].zero_()

@classmethod
def _gen_aoa_config(cls, config: InternLM25Config):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不建议禁用,参考以下实现:
def _gen_aoa_config(cls, config: InternLM25Config):
model_prefix = cls.base_model_prefix + "." if cls != cls.base_model_class else ""
aoa_statements = [
f"model.tok_embeddings.weight -> {model_prefix}tok_embeddings.weight",
f"model.norm.weight -> {model_prefix}norm.weight",
f"model.layers.$LAYER_ID.attention_norm.weight -> {model_prefix}layers.$LAYER_ID.attention_norm.weight",
f"model.layers.$LAYER_ID.ffn_norm.weight -> {model_prefix}layers.$LAYER_ID.ffn_norm.weight",
]
aoa_statements.extend([
f"model.layers.$LAYER_ID.attention.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.attention.{w}.weight"
for w in ["wqkv", "wo"]
])
aoa_statements.extend([
f"model.layers.$LAYER_ID.feed_forward.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.feed_forward.{w}.weight"
for w in ["w1", "w2", "w3"]
])
if cls != cls.base_model_class:
if getattr(config, "tie_word_embeddings", False):
aoa_statements.append("model.tok_embeddings.weight -> output.weight")
else:
aoa_statements.append("output.weight^T -> output.weight")
return {"aoa_statements": aoa_statements}

)

if attention_mask is not None and attention_mask.ndim == 4:
if attention_mask.max() != 0:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练在此报错,建议删除这个判断,直接使用 4D mask

("Gemma3", "gemma3_text"),
("Glm4vMoe", "glm4v_moe"),
("GlmOcr", "glm_ocr"),
("InternLM2", "intern_lm2_5"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否应该是InternLM25

logging_steps: 1
gradient_accumulation_steps: 4
logging_dir: ./vdl_log
output_dir: ./checkpoints/qwen3-sft-full

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不应该写qwen3

Comment thread tests/integration_test/interlm_sft.sh Outdated
# limitations under the License.

# TODO ,前期不在 .github/workflows/fleet-model-test.yml 中生效,避免直接卡死流程
# TODO,提交PR的时候,会提交loss对比材料

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下是否要保留todo的注释

@a31413510

Copy link
Copy Markdown
Collaborator

1、copyright 年份错误
2、中文注释需要全部改为英文

@Paddle-CI-Bot

Paddle-CI-Bot commented May 21, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26274089790 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Fleet Model Test · Integration test (H20, multi-card) 显卡掉 已知问题,CI 维护人员关注 paddle-10-1567435 机器 GPU 挂载状态 报错代码
CI_ILUVATAR · iluvatar_test LossNan / 数据类型自动提升 检查 PR 中对 ERNIE-21B-SFT 计算路径的改动是否引入了 dtype 不一致(Got different data type, run type promotion automatically),重点排查 trainer.py 2248 行附近的 loss 累积逻辑 报错代码

失败的测试 case:

scripts/iluvatar_ci/test_ernie_21b_sft.py::test_ernie_21b_sft_training

根本原因分析:

  • Fleet Model Test (H20 multi-card):机器 paddle-10-1567435 的 GPU 未被 CUDA 驱动正确识别,paddle.device.cuda.get_device_properties 拿到的是 Place(cpu) 而非 CUDA Place,系统日志同时出现 Your CUDA device is not set properly. CPU device will be used by default,属于宿主机 / 容器 GPU 挂载异常,与 PR 代码无关。

  • CI_ILUVATAR:Iluvatar 机器上运行 ERNIE-21B-SFT 时,step 2 出现 Got different data type, run type promotion automatically,随后 loss 跌为 NaN。结合 PR 对训练流程或 MoE 相关计算的修改,极可能是某处 tensor 的精度(fp16/bf16/fp32)在 Iluvatar 设备上未正确处理,导致溢出后 loss 变 NaN,_check_loss_valid 报错终止。

修复建议:

  1. Fleet Model Test (H20 multi-card)

    • 该机器 GPU 掉卡为环境问题,直接 rerun 流水线;若连续复现,通知 CI 维护人员检查 paddle-10-1567435 宿主机 nvidia-smi 与 Docker --gpus all 挂载是否正常。
  2. CI_ILUVATAR (LossNan)

    • 检查 PR 中涉及 trainer.pyworkflow.py(sft)及 MoE 相关文件的 dtype cast 操作,确保在 Iluvatar corex 4.3.8 环境下所有中间 tensor 精度统一。
    • 本地复现命令:paddleformers-cli train scripts/iluvatar_ci/config/ERNIE-21B-SFT.yaml,在第 2 步前后打印各 tensor 的 dtype,定位触发 type promotion 的具体算子。
    • 若确认是 Iluvatar 硬件偶现精度问题(非代码引入),可先 rerun 一次验证。

🔄 每次 Re-run 后自动更新

@risemeup1111 risemeup1111 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已完成本轮审查。当前仍有需要先修复的阻塞问题,细节已放在行内评论中;主要集中在 intern 兼容代理的权重加载/参数注册路径,以及 InternLM2.5 flash attention 带 mask 路径的运行时失败。另有 tokenizer 返回字段长度不一致的问题建议一并修复。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

ImplModel = getattr(_impl_module, _cls_name)

# Store the actual implementation
self._impl = ImplModel(config)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1
这里把真实模型放在代理对象里,但 _impl 经过当前 __setattr__ 的下划线分支会走 object.__setattr__,不会注册成 Paddle 的子层;后面直接写 __dict__ 也同样不会注册 model/output/layers 等参数。PretrainedModel.from_pretrained 会先用 model.state_dict() 计算 expected keys,AutoModel 加载公开的 InternLM2.5 配置(model_type: internlm2, architectures: InternLM2ForCausalLM)会走这个代理,结果权重加载、保存和优化器参数枚举都看不到真实参数。

请不要用未注册子层的透明代理;如果保留 wrapper,需要显式注册并委托权重相关接口,或者让 AutoModel 直接返回具体实现类。例如:

impl = ImplModel(config)
self.add_sublayer("impl", impl)
self._impl = impl

# 如果 wrapper 的公开 state_dict key 仍需保持为具体模型的 key,继续显式委托:
def state_dict(self, *args, **kwargs):
    return self._impl.state_dict(*args, **kwargs)

def set_state_dict(self, state_dict, *args, **kwargs):
    return self._impl.set_state_dict(state_dict, *args, **kwargs)

from ..intern_lm2_5 import modeling as _impl_module
else:
logger.info("Detected InternLM2 2.0, loading 2.0 implementation")
from ..intern_lm2 import modeling as _impl_module

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1
这个 2.0 分支会导入 ..intern_lm2,但当前仓库只有 intern/intern_lm2_5/,没有 paddleformers.transformers.intern_lm2 包。只要加载没有 AutoModelForSequenceClassification 标记的普通 model_type: internlm2 配置,就会走到这里并直接 ModuleNotFoundError;这也和注释里“同时支持 2.0 和 2.5”的兼容目标冲突。

请补齐真实的 InternLM2 2.0 实现包,或不要声明/映射 2.0 代理能力。如果当前 PR 只支持 2.5,建议把 Auto 映射限制到能确定为 2.5 的路径,避免普通 internlm2 配置落到不存在的模块。修复形态示例:

if config.is_version_2_5:
    from ..intern_lm2_5 import modeling as _impl_module
else:
    raise NotImplementedError("InternLM2 2.0 is not supported by this PaddleFormers implementation yet.")

has_flash_attn = False

try:
from ..intern.bert_padding_delte import index_first_axis, pad_input, unpad_input

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1
这里导入的是不存在的 ..intern.bert_padding_delte,会一直落到下面的 fallback;但 fallback 的 pad_input(hidden_states, attention_mask) 只有 2 个参数,而 _flash_attention_forward 在第 441 行按 pad_input(attn_output_unpad, indices_q, batch_size, query_length) 调了 4 个参数。只要 attn_implementation="flash_attention_2" 且传入 padding mask,就会在恢复 batch 输出时触发 TypeError,训练/推理的带 mask flash attention 路径不可用。

请引入正确的 unpad/pad 工具,或把 fallback 实现成与调用点一致的签名和语义,并补一个带 padding mask 的 flash attention 测试。示例修复形态:

def pad_input(hidden_states, indices, batch_size, seqlen):
    output = paddle.zeros([batch_size * seqlen, *hidden_states.shape[1:]], dtype=hidden_states.dtype)
    output = paddle.scatter(output, indices, hidden_states)
    return output.reshape([batch_size, seqlen, *hidden_states.shape[1:]])

def unpad_input(hidden_states, attention_mask):
    indices, cu_seqlens, max_seqlen = _get_unpad_data(attention_mask)
    hidden_states = index_first_axis(hidden_states.reshape([-1, *hidden_states.shape[2:]]), indices)
    return hidden_states, indices, cu_seqlens, max_seqlen

Comment on lines +164 to +166
if token_ids_1 is None:
return [1] + ([0] * len(token_ids_0)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 优先级:P2
build_inputs_with_special_tokens 默认只加 BOS(add_eos_token=False),但这里无条件把 EOS 也标成 special token;create_token_type_ids_from_sequences 同样多算了一个 EOS 长度。这样 return_special_tokens_mask=True 或返回 token_type_ids 时长度会比实际 input_ids 多 1,批处理/对齐会出错。

请让 mask 和 token_type_ids 复用实际构造出来的序列长度,例如:

Suggested change
if token_ids_1 is None:
return [1] + ([0] * len(token_ids_0)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
output = self.build_inputs_with_special_tokens(token_ids_0, token_ids_1)
if token_ids_1 is None:
prefix_len = 1 if self.add_bos_token else 0
suffix_len = 1 if self.add_eos_token else 0
return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([1] * suffix_len)
prefix_len = 1 if self.add_bos_token else 0
suffix_len = 1 if self.add_eos_token else 0
return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + ([1] * suffix_len)

Comment on lines +171 to +175
eos = [self.eos_token_id]

if token_ids_1 is None:
return len(token_ids_0 + eos) * [0]
return len(token_ids_0 + eos + token_ids_1 + eos) * [0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 优先级:P2
这里也无条件把 EOS 算进长度,但默认 add_eos_token=False 时实际 input_ids 没有 EOS;例如 3 个 token 会返回 4 个 token_type_ids。请按 build_inputs_with_special_tokens 的真实输出长度生成,避免批处理返回字段长度不一致。

Suggested change
eos = [self.eos_token_id]
if token_ids_1 is None:
return len(token_ids_0 + eos) * [0]
return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
return [0] * len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1))

@risemeup1111 risemeup1111 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新提交。上轮关于 2.0 分支导入、flash attention padding helper、tokenizer 返回长度的问题看起来已经处理;但 intern 兼容代理仍有两个会影响默认加载/生成路径的漏委托问题,细节已放在新的行内评论中,建议合入前修复。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

return self._impl.parameters(include_sublayers=include_sublayers)

def named_parameters(self, prefix="", include_sublayers=True):
return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1
这里补了 state_dict/named_parameters 的委托,但 from_pretrained 默认 load_checkpoint_format="flex_checkpoint" 会直接调用 model.sharded_state_dict();代理类没有覆盖这个方法时会走 Paddle Layer.sharded_state_dict(),它按注册子层递归生成 _impl.model... / _impl.output... 这类 key。与此同时 _gen_aoa_config() 仍返回真实实现的 model... / output... 目标 key,dist.load_state_dict 找不到这些目标,AutoModel 加载 InternLM2.5 的 flex checkpoint 路径仍会失败。

请把 sharded state dict 也委托给真实实现,并避免 _impl. 前缀进入公开权重 key:

Suggested change
return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)
def named_parameters(self, prefix="", include_sublayers=True):
return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers)
def sharded_state_dict(self, *args, **kwargs):
return self._impl.sharded_state_dict(*args, **kwargs)

_auto_class = "AutoModelForCausalLM"
_tied_weights_keys = ["output.weight"]

def __init__(self, config: InternLM2Config):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1
InternLM2ForCausalLM 代理没有覆盖 prepare_inputs_for_generation,而这个方法已经在 GenerationMixin 上定义,__getattr__ 不会把它转发给 _impl。结果 PretrainedModel.can_generate() 会把代理判断为不可生成,from_pretrained 不会加载 generation_config.jsonsave_pretrained 也不会保存 generation config;generate() 调用时还会走通用输入准备逻辑,而不是 InternLM2.5 实现里针对 cache_position / DynamicCache 的裁剪逻辑。

请在 CausalLM 代理上显式委托生成入口,恢复和真实实现一致的生成行为:

Suggested change
def __init__(self, config: InternLM2Config):
def __init__(self, config: InternLM2Config):
super().__init__(config)
def prepare_inputs_for_generation(self, *args, **kwargs):
return self._impl.prepare_inputs_for_generation(*args, **kwargs)

@risemeup1111 risemeup1111 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新提交。上一轮两个阻塞的代理委托问题已经按预期修复,当前未发现新的阻塞问题;另有一个 AutoModel 代理路径下 embedding/head API 未完全委托的中等优先级建议,细节已放在行内评论中。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

return object.__getattribute__(self, name)
except AttributeError:
return getattr(self._impl, name)
return getattr(self._impl, name)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 优先级:P2
__getattr__ 只能代理“当前类找不到”的属性,但 get_input_embeddings / set_input_embeddings / get_output_embeddings / set_output_embeddings 已经在 PretrainedModel 上定义,不会走到这里。现在 InternLM2Model.get_input_embeddings() 会按基类逻辑查 model/embed_tokens,而真实实现用的是 tok_embeddings,会抛 NotImplementedErrorInternLM2ForQuestionAnswering 也类似。InternLM2ForCausalLM.get_output_embeddings() 则会因为基类只查 lm_head 而返回 None,与真实 InternLM25ForCausalLMoutput head 不一致。下游常见的 resize_token_embeddings、NEFTune hook、权重 tying 或测试里的 embedding API 会在 AutoModel 代理路径上表现不一致。

请把这些基类已有的公开 embedding/head API 也显式委托,并补一个 InternLM2* 代理类的轻量测试,例如:

# in InternLM2PretrainedModel
def get_input_embeddings(self):
    return self._impl.get_input_embeddings()

def set_input_embeddings(self, value):
    return self._impl.set_input_embeddings(value)

# in InternLM2ForCausalLM
def get_output_embeddings(self):
    return self._impl.get_output_embeddings()

def set_output_embeddings(self, new_embeddings):
    return self._impl.set_output_embeddings(new_embeddings)

@risemeup1111 risemeup1111 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新提交。上一轮关于 AutoModel 代理路径下 embedding/head API 的建议已经通过显式委托和代理测试覆盖,当前未发现需要继续阻塞合入的问题;CI 也已通过。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

@liuhao2638 liuhao2638 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查合入 develop 后的最新提交。InternLM2.5 相关改动与上一轮通过时保持一致,之前的代理委托问题仍然已修复;当前 CI 通过,未发现需要阻塞合入的新问题。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

@learncat163 learncat163 mentioned this pull request Jul 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants