支持interlm25,支持模型对齐#4131
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (51.09%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #4131 +/- ##
==========================================
Coverage ? 47.11%
==========================================
Files ? 482
Lines ? 91611
Branches ? 0
==========================================
Hits ? 43165
Misses ? 48446
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
| module.weight[module._padding_idx].zero_() | ||
|
|
||
| @classmethod | ||
| def _gen_aoa_config(cls, config: InternLM25Config): |
There was a problem hiding this comment.
不建议禁用,参考以下实现:
def _gen_aoa_config(cls, config: InternLM25Config):
model_prefix = cls.base_model_prefix + "." if cls != cls.base_model_class else ""
aoa_statements = [
f"model.tok_embeddings.weight -> {model_prefix}tok_embeddings.weight",
f"model.norm.weight -> {model_prefix}norm.weight",
f"model.layers.$LAYER_ID.attention_norm.weight -> {model_prefix}layers.$LAYER_ID.attention_norm.weight",
f"model.layers.$LAYER_ID.ffn_norm.weight -> {model_prefix}layers.$LAYER_ID.ffn_norm.weight",
]
aoa_statements.extend([
f"model.layers.$LAYER_ID.attention.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.attention.{w}.weight"
for w in ["wqkv", "wo"]
])
aoa_statements.extend([
f"model.layers.$LAYER_ID.feed_forward.{w}.weight^T -> {model_prefix}layers.$LAYER_ID.feed_forward.{w}.weight"
for w in ["w1", "w2", "w3"]
])
if cls != cls.base_model_class:
if getattr(config, "tie_word_embeddings", False):
aoa_statements.append("model.tok_embeddings.weight -> output.weight")
else:
aoa_statements.append("output.weight^T -> output.weight")
return {"aoa_statements": aoa_statements}
| ) | ||
|
|
||
| if attention_mask is not None and attention_mask.ndim == 4: | ||
| if attention_mask.max() != 0: |
There was a problem hiding this comment.
训练在此报错,建议删除这个判断,直接使用 4D mask
| ("Gemma3", "gemma3_text"), | ||
| ("Glm4vMoe", "glm4v_moe"), | ||
| ("GlmOcr", "glm_ocr"), | ||
| ("InternLM2", "intern_lm2_5"), |
| logging_steps: 1 | ||
| gradient_accumulation_steps: 4 | ||
| logging_dir: ./vdl_log | ||
| output_dir: ./checkpoints/qwen3-sft-full |
| # limitations under the License. | ||
|
|
||
| # TODO ,前期不在 .github/workflows/fleet-model-test.yml 中生效,避免直接卡死流程 | ||
| # TODO,提交PR的时候,会提交loss对比材料 |
|
1、copyright 年份错误 |
PaddleFormers Log Analysis
日志分析报告
失败的测试 case: 根本原因分析:
修复建议:
🔄 每次 Re-run 后自动更新 |
risemeup1111
left a comment
There was a problem hiding this comment.
已完成本轮审查。当前仍有需要先修复的阻塞问题,细节已放在行内评论中;主要集中在 intern 兼容代理的权重加载/参数注册路径,以及 InternLM2.5 flash attention 带 mask 路径的运行时失败。另有 tokenizer 返回字段长度不一致的问题建议一并修复。
| ImplModel = getattr(_impl_module, _cls_name) | ||
|
|
||
| # Store the actual implementation | ||
| self._impl = ImplModel(config) |
There was a problem hiding this comment.
优先级:P1
这里把真实模型放在代理对象里,但 _impl 经过当前 __setattr__ 的下划线分支会走 object.__setattr__,不会注册成 Paddle 的子层;后面直接写 __dict__ 也同样不会注册 model/output/layers 等参数。PretrainedModel.from_pretrained 会先用 model.state_dict() 计算 expected keys,AutoModel 加载公开的 InternLM2.5 配置(model_type: internlm2, architectures: InternLM2ForCausalLM)会走这个代理,结果权重加载、保存和优化器参数枚举都看不到真实参数。
请不要用未注册子层的透明代理;如果保留 wrapper,需要显式注册并委托权重相关接口,或者让 AutoModel 直接返回具体实现类。例如:
impl = ImplModel(config)
self.add_sublayer("impl", impl)
self._impl = impl
# 如果 wrapper 的公开 state_dict key 仍需保持为具体模型的 key,继续显式委托:
def state_dict(self, *args, **kwargs):
return self._impl.state_dict(*args, **kwargs)
def set_state_dict(self, state_dict, *args, **kwargs):
return self._impl.set_state_dict(state_dict, *args, **kwargs)| from ..intern_lm2_5 import modeling as _impl_module | ||
| else: | ||
| logger.info("Detected InternLM2 2.0, loading 2.0 implementation") | ||
| from ..intern_lm2 import modeling as _impl_module |
There was a problem hiding this comment.
优先级:P1
这个 2.0 分支会导入 ..intern_lm2,但当前仓库只有 intern/ 和 intern_lm2_5/,没有 paddleformers.transformers.intern_lm2 包。只要加载没有 AutoModelForSequenceClassification 标记的普通 model_type: internlm2 配置,就会走到这里并直接 ModuleNotFoundError;这也和注释里“同时支持 2.0 和 2.5”的兼容目标冲突。
请补齐真实的 InternLM2 2.0 实现包,或不要声明/映射 2.0 代理能力。如果当前 PR 只支持 2.5,建议把 Auto 映射限制到能确定为 2.5 的路径,避免普通 internlm2 配置落到不存在的模块。修复形态示例:
if config.is_version_2_5:
from ..intern_lm2_5 import modeling as _impl_module
else:
raise NotImplementedError("InternLM2 2.0 is not supported by this PaddleFormers implementation yet.")| has_flash_attn = False | ||
|
|
||
| try: | ||
| from ..intern.bert_padding_delte import index_first_axis, pad_input, unpad_input |
There was a problem hiding this comment.
优先级:P1
这里导入的是不存在的 ..intern.bert_padding_delte,会一直落到下面的 fallback;但 fallback 的 pad_input(hidden_states, attention_mask) 只有 2 个参数,而 _flash_attention_forward 在第 441 行按 pad_input(attn_output_unpad, indices_q, batch_size, query_length) 调了 4 个参数。只要 attn_implementation="flash_attention_2" 且传入 padding mask,就会在恢复 batch 输出时触发 TypeError,训练/推理的带 mask flash attention 路径不可用。
请引入正确的 unpad/pad 工具,或把 fallback 实现成与调用点一致的签名和语义,并补一个带 padding mask 的 flash attention 测试。示例修复形态:
def pad_input(hidden_states, indices, batch_size, seqlen):
output = paddle.zeros([batch_size * seqlen, *hidden_states.shape[1:]], dtype=hidden_states.dtype)
output = paddle.scatter(output, indices, hidden_states)
return output.reshape([batch_size, seqlen, *hidden_states.shape[1:]])
def unpad_input(hidden_states, attention_mask):
indices, cu_seqlens, max_seqlen = _get_unpad_data(attention_mask)
hidden_states = index_first_axis(hidden_states.reshape([-1, *hidden_states.shape[2:]]), indices)
return hidden_states, indices, cu_seqlens, max_seqlen| if token_ids_1 is None: | ||
| return [1] + ([0] * len(token_ids_0)) + [1] | ||
| return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1] |
There was a problem hiding this comment.
优先级:P2
build_inputs_with_special_tokens 默认只加 BOS(add_eos_token=False),但这里无条件把 EOS 也标成 special token;create_token_type_ids_from_sequences 同样多算了一个 EOS 长度。这样 return_special_tokens_mask=True 或返回 token_type_ids 时长度会比实际 input_ids 多 1,批处理/对齐会出错。
请让 mask 和 token_type_ids 复用实际构造出来的序列长度,例如:
| if token_ids_1 is None: | |
| return [1] + ([0] * len(token_ids_0)) + [1] | |
| return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1] | |
| output = self.build_inputs_with_special_tokens(token_ids_0, token_ids_1) | |
| if token_ids_1 is None: | |
| prefix_len = 1 if self.add_bos_token else 0 | |
| suffix_len = 1 if self.add_eos_token else 0 | |
| return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([1] * suffix_len) | |
| prefix_len = 1 if self.add_bos_token else 0 | |
| suffix_len = 1 if self.add_eos_token else 0 | |
| return ([1] * prefix_len) + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + ([1] * suffix_len) |
| eos = [self.eos_token_id] | ||
|
|
||
| if token_ids_1 is None: | ||
| return len(token_ids_0 + eos) * [0] | ||
| return len(token_ids_0 + eos + token_ids_1 + eos) * [0] |
There was a problem hiding this comment.
优先级:P2
这里也无条件把 EOS 算进长度,但默认 add_eos_token=False 时实际 input_ids 没有 EOS;例如 3 个 token 会返回 4 个 token_type_ids。请按 build_inputs_with_special_tokens 的真实输出长度生成,避免批处理返回字段长度不一致。
| eos = [self.eos_token_id] | |
| if token_ids_1 is None: | |
| return len(token_ids_0 + eos) * [0] | |
| return len(token_ids_0 + eos + token_ids_1 + eos) * [0] | |
| return [0] * len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1)) |
risemeup1111
left a comment
There was a problem hiding this comment.
已复查新提交。上轮关于 2.0 分支导入、flash attention padding helper、tokenizer 返回长度的问题看起来已经处理;但 intern 兼容代理仍有两个会影响默认加载/生成路径的漏委托问题,细节已放在新的行内评论中,建议合入前修复。
| return self._impl.parameters(include_sublayers=include_sublayers) | ||
|
|
||
| def named_parameters(self, prefix="", include_sublayers=True): | ||
| return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers) |
There was a problem hiding this comment.
优先级:P1
这里补了 state_dict/named_parameters 的委托,但 from_pretrained 默认 load_checkpoint_format="flex_checkpoint" 会直接调用 model.sharded_state_dict();代理类没有覆盖这个方法时会走 Paddle Layer.sharded_state_dict(),它按注册子层递归生成 _impl.model... / _impl.output... 这类 key。与此同时 _gen_aoa_config() 仍返回真实实现的 model... / output... 目标 key,dist.load_state_dict 找不到这些目标,AutoModel 加载 InternLM2.5 的 flex checkpoint 路径仍会失败。
请把 sharded state dict 也委托给真实实现,并避免 _impl. 前缀进入公开权重 key:
| return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers) | |
| def named_parameters(self, prefix="", include_sublayers=True): | |
| return self._impl.named_parameters(prefix=prefix, include_sublayers=include_sublayers) | |
| def sharded_state_dict(self, *args, **kwargs): | |
| return self._impl.sharded_state_dict(*args, **kwargs) |
| _auto_class = "AutoModelForCausalLM" | ||
| _tied_weights_keys = ["output.weight"] | ||
|
|
||
| def __init__(self, config: InternLM2Config): |
There was a problem hiding this comment.
优先级:P1
InternLM2ForCausalLM 代理没有覆盖 prepare_inputs_for_generation,而这个方法已经在 GenerationMixin 上定义,__getattr__ 不会把它转发给 _impl。结果 PretrainedModel.can_generate() 会把代理判断为不可生成,from_pretrained 不会加载 generation_config.json,save_pretrained 也不会保存 generation config;generate() 调用时还会走通用输入准备逻辑,而不是 InternLM2.5 实现里针对 cache_position / DynamicCache 的裁剪逻辑。
请在 CausalLM 代理上显式委托生成入口,恢复和真实实现一致的生成行为:
| def __init__(self, config: InternLM2Config): | |
| def __init__(self, config: InternLM2Config): | |
| super().__init__(config) | |
| def prepare_inputs_for_generation(self, *args, **kwargs): | |
| return self._impl.prepare_inputs_for_generation(*args, **kwargs) |
risemeup1111
left a comment
There was a problem hiding this comment.
已复查新提交。上一轮两个阻塞的代理委托问题已经按预期修复,当前未发现新的阻塞问题;另有一个 AutoModel 代理路径下 embedding/head API 未完全委托的中等优先级建议,细节已放在行内评论中。
| return object.__getattribute__(self, name) | ||
| except AttributeError: | ||
| return getattr(self._impl, name) | ||
| return getattr(self._impl, name) |
There was a problem hiding this comment.
优先级:P2
__getattr__ 只能代理“当前类找不到”的属性,但 get_input_embeddings / set_input_embeddings / get_output_embeddings / set_output_embeddings 已经在 PretrainedModel 上定义,不会走到这里。现在 InternLM2Model.get_input_embeddings() 会按基类逻辑查 model/embed_tokens,而真实实现用的是 tok_embeddings,会抛 NotImplementedError;InternLM2ForQuestionAnswering 也类似。InternLM2ForCausalLM.get_output_embeddings() 则会因为基类只查 lm_head 而返回 None,与真实 InternLM25ForCausalLM 的 output head 不一致。下游常见的 resize_token_embeddings、NEFTune hook、权重 tying 或测试里的 embedding API 会在 AutoModel 代理路径上表现不一致。
请把这些基类已有的公开 embedding/head API 也显式委托,并补一个 InternLM2* 代理类的轻量测试,例如:
# in InternLM2PretrainedModel
def get_input_embeddings(self):
return self._impl.get_input_embeddings()
def set_input_embeddings(self, value):
return self._impl.set_input_embeddings(value)
# in InternLM2ForCausalLM
def get_output_embeddings(self):
return self._impl.get_output_embeddings()
def set_output_embeddings(self, new_embeddings):
return self._impl.set_output_embeddings(new_embeddings)
risemeup1111
left a comment
There was a problem hiding this comment.
已复查新提交。上一轮关于 AutoModel 代理路径下 embedding/head API 的建议已经通过显式委托和代理测试覆盖,当前未发现需要继续阻塞合入的问题;CI 也已通过。
liuhao2638
left a comment
There was a problem hiding this comment.
已复查合入 develop 后的最新提交。InternLM2.5 相关改动与上一轮通过时保持一致,之前的代理委托问题仍然已修复;当前 CI 通过,未发现需要阻塞合入的新问题。
PR types
New features
PR changes
Add Models
Description
InternLM2.5 模型迁移到 PaddleFormers:
1. 模型功能和对齐测试 (
tests/transformers/intern_lm2_5/test_modeling.py)InternLM25CompatibilityTest等功能测试类2. 转换模型的地址
ms-swift对比
1. ms-swift配置
2. paddleformers-cli配置
3. loss输出diff脚本
4. loss对比结果
common: 102, show: 100