Skip to content

[BugFix] fix nvfp4#7879

Merged
lizexu123 merged 1 commit into
PaddlePaddle:developfrom
lizexu123:fp4_fix_1
May 21, 2026
Merged

[BugFix] fix nvfp4#7879
lizexu123 merged 1 commit into
PaddlePaddle:developfrom
lizexu123:fp4_fix_1

Conversation

@lizexu123
Copy link
Copy Markdown
Collaborator

Motivation

修复因为fp4模型多模权重没加载跳过了,导致GLM错误的bug

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@4402396). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7879   +/-   ##
==========================================
  Coverage           ?   63.87%           
==========================================
  Files              ?      462           
  Lines              ?    64487           
  Branches           ?     9886           
==========================================
  Hits               ?    41188           
  Misses             ?    20499           
  Partials           ?     2800           
Flag Coverage Δ
GPU 73.04% <ø> (?)
XPU 7.11% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-21 16:59:00

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务均已通过,PR 可合并(2 个可选任务失败,不阻塞合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
59(18) 41 38 2 1 0 0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务通过 - - - - -

2.2 可选任务 — 28/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
Trigger Jenkins for PR 7m54s Job -
CI_HPU - Job -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-21 15:50:47

📋 Review 摘要

PR 概述:移除 process_weight_transpose 中对权重 _is_initialized() 的检查,修复 fp4 多模权重因被误判为未初始化而跳过 transpose 导致 GLM 推理错误的 Bug。
变更范围fastdeploy/model_executor/utils.py
影响面 Tag[Loader] [Quantization]

问题

级别 文件 概述
❓ 疑问 utils.py:135~138(已删除行) 全量删除 _is_initialized() 检查是否会引入新的崩溃风险?
📝 PR 规范 Modifications 章节为空;Checklist 第一条应勾选

❓ 疑问详述

原来的逻辑:若 weight._is_initialized() == False,打印 warning 并 return(graceful skip)。此次修复将该检查全量删除,无论初始化状态如何都直接访问 weight.shape

潜在风险:多模态模型在纯文本推理场景下,某些视觉分支权重可能从未被加载,若这些权重也会走到 process_weight_transpose,删除检查后访问 .shape 可能直接抛异常,比原来的 warning+skip 更难排查。

建议确认:fp4 多模权重 _is_initialized() 返回 False 的根因是什么?除这批权重外,是否还有其他权重也可能经过此函数但真正未初始化?如有,建议改为精准修复:

def process_weight_transpose(layer, weight_name):
    weight = getattr(layer, weight_name)
    if not weight._is_initialized():
        # fp4 packed weights may report uninitialized; check dtype before skipping
        if not hasattr(weight, 'dtype') or weight.dtype is None:
            logger.warning(f"Weight {weight_name} not initialized, skipping transpose.")
            return
    if len(weight.shape) == 2:

或者至少加注释说明为何安全删除该检查。

📝 PR 规范检查

标题格式合规(含 [BugFix] Tag)✓。Modifications 章节内容为空,Checklist 第一条应勾选(PR 已有官方 Tag)。

标题建议(可直接复制):

  • [BugFix] Fix nvfp4 multi-modal weight transpose skip bug in GLM

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
修复 fp4 多模态模型中,权重因 `_is_initialized()` 返回 False 被误判为未初始化,导致 `process_weight_transpose` 跳过 transpose 操作,GLM 模型推理结果出现错误的 Bug。

## Modifications
- `fastdeploy/model_executor/utils.py`:移除 `process_weight_transpose` 函数中对 `weight._is_initialized()` 的检查,确保 fp4 格式的多模权重不会被错误地跳过 transpose。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复思路直观,删除错误的 early-return 可解决 fp4 多模权重被跳过的问题。建议作者补充说明此修改对其他"真正未初始化"权重场景的影响分析,或添加单元测试覆盖该路径,以减少回归风险。

@lizexu123 lizexu123 merged commit fe146ec into PaddlePaddle:develop May 21, 2026
55 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants