Skip to content

[Loader] Add values natural order check to layers grouped validation#7498

Merged
bukejiyu merged 3 commits into
PaddlePaddle:developfrom
bukejiyu:feat/weight-loading-natural-order
May 20, 2026
Merged

[Loader] Add values natural order check to layers grouped validation#7498
bukejiyu merged 3 commits into
PaddlePaddle:developfrom
bukejiyu:feat/weight-loading-natural-order

Conversation

@bukejiyu
Copy link
Copy Markdown
Collaborator

@bukejiyu bukejiyu commented Apr 20, 2026

Motivation

修复 weight_map 中 value(权重文件名)无序时可能引发加载 OOM 的问题。原有逻辑仅校验 keys 是否按 layer 分组,未检查 values(各权重文件名)是否按自然顺序排列,导致在某些模型下按非顺序读取多个 shard 文件,需同时持有多个 shard 数据从而 OOM。

Modifications

  • load_weight_utils.py:新增 values_are_naturally_ordered(values) 辅助函数,利用 natural_key 校验 values 是否按自然顺序排列
  • get_all_weights_file:将 is_layers_are_grouped 的计算逻辑从仅检查 keys 分组扩展为同时要求 values 自然有序,两者均满足时才启用分组加载策略

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 20, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 28, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 01:40:52

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务均已通过 ✅,PR 可合并(有 1 个 Optional 任务失败,不阻塞合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 41 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务通过 - - - - -

2.2 可选任务 — 31/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
CI_HPU 1h4m Job -
其余 31 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@bda1756). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7498   +/-   ##
==========================================
  Coverage           ?   63.33%           
==========================================
  Files              ?      462           
  Lines              ?    64378           
  Branches           ?     9871           
==========================================
  Hits               ?    40776           
  Misses             ?    20829           
  Partials           ?     2773           
Flag Coverage Δ
GPU 72.44% <100.00%> (?)
XPU 7.12% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

zoooo0820
zoooo0820 previously approved these changes May 14, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 08:28:14

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

2 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 36 5 0 0 0

2 任务状态汇总

2.1 Required任务 : 7/9 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h22m PR问题:parallel_config 为 None 时访问 .tensor_parallel_size load_weight_utils.py:138 增加 None 判断 Job -
xpu_4cards_case_test / run_xpu_4cards_cases 53m38s 不稳定问题:W4A8 XPU 服务启动失败,与 PR 变更无关联 已知不稳定,建议 rerun Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 11m0s Job -
CI_HPU 1h4m Job -
Trigger Jenkins for PR 3m38s Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 高
  • 根因摘要: PR新增parallel_config.tensor_parallel_size判断,fd_config=None时触发AttributeError
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_load_weight_utils.py::TestWeightIterators::test_get_weight_iterator_ordered_and_kv_scale AttributeError: 'NoneType' object has no attribute 'tensor_parallel_size' parallel_config 为 None,访问属性报错

根因详情:
PR 将 get_weight_iterator 签名从 load_config: Optional[LoadConfig] 改为 fd_config: Optional[FDConfig],并新增判断 if is_layers_are_grouped or parallel_config.tensor_parallel_size == 1:。当 fd_config=None 时(测试默认调用),parallel_config = fd_config.parallel_config if fd_config else None 得到 None,后续代码直接访问 parallel_config.tensor_parallel_size 抛出 AttributeError

关键日志:

> if is_layers_are_grouped or parallel_config.tensor_parallel_size == 1:
E AttributeError: 'NoneType' object has no attribute 'tensor_parallel_size'
fastdeploy/model_executor/load_weight_utils.py:138: AttributeError

修复建议:

  1. fastdeploy/model_executor/load_weight_utils.py L138:将条件改为 if is_layers_are_grouped or parallel_config is None or parallel_config.tensor_parallel_size == 1:

修复建议摘要: load_weight_utils.py:138 添加 parallel_config is None 的判断

关联变更: fastdeploy/model_executor/load_weight_utils.py L135-L138(PR 新增的 parallel_config 相关逻辑)

链接: 查看日志

xpu_4cards_case_test / run_xpu_4cards_cases — 测试失败(置信度: 中)

xpu_4cards_case_test / run_xpu_4cards_cases

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 中
  • 根因摘要: W4A8 XPU服务启动失败,v0加载路径与PR变更无直接关联
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_w4a8.py::test_w4a8 Failed: W4A8模式服务启动失败 W4A8 服务在 XPU 环境下启动失败
test_w4a8_cudagraph.py::test_w4a8 Failed: W4A8 CudaGraph模式服务启动失败 W4A8 CudaGraph 模式服务启动失败

根因详情:
ERNIE-4.5-300B-A47B-W4A8C8-TP4 为预切分(TP4)权重,v1 加载器不支持预切分权重,日志显示 v1 loader currently does not support pre-sliced weights; fallback to the v0 loader,全程走 v0 加载路径。本 PR 修改的是 default_loader_v1.pyget_weight_iterator(v1 路径),v0 路径理论上不受影响。2 个 W4A8 测试失败而其余 12 个测试通过,符合 XPU W4A8 已知不稳定特征。

关键日志:

INFO utils.py[line:430] v1 loader currently does not support pre-sliced weights; fallback to the v0 loader for model loading.
...
tests/xpu_ci/4cards_cases/test_w4a8.py:63: in test_w4a8
E   Failed: W4A8模式服务启动失败
================== 2 failed, 12 passed in 3050.10s (0:50:50) ===================

修复建议:

  1. 建议先 rerun 验证是否为偶发问题
  2. 若持续失败,排查 v0 loader 路径中是否调用了被 PR 修改的 get_all_weights_file

修复建议摘要: 已知不稳定,建议 rerun 验证

关联变更: 不明确(v0 loader 路径,与 PR 修改的 v1 路径解耦)

链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 18:08:58

📋 Review 摘要

PR 概述:在 weight_map 的 layers 分组校验中新增 values 自然顺序检查,修复多 shard 场景下因文件乱序读取引发的加载 OOM 问题。
变更范围fastdeploy/model_executor/load_weight_utils.pymodel_loader/default_loader_v1.py
影响面 Tag[Loader]

问题

级别 文件 概述
🟡 建议 load_weight_utils.py:138 TP=1 短路条件缺少注释说明意图
🟡 建议 load_weight_utils.py:77 新增函数缺少单元测试覆盖
📝 PR 规范 Checklist unit tests 未勾选且未说明原因

📝 PR 规范检查

标题格式 [Loader] Add values natural order check to layers grouped validation 规范,Tag 合规。描述结构完整,包含所有必填 section。

Checklist 中 [ ] Add unit tests. 未勾选,但 PR 正文未说明原因(模板要求"Please write the reason in this PR if no unit tests")。建议补充说明或直接添加测试。

总体评价

核心修复逻辑正确:通过同时要求 keys 分组和 values 自然有序来决定加载策略,TP=1 时强制走 file-by-file 规避 OOM 也合理。主要建议是为 TP=1 条件补充注释说明意图,并补充针对 values_are_naturally_ordered 的单元测试,以防后续回归。

)
else:
if is_layers_are_grouped:
if is_layers_are_grouped or (parallel_config is not None and parallel_config.tensor_parallel_size == 1):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 新增的 TP=1 短路条件缺少注释说明意图

tensor_parallel_size == 1 时,强制走 safetensors_weights_iterator(files_list)(按文件名顺序逐文件加载),绕过了 is_layers_are_grouped 的检查。这是合理的优化(单卡场景下权重加载顺序对正确性无影响,且可避免 ordered iterator 跨 shard 跳读引发的 OOM),但直接阅读时逻辑意图不明显。

建议加上注释说明:

if is_layers_are_grouped or (
    # For TP=1, sequential file-by-file loading is always safe and avoids
    # the OOM risk of safetensors_weights_iterator_ordered jumping between
    # shard files in non-sequential order.
    parallel_config is not None and parallel_config.tensor_parallel_size == 1
):


def values_are_naturally_ordered(values):
"""Check if values are sorted in natural order."""
return list(values) == sorted(values, key=natural_key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 values_are_naturally_ordered 缺少单元测试覆盖

该函数是此 PR 核心修复逻辑的入口,但测试文件中未新增对应测试用例。现有 test_get_weight_iterator_ordered_and_kv_scale 的 weight_map 全部指向同一个 shard,无法覆盖「多 shard 文件中 values 无序」的触发场景。

建议补充:

  1. 单测 values_are_naturally_ordered(有序/无序各一个 case)
  2. test_get_weight_iterator_ordered_and_kv_scale 风格下添加多 shard、values 无序的集成场景(如 layer 0 → shard-2, layer 1 → shard-1)

此逻辑是 OOM 修复的关键路径,回归测试缺失时风险较高。

@bukejiyu bukejiyu merged commit c1b7c08 into PaddlePaddle:develop May 20, 2026
42 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants