[Loader] Add values natural order check to layers grouped validation by bukejiyu · Pull Request #7498 · PaddlePaddle/FastDeploy

bukejiyu · 2026-04-20T03:24:55Z

Motivation

修复 weight_map 中 value（权重文件名）无序时可能引发加载 OOM 的问题。原有逻辑仅校验 keys 是否按 layer 分组，未检查 values（各权重文件名）是否按自然顺序排列，导致在某些模型下按非顺序读取多个 shard 文件，需同时持有多个 shard 数据从而 OOM。

Modifications

load_weight_utils.py：新增 values_are_naturally_ordered(values) 辅助函数，利用 natural_key 校验 values 是否按自然顺序排列
get_all_weights_file：将 is_layers_are_grouped 的计算逻辑从仅检查 keys 分组扩展为同时要求 values 自然有序，两者均满足时才启用分组加载策略

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-20T03:25:01Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-04-28T13:03:25Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-20 01:40:52

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: f751858
Merge base: bda1756 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 Required 任务均已通过 ✅，PR 可合并（有 1 个 Optional 任务失败，不阻塞合并）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	41	1	0	0	0

2 任务状态汇总

2.1 Required任务 : 10/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	其余 10 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 31/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`CI_HPU`	1h4m	Job	-
✅	其余 31 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

codecov-commenter · 2026-04-28T13:49:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@bda1756). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7498   +/-   ##
==========================================
  Coverage           ?   63.33%           
==========================================
  Files              ?      462           
  Lines              ?    64378           
  Branches           ?     9871           
==========================================
  Hits               ?    40776           
  Misses             ?    20829           
  Partials           ?     2773

Flag	Coverage Δ
GPU	`72.44% <100.00%> (?)`
XPU	`7.12% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-19T00:30:10Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-19 08:28:14

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 22da62c
Merge base: 7bc29b5 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 2 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	36	5	0	0	0

2 任务状态汇总

2.1 Required任务 : 7/9 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h22m	PR问题：`parallel_config` 为 None 时访问 `.tensor_parallel_size`	`load_weight_utils.py:138` 增加 None 判断	Job	-
❌	`xpu_4cards_case_test / run_xpu_4cards_cases`	53m38s	不稳定问题：W4A8 XPU 服务启动失败，与 PR 变更无关联	已知不稳定，建议 rerun	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	11m0s	Job	-
❌	`CI_HPU`	1h4m	Job	-
❌	`Trigger Jenkins for PR`	3m38s	Job	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 测试失败
置信度: 高
根因摘要: PR新增parallel_config.tensor_parallel_size判断，fd_config=None时触发AttributeError
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_load_weight_utils.py::TestWeightIterators::test_get_weight_iterator_ordered_and_kv_scale`	AttributeError: 'NoneType' object has no attribute 'tensor_parallel_size'	`parallel_config` 为 None，访问属性报错

根因详情:
PR 将 get_weight_iterator 签名从 load_config: Optional[LoadConfig] 改为 fd_config: Optional[FDConfig]，并新增判断 if is_layers_are_grouped or parallel_config.tensor_parallel_size == 1:。当 fd_config=None 时（测试默认调用），parallel_config = fd_config.parallel_config if fd_config else None 得到 None，后续代码直接访问 parallel_config.tensor_parallel_size 抛出 AttributeError。

关键日志:

> if is_layers_are_grouped or parallel_config.tensor_parallel_size == 1:
E AttributeError: 'NoneType' object has no attribute 'tensor_parallel_size'
fastdeploy/model_executor/load_weight_utils.py:138: AttributeError

修复建议:

fastdeploy/model_executor/load_weight_utils.py L138：将条件改为 if is_layers_are_grouped or parallel_config is None or parallel_config.tensor_parallel_size == 1:

修复建议摘要: load_weight_utils.py:138 添加 parallel_config is None 的判断

关联变更: fastdeploy/model_executor/load_weight_utils.py L135-L138（PR 新增的 parallel_config 相关逻辑）

链接: 查看日志

xpu_4cards_case_test / run_xpu_4cards_cases — 测试失败（置信度: 中）

xpu_4cards_case_test / run_xpu_4cards_cases

状态: ❌ 失败
错误类型: 测试失败
置信度: 中
根因摘要: W4A8 XPU服务启动失败，v0加载路径与PR变更无直接关联
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`test_w4a8.py::test_w4a8`	Failed: W4A8模式服务启动失败	W4A8 服务在 XPU 环境下启动失败
`test_w4a8_cudagraph.py::test_w4a8`	Failed: W4A8 CudaGraph模式服务启动失败	W4A8 CudaGraph 模式服务启动失败

根因详情:
ERNIE-4.5-300B-A47B-W4A8C8-TP4 为预切分（TP4）权重，v1 加载器不支持预切分权重，日志显示 v1 loader currently does not support pre-sliced weights; fallback to the v0 loader，全程走 v0 加载路径。本 PR 修改的是 default_loader_v1.py 和 get_weight_iterator（v1 路径），v0 路径理论上不受影响。2 个 W4A8 测试失败而其余 12 个测试通过，符合 XPU W4A8 已知不稳定特征。

关键日志:

INFO utils.py[line:430] v1 loader currently does not support pre-sliced weights; fallback to the v0 loader for model loading.
...
tests/xpu_ci/4cards_cases/test_w4a8.py:63: in test_w4a8
E   Failed: W4A8模式服务启动失败
================== 2 failed, 12 passed in 3050.10s (0:50:50) ===================

修复建议:

建议先 rerun 验证是否为偶发问题
若持续失败，排查 v0 loader 路径中是否调用了被 PR 修改的 get_all_weights_file

修复建议摘要: 已知不稳定，建议 rerun 验证

关联变更: 不明确（v0 loader 路径，与 PR 修改的 v1 路径解耦）

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-19 18:08:58

📋 Review 摘要

PR 概述：在 weight_map 的 layers 分组校验中新增 values 自然顺序检查，修复多 shard 场景下因文件乱序读取引发的加载 OOM 问题。
变更范围：fastdeploy/model_executor/load_weight_utils.py、model_loader/default_loader_v1.py
影响面 Tag：[Loader]

问题

级别	文件	概述
🟡 建议	`load_weight_utils.py:138`	TP=1 短路条件缺少注释说明意图
🟡 建议	`load_weight_utils.py:77`	新增函数缺少单元测试覆盖
📝 PR 规范	—	Checklist `unit tests` 未勾选且未说明原因

📝 PR 规范检查

标题格式 [Loader] Add values natural order check to layers grouped validation 规范，Tag 合规。描述结构完整，包含所有必填 section。

Checklist 中 [ ] Add unit tests. 未勾选，但 PR 正文未说明原因（模板要求"Please write the reason in this PR if no unit tests"）。建议补充说明或直接添加测试。

总体评价

核心修复逻辑正确：通过同时要求 keys 分组和 values 自然有序来决定加载策略，TP=1 时强制走 file-by-file 规避 OOM 也合理。主要建议是为 TP=1 条件补充注释说明意图，并补充针对 values_are_naturally_ordered 的单元测试，以防后续回归。

PaddlePaddle-bot · 2026-05-19T10:13:34Z

            )
        else:
-            if is_layers_are_grouped:
+            if is_layers_are_grouped or (parallel_config is not None and parallel_config.tensor_parallel_size == 1):


🟡 建议 新增的 TP=1 短路条件缺少注释说明意图

当 tensor_parallel_size == 1 时，强制走 safetensors_weights_iterator(files_list)（按文件名顺序逐文件加载），绕过了 is_layers_are_grouped 的检查。这是合理的优化（单卡场景下权重加载顺序对正确性无影响，且可避免 ordered iterator 跨 shard 跳读引发的 OOM），但直接阅读时逻辑意图不明显。

建议加上注释说明：

if is_layers_are_grouped or ( # For TP=1, sequential file-by-file loading is always safe and avoids # the OOM risk of safetensors_weights_iterator_ordered jumping between # shard files in non-sequential order. parallel_config is not None and parallel_config.tensor_parallel_size == 1 ):

PaddlePaddle-bot · 2026-05-19T10:13:34Z


+def values_are_naturally_ordered(values):
+    """Check if values are sorted in natural order."""
+    return list(values) == sorted(values, key=natural_key)


🟡 建议 values_are_naturally_ordered 缺少单元测试覆盖

该函数是此 PR 核心修复逻辑的入口，但测试文件中未新增对应测试用例。现有 test_get_weight_iterator_ordered_and_kv_scale 的 weight_map 全部指向同一个 shard，无法覆盖「多 shard 文件中 values 无序」的触发场景。

建议补充：

单测 values_are_naturally_ordered（有序/无序各一个 case）

在 test_get_weight_iterator_ordered_and_kv_scale 风格下添加多 shard、values 无序的集成场景（如 layer 0 → shard-2, layer 1 → shard-1）

此逻辑是 OOM 修复的关键路径，回归测试缺失时风险较高。

bukejiyu had a problem deploying to Metax_ci April 20, 2026 03:24 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bukejiyu had a problem deploying to Metax_ci April 28, 2026 12:01 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bukejiyu had a problem deploying to Metax_ci May 14, 2026 11:43 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zoooo0820 previously approved these changes May 14, 2026

View reviewed changes

Add values natural order check to layers grouped validation

22da62c

bukejiyu dismissed zoooo0820’s stale review via 22da62c May 18, 2026 07:00

bukejiyu force-pushed the feat/weight-loading-natural-order branch from b682deb to 22da62c Compare May 18, 2026 07:00

bukejiyu had a problem deploying to Metax_ci May 18, 2026 07:00 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

update

a90785d

bukejiyu had a problem deploying to Metax_ci May 19, 2026 06:30 — with GitHub Actions Failure

Merge branch 'develop' into feat/weight-loading-natural-order

f751858

bukejiyu temporarily deployed to Metax_ci May 19, 2026 08:43 — with GitHub Actions Inactive

PaddlePaddle-bot reviewed May 19, 2026

View reviewed changes

zoooo0820 approved these changes May 20, 2026

View reviewed changes

bukejiyu merged commit c1b7c08 into PaddlePaddle:develop May 20, 2026
42 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Loader] Add values natural order check to layers grouped validation#7498

[Loader] Add values natural order check to layers grouped validation#7498
bukejiyu merged 3 commits into
PaddlePaddle:developfrom
bukejiyu:feat/weight-loading-natural-order

bukejiyu commented Apr 20, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 28, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 19, 2026

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

xpu_4cards_case_test / run_xpu_4cards_cases

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Uh oh!

PaddlePaddle-bot May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bukejiyu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 10/10 通过

2.2 可选任务 — 31/32 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 19, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/9 通过

2.2 可选任务 — 29/32 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

xpu_4cards_case_test / run_xpu_4cards_cases

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bukejiyu commented Apr 20, 2026 •

edited

Loading

PaddlePaddle-bot commented Apr 28, 2026 •

edited

Loading

codecov-commenter commented Apr 28, 2026 •

edited

Loading