Skip to content

[Cherry-Pick][Loader] Add values natural order check to layers grouped validation#7822

Merged
freeliuzc merged 4 commits into
PaddlePaddle:release/online/20260415from
bukejiyu:cp/47a3cff6-add-values-check
May 26, 2026
Merged

[Cherry-Pick][Loader] Add values natural order check to layers grouped validation#7822
freeliuzc merged 4 commits into
PaddlePaddle:release/online/20260415from
bukejiyu:cp/47a3cff6-add-values-check

Conversation

@bukejiyu

@bukejiyu bukejiyu commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot

paddle-bot Bot commented May 14, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@bukejiyu bukejiyu changed the title Add values natural order check to layers grouped validation [Cherry-Pick][Loader] Add values natural order check to layers grouped validation May 14, 2026
zoooo0820
zoooo0820 previously approved these changes May 14, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented May 14, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-25 19:01:20

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 任务存在 1 个失败、0 个运行中、0 个等待中;失败原因为 GitHub Actions 账号计费锁定导致 Job 未启动,建议先解除 billing 锁定后重新触发 CI。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
12(0) 12 1 2 0 0 9

注意:本次没有 action_required Workflow。9 个任务处于 skipped,其中包括主测试任务 Run FastDeploy Unit Tests and Coverage,需在账号/计费问题解除后 rerun 才能获得有效测试结果。

2 任务状态汇总

2.1 Required任务 : 0/7 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Pre Commit 2s 环境问题:GitHub Actions账号计费锁定 解除 billing 锁定后 rerun Job -
⏭️ 其余 6 个必选任务未通过/跳过 - 上游 CI 未启动/被跳过 修复账号问题后重新触发 - -

2.2 可选任务 — 1/5 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
FD-Clone-Linux / check_bypass / Check bypass 3s Job -
⏭️ 其余 3 个可选任务跳过/未执行 - - -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

Pre Commit — 基础设施(置信度: 高)

Pre Commit

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: GitHub Actions 账号因计费问题被锁定
  • 分析器: 通用分析(fallback)

根因详情:
该 Job 未真正启动,深度日志中 failed_steps / all_steps 为空、log_file_path 为空;快速状态中的 GitHub 注解显示:The job was not started because your account is locked due to a billing issue. 因此这不是 pre-commit 检查本身失败,也未执行到 PR 修改文件。

关键日志:

The job was not started because your account is locked due to a billing issue.

修复建议:

  1. 处理并解除 GitHub Actions 账号 billing 锁定后,重新触发 CI。
  2. 解锁后 rerun Pre Commit 以及被跳过的下游任务,确认主测试任务重新执行。

修复建议摘要: 解除 billing 锁定后 rerun

关联变更: 无,失败发生在 Job 启动前,与本 PR 修改的 loader 代码无直接关联。

链接: 查看日志

…addlePaddle#7498)

* Add values natural order check to layers grouped validation

* update
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20260415@60261dc). Learn more about missing BASE report.

Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #7822   +/-   ##
==========================================================
  Coverage                           ?   72.36%           
==========================================================
  Files                              ?      387           
  Lines                              ?    54028           
  Branches                           ?     8467           
==========================================================
  Hits                               ?    39100           
  Misses                             ?    12236           
  Partials                           ?     2692           
Flag Coverage Δ
GPU 72.36% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-25 19:46:11

📋 Review 摘要

PR 概述:在 layers_grouped 校验中新增 weight_map values(文件名)自然顺序检查,并将 get_weight_iterator 参数由 LoadConfig 升级为 FDConfig 以支持更多配置访问。
变更范围model_executor/load_weight_utils.pymodel_executor/model_loader/default_loader_v1.py
影响面 Tag[Loader]

问题

级别 文件 概述
❓ 疑问 load_weight_utils.py:138 TP=1 短路条件语义存疑
📝 PR 规范 标题缺原始 PR 号;描述各 section 为空

❓ 疑问:tensor_parallel_size == 1 短路条件(load_weight_utils.py 第138行)

if is_layers_are_grouped or (parallel_config is not None and parallel_config.tensor_parallel_size == 1):
    weights_iterator = safetensors_weights_iterator(files_list)

is_layers_are_grouped=False(keys 不按 layer 连续分组,或 values 文件名不自然有序)且 TP=1 时,此短路条件会使用 safetensors_weights_iterator(files_list)字典序 (sorted(weight_files_in_index)) 加载权重,而非使用 safetensors_weights_iterator_ordered(ordered_weight_map)natural_key 排序的 ordered map 加载。

请作者确认:

  1. TP=1 时 is_layers_are_grouped=False 的触发场景是什么?TP=1 时权重加载顺序对模型正确性真的无影响吗?
  2. 若此短路是为了兼容某类特定格式(keys 乱序但文件 natural order 本来正确),建议在代码注释中说明该场景,避免后续维护者误解。

📝 PR 规范检查

问题 1:Cherry-Pick 标题缺少原始 PR 号(格式要求:[Cherry-Pick][Tag] 描述(#原PR号))。

标题建议(可直接复制):

  • [Cherry-Pick][Loader] Add values natural order check to layers grouped validation(#原PR号)

问题 2:PR 描述各 section 均为空/仅占位符,不合规。

PR 描述建议(可直接复制,复刻 checklist §D2 模板完整结构):

## Motivation
在使用多 safetensors 分片(`model.safetensors.index.json`)的模型中,`is_layers_are_grouped` 的判断原来只检查 weight_map 的 keys 是否按 layer 连续分组,未检查 values(分片文件名)是否按自然顺序排列。当 values 乱序时,`is_layers_are_grouped=True` 会导致使用 `safetensors_weights_iterator(files_list)` 而非 `safetensors_weights_iterator_ordered(ordered_weight_map)` 进行加载,从而可能引起权重与参数名映射错误。

## Modifications
- `load_weight_utils.py`:新增 `values_are_naturally_ordered(values)` 函数,检查 weight_map values(分片文件名列表)是否按 `natural_key` 自然顺序排列
- `load_weight_utils.py``get_all_weights_file` 中将 `is_layers_are_grouped` 改为 `is_keys_orders and is_values_naturally_ordered` 双重检查
- `load_weight_utils.py``get_weight_iterator` 参数由 `Optional[LoadConfig]` 升级为 `Optional[FDConfig]`,以同时支持 `load_config`(多线程加载)和 `parallel_config`(TP size)的访问;TP=1 时新增短路条件直接使用无序 iterator
- `default_loader_v1.py`:调用 `get_weight_iterator` 时改传 `fd_config` 而非 `fd_config.load_config`

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

变更思路正确——在原有 keys 分组检查之外补充 values 自然顺序校验,能有效减少误判。get_weight_iterator 参数升级为 FDConfig 也更合理。主要疑问是新增的 TP=1 短路条件(第138行)语义尚不明确,建议作者补充注释说明触发场景。

@freeliuzc freeliuzc merged commit 0d7fccd into PaddlePaddle:release/online/20260415 May 26, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants