Commit 29bfe93
authored
[megatron] feat: load dist checkpoint with customized prefix for state dict keys. (verl-project#4139)
### What does this PR do?
> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here:
https://github.com/search?q=repo%3Avolcengine%2Fverl+dist+checkpoint+prefix&type=pullrequests
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
For Megatron dist checkpoint using customized prefix for state dict
keys, e.g. NeMo2 use the prefix`module.` for the keys in the state dict,
user can add the spec `dist_checkpointing_prefix` to the corresponding
role to load that dist ckpt. For instance, the snippet below shows an
example for actor model:
```python
actor_rollout_ref.actor.megatron.dist_checkpointing_prefix='module.'
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: can be covered by
the existing tests.
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)1 parent d951de4 commit 29bfe93
6 files changed
Lines changed: 31 additions & 6 deletions
File tree
- verl
- trainer/config
- engine
- reward_model
- utils
- workers
- config
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
165 | 166 | | |
166 | 167 | | |
167 | 168 | | |
| 169 | + | |
168 | 170 | | |
169 | 171 | | |
170 | 172 | | |
| |||
356 | 358 | | |
357 | 359 | | |
358 | 360 | | |
| 361 | + | |
359 | 362 | | |
360 | 363 | | |
361 | 364 | | |
| |||
473 | 476 | | |
474 | 477 | | |
475 | 478 | | |
| 479 | + | |
476 | 480 | | |
477 | 481 | | |
478 | 482 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
43 | 46 | | |
44 | 47 | | |
45 | 48 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
531 | 531 | | |
532 | 532 | | |
533 | 533 | | |
534 | | - | |
| 534 | + | |
535 | 535 | | |
536 | 536 | | |
537 | 537 | | |
| |||
540 | 540 | | |
541 | 541 | | |
542 | 542 | | |
543 | | - | |
| 543 | + | |
544 | 544 | | |
545 | 545 | | |
546 | 546 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| 68 | + | |
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
333 | 333 | | |
334 | 334 | | |
335 | 335 | | |
336 | | - | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
337 | 340 | | |
338 | 341 | | |
339 | 342 | | |
| |||
366 | 369 | | |
367 | 370 | | |
368 | 371 | | |
369 | | - | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
370 | 376 | | |
371 | 377 | | |
372 | 378 | | |
| |||
971 | 977 | | |
972 | 978 | | |
973 | 979 | | |
974 | | - | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
975 | 984 | | |
976 | 985 | | |
977 | 986 | | |
| |||
1233 | 1242 | | |
1234 | 1243 | | |
1235 | 1244 | | |
1236 | | - | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
1237 | 1251 | | |
1238 | 1252 | | |
1239 | 1253 | | |
| |||
0 commit comments