Commit 27a18b6
perf(pt_expt): share compiled forward_lower across multi-task shared-fitting tasks (deepmodeling#5457)
make dataset embedding and energy bias as input not buffer for compile,
this allows multitask training share compiled model thus resolve OOM and
NCCL timeout issue. Since the empty_cache and del are removed, no GC
complaints.
Regression Test
<img width="3600" height="2100" alt="lcurve"
src="https://github.com/user-attachments/assets/c043bf6c-53bb-441f-ac98-0d021b68ec1b"
/>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Multi-task training groups models by structure and caches/reuses
compiled computation graphs, with per-task buffer handling to support
shared fitting nets.
* **Bug Fixes**
* Checkpoint loading now skips extraneous per-task buffer entries so
only original model parameters are restored.
* Training aggregation coerces tensor-like loss/metric values to floats
for accurate reporting.
* **Tests**
* Added regression test ensuring compiled and eager outputs match per
task for shared-fitting, different-descriptor setups.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5457?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>1 parent 1de6de9 commit 27a18b6
5 files changed
Lines changed: 578 additions & 64 deletions
File tree
- deepmd/pt_expt
- descriptor
- infer
- train
- source/tests/pt_expt
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
| 72 | + | |
| 73 | + | |
72 | 74 | | |
73 | 75 | | |
74 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
44 | 48 | | |
45 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
368 | 368 | | |
369 | 369 | | |
370 | 370 | | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
371 | 374 | | |
372 | 375 | | |
373 | 376 | | |
374 | 377 | | |
| 378 | + | |
| 379 | + | |
375 | 380 | | |
376 | 381 | | |
377 | 382 | | |
| |||
0 commit comments