Skip to content

[pre-commit.ci] pre-commit autoupdate#5456

Merged
njzjz merged 1 commit into
masterfrom
pre-commit-ci-update-config
May 26, 2026
Merged

[pre-commit.ci] pre-commit autoupdate#5456
njzjz merged 1 commit into
masterfrom
pre-commit-ci-update-config

Conversation

@pre-commit-ci
Copy link
Copy Markdown
Contributor

@pre-commit-ci pre-commit-ci Bot commented May 25, 2026

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.15.13 → v0.15.14](astral-sh/ruff-pre-commit@v0.15.13...v0.15.14)
@dosubot dosubot Bot added the build label May 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.46%. Comparing base (f39a081) to head (63144d7).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5456   +/-   ##
=======================================
  Coverage   82.46%   82.46%           
=======================================
  Files         829      829           
  Lines       88763    88763           
  Branches     4225     4225           
=======================================
  Hits        73197    73197           
  Misses      14274    14274           
  Partials     1292     1292           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz added this pull request to the merge queue May 26, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026
@njzjz njzjz added this pull request to the merge queue May 26, 2026
Merged via the queue into master with commit 4e64f8b May 26, 2026
82 of 90 checks passed
@njzjz njzjz deleted the pre-commit-ci-update-config branch May 26, 2026 21:37
njzjz pushed a commit to njzjz/deepmd-kit that referenced this pull request May 27, 2026
…ling#5467)

## Summary

`TestChangeBias` is the dominant memory hog in `Test Python` shard `(10,
3.13)` of the CI matrix — by itself it peaks at **~5 GB RSS**, leaving
so little headroom under the 7 GB GitHub-hosted runner that the shard
intermittently loses communication with the GitHub Actions server. This
causes the recurring `runner lost communication` failure that has
affected many recent PRs (deepmodeling#5446, deepmodeling#5448, deepmodeling#5450, deepmodeling#5455, deepmodeling#5456, …).

This PR shrinks the change-bias test dataset from 80 frames to 5 frames,
dropping the class's peak RSS to **~1.7 GB** while keeping all 9 tests
passing — including the strict `atol=1e-10` `pt2_pte_consistency` check.

## How I located it

Local reproduction of shard `(10, 3.13)` (using the same
`.test_durations` cache CI uses, so identical test partitioning):

| Test profiled in isolation | Peak RSS |
|---|---|
| **`test_change_bias_frozen_pte`** | **5.04 GB** ← outlier |
| `TestDeepEvalEnerPt2` class | 1.43 GB |
| `TestDeepEvalEnerAparamPt2` class | 1.44 GB |
| `TestSpinInference::test_get_use_spin` | 1.41 GB |
| `test_finetune_from_pt2_use_pretrain_script` | 1.41 GB |
| `test_training_loop_compiled` | 1.32 GB |
| `test_export_pipeline` | 1.57 GB |
| `test_descriptor_shape_dpa1` | 1.34 GB |

Then phase-by-phase RSS profiling inside `dp change-bias` showed the 4.3
GB jump happens entirely inside `compute_output_stats` →
`_compute_model_predict`. Scaling experiment confirms it: peak grows
**linearly at ~50 MB per frame** of input data.

| nbatches | Peak RSS | per-frame |
|---|---|---|
| 1  | 567 MB | — |
| 5  | 781 MB | +43 MB |
| 20 | 1583 MB | +53 MB |
| 80 | 4797 MB | +53 MB |

That's a leak in the `torch.no_grad()`-wrapped `forward_common_atomic`
somewhere — separate from autograd. The water example has 80 frames at
batch_size=1, so the CLI default `nbatches = min(data.get_nbatches()) =
80` triggers all 80 forwards in one go.

## Why I can't just pass `-n 5`

`_load_batch_set` shuffles when it loads the set. If `nbatches <
total_frames`, the loop samples a random subset — and the two calls in
`test_change_bias_pt2_pte_consistency` (running in the **same Python
process** via `main(cmds)`, with `dp_random`'s state advancing between
calls) would see **different** subsets → different biases → the
`atol=1e-10` assertion fails.

`nbatches == total_frames` makes the forward enumerate **every** frame
regardless of shuffle order, so the aggregate bias is invariant under
shuffle. Determinism is preserved.

## The fix

Build a 5-frame subset of `examples/water/data/data_0` in
`TestChangeBias.setUpClass` and point both the trainer config and the
change-bias `-s` argument at it. `nbatches` then resolves to 5 (= the
new dataset size = full enumeration), and all 9 tests pass with peak RSS
at ~1.7 GB.

## Test plan

- [x] All 9 tests in `TestChangeBias` pass locally (CPU fp64):
  - `test_change_bias_with_data`
  - `test_change_bias_with_data_sys_file`
  - `test_change_bias_with_user_defined`
  - `test_change_bias_frozen_pte`
  - `test_change_bias_frozen_pt2`
  - `test_change_bias_frozen_pt2_user_defined`
- `test_change_bias_pt2_pte_consistency` (atol=1e-10 — the
determinism-sensitive one)
  - `test_change_bias_pte_preserves_model_def_script`
  - `test_change_bias_pt2_preserves_model_def_script`
- [x] Peak RSS measurement (kernel `ru_maxrss`):
  - Before: 5.66 GB
  - After: **1.62 GB** (single test) / **1.75 GB** (whole class)
- [ ] CI shard `(10, 3.13)` confirms no more `runner lost communication`
on this branch (pending CI run)

## Known limitations

- **The underlying ~50 MB/frame leak in `forward_common_atomic` remains
as a production bug.** Users running `dp change-bias` on real datasets
with thousands of frames will see multi-GB RSS growth. Worth a separate
follow-up to find and patch the leak.
- The 5-frame number is somewhat arbitrary. It's chosen as the smallest
value that (a) keeps the assertion logic working, (b) leaves room for
the least-squares regression (need ≥ ntypes = 2 frames).
- `_make_subset_dataset` only handles `set.000`; multi-set datasets
would need extension. Not needed for water/data_0.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Reduced the dataset used by a change-bias test to a small, fixed
subset (5 frames) so test runs use far less disk space and complete
faster.
* Test setup now builds and points to the truncated dataset for all
related invocations, lowering resource overhead during CI and local
testing.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5467?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant