fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map by wanghan-iapcm · Pull Request #5450 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-05-21T08:59:08Z

Summary

Surface a previously-silent corruption / CUDA index assert in LAMMPS .pt2 inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's _exchange_ghosts (deepmd/dpmodel/descriptor/repformers.py) then performed take_along_axis(g1[1, nloc, dim], mapping_tiled) with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.
Add a has_message_passing field to .pt2 metadata (mirrors the descriptor's has_message_passing() API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast in DeepPotPTExpt::compute_inner and DeepSpinPTExpt::compute_inner on it. Non-GNN models retain their previous behaviour.
Two error messages target the two distinct unsupported configurations:
- Single-rank without atom-map: "Single-rank LAMMPS .pt2 inference requires atom_modify map yes…"
- Multi-rank without a with-comm artifact: "Multi-rank LAMMPS .pt2 inference requires the model to be exported with use_loc_mapping=False…"
Refined predicate: has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. The nghost > 0 guard skips NoPbc and isolated-cluster cases where identity over [0, nloc) is trivially correct.

Four-cell coverage matrix in `test_lammps_dpa3_pt2.py`

Cell	`use_loc_mapping`	atom-map	nprocs	Path	Test
A	True (regular only)	yes	1	regular w/ correct mapping	`test_pair_deepmd` (existing)
B	True	no	1	fail fast (single-rank msg)	`test_pair_deepmd_no_atom_map_fails_fast` (new)
B-mr	True	any	>1	fail fast (multi-rank msg)	`test_pair_deepmd_mpi_no_with_comm_fails_fast` (new, subprocess)
C	False (regular + with-comm)	yes	1	regular w/ atom-map	`test_pair_deepmd_with_comm` (new)
C-mr	False	any	>1	with-comm (`border_op`)	`test_pair_deepmd_mpi_dpa3` (existing)
D	False	no	1	fail fast (single-rank PBC can't drive border_op)	`test_pair_deepmd_with_comm_no_atom_map_fails_fast` (new)
D-mr	False	no	>1	with-comm (mapping-free)	`test_pair_deepmd_mpi_no_atom_map` (new, subprocess)

Investigation note (resolves an earlier mystery)

test_deeppot_dpa_ptexpt.cc is misleadingly named — despite the Dpa prefix it loads deeppot_dpa1.pt2 (DPA1, non-message-passing). Its regular .pt2 graph never consumes mapping for ghost gather, so the identity fallback was trivially safe and the test passed without explicit inlist.mapping. The genuinely-DPA2 ctest is test_deeppot_dpa2_ptexpt.cc (different file), which already explicitly sets inlist.mapping = mapping.data(); on all cpu_lmp_nlist* paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.

Backward compatibility

has_message_passing_ defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated by gen_*.py at CI time, which always writes the new field.

Test plan

Local C++ ctest *PtExpt* filter: 160 / 160 PASSED (270 s) against freshly-regenerated .pt2 fixtures.
CI runs the negative cells (B / B-mr / D) — they exercise the throw and verify the error-message substrings. The pytest assertions use pytest.raises(Exception, match=r\"atom_modify map yes\") and stdout/stderr substring use_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.
CI cell D-mr (test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts via border_op without consuming the mapping tensor.

Known limitations

Multi-rank with use_loc_mapping=True is permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.
Single-rank PBC + with-comm artifact + no atom-map (cell D) could be made to work via a synthesized self-mirror comm_dict; deferred to a follow-up.
MPI_Comm_size is not used as the multi-rank predicate because api_cc does not link MPI directly; lmp_list.nswap > 0 serves as the proxy (equivalent for all current LAMMPS configurations).
The pre-PR DPA3 use_loc_mapping=True archives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.

Summary by CodeRabbit

New Features
- CLI flag to disable atom-ID→local-index mapping for test runs; generator now produces a single-artifact spin model variant; APIs allow callers to declare MPI rank count for neighbor lists.
Bug Fixes
- Serialized metadata now records a message-passing capability flag; runtime enforces compatibility and surfaces clear fail-fast errors when required atom-mapping or artifacts are missing.
Tests
- Expanded coverage for message-passing variants, atom-map on/off scenarios, single- vs multi-rank MPI cases, and related fail-fast behaviors.

Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2, DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate ``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``) whose values are wrong for ghost slots ``[nloc, nall)``. The model graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``) then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with out-of-bounds gather indices for ghosts, producing a CUDA index assert (reproduced by the user on a DPA4 model) or undefined CPU output. Multi-rank LAMMPS without a with-comm AOTI artifact has the same class of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping`` for ``nprocs == 1``, so the regular path always misses the ghost mapping under multi-rank. Both unsupported combinations now fail-fast with an actionable message instead of silently corrupting ghost features. Files: * deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing`` in .pt2 metadata, mirroring the descriptor's ``has_message_passing()`` API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a / DPA1). * source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt — read the metadata into ``has_message_passing_``, gate the fail-fast on it so non-GNN models retain their previous behaviour. Refined predicate: ``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0`` Two error messages: single-rank ("add atom_modify map yes") and multi-rank ("re-export with use_loc_mapping=False"). Defaults to ``false`` for pre-PR .pt2 archives that lack the field, so non-GNN archives continue to work; GNN archives must be regenerated to opt into the fail-fast guard. * source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the LAMMPS input. * source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix: A : test_pair_deepmd (existing) B : test_pair_deepmd_no_atom_map_fails_fast (new) C : test_pair_deepmd_with_comm (new) D : test_pair_deepmd_with_comm_no_atom_map_fails_fast (new) C-mr: test_pair_deepmd_mpi_dpa3 (existing) B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast (new) D-mr: test_pair_deepmd_mpi_no_atom_map (new) Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is misleadingly named — despite the "Dpa" prefix it loads ``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2 graph never consumes ``mapping`` for ghost gather and identity fallback is trivially safe. The genuinely-DPA2 ctest is ``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets ``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case. No C++ ctest fixtures need to be edited by this PR — the metadata-gated fail-fast correctly skips non-message-passing models. Local verification: 160/160 ptexpt ctests pass against freshly-regenerated .pt2 fixtures (the new metadata field is written by all gen_*.py scripts). The negative cells B/B-mr/D/D-mr fail-fast paths are exercised only via the LAMMPS Python tests in CI. Known limitations: - Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported; the fail-fast surfaces this clearly. - Single-rank with-comm-artifact + no atom-map (cell D) could be made to work by populating a synthetic self-mirror comm_dict; deferred. - ``MPI_Comm_size`` is not used as the multi-rank predicate because api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy (equivalent for all current LAMMPS configurations).

coderabbitai · 2026-05-21T09:02:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Serializer adds has_message_passing to PT metadata; DeepPotPTExpt and DeepSpinPTExpt load it into has_message_passing_, use it to choose with-comm vs regular dispatch and to fail fast when mapping tensors are required but absent; tests and runner options exercise these cases.

Changes

Message Passing Metadata Export and Dispatch Guard

Layer / File(s)	Summary
Metadata Export `deepmd/pt_expt/utils/serialization.py`	`_collect_metadata` now computes and exports `has_message_passing` by querying the descriptor's `has_message_passing()` method, with defensive fallback to `False` for missing or unsupported methods.
Runtime Data Structures `source/api_cc/include/DeepPotPTExpt.h`, `source/api_cc/include/DeepSpinPTExpt.h`	Both DeepPotPTExpt and DeepSpinPTExpt define a private `has_message_passing_` boolean member (default `false`) to store the exported metadata flag at runtime.
Neighbor-list API and hpp binding `source/lib/include/neighbor_list.h`, `source/api_c/include/c_api.h`, `source/api_c/include/deepmd.hpp`, `source/api_c/src/c_api.cc`	Adds `InputNlist::nprocs` and `set_nprocs` plus `DP_NlistSetNprocs` C API and hpp wrapper so callers can convey communicator size to the DeepMD neighbor-list input.
LAMMPS wiring: set nprocs and mapping `source/lmp/fix_dplr.cpp`, `source/lmp/pair_deepmd.cpp`, `source/lmp/pair_deepspin.cpp`	LAMMPS pair/fix code sets `lmp_list.nprocs` from `comm->nprocs`; PairDeepSpin builds a single-rank `mapping_vec` when atom-map style is enabled and passes it into the neighbor-list input when appropriate.
DeepPotPTExpt Initialization and Dispatch Logic `source/api_cc/src/DeepPotPTExpt.cc`	DeepPotPTExpt::init reads `has_message_passing`. In compute, `multi_rank` is derived from `lmp_list.nprocs`, `use_with_comm` is early-chosen as `has_comm_artifact_ && multi_rank`, and fail-fast errors are raised when `has_message_passing_ && nghost > 0` but mapping/with-comm prerequisites are unmet. Later dispatch reuses the earlier decision.
DeepSpinPTExpt Initialization and Dispatch Logic `source/api_cc/src/DeepSpinPTExpt.cc`	DeepSpinPTExpt mirrors DeepPot: init reads metadata, compute derives `multi_rank` and mapping presence, selects `use_with_comm` early, enforces fail-fast exceptions for missing mapping under `has_message_passing_`, and preserves the earlier routing decision.
Test Infrastructure, runners, and gen_spin `source/lmp/tests/*`, `source/tests/infer/gen_spin.py`	Runner scripts add `--no-atom-map`; test helpers gain `atom_map`, `pb_path`, and `capture` options; fixtures and tests added for single-rank and MPI fail-fast and baseline-matching cases; `gen_spin` can build a single-artifact spin PT2 for tests.

Sequence Diagram

sequenceDiagram
  participant Serializer
  participant Init as Runtime::init
  participant Compute as Runtime::compute(lmp_list)
  participant Dispatch as DispatchGate
  participant WithComm as run_model_with_comm
  participant Regular as run_model
  participant Error as FailFastError

  Serializer->>Init: write has_message_passing metadata
  Init->>Compute: set has_message_passing_
  Compute->>Compute: derive multi_rank, check mapping presence
  Compute->>Dispatch: set use_with_comm = has_comm_artifact_ && multi_rank
  alt has_message_passing_ && nghost > 0 && no mapping
    Dispatch->>Error: throw rank-specific error
  else use_with_comm
    Dispatch->>WithComm: call run_model_with_comm
  else
    Dispatch->>Regular: call run_model
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5354: Related metadata/layout refactor touching serialization and runtime init.
deepmodeling/deepmd-kit#5430: Related work extending with-comm/message-passing workflow and metadata usage.
deepmodeling/deepmd-kit#5370: Earlier DeepSpin pt_expt/backends and metadata parsing changes consumed by this PR.

Suggested reviewers

iProzd
njzjz
anyangml

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding fail-fast error handling for .pt2 GNN inference when LAMMPS atom-map is not enabled, which is the core objective described in the PR context.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 917524709b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

codecov · 2026-05-21T09:59:20Z

Codecov Report

❌ Patch coverage is 84.44444% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.50%. Comparing base (d3f08f3) to head (d4a9e5d).

Files with missing lines	Patch %	Lines
deepmd/pt_expt/utils/serialization.py	76.92%	3 Missing ⚠️
source/api_cc/src/DeepSpinPTExpt.cc	70.00%	0 Missing and 3 partials ⚠️
source/api_cc/src/DeepPotPTExpt.cc	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5450      +/-   ##
==========================================
+ Coverage   82.48%   82.50%   +0.02%     
==========================================
  Files         830      830              
  Lines       88522    88559      +37     
  Branches     4232     4243      +11     
==========================================
+ Hits        73015    73066      +51     
+ Misses      14220    14201      -19     
- Partials     1287     1292       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… map no Address Codex review (PR deepmodeling#5450) and CI failure: 1. **Codex P1 (DeepPotPTExpt.cc, DeepSpinPTExpt.cc)**: tighten the fail-fast predicate from ``!use_with_comm && !atom_map_present`` to ``!use_with_comm && (multi_rank || !atom_map_present)`` Previously a hypothetical caller that populated ``lmp_list.mapping`` on a multi-rank invocation (any C++ user bypassing pair_deepmd's ``nprocs == 1`` gate, or a future patch that propagated atom-map under multi-rank) could slip past the guard and silently use the regular path's rank-local mapping to gather cross-rank ghosts — exactly the silent-corruption class this PR is meant to close. The new predicate fails-fast multi-rank unconditionally on atom_map_present. 2. **CI failure**: LAMMPS rejects ``atom_modify map no`` ("Illegal atom_modify map command argument no", src/atom.cpp:870). The valid keywords are ``array | hash | yes``; to leave the atom-map disabled, the command must simply be omitted (default for ``atom_style atomic``). Update both the pytest ``_lammps`` helper and the MPI runner script to omit the line when ``atom_map="no"`` / ``--no-atom-map`` is set instead of emitting the invalid syntax. Local verification: ``*PtExpt*`` ctest filter — 160/160 PASSED. Known limitation: the codex P1 fix is defensive — no in-tree LAMMPS caller can reach the ``multi_rank && atom_map_present`` combination because pair_deepmd.cpp:242 gates atom-map propagation on ``nprocs == 1``. A C++ gtest could synthesize the case via ``inlist.nswap = 1`` + ``inlist.mapping = …`` (precedent: test_with_comm_load_failure_ptexpt.cc); deferred from this commit to keep the PR focused.

Per review feedback: the previous predicate has_message_passing_ && !use_with_comm && nghost > 0 && (multi_rank || !atom_map_present) collapsed two distinct conditions into one boolean, making it hard to trace which matrix cell each throw corresponds to. Restructure as two side-by-side throws, each mirroring one row of the PR's coverage matrix: if (has_message_passing_ && nghost > 0) { if (multi_rank && !has_comm_artifact_) throw multi-rank msg; // B-mr if (!multi_rank && !atom_map_present) throw single-rank msg; // B / D } Functionally identical to the prior boolean (verified equivalence by truth table; 160/160 ptexpt ctests pass), but the structure now maps 1-to-1 to the PR description's table. Applied symmetrically in DeepPotPTExpt.cc and DeepSpinPTExpt.cc. The four "no-throw" cells fall out naturally: - non-GNN (has_message_passing_=false) → no throw - nghost==0 (NoPbc / isolated cluster) → no throw - multi-rank && has_comm_artifact_ (cells C-mr, D-mr) → no throw (with-comm) - single-rank && atom_map_present (cells A, C) → no throw (regular w/ map)

Mirror the non-spin DPA3 coverage matrix in ``test_lammps_dpa3_pt2.py`` for the spin path (DeepSpinPTExpt / pair_deepspin), ensuring the fail-fast guard against silent corruption fires symmetrically for the spin GNN. * source/tests/infer/gen_spin.py — add ``_build_dpa3_single_yaml`` producing a DPA3 spin model with ``use_loc_mapping=True``. Together with the existing ``deeppot_dpa3_spin_mpi.pt2`` (use_loc_mapping= False), this covers both with-comm and no-with-comm spin variants needed for cells A/B/B-mr (single-artifact) and C/C-mr/D/D-mr (dual-artifact). * source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py — add ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the LAMMPS input (mirrors the non-spin runner; LAMMPS rejects ``atom_modify map no`` so omission is the only valid path). * source/lmp/tests/test_lammps_spin_dpa3_pt2.py — extend ``_run_mpi_subprocess`` with ``pb_path`` and ``capture`` parameters (mirrors the non-spin helper). Add six new tests covering the spin matrix: test_pair_deepspin_single_artifact_with_atom_map (cell A) test_pair_deepspin_no_atom_map_fails_fast (cell B) test_pair_deepspin_mpi_no_with_comm_fails_fast (cell B-mr) test_pair_deepspin_with_comm_single_atom_map (cell C) test_pair_deepspin_with_comm_no_atom_map_fails_fast (cell D) test_pair_deepspin_mpi_no_atom_map (cell D-mr) Cell C-mr remains covered by the existing ``test_pair_deepmd_mpi_dpa3_spin``. Local verification: gen_spin.py exit 0 produces ``deeppot_dpa3_spin.pt2`` with metadata ``has_message_passing=true, has_comm_artifact=false, is_spin=true`` — exactly the single-artifact-GNN-spin case for cells A/B/B-mr. pytest --collect-only reports 9 tests total (3 existing + 6 new). Negative cells (B/B-mr/D) are CI-only — require LAMMPS-with- deepmd runtime.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/lmp/tests/test_lammps_spin_dpa3_pt2.py`:
- Around line 458-476: The test function test_pair_deepspin_mpi_no_atom_map
currently compares potential energy, forces, and force magnitudes between
out_no_map and out_baseline but omits virial; add a parity assertion for the
virial tensor by comparing out_no_map["virial"] to out_baseline["virial"] (use
np.testing.assert_allclose with similar tolerances as forces, e.g., atol=1e-8,
rtol=0) so virial regressions are detected; place this assertion alongside the
existing force/force_mag checks referencing the same out_no_map and out_baseline
variables.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0f4fc365-5441-4066-8a47-36ef3404b756

📥 Commits

Reviewing files that changed from the base of the PR and between 7be4cd0 and 397a523.

📒 Files selected for processing (3)

source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py
source/lmp/tests/test_lammps_spin_dpa3_pt2.py
source/tests/infer/gen_spin.py

njzjz-bot

Thanks, the new version addresses the earlier metadata probing/comment issues and the nprocs predicate is a good improvement over nswap for spin.

I left one blocking inline test-coverage comment: test_pair_deepspin_mpi_no_atom_map compares energy/forces/force_mag but still skips per-atom virials even though the runner parses them and the adjacent parity tests assert them. Please add the virial parity assertion before merge.

CI is otherwise green from GitHub checks.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

Replaces the ``lmp_list.nswap > 0`` proxy with a direct ``lmp_list.nprocs > 1`` predicate in DeepPotPTExpt / DeepSpinPTExpt dispatch. ``atom_style spin`` populates ``nswap`` even in single-rank runs (to propagate PBC ghost spins), so the proxy was unsound and caused single-rank spin LAMMPS jobs to be misclassified as multi-rank by the fail-fast guard. Also populates ``lmp_list.mapping`` from the LAMMPS atom-map in ``pair_deepspin.cpp`` (mirroring ``pair_deepmd.cpp``), so single-rank spin .pt2 GNN inference can resolve ghost indices. Mechanism: - Add ``int nprocs`` field + ``set_nprocs`` to ``InputNlist`` (and the matching ``DP_NlistSetNprocs`` / ``deepmd::hpp::InputNlist::set_nprocs`` wrappers) so LAMMPS pair styles can forward ``comm->nprocs``. - ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, and ``fix_dplr.cpp`` now call ``lmp_list.set_nprocs(comm->nprocs)``. - Refactor the ``serialization.py`` ``has_message_passing`` probe to walk ``model -> atomic_model -> descriptor`` (per njzjz-bot review), so composite atomic models (e.g. LinearAtomicModel) report the flag. - Rewrite the stale ``world == nullptr`` carve-out comment in the identity-mapping fallback blocks.

Replaces the ``set_nprocs(comm->nprocs)`` setter with a constructor parameter on both ``InputNlist`` overloads (lightweight and comm-aware) and the matching ``DP_NewNlist`` / ``DP_NewNlist_comm`` C entry points plus the hpp wrappers. ``nprocs`` is a conceptual sibling of ``world`` and ``nswap`` and lives more naturally in the initializer than as a post-construction setter. Default value 1 keeps direct C++ API consumers source-compatible — they need not pass anything. - ``InputNlist`` lightweight ctor: new trailing ``int nprocs_ = 1``. - ``InputNlist`` comm-aware ctor: new trailing ``int nprocs_ = 1``. - ``DP_NewNlist`` / ``DP_NewNlist_comm``: new trailing ``int nprocs`` arg (no default — C API). - ``deepmd::hpp::InputNlist`` ctors: new trailing ``int nprocs = 1``. - ``DP_NlistSetNprocs`` / ``set_nprocs`` removed (single-use in-tree). - ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, ``fix_dplr.cpp`` now pass ``comm->nprocs`` directly in the constructor call. - Update stale comments in ``DeepPotPTExpt.cc`` and ``DeepSpinPTExpt.cc`` to reference the constructor parameter instead of the removed setter.

Reviewer ask (njzjz-bot / coderabbitai on PR deepmodeling#5450): the ``test_pair_deepspin_mpi_no_atom_map`` parity check compared energy, forces, and force_mag against the with-atom-map multi-rank baseline but skipped virials. Add the matching ``virials`` assertion so a regression in the spin with-comm path that affects only the virial tensor cannot pass silently. Other spin MPI parity tests in this file already assert virials at the same tolerance.

…ment The lightweight 4-arg ``InputNlist`` (and matching ``DP_NewNlist`` / hpp wrapper) takes neither ``world`` nor any swap arrays, so it cannot drive the with-comm dispatch path even when constructed in a multi-rank LAMMPS context. The earlier addition of an ``int nprocs_ = 1`` parameter was therefore write-only — passing ``comm->nprocs`` to it (as ``fix_dplr.cpp`` did) had no observable effect on dispatch. Drop the parameter to restore the original ABI of the 4-arg form and make the contract honest: this overload is always single-rank-classified. Multi-rank GNN dispatch goes through the 12-arg comm-aware constructor. The ``nprocs`` data member stays (default 1), settable only via the comm-aware constructor's trailing ``nprocs_`` argument. Also mirror the decision-matrix comment from ``DeepPotPTExpt.cc`` into ``DeepSpinPTExpt.cc`` so the spin path documents the four-cell rationale in-line rather than redirecting the reader to the non-spin file.

njzjz-bot

Re-reviewed latest tip d4a9e5d. The previous concerns have been addressed, CI is green except the still-running CUDA aggregate check, and I did not find any new blocker.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

…ers (deepmodeling#5455) ## Summary - Adds an explicit `del lammps` before `MPI.Finalize()` in all four mpirun-driven LAMMPS test runners (`run_mpi_pair_deepmd.py`, `run_mpi_pair_deepmd_spin.py`, `run_mpi_pair_deepmd_dpa3_pt2.py`, `run_mpi_pair_deepmd_spin_dpa3_pt2.py`). - Fixes a teardown-order race that intermittently manifests as **subprocess exit code 136 (SIGFPE)** for `test_pair_deepmd_mpi_dpa3_spin_empty_subdomain` on the GitHub Actions CUDA runner image. ## Background Recent CI runs on multiple unrelated PRs (deepmodeling#5446, deepmodeling#5450) hit the identical failure signature: ``` short test summary info ============================ FAILED source/lmp/tests/test_lammps_spin_dpa3_pt2.py::test_pair_deepmd_mpi_dpa3_spin_empty_subdomain - subprocess.CalledProcessError: ... returned non-zero exit status 136. ``` - Reproduces ~1 in 5 runs on the GitHub Actions CUDA image (`nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04`). - **Does not reproduce on a V100 Bohrium dev box** — 60/60 consecutive passes. So it's a pre-existing flake, not caused by either of the recent PRs. ## Root cause (empirically confirmed) The runner ends with: ```python forces_global = lammps.lmp.gather_atoms(...) ... MPI.Finalize() ``` `lammps` is still alive when `MPI.Finalize()` returns. Python then garbage-collects it during interpreter shutdown, which triggers `LAMMPS::~LAMMPS` → `Finish::end()` → **`MPI_Allreduce`** for timing aggregation. By that time, MPI has already been finalized, which is undefined behavior. I instrumented the runner with timestamped prints to verify the order directly. Without the fix: ``` t=3311.770 R1: BEFORE MPI.Finalize t=3311.778 R0/R1: AFTER MPI.Finalize ← MPI is finalized t=3311.778 R0/R1: PY ATEXIT … process exit, LAMMPS destructor runs HERE ``` With the fix: ``` t=3423.100 R1: AFTER del lammps (LAMMPS destructor done) ← MPI still up t=3423.108 R0/R1: BEFORE MPI.Finalize t=3423.108 R0/R1: AFTER MPI.Finalize ``` So the LAMMPS destructor now runs while MPI is still up, which is what its `MPI_Allreduce`/`MPI_Gather` calls require. The reason this manifests as SIGFPE only on the CUDA CI image (not on V100) is most likely that the CI image (or one of its preloaded libraries) enables FP-exception trapping; on V100 the same MPI-after-Finalize errors return silently. The flake is environment-specific, but the underlying antipattern is unconditional and worth fixing in any environment. ## Test plan - [x] Local CPU: 29/29 LAMMPS tests pass (`test_lammps_dpa3_pt2.py`, `test_lammps_spin_dpa3_pt2.py`) - [x] Remote V100: 50/50 stress runs of the previously-failing test - [x] Empirical confirmation that the fix flips the LAMMPS-destructor-vs-MPI.Finalize ordering (see Background) - [ ] CI: re-run the spin LAMMPS suite multiple times to confirm the SIGFPE no longer appears ## Known limitations - Cannot directly observe the SIGFPE on V100, so the fix has not been observed *preventing* the actual crash — only correcting the antipattern that we have strong reason to believe causes it. - If the failure persists after merge, the next candidate root cause is CUDA stream destruction order, and we should revisit.  ## Summary by CodeRabbit * **Bug Fixes** * Improved MPI cleanup sequence in multiple test runners to prevent finalization-related crashes when executing tests in distributed MPI environments.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5455?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

…ling#5467) ## Summary `TestChangeBias` is the dominant memory hog in `Test Python` shard `(10, 3.13)` of the CI matrix — by itself it peaks at **~5 GB RSS**, leaving so little headroom under the 7 GB GitHub-hosted runner that the shard intermittently loses communication with the GitHub Actions server. This causes the recurring `runner lost communication` failure that has affected many recent PRs (deepmodeling#5446, deepmodeling#5448, deepmodeling#5450, deepmodeling#5455, deepmodeling#5456, …). This PR shrinks the change-bias test dataset from 80 frames to 5 frames, dropping the class's peak RSS to **~1.7 GB** while keeping all 9 tests passing — including the strict `atol=1e-10` `pt2_pte_consistency` check. ## How I located it Local reproduction of shard `(10, 3.13)` (using the same `.test_durations` cache CI uses, so identical test partitioning): | Test profiled in isolation | Peak RSS | |---|---| | **`test_change_bias_frozen_pte`** | **5.04 GB** ← outlier | | `TestDeepEvalEnerPt2` class | 1.43 GB | | `TestDeepEvalEnerAparamPt2` class | 1.44 GB | | `TestSpinInference::test_get_use_spin` | 1.41 GB | | `test_finetune_from_pt2_use_pretrain_script` | 1.41 GB | | `test_training_loop_compiled` | 1.32 GB | | `test_export_pipeline` | 1.57 GB | | `test_descriptor_shape_dpa1` | 1.34 GB | Then phase-by-phase RSS profiling inside `dp change-bias` showed the 4.3 GB jump happens entirely inside `compute_output_stats` → `_compute_model_predict`. Scaling experiment confirms it: peak grows **linearly at ~50 MB per frame** of input data. | nbatches | Peak RSS | per-frame | |---|---|---| | 1 | 567 MB | — | | 5 | 781 MB | +43 MB | | 20 | 1583 MB | +53 MB | | 80 | 4797 MB | +53 MB | That's a leak in the `torch.no_grad()`-wrapped `forward_common_atomic` somewhere — separate from autograd. The water example has 80 frames at batch_size=1, so the CLI default `nbatches = min(data.get_nbatches()) = 80` triggers all 80 forwards in one go. ## Why I can't just pass `-n 5` `_load_batch_set` shuffles when it loads the set. If `nbatches < total_frames`, the loop samples a random subset — and the two calls in `test_change_bias_pt2_pte_consistency` (running in the **same Python process** via `main(cmds)`, with `dp_random`'s state advancing between calls) would see **different** subsets → different biases → the `atol=1e-10` assertion fails. `nbatches == total_frames` makes the forward enumerate **every** frame regardless of shuffle order, so the aggregate bias is invariant under shuffle. Determinism is preserved. ## The fix Build a 5-frame subset of `examples/water/data/data_0` in `TestChangeBias.setUpClass` and point both the trainer config and the change-bias `-s` argument at it. `nbatches` then resolves to 5 (= the new dataset size = full enumeration), and all 9 tests pass with peak RSS at ~1.7 GB. ## Test plan - [x] All 9 tests in `TestChangeBias` pass locally (CPU fp64): - `test_change_bias_with_data` - `test_change_bias_with_data_sys_file` - `test_change_bias_with_user_defined` - `test_change_bias_frozen_pte` - `test_change_bias_frozen_pt2` - `test_change_bias_frozen_pt2_user_defined` - `test_change_bias_pt2_pte_consistency` (atol=1e-10 — the determinism-sensitive one) - `test_change_bias_pte_preserves_model_def_script` - `test_change_bias_pt2_preserves_model_def_script` - [x] Peak RSS measurement (kernel `ru_maxrss`): - Before: 5.66 GB - After: **1.62 GB** (single test) / **1.75 GB** (whole class) - [ ] CI shard `(10, 3.13)` confirms no more `runner lost communication` on this branch (pending CI run) ## Known limitations - **The underlying ~50 MB/frame leak in `forward_common_atomic` remains as a production bug.** Users running `dp change-bias` on real datasets with thousands of frames will see multi-GB RSS growth. Worth a separate follow-up to find and patch the leak. - The 5-frame number is somewhat arbitrary. It's chosen as the smallest value that (a) keeps the assertion logic working, (b) leaves room for the least-squares regression (need ≥ ntypes = 2 frames). - `_make_subset_dataset` only handles `set.000`; multi-set datasets would need extension. Not needed for water/data_0.  ## Summary by CodeRabbit * **Tests** * Reduced the dataset used by a change-bias test to a small, fixed subset (5 frames) so test runs use far less disk space and complete faster. * Test setup now builds and points to the truncated dataset for all related invocations, lowering resource overhead during CI and local testing.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5467?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

dosubot Bot added the bug label May 21, 2026

github-actions Bot added Python C++ LAMMPS labels May 21, 2026

wanghan-iapcm requested review from OutisLi and njzjz May 21, 2026 09:00

OutisLi approved these changes May 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated

njzjz-bot reviewed May 21, 2026

View reviewed changes

Comment thread deepmd/pt_expt/utils/serialization.py Outdated

njzjz-bot reviewed May 21, 2026

View reviewed changes

Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated

Han Wang added 3 commits May 22, 2026 08:54

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread source/lmp/tests/test_lammps_spin_dpa3_pt2.py

github-actions Bot added Core C labels May 22, 2026

njzjz-bot reviewed May 22, 2026

View reviewed changes

Comment thread source/lmp/tests/test_lammps_spin_dpa3_pt2.py

njzjz-bot suggested changes May 22, 2026

View reviewed changes

Han Wang added 2 commits May 22, 2026 13:41

wanghan-iapcm force-pushed the fix-pt-expt-lammps-no-atom-map branch from d88a46a to 4fcd312 Compare May 22, 2026 05:45

Han Wang added 2 commits May 22, 2026 14:01

wanghan-iapcm requested a review from njzjz-bot May 22, 2026 06:48

wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 22, 2026

github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 22, 2026

njzjz approved these changes May 22, 2026

View reviewed changes

njzjz-bot approved these changes May 22, 2026

View reviewed changes

njzjz added this pull request to the merge queue May 22, 2026

Merged via the queue into deepmodeling:master with commit 4604131 May 23, 2026
73 of 75 checks passed

wanghan-iapcm mentioned this pull request May 23, 2026

fix(test): tear down LAMMPS before MPI.Finalize() in mpirun test runners #5455

Merged

4 tasks

wanghan-iapcm mentioned this pull request May 27, 2026

test(pt_expt): shrink change-bias water dataset to 5 frames #5467

Merged

3 tasks

Conversation

wanghan-iapcm commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Four-cell coverage matrix in test_lammps_dpa3_pt2.py

Investigation note (resolves an earlier mystery)

Backward compatibility

Test plan

Known limitations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wanghan-iapcm commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Four-cell coverage matrix in `test_lammps_dpa3_pt2.py`

coderabbitai Bot commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading