Skip to content

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450

Merged
njzjz merged 8 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-lammps-no-atom-map
May 23, 2026
Merged

fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map#5450
njzjz merged 8 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-expt-lammps-no-atom-map

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 21, 2026

Summary

  • Surface a previously-silent corruption / CUDA index assert in LAMMPS .pt2 inference for message-passing models (DPA2, DPA3, hybrids over those) when the LAMMPS atom-map is not enabled. Previously the C++ side fell into an identity-mapping fallback (DeepPotPTExpt.cc:374-384) whose values are wrong for ghost slots; the model's _exchange_ghosts (deepmd/dpmodel/descriptor/repformers.py) then performed take_along_axis(g1[1, nloc, dim], mapping_tiled) with out-of-bounds gather indices for ghosts — CUDA index assert in the user's DPA4 report, undefined CPU output otherwise.
  • Add a has_message_passing field to .pt2 metadata (mirrors the descriptor's has_message_passing() API: true for DPA2/DPA3/hybrids over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast in DeepPotPTExpt::compute_inner and DeepSpinPTExpt::compute_inner on it. Non-GNN models retain their previous behaviour.
  • Two error messages target the two distinct unsupported configurations:
    • Single-rank without atom-map: "Single-rank LAMMPS .pt2 inference requires atom_modify map yes…"
    • Multi-rank without a with-comm artifact: "Multi-rank LAMMPS .pt2 inference requires the model to be exported with use_loc_mapping=False…"
  • Refined predicate: has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0. The nghost > 0 guard skips NoPbc and isolated-cluster cases where identity over [0, nloc) is trivially correct.

Four-cell coverage matrix in test_lammps_dpa3_pt2.py

Cell use_loc_mapping atom-map nprocs Path Test
A True (regular only) yes 1 regular w/ correct mapping test_pair_deepmd (existing)
B True no 1 fail fast (single-rank msg) test_pair_deepmd_no_atom_map_fails_fast (new)
B-mr True any >1 fail fast (multi-rank msg) test_pair_deepmd_mpi_no_with_comm_fails_fast (new, subprocess)
C False (regular + with-comm) yes 1 regular w/ atom-map test_pair_deepmd_with_comm (new)
C-mr False any >1 with-comm (border_op) test_pair_deepmd_mpi_dpa3 (existing)
D False no 1 fail fast (single-rank PBC can't drive border_op) test_pair_deepmd_with_comm_no_atom_map_fails_fast (new)
D-mr False no >1 with-comm (mapping-free) test_pair_deepmd_mpi_no_atom_map (new, subprocess)

Investigation note (resolves an earlier mystery)

test_deeppot_dpa_ptexpt.cc is misleadingly named — despite the Dpa prefix it loads deeppot_dpa1.pt2 (DPA1, non-message-passing). Its regular .pt2 graph never consumes mapping for ghost gather, so the identity fallback was trivially safe and the test passed without explicit inlist.mapping. The genuinely-DPA2 ctest is test_deeppot_dpa2_ptexpt.cc (different file), which already explicitly sets inlist.mapping = mapping.data(); on all cpu_lmp_nlist* paths. No C++ ctest fixtures need editing in this PR — the metadata-gated fail-fast correctly skips DPA1.

Backward compatibility

has_message_passing_ defaults to false in C++ when the metadata field is missing — so pre-PR .pt2 archives retain their previous behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives must be regenerated to opt into the fail-fast guard. In-tree fixtures are generated by gen_*.py at CI time, which always writes the new field.

Test plan

  • Local C++ ctest *PtExpt* filter: 160 / 160 PASSED (270 s) against freshly-regenerated .pt2 fixtures.
  • CI runs the negative cells (B / B-mr / D) — they exercise the throw and verify the error-message substrings. The pytest assertions use pytest.raises(Exception, match=r\"atom_modify map yes\") and stdout/stderr substring use_loc_mapping=False; if LAMMPS wraps the exception with a prefix/suffix differently than expected, the match may need adjustment.
  • CI cell D-mr (test_pair_deepmd_mpi_no_atom_map) verifies the with-comm artifact handles ghosts via border_op without consuming the mapping tensor.

Known limitations

  • Multi-rank with use_loc_mapping=True is permanently unsupported by this fix — the fail-fast surfaces it clearly, no path forward without re-export.
  • Single-rank PBC + with-comm artifact + no atom-map (cell D) could be made to work via a synthesized self-mirror comm_dict; deferred to a follow-up.
  • MPI_Comm_size is not used as the multi-rank predicate because api_cc does not link MPI directly; lmp_list.nswap > 0 serves as the proxy (equivalent for all current LAMMPS configurations).
  • The pre-PR DPA3 use_loc_mapping=True archives lacking the new metadata field continue to exhibit the silent-corruption bug — users must regenerate.

Summary by CodeRabbit

  • New Features

    • CLI flag to disable atom-ID→local-index mapping for test runs; generator now produces a single-artifact spin model variant; APIs allow callers to declare MPI rank count for neighbor lists.
  • Bug Fixes

    • Serialized metadata now records a message-passing capability flag; runtime enforces compatibility and surfaces clear fail-fast errors when required atom-mapping or artifacts are missing.
  • Tests

    • Expanded coverage for message-passing variants, atom-map on/off scenarios, single- vs multi-rank MPI cases, and related fail-fast behaviors.

Review Change Stack

Single-rank LAMMPS .pt2 inference for a message-passing model (DPA2,
DPA3, hybrids over those) silently relied on LAMMPS atom-map to populate
``InputNlist.mapping`` — without ``atom_modify map yes`` the C++ side
fell into an identity-mapping fallback (``DeepPotPTExpt.cc:374-384``)
whose values are wrong for ghost slots ``[nloc, nall)``.  The model
graph's ``_exchange_ghosts`` (``deepmd/dpmodel/descriptor/repformers.py``)
then performed ``take_along_axis(g1[1, nloc, dim], mapping_tiled)`` with
out-of-bounds gather indices for ghosts, producing a CUDA index assert
(reproduced by the user on a DPA4 model) or undefined CPU output.

Multi-rank LAMMPS without a with-comm AOTI artifact has the same class
of failure: ``pair_deepmd.cpp:243`` only populates ``lmp_list.mapping``
for ``nprocs == 1``, so the regular path always misses the ghost mapping
under multi-rank.

Both unsupported combinations now fail-fast with an actionable message
instead of silently corrupting ghost features.

Files:

* deepmd/pt_expt/utils/serialization.py — emit ``has_message_passing``
  in .pt2 metadata, mirroring the descriptor's ``has_message_passing()``
  API (true for DPA2 / DPA3 / hybrids over those; false for se_e2_a /
  DPA1).
* source/api_cc/{include,src}/DeepPotPTExpt.{h,cc} and DeepSpinPTExpt
  — read the metadata into ``has_message_passing_``, gate the fail-fast
  on it so non-GNN models retain their previous behaviour.  Refined
  predicate:
    ``has_message_passing_ && !use_with_comm && !atom_map_present && nghost > 0``
  Two error messages: single-rank ("add atom_modify map yes") and
  multi-rank ("re-export with use_loc_mapping=False").  Defaults to
  ``false`` for pre-PR .pt2 archives that lack the field, so non-GNN
  archives continue to work; GNN archives must be regenerated to opt
  into the fail-fast guard.
* source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py — new
  ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
  LAMMPS input.
* source/lmp/tests/test_lammps_dpa3_pt2.py — four-cell coverage matrix:
    A   : test_pair_deepmd                              (existing)
    B   : test_pair_deepmd_no_atom_map_fails_fast       (new)
    C   : test_pair_deepmd_with_comm                    (new)
    D   : test_pair_deepmd_with_comm_no_atom_map_fails_fast  (new)
    C-mr: test_pair_deepmd_mpi_dpa3                     (existing)
    B-mr: test_pair_deepmd_mpi_no_with_comm_fails_fast  (new)
    D-mr: test_pair_deepmd_mpi_no_atom_map              (new)

Investigation note: the ``test_deeppot_dpa_ptexpt.cc`` C++ ctest is
misleadingly named — despite the "Dpa" prefix it loads
``deeppot_dpa1.pt2`` (DPA1, non-message-passing), so its regular .pt2
graph never consumes ``mapping`` for ghost gather and identity fallback
is trivially safe.  The genuinely-DPA2 ctest is
``test_deeppot_dpa2_ptexpt.cc``, which already explicitly sets
``inlist.mapping = mapping.data();`` on every ``cpu_lmp_nlist*`` case.
No C++ ctest fixtures need to be edited by this PR — the metadata-gated
fail-fast correctly skips non-message-passing models.

Local verification: 160/160 ptexpt ctests pass against freshly-regenerated
.pt2 fixtures (the new metadata field is written by all gen_*.py
scripts).  The negative cells B/B-mr/D/D-mr fail-fast paths are
exercised only via the LAMMPS Python tests in CI.

Known limitations:
- Multi-rank DPA3 ``use_loc_mapping=True`` is permanently unsupported;
  the fail-fast surfaces this clearly.
- Single-rank with-comm-artifact + no atom-map (cell D) could be made
  to work by populating a synthetic self-mirror comm_dict; deferred.
- ``MPI_Comm_size`` is not used as the multi-rank predicate because
  api_cc does not link MPI; ``lmp_list.nswap > 0`` serves as the proxy
  (equivalent for all current LAMMPS configurations).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Serializer adds has_message_passing to PT metadata; DeepPotPTExpt and DeepSpinPTExpt load it into has_message_passing_, use it to choose with-comm vs regular dispatch and to fail fast when mapping tensors are required but absent; tests and runner options exercise these cases.

Changes

Message Passing Metadata Export and Dispatch Guard

Layer / File(s) Summary
Metadata Export
deepmd/pt_expt/utils/serialization.py
_collect_metadata now computes and exports has_message_passing by querying the descriptor's has_message_passing() method, with defensive fallback to False for missing or unsupported methods.
Runtime Data Structures
source/api_cc/include/DeepPotPTExpt.h, source/api_cc/include/DeepSpinPTExpt.h
Both DeepPotPTExpt and DeepSpinPTExpt define a private has_message_passing_ boolean member (default false) to store the exported metadata flag at runtime.
Neighbor-list API and hpp binding
source/lib/include/neighbor_list.h, source/api_c/include/c_api.h, source/api_c/include/deepmd.hpp, source/api_c/src/c_api.cc
Adds InputNlist::nprocs and set_nprocs plus DP_NlistSetNprocs C API and hpp wrapper so callers can convey communicator size to the DeepMD neighbor-list input.
LAMMPS wiring: set nprocs and mapping
source/lmp/fix_dplr.cpp, source/lmp/pair_deepmd.cpp, source/lmp/pair_deepspin.cpp
LAMMPS pair/fix code sets lmp_list.nprocs from comm->nprocs; PairDeepSpin builds a single-rank mapping_vec when atom-map style is enabled and passes it into the neighbor-list input when appropriate.
DeepPotPTExpt Initialization and Dispatch Logic
source/api_cc/src/DeepPotPTExpt.cc
DeepPotPTExpt::init reads has_message_passing. In compute, multi_rank is derived from lmp_list.nprocs, use_with_comm is early-chosen as has_comm_artifact_ && multi_rank, and fail-fast errors are raised when has_message_passing_ && nghost > 0 but mapping/with-comm prerequisites are unmet. Later dispatch reuses the earlier decision.
DeepSpinPTExpt Initialization and Dispatch Logic
source/api_cc/src/DeepSpinPTExpt.cc
DeepSpinPTExpt mirrors DeepPot: init reads metadata, compute derives multi_rank and mapping presence, selects use_with_comm early, enforces fail-fast exceptions for missing mapping under has_message_passing_, and preserves the earlier routing decision.
Test Infrastructure, runners, and gen_spin
source/lmp/tests/*, source/tests/infer/gen_spin.py
Runner scripts add --no-atom-map; test helpers gain atom_map, pb_path, and capture options; fixtures and tests added for single-rank and MPI fail-fast and baseline-matching cases; gen_spin can build a single-artifact spin PT2 for tests.

Sequence Diagram

sequenceDiagram
  participant Serializer
  participant Init as Runtime::init
  participant Compute as Runtime::compute(lmp_list)
  participant Dispatch as DispatchGate
  participant WithComm as run_model_with_comm
  participant Regular as run_model
  participant Error as FailFastError

  Serializer->>Init: write has_message_passing metadata
  Init->>Compute: set has_message_passing_
  Compute->>Compute: derive multi_rank, check mapping presence
  Compute->>Dispatch: set use_with_comm = has_comm_artifact_ && multi_rank
  alt has_message_passing_ && nghost > 0 && no mapping
    Dispatch->>Error: throw rank-specific error
  else use_with_comm
    Dispatch->>WithComm: call run_model_with_comm
  else
    Dispatch->>Regular: call run_model
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • iProzd
  • njzjz
  • anyangml
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding fail-fast error handling for .pt2 GNN inference when LAMMPS atom-map is not enabled, which is the core objective described in the PR context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 917524709b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated
Comment thread deepmd/pt_expt/utils/serialization.py Outdated
Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 84.44444% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.50%. Comparing base (d3f08f3) to head (d4a9e5d).

Files with missing lines Patch % Lines
deepmd/pt_expt/utils/serialization.py 76.92% 3 Missing ⚠️
source/api_cc/src/DeepSpinPTExpt.cc 70.00% 0 Missing and 3 partials ⚠️
source/api_cc/src/DeepPotPTExpt.cc 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5450      +/-   ##
==========================================
+ Coverage   82.48%   82.50%   +0.02%     
==========================================
  Files         830      830              
  Lines       88522    88559      +37     
  Branches     4232     4243      +11     
==========================================
+ Hits        73015    73066      +51     
+ Misses      14220    14201      -19     
- Partials     1287     1292       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Han Wang added 3 commits May 22, 2026 08:54
… map no

Address Codex review (PR deepmodeling#5450) and CI failure:

1. **Codex P1 (DeepPotPTExpt.cc, DeepSpinPTExpt.cc)**: tighten the
   fail-fast predicate from
     ``!use_with_comm && !atom_map_present``
   to
     ``!use_with_comm && (multi_rank || !atom_map_present)``
   Previously a hypothetical caller that populated ``lmp_list.mapping``
   on a multi-rank invocation (any C++ user bypassing pair_deepmd's
   ``nprocs == 1`` gate, or a future patch that propagated atom-map
   under multi-rank) could slip past the guard and silently use the
   regular path's rank-local mapping to gather cross-rank ghosts —
   exactly the silent-corruption class this PR is meant to close.
   The new predicate fails-fast multi-rank unconditionally on
   atom_map_present.

2. **CI failure**: LAMMPS rejects ``atom_modify map no`` ("Illegal
   atom_modify map command argument no", src/atom.cpp:870). The valid
   keywords are ``array | hash | yes``; to leave the atom-map
   disabled, the command must simply be omitted (default for
   ``atom_style atomic``). Update both the pytest ``_lammps`` helper
   and the MPI runner script to omit the line when ``atom_map="no"``
   / ``--no-atom-map`` is set instead of emitting the invalid syntax.

Local verification: ``*PtExpt*`` ctest filter — 160/160 PASSED.

Known limitation: the codex P1 fix is defensive — no in-tree LAMMPS
caller can reach the ``multi_rank && atom_map_present`` combination
because pair_deepmd.cpp:242 gates atom-map propagation on
``nprocs == 1``. A C++ gtest could synthesize the case via
``inlist.nswap = 1`` + ``inlist.mapping = …`` (precedent:
test_with_comm_load_failure_ptexpt.cc); deferred from this commit
to keep the PR focused.
Per review feedback: the previous predicate
  has_message_passing_ && !use_with_comm && nghost > 0 &&
      (multi_rank || !atom_map_present)
collapsed two distinct conditions into one boolean, making it hard to
trace which matrix cell each throw corresponds to.

Restructure as two side-by-side throws, each mirroring one row of the
PR's coverage matrix:

  if (has_message_passing_ && nghost > 0) {
    if (multi_rank && !has_comm_artifact_)  throw multi-rank msg;  // B-mr
    if (!multi_rank && !atom_map_present)   throw single-rank msg; // B / D
  }

Functionally identical to the prior boolean (verified equivalence by
truth table; 160/160 ptexpt ctests pass), but the structure now maps
1-to-1 to the PR description's table.  Applied symmetrically in
DeepPotPTExpt.cc and DeepSpinPTExpt.cc.

The four "no-throw" cells fall out naturally:
- non-GNN (has_message_passing_=false)             → no throw
- nghost==0 (NoPbc / isolated cluster)             → no throw
- multi-rank && has_comm_artifact_ (cells C-mr, D-mr) → no throw (with-comm)
- single-rank && atom_map_present (cells A, C)     → no throw (regular w/ map)
Mirror the non-spin DPA3 coverage matrix in ``test_lammps_dpa3_pt2.py``
for the spin path (DeepSpinPTExpt / pair_deepspin), ensuring the
fail-fast guard against silent corruption fires symmetrically for
the spin GNN.

* source/tests/infer/gen_spin.py — add ``_build_dpa3_single_yaml``
  producing a DPA3 spin model with ``use_loc_mapping=True``. Together
  with the existing ``deeppot_dpa3_spin_mpi.pt2`` (use_loc_mapping=
  False), this covers both with-comm and no-with-comm spin variants
  needed for cells A/B/B-mr (single-artifact) and C/C-mr/D/D-mr
  (dual-artifact).

* source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py — add
  ``--no-atom-map`` flag that omits ``atom_modify map yes`` from the
  LAMMPS input (mirrors the non-spin runner; LAMMPS rejects
  ``atom_modify map no`` so omission is the only valid path).

* source/lmp/tests/test_lammps_spin_dpa3_pt2.py — extend
  ``_run_mpi_subprocess`` with ``pb_path`` and ``capture`` parameters
  (mirrors the non-spin helper). Add six new tests covering the
  spin matrix:

    test_pair_deepspin_single_artifact_with_atom_map      (cell A)
    test_pair_deepspin_no_atom_map_fails_fast             (cell B)
    test_pair_deepspin_mpi_no_with_comm_fails_fast        (cell B-mr)
    test_pair_deepspin_with_comm_single_atom_map          (cell C)
    test_pair_deepspin_with_comm_no_atom_map_fails_fast   (cell D)
    test_pair_deepspin_mpi_no_atom_map                    (cell D-mr)

  Cell C-mr remains covered by the existing
  ``test_pair_deepmd_mpi_dpa3_spin``.

Local verification: gen_spin.py exit 0 produces ``deeppot_dpa3_spin.pt2``
with metadata ``has_message_passing=true, has_comm_artifact=false,
is_spin=true`` — exactly the single-artifact-GNN-spin case for cells
A/B/B-mr. pytest --collect-only reports 9 tests total (3 existing +
6 new). Negative cells (B/B-mr/D) are CI-only — require LAMMPS-with-
deepmd runtime.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/lmp/tests/test_lammps_spin_dpa3_pt2.py`:
- Around line 458-476: The test function test_pair_deepspin_mpi_no_atom_map
currently compares potential energy, forces, and force magnitudes between
out_no_map and out_baseline but omits virial; add a parity assertion for the
virial tensor by comparing out_no_map["virial"] to out_baseline["virial"] (use
np.testing.assert_allclose with similar tolerances as forces, e.g., atol=1e-8,
rtol=0) so virial regressions are detected; place this assertion alongside the
existing force/force_mag checks referencing the same out_no_map and out_baseline
variables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0f4fc365-5441-4066-8a47-36ef3404b756

📥 Commits

Reviewing files that changed from the base of the PR and between 7be4cd0 and 397a523.

📒 Files selected for processing (3)
  • source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py
  • source/lmp/tests/test_lammps_spin_dpa3_pt2.py
  • source/tests/infer/gen_spin.py

Comment thread source/lmp/tests/test_lammps_spin_dpa3_pt2.py
Comment thread source/lmp/tests/test_lammps_spin_dpa3_pt2.py
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the new version addresses the earlier metadata probing/comment issues and the nprocs predicate is a good improvement over nswap for spin.

I left one blocking inline test-coverage comment: test_pair_deepspin_mpi_no_atom_map compares energy/forces/force_mag but still skips per-atom virials even though the runner parses them and the adjacent parity tests assert them. Please add the virial parity assertion before merge.

CI is otherwise green from GitHub checks.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

Han Wang added 2 commits May 22, 2026 13:41
Replaces the ``lmp_list.nswap > 0`` proxy with a direct
``lmp_list.nprocs > 1`` predicate in DeepPotPTExpt / DeepSpinPTExpt
dispatch.  ``atom_style spin`` populates ``nswap`` even in single-rank
runs (to propagate PBC ghost spins), so the proxy was unsound and caused
single-rank spin LAMMPS jobs to be misclassified as multi-rank by the
fail-fast guard.

Also populates ``lmp_list.mapping`` from the LAMMPS atom-map in
``pair_deepspin.cpp`` (mirroring ``pair_deepmd.cpp``), so single-rank
spin .pt2 GNN inference can resolve ghost indices.

Mechanism:
- Add ``int nprocs`` field + ``set_nprocs`` to ``InputNlist`` (and the
  matching ``DP_NlistSetNprocs`` / ``deepmd::hpp::InputNlist::set_nprocs``
  wrappers) so LAMMPS pair styles can forward ``comm->nprocs``.
- ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, and ``fix_dplr.cpp`` now
  call ``lmp_list.set_nprocs(comm->nprocs)``.
- Refactor the ``serialization.py`` ``has_message_passing`` probe to
  walk ``model -> atomic_model -> descriptor`` (per njzjz-bot review),
  so composite atomic models (e.g. LinearAtomicModel) report the flag.
- Rewrite the stale ``world == nullptr`` carve-out comment in the
  identity-mapping fallback blocks.
Replaces the ``set_nprocs(comm->nprocs)`` setter with a constructor
parameter on both ``InputNlist`` overloads (lightweight and comm-aware)
and the matching ``DP_NewNlist`` / ``DP_NewNlist_comm`` C entry points
plus the hpp wrappers.  ``nprocs`` is a conceptual sibling of ``world``
and ``nswap`` and lives more naturally in the initializer than as a
post-construction setter.  Default value 1 keeps direct C++ API
consumers source-compatible — they need not pass anything.

- ``InputNlist`` lightweight ctor: new trailing ``int nprocs_ = 1``.
- ``InputNlist`` comm-aware ctor: new trailing ``int nprocs_ = 1``.
- ``DP_NewNlist`` / ``DP_NewNlist_comm``: new trailing ``int nprocs``
  arg (no default — C API).
- ``deepmd::hpp::InputNlist`` ctors: new trailing ``int nprocs = 1``.
- ``DP_NlistSetNprocs`` / ``set_nprocs`` removed (single-use in-tree).
- ``pair_deepmd.cpp``, ``pair_deepspin.cpp``, ``fix_dplr.cpp`` now pass
  ``comm->nprocs`` directly in the constructor call.
- Update stale comments in ``DeepPotPTExpt.cc`` and ``DeepSpinPTExpt.cc``
  to reference the constructor parameter instead of the removed setter.
@wanghan-iapcm wanghan-iapcm force-pushed the fix-pt-expt-lammps-no-atom-map branch from d88a46a to 4fcd312 Compare May 22, 2026 05:45
Han Wang added 2 commits May 22, 2026 14:01
Reviewer ask (njzjz-bot / coderabbitai on PR deepmodeling#5450): the
``test_pair_deepspin_mpi_no_atom_map`` parity check compared energy,
forces, and force_mag against the with-atom-map multi-rank baseline
but skipped virials.  Add the matching ``virials`` assertion so a
regression in the spin with-comm path that affects only the virial
tensor cannot pass silently.  Other spin MPI parity tests in this
file already assert virials at the same tolerance.
…ment

The lightweight 4-arg ``InputNlist`` (and matching ``DP_NewNlist`` /
hpp wrapper) takes neither ``world`` nor any swap arrays, so it cannot
drive the with-comm dispatch path even when constructed in a multi-rank
LAMMPS context.  The earlier addition of an ``int nprocs_ = 1``
parameter was therefore write-only — passing ``comm->nprocs`` to it
(as ``fix_dplr.cpp`` did) had no observable effect on dispatch.  Drop
the parameter to restore the original ABI of the 4-arg form and make
the contract honest: this overload is always single-rank-classified.
Multi-rank GNN dispatch goes through the 12-arg comm-aware constructor.

The ``nprocs`` data member stays (default 1), settable only via the
comm-aware constructor's trailing ``nprocs_`` argument.

Also mirror the decision-matrix comment from ``DeepPotPTExpt.cc`` into
``DeepSpinPTExpt.cc`` so the spin path documents the four-cell rationale
in-line rather than redirecting the reader to the non-spin file.
@wanghan-iapcm wanghan-iapcm requested a review from njzjz-bot May 22, 2026 06:48
@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 22, 2026
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 22, 2026
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed latest tip d4a9e5d. The previous concerns have been addressed, CI is green except the still-running CUDA aggregate check, and I did not find any new blocker.

— OpenClaw 2026.5.12 (model: custom-chat-jinzhezeng-group/gpt-5.5)

@njzjz njzjz added this pull request to the merge queue May 22, 2026
Merged via the queue into deepmodeling:master with commit 4604131 May 23, 2026
73 of 75 checks passed
pull Bot pushed a commit to ishandutta2007/deepmd-kit that referenced this pull request May 23, 2026
…ers (deepmodeling#5455)

## Summary

- Adds an explicit `del lammps` before `MPI.Finalize()` in all four
mpirun-driven LAMMPS test runners (`run_mpi_pair_deepmd.py`,
`run_mpi_pair_deepmd_spin.py`, `run_mpi_pair_deepmd_dpa3_pt2.py`,
`run_mpi_pair_deepmd_spin_dpa3_pt2.py`).
- Fixes a teardown-order race that intermittently manifests as
**subprocess exit code 136 (SIGFPE)** for
`test_pair_deepmd_mpi_dpa3_spin_empty_subdomain` on the GitHub Actions
CUDA runner image.

## Background

Recent CI runs on multiple unrelated PRs (deepmodeling#5446, deepmodeling#5450) hit the
identical failure signature:

```
short test summary info ============================
FAILED source/lmp/tests/test_lammps_spin_dpa3_pt2.py::test_pair_deepmd_mpi_dpa3_spin_empty_subdomain
  - subprocess.CalledProcessError: ... returned non-zero exit status 136.
```

- Reproduces ~1 in 5 runs on the GitHub Actions CUDA image
(`nvidia/cuda:12.9.1-cudnn-devel-ubuntu22.04`).
- **Does not reproduce on a V100 Bohrium dev box** — 60/60 consecutive
passes.

So it's a pre-existing flake, not caused by either of the recent PRs.

## Root cause (empirically confirmed)

The runner ends with:

```python
forces_global = lammps.lmp.gather_atoms(...)
...
MPI.Finalize()
```

`lammps` is still alive when `MPI.Finalize()` returns. Python then
garbage-collects it during interpreter shutdown, which triggers
`LAMMPS::~LAMMPS` → `Finish::end()` → **`MPI_Allreduce`** for timing
aggregation. By that time, MPI has already been finalized, which is
undefined behavior.

I instrumented the runner with timestamped prints to verify the order
directly. Without the fix:

```
t=3311.770  R1: BEFORE MPI.Finalize
t=3311.778  R0/R1: AFTER MPI.Finalize     ← MPI is finalized
t=3311.778  R0/R1: PY ATEXIT
… process exit, LAMMPS destructor runs HERE
```

With the fix:

```
t=3423.100  R1: AFTER del lammps (LAMMPS destructor done)   ← MPI still up
t=3423.108  R0/R1: BEFORE MPI.Finalize
t=3423.108  R0/R1: AFTER MPI.Finalize
```

So the LAMMPS destructor now runs while MPI is still up, which is what
its `MPI_Allreduce`/`MPI_Gather` calls require.

The reason this manifests as SIGFPE only on the CUDA CI image (not on
V100) is most likely that the CI image (or one of its preloaded
libraries) enables FP-exception trapping; on V100 the same
MPI-after-Finalize errors return silently. The flake is
environment-specific, but the underlying antipattern is unconditional
and worth fixing in any environment.

## Test plan

- [x] Local CPU: 29/29 LAMMPS tests pass (`test_lammps_dpa3_pt2.py`,
`test_lammps_spin_dpa3_pt2.py`)
- [x] Remote V100: 50/50 stress runs of the previously-failing test
- [x] Empirical confirmation that the fix flips the
LAMMPS-destructor-vs-MPI.Finalize ordering (see Background)
- [ ] CI: re-run the spin LAMMPS suite multiple times to confirm the
SIGFPE no longer appears

## Known limitations

- Cannot directly observe the SIGFPE on V100, so the fix has not been
observed *preventing* the actual crash — only correcting the antipattern
that we have strong reason to believe causes it.
- If the failure persists after merge, the next candidate root cause is
CUDA stream destruction order, and we should revisit.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved MPI cleanup sequence in multiple test runners to prevent
finalization-related crashes when executing tests in distributed MPI
environments.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5455?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
njzjz pushed a commit to njzjz/deepmd-kit that referenced this pull request May 27, 2026
…ling#5467)

## Summary

`TestChangeBias` is the dominant memory hog in `Test Python` shard `(10,
3.13)` of the CI matrix — by itself it peaks at **~5 GB RSS**, leaving
so little headroom under the 7 GB GitHub-hosted runner that the shard
intermittently loses communication with the GitHub Actions server. This
causes the recurring `runner lost communication` failure that has
affected many recent PRs (deepmodeling#5446, deepmodeling#5448, deepmodeling#5450, deepmodeling#5455, deepmodeling#5456, …).

This PR shrinks the change-bias test dataset from 80 frames to 5 frames,
dropping the class's peak RSS to **~1.7 GB** while keeping all 9 tests
passing — including the strict `atol=1e-10` `pt2_pte_consistency` check.

## How I located it

Local reproduction of shard `(10, 3.13)` (using the same
`.test_durations` cache CI uses, so identical test partitioning):

| Test profiled in isolation | Peak RSS |
|---|---|
| **`test_change_bias_frozen_pte`** | **5.04 GB** ← outlier |
| `TestDeepEvalEnerPt2` class | 1.43 GB |
| `TestDeepEvalEnerAparamPt2` class | 1.44 GB |
| `TestSpinInference::test_get_use_spin` | 1.41 GB |
| `test_finetune_from_pt2_use_pretrain_script` | 1.41 GB |
| `test_training_loop_compiled` | 1.32 GB |
| `test_export_pipeline` | 1.57 GB |
| `test_descriptor_shape_dpa1` | 1.34 GB |

Then phase-by-phase RSS profiling inside `dp change-bias` showed the 4.3
GB jump happens entirely inside `compute_output_stats` →
`_compute_model_predict`. Scaling experiment confirms it: peak grows
**linearly at ~50 MB per frame** of input data.

| nbatches | Peak RSS | per-frame |
|---|---|---|
| 1  | 567 MB | — |
| 5  | 781 MB | +43 MB |
| 20 | 1583 MB | +53 MB |
| 80 | 4797 MB | +53 MB |

That's a leak in the `torch.no_grad()`-wrapped `forward_common_atomic`
somewhere — separate from autograd. The water example has 80 frames at
batch_size=1, so the CLI default `nbatches = min(data.get_nbatches()) =
80` triggers all 80 forwards in one go.

## Why I can't just pass `-n 5`

`_load_batch_set` shuffles when it loads the set. If `nbatches <
total_frames`, the loop samples a random subset — and the two calls in
`test_change_bias_pt2_pte_consistency` (running in the **same Python
process** via `main(cmds)`, with `dp_random`'s state advancing between
calls) would see **different** subsets → different biases → the
`atol=1e-10` assertion fails.

`nbatches == total_frames` makes the forward enumerate **every** frame
regardless of shuffle order, so the aggregate bias is invariant under
shuffle. Determinism is preserved.

## The fix

Build a 5-frame subset of `examples/water/data/data_0` in
`TestChangeBias.setUpClass` and point both the trainer config and the
change-bias `-s` argument at it. `nbatches` then resolves to 5 (= the
new dataset size = full enumeration), and all 9 tests pass with peak RSS
at ~1.7 GB.

## Test plan

- [x] All 9 tests in `TestChangeBias` pass locally (CPU fp64):
  - `test_change_bias_with_data`
  - `test_change_bias_with_data_sys_file`
  - `test_change_bias_with_user_defined`
  - `test_change_bias_frozen_pte`
  - `test_change_bias_frozen_pt2`
  - `test_change_bias_frozen_pt2_user_defined`
- `test_change_bias_pt2_pte_consistency` (atol=1e-10 — the
determinism-sensitive one)
  - `test_change_bias_pte_preserves_model_def_script`
  - `test_change_bias_pt2_preserves_model_def_script`
- [x] Peak RSS measurement (kernel `ru_maxrss`):
  - Before: 5.66 GB
  - After: **1.62 GB** (single test) / **1.75 GB** (whole class)
- [ ] CI shard `(10, 3.13)` confirms no more `runner lost communication`
on this branch (pending CI run)

## Known limitations

- **The underlying ~50 MB/frame leak in `forward_common_atomic` remains
as a production bug.** Users running `dp change-bias` on real datasets
with thousands of frames will see multi-GB RSS growth. Worth a separate
follow-up to find and patch the leak.
- The 5-frame number is somewhat arbitrary. It's chosen as the smallest
value that (a) keeps the assertion logic working, (b) leaves room for
the least-squares regression (need ≥ ntypes = 2 frames).
- `_make_subset_dataset` only handles `set.000`; multi-set datasets
would need extension. Not needed for water/data_0.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Reduced the dataset used by a change-bias test to a small, fixed
subset (5 frames) so test runs use far less disk space and complete
faster.
* Test setup now builds and points to the truncated dataset for all
related invocations, lowering resource overhead during CI and local
testing.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5467?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants