You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(pt_expt): fail-fast on .pt2 GNN inference without LAMMPS atom-map (#5450)
## Summary
- Surface a previously-silent corruption / CUDA index assert in LAMMPS
`.pt2` inference for message-passing models (DPA2, DPA3, hybrids over
those) when the LAMMPS atom-map is not enabled. Previously the C++ side
fell into an identity-mapping fallback (`DeepPotPTExpt.cc:374-384`)
whose values are wrong for ghost slots; the model's `_exchange_ghosts`
(`deepmd/dpmodel/descriptor/repformers.py`) then performed
`take_along_axis(g1[1, nloc, dim], mapping_tiled)` with out-of-bounds
gather indices for ghosts — CUDA index assert in the user's DPA4 report,
undefined CPU output otherwise.
- Add a `has_message_passing` field to .pt2 metadata (mirrors the
descriptor's `has_message_passing()` API: true for DPA2/DPA3/hybrids
over those; false for se_e2_a/DPA1/etc.). Gate the fail-fast in
`DeepPotPTExpt::compute_inner` and `DeepSpinPTExpt::compute_inner` on
it. Non-GNN models retain their previous behaviour.
- Two error messages target the two distinct unsupported configurations:
- **Single-rank without atom-map**: "Single-rank LAMMPS .pt2 inference
requires `atom_modify map yes`…"
- **Multi-rank without a with-comm artifact**: "Multi-rank LAMMPS .pt2
inference requires the model to be exported with
`use_loc_mapping=False`…"
- Refined predicate: `has_message_passing_ && !use_with_comm &&
!atom_map_present && nghost > 0`. The `nghost > 0` guard skips NoPbc and
isolated-cluster cases where identity over `[0, nloc)` is trivially
correct.
### Four-cell coverage matrix in `test_lammps_dpa3_pt2.py`
| Cell | `use_loc_mapping` | atom-map | nprocs | Path | Test |
|---|---|---|---|---|---|
| A | True (regular only) | yes | 1 | regular w/ correct mapping |
`test_pair_deepmd` *(existing)* |
| B | True | no | 1 | **fail fast** (single-rank msg) |
`test_pair_deepmd_no_atom_map_fails_fast` *(new)* |
| B-mr| True | any | >1 | **fail fast** (multi-rank msg) |
`test_pair_deepmd_mpi_no_with_comm_fails_fast` *(new, subprocess)* |
| C | False (regular + with-comm) | yes | 1 | regular w/ atom-map |
`test_pair_deepmd_with_comm` *(new)* |
| C-mr| False | any | >1 | with-comm (`border_op`) |
`test_pair_deepmd_mpi_dpa3` *(existing)* |
| D | False | no | 1 | **fail fast** (single-rank PBC can't drive
border_op) | `test_pair_deepmd_with_comm_no_atom_map_fails_fast` *(new)*
|
| D-mr| False | no | >1 | with-comm (mapping-free) |
`test_pair_deepmd_mpi_no_atom_map` *(new, subprocess)* |
### Investigation note (resolves an earlier mystery)
`test_deeppot_dpa_ptexpt.cc` is misleadingly named — despite the `Dpa`
prefix it loads `deeppot_dpa1.pt2` (DPA1, non-message-passing). Its
regular `.pt2` graph never consumes `mapping` for ghost gather, so the
identity fallback was trivially safe and the test passed without
explicit `inlist.mapping`. The genuinely-DPA2 ctest is
`test_deeppot_dpa2_ptexpt.cc` (different file), which already explicitly
sets `inlist.mapping = mapping.data();` on all `cpu_lmp_nlist*` paths.
**No C++ ctest fixtures need editing in this PR** — the metadata-gated
fail-fast correctly skips DPA1.
### Backward compatibility
`has_message_passing_` defaults to **false** in C++ when the metadata
field is missing — so pre-PR .pt2 archives retain their previous
behaviour. Non-GNN pre-PR archives continue to work; GNN pre-PR archives
must be regenerated to opt into the fail-fast guard. In-tree fixtures
are generated by `gen_*.py` at CI time, which always writes the new
field.
## Test plan
- [x] Local C++ ctest `*PtExpt*` filter: **160 / 160 PASSED** (270 s)
against freshly-regenerated `.pt2` fixtures.
- [ ] CI runs the negative cells (B / B-mr / D) — they exercise the
throw and verify the error-message substrings. The pytest assertions use
`pytest.raises(Exception, match=r\"atom_modify map yes\")` and
stdout/stderr substring `use_loc_mapping=False`; if LAMMPS wraps the
exception with a prefix/suffix differently than expected, the match may
need adjustment.
- [ ] CI cell D-mr (`test_pair_deepmd_mpi_no_atom_map`) verifies the
with-comm artifact handles ghosts via `border_op` without consuming the
mapping tensor.
## Known limitations
- Multi-rank with `use_loc_mapping=True` is permanently unsupported by
this fix — the fail-fast surfaces it clearly, no path forward without
re-export.
- Single-rank PBC + with-comm artifact + no atom-map (cell D) could be
made to work via a synthesized self-mirror `comm_dict`; deferred to a
follow-up.
- `MPI_Comm_size` is not used as the multi-rank predicate because
`api_cc` does not link MPI directly; `lmp_list.nswap > 0` serves as the
proxy (equivalent for all current LAMMPS configurations).
- The pre-PR DPA3 `use_loc_mapping=True` archives lacking the new
metadata field continue to exhibit the silent-corruption bug — users
must regenerate.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* CLI flag to disable atom-ID→local-index mapping for test runs;
generator now produces a single-artifact spin model variant; APIs allow
callers to declare MPI rank count for neighbor lists.
* **Bug Fixes**
* Serialized metadata now records a message-passing capability flag;
runtime enforces compatibility and surfaces clear fail-fast errors when
required atom-mapping or artifacts are missing.
* **Tests**
* Expanded coverage for message-passing variants, atom-map on/off
scenarios, single- vs multi-rank MPI cases, and related fail-fast
behaviors.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/deepmodeling/deepmd-kit/pull/5450?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
0 commit comments