Refactor/merge openmp by lijianing-sudo · Pull Request #7446 · deepmodeling/abacus-develop

lijianing-sudo · 2026-06-06T17:00:51Z

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

Existing Unit Tests Pass:

MODULE_MD_LJ_pot (6 tests)
MODULE_MD_func (7 tests)
MODULE_MD_fire
MODULE_MD_verlet
MODULE_MD_nhc
MODULE_MD_msst
MODULE_MD_lgv

Test Infrastructure:

Introduced shared MD test fixtures (source/source_md/test/md_test_fixture.h) to eliminate duplicated SetUp/TearDown across 6 test files.

Microbenchmark Verification:

Independent C++ microbenchmarks were written for each optimized kernel (see Test/openmp_nep_basic_benchmark.cpp and companion scripts).
2 million atoms, repeated 5 times, tested at 1/2/4/8/16 threads on Intel Xeon Platinum 8163.
All per-atom write loops produce bitwise-identical results (max_abs_diff = 0).
Reduction loops show floating-point differences at the 1e-10 to 1e-8 level due to summation order changes — expected and acceptable for MD trajectories.

What's changed?

This PR integrates OpenMP parallelization from three feature branches (refactor/md-factory, refactor/parallel-optimize, refactor/md-openmp-remainder) into the ABACUS MD module and ML potential interfaces. 22 parallel loops or worksharing regions are added across 12 source files (+3934/−342 lines total).

1. MD Base Loops (`source/source_md/`)

Function	File	Strategy
`MD_base::update_pos()`	`md_base.cpp`	`#pragma omp parallel for schedule(static)`
`MD_base::update_vel()`	`md_base.cpp`	`#pragma omp parallel for schedule(static)`
`kinetic_energy()`	`md_func.cpp`	`reduction(+:ke)`
`force_virial()` force copy	`md_func.cpp`	Parallel per-atom copy
`temp_vector()`	`md_func.cpp`	9 scalar reductions instead of shared-matrix accumulation
`rescale_vel()`	`md_func.cpp`	`schedule(static)`

All loops use if (natom >= 256) to skip parallel overhead for small systems.

2. NEP Interface (`source/source_esolver/esolver_nep.cpp/.h`)

Added atom_type_index / atom_local_index index caches for flat iat-based parallel loops.
Parallelized: coordinate buffer fill, per-atom energy reduction, force copy-back with unit conversion, and 9-component per-atom virial reduction.
NEP virial: reorganized from 9 separate full-array scans into a single per-atom scan with 9 scalar reductions — algorithmic + parallel gains combined (14.24× speedup at 8 threads).
nep.compute() external library call remains serial.

3. DPMD Interface (`source/source_esolver/esolver_dp.cpp/.h`)

Added iat → (it, ia) index caches.
Parallelized: coordinate buffer fill and model force copy-back with unit conversion.
Introduced persistent member buffers (dp_cell, dp_coord, dp_model_force, dp_model_virial) to avoid repeated allocations.
dp.compute() external library call and 3×3 virial copy-back remain serial.

4. Thermostat and Barostat (`source/source_md/`)

Class	Method	File
`Verlet`	`thermalize()` velocity rescaling	`verlet.cpp`
`MSST`	`rescale()` shock-direction velocity scaling	`msst.cpp`
`MSST`	`vel_sum()` velocity norm reduction	`msst.cpp`
`MSST`	`propagate_vel()` per-atom velocity propagation	`msst.cpp`
`NoseHoover`	`particle_thermo()` final velocity scaling	`nhchain.cpp`
`NoseHoover`	`vel_baro()` barostat velocity update	`nhchain.cpp`

Thermostat chain recurrence integration and cell dilation remain serial.

5. FIRE Algorithm (`source/source_md/fire.cpp`)

FIRE::check_fire() parallelized in three phases:

Three-scalar reduction for P, sumforce, normvel
Parallel velocity-force mixing
Parallel velocity zeroing (in P <= 0 branch)

Scalar state updates (alpha, negative_count, dt) remain serial.

6. LJ Interface (`source/source_esolver/esolver_lj.cpp/.h`)

Added global atom index cache.
Restructured nested type-iteration loops into flat iat-based loop.
schedule(dynamic, 32) to handle neighbor-count imbalance.
Thread-local potential and virial arrays with atomic (energy) and critical (virial) reduction at thread exit — no per-neighbor locks.

7. Code Quality Refactors

Extracted MD statistics helpers: calc_kinetic_state() / calc_stress_state() (md_func.h, md_statistics.h).
MD runner factory function: new/delete → std::unique_ptr (run_md.cpp).
Shared test fixture base classes to reduce duplication across 6 MD test files.

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Category	Kernel	Speedup	Efficiency
MD Base	`update_pos`	7.36×	92.0%
MD Base	`update_vel`	7.19×	89.9%
MD Base	`kinetic_energy`	6.98×	87.2%
MD Base	`temp_vector`	7.38×	92.2%
NEP	`coord_fill`	7.15×	89.4%
NEP	`energy_sum`	8.03×	100.4%
NEP	`force_fill`	6.99×	87.4%
NEP	`virial_sum`	14.24×	177.9%*
DPMD	`coord_fill`	5.50×	68.8%
DPMD	`force_copy`	7.19×	89.9%
Verlet	`thermalize`	7.80×	97.5%
MSST	`rescale`	7.28×	91.1%
MSST	`propagate_vel`	7.18×	89.7%
NHC	`particle_thermo`	7.21×	90.2%
FIRE	`check_fire` (mix)	7.64×	95.6%
LJ	`runner` core loop	6.96×	87.0%

*NEP virial 14.24× includes loop reorganization benefits beyond pure 8-thread scaling.

Known Limitations & Future Work

End-to-end tests: NEP and DPMD optimizations lack end-to-end tests with real external model libraries (__NEP, deepmd).
LJ parallel path: existing LJ unit tests use 4 atoms (< 256 threshold), covering only the serial path.
MPI + OpenMP hybrid: microbenchmarks are single-process; oversubscription risks under mixed MPI/OpenMP have not been characterized.
Thread threshold: nat >= 256 is an empirical uniform threshold; per-kernel tuning (64/128/256/512) is recommended.
LJ scheduling: schedule(dynamic, 32) vs static and optimal chunk size have not been systematically benchmarked across different neighbor distributions.
Microbenchmark results ≠ end-to-end wall-time: excluded overheads include MPI communication, neighbor-list construction, file I/O, and external model computation.

Any changes of core modules? (ignore if not applicable)

The MD ESolver interface layer (esolver_nep.cpp, esolver_dp.cpp, esolver_lj.cpp) is modified to add index caches and parallel worksharing constructs. No changes to the ESolver base class virtual function signatures. All external library calls (nep.compute(), dp.compute()) remain serial and their calling convention is unchanged.

mohanchen · 2026-06-09T03:27:12Z

this file is not needed

mohanchen · 2026-06-09T03:35:23Z

You may first remove unnecessary files, then add tests to show the effects of code refactoring.

lijianing-sudo · 2026-06-26T11:48:37Z

Done. Removed unnecessary files (Planners/, Results/, Test/, opt_logs/). The PR now contains only 26 source files across source/source_md/ and source/source_esolver/, plus unit tests in source/source_md/test/. Please let me know if additional performance tests are needed.

You may first remove unnecessary files, then add tests to show the effects of code refactoring.

Add #pragma omp parallel for to major per-atom loops in MD module, enabling multi-threaded execution for NEP/DPMD potentials and thermostat/integrator operations. Scope (23 files): - source/source_md/: md_base, md_func, fire, msst, nhchain, verlet, run_md, md_statistics.h - source/source_esolver/: esolver_nep, esolver_dp - source/source_md/test/: 7 unit tests + md_test_fixture.h Strategy: schedule(static) with if(nat>=256), reduction clauses, atomic/critical for shared accumulators. LJ esolver excluded (upstream refactored to UnitCellLite API). Rebased onto deepmodeling/develop. Co-Authored-By: Claude <noreply@anthropic.com>

lijianing-sudo · 2026-06-26T13:25:14Z

Updated the PR: - Removed unnecessary non-code files (Planners/, Results/, Test/, opt_logs/)

Excluded LJ esolver (upstream refactored to UnitCellLite API, incompatible)
Rebased onto latest deepmodeling/develop
Fixed test CMakeLists.txt for the new module_neighlist dependencies

PR now focuses on 23 source files: core MD loops (md_base, md_func, fire, msst, nhchain, verlet, run_md) + NEP/DPMD

esolver interfaces + unit tests. All #pragma omp directives follow the same strategy: schedule(static) with ▎ if(nat>=256), reduction clauses for energy/virial, atomic/critical for shared accumulators.

Please let me know if any further changes are needed.

lijianing-sudo · 2026-06-26T13:39:16Z

All 16 CI checks are now passing.

mohanchen added Refactor Refactor ABACUS codes MD & LAM MD and Larege Atomic Models project_learning and removed Refactor Refactor ABACUS codes labels Jun 7, 2026

mohanchen reviewed Jun 9, 2026

View reviewed changes

Comment thread opt_logs/dpmd_interface_20260603.md Outdated

mohanchen Jun 9, 2026

Copy link
Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is not needed

lijianing-sudo force-pushed the refactor/merge-openmp branch from 56538dc to b0fccfa Compare June 26, 2026 11:43

lijianing-sudo force-pushed the refactor/merge-openmp branch from b0fccfa to f4ebb41 Compare June 26, 2026 12:26

lijianing-sudo force-pushed the refactor/merge-openmp branch from 4483678 to 72fa195 Compare June 26, 2026 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor/merge openmp#7446

Refactor/merge openmp#7446
lijianing-sudo wants to merge 1 commit into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp

lijianing-sudo commented Jun 6, 2026

Uh oh!

mohanchen Jun 9, 2026

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lijianing-sudo commented Jun 6, 2026

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

1. MD Base Loops (source/source_md/)

2. NEP Interface (source/source_esolver/esolver_nep.cpp/.h)

3. DPMD Interface (source/source_esolver/esolver_dp.cpp/.h)

4. Thermostat and Barostat (source/source_md/)

5. FIRE Algorithm (source/source_md/fire.cpp)

6. LJ Interface (source/source_esolver/esolver_lj.cpp/.h)

7. Code Quality Refactors

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Known Limitations & Future Work

Any changes of core modules? (ignore if not applicable)

Uh oh!

mohanchen Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

lijianing-sudo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. MD Base Loops (`source/source_md/`)

2. NEP Interface (`source/source_esolver/esolver_nep.cpp/.h`)

3. DPMD Interface (`source/source_esolver/esolver_dp.cpp/.h`)

4. Thermostat and Barostat (`source/source_md/`)

5. FIRE Algorithm (`source/source_md/fire.cpp`)

6. LJ Interface (`source/source_esolver/esolver_lj.cpp/.h`)