Skip to content

Refactor/merge openmp#7446

Open
lijianing-sudo wants to merge 1 commit into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp
Open

Refactor/merge openmp#7446
lijianing-sudo wants to merge 1 commit into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp

Conversation

@lijianing-sudo

Copy link
Copy Markdown

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

Existing Unit Tests Pass:

  • MODULE_MD_LJ_pot (6 tests)
  • MODULE_MD_func (7 tests)
  • MODULE_MD_fire
  • MODULE_MD_verlet
  • MODULE_MD_nhc
  • MODULE_MD_msst
  • MODULE_MD_lgv

Test Infrastructure:

  • Introduced shared MD test fixtures (source/source_md/test/md_test_fixture.h) to eliminate duplicated SetUp/TearDown across 6 test files.

Microbenchmark Verification:

  • Independent C++ microbenchmarks were written for each optimized kernel (see Test/openmp_nep_basic_benchmark.cpp and companion scripts).
  • 2 million atoms, repeated 5 times, tested at 1/2/4/8/16 threads on Intel Xeon Platinum 8163.
  • All per-atom write loops produce bitwise-identical results (max_abs_diff = 0).
  • Reduction loops show floating-point differences at the 1e-10 to 1e-8 level due to summation order changes — expected and acceptable for MD trajectories.

What's changed?

This PR integrates OpenMP parallelization from three feature branches (refactor/md-factory, refactor/parallel-optimize, refactor/md-openmp-remainder) into the ABACUS MD module and ML potential interfaces. 22 parallel loops or worksharing regions are added across 12 source files (+3934/−342 lines total).

1. MD Base Loops (source/source_md/)

Function File Strategy
MD_base::update_pos() md_base.cpp #pragma omp parallel for schedule(static)
MD_base::update_vel() md_base.cpp #pragma omp parallel for schedule(static)
kinetic_energy() md_func.cpp reduction(+:ke)
force_virial() force copy md_func.cpp Parallel per-atom copy
temp_vector() md_func.cpp 9 scalar reductions instead of shared-matrix accumulation
rescale_vel() md_func.cpp schedule(static)

All loops use if (natom >= 256) to skip parallel overhead for small systems.

2. NEP Interface (source/source_esolver/esolver_nep.cpp/.h)

  • Added atom_type_index / atom_local_index index caches for flat iat-based parallel loops.
  • Parallelized: coordinate buffer fill, per-atom energy reduction, force copy-back with unit conversion, and 9-component per-atom virial reduction.
  • NEP virial: reorganized from 9 separate full-array scans into a single per-atom scan with 9 scalar reductions — algorithmic + parallel gains combined (14.24× speedup at 8 threads).
  • nep.compute() external library call remains serial.

3. DPMD Interface (source/source_esolver/esolver_dp.cpp/.h)

  • Added iat → (it, ia) index caches.
  • Parallelized: coordinate buffer fill and model force copy-back with unit conversion.
  • Introduced persistent member buffers (dp_cell, dp_coord, dp_model_force, dp_model_virial) to avoid repeated allocations.
  • dp.compute() external library call and 3×3 virial copy-back remain serial.

4. Thermostat and Barostat (source/source_md/)

Class Method File
Verlet thermalize() velocity rescaling verlet.cpp
MSST rescale() shock-direction velocity scaling msst.cpp
MSST vel_sum() velocity norm reduction msst.cpp
MSST propagate_vel() per-atom velocity propagation msst.cpp
NoseHoover particle_thermo() final velocity scaling nhchain.cpp
NoseHoover vel_baro() barostat velocity update nhchain.cpp

Thermostat chain recurrence integration and cell dilation remain serial.

5. FIRE Algorithm (source/source_md/fire.cpp)

FIRE::check_fire() parallelized in three phases:

  1. Three-scalar reduction for P, sumforce, normvel
  2. Parallel velocity-force mixing
  3. Parallel velocity zeroing (in P <= 0 branch)

Scalar state updates (alpha, negative_count, dt) remain serial.

6. LJ Interface (source/source_esolver/esolver_lj.cpp/.h)

  • Added global atom index cache.
  • Restructured nested type-iteration loops into flat iat-based loop.
  • schedule(dynamic, 32) to handle neighbor-count imbalance.
  • Thread-local potential and virial arrays with atomic (energy) and critical (virial) reduction at thread exit — no per-neighbor locks.

7. Code Quality Refactors

  • Extracted MD statistics helpers: calc_kinetic_state() / calc_stress_state() (md_func.h, md_statistics.h).
  • MD runner factory function: new/deletestd::unique_ptr (run_md.cpp).
  • Shared test fixture base classes to reduce duplication across 6 MD test files.

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Category Kernel Speedup Efficiency
MD Base update_pos 7.36× 92.0%
MD Base update_vel 7.19× 89.9%
MD Base kinetic_energy 6.98× 87.2%
MD Base temp_vector 7.38× 92.2%
NEP coord_fill 7.15× 89.4%
NEP energy_sum 8.03× 100.4%
NEP force_fill 6.99× 87.4%
NEP virial_sum 14.24× 177.9%*
DPMD coord_fill 5.50× 68.8%
DPMD force_copy 7.19× 89.9%
Verlet thermalize 7.80× 97.5%
MSST rescale 7.28× 91.1%
MSST propagate_vel 7.18× 89.7%
NHC particle_thermo 7.21× 90.2%
FIRE check_fire (mix) 7.64× 95.6%
LJ runner core loop 6.96× 87.0%

*NEP virial 14.24× includes loop reorganization benefits beyond pure 8-thread scaling.

Known Limitations & Future Work

  • End-to-end tests: NEP and DPMD optimizations lack end-to-end tests with real external model libraries (__NEP, deepmd).
  • LJ parallel path: existing LJ unit tests use 4 atoms (< 256 threshold), covering only the serial path.
  • MPI + OpenMP hybrid: microbenchmarks are single-process; oversubscription risks under mixed MPI/OpenMP have not been characterized.
  • Thread threshold: nat >= 256 is an empirical uniform threshold; per-kernel tuning (64/128/256/512) is recommended.
  • LJ scheduling: schedule(dynamic, 32) vs static and optimal chunk size have not been systematically benchmarked across different neighbor distributions.
  • Microbenchmark results ≠ end-to-end wall-time: excluded overheads include MPI communication, neighbor-list construction, file I/O, and external model computation.

Any changes of core modules? (ignore if not applicable)

The MD ESolver interface layer (esolver_nep.cpp, esolver_dp.cpp, esolver_lj.cpp) is modified to add index caches and parallel worksharing constructs. No changes to the ESolver base class virtual function signatures. All external library calls (nep.compute(), dp.compute()) remain serial and their calling convention is unchanged.

@mohanchen mohanchen added Refactor Refactor ABACUS codes MD & LAM MD and Larege Atomic Models project_learning and removed Refactor Refactor ABACUS codes labels Jun 7, 2026
Comment thread opt_logs/dpmd_interface_20260603.md Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is not needed

@mohanchen

Copy link
Copy Markdown
Collaborator

You may first remove unnecessary files, then add tests to show the effects of code refactoring.

@lijianing-sudo lijianing-sudo force-pushed the refactor/merge-openmp branch from 56538dc to b0fccfa Compare June 26, 2026 11:43
@lijianing-sudo

Copy link
Copy Markdown
Author

Done. Removed unnecessary files (Planners/, Results/, Test/, opt_logs/). The PR now contains only 26 source files across source/source_md/ and source/source_esolver/, plus unit tests in source/source_md/test/. Please let me know if additional performance tests are needed.

You may first remove unnecessary files, then add tests to show the effects of code refactoring.

@lijianing-sudo lijianing-sudo force-pushed the refactor/merge-openmp branch from b0fccfa to f4ebb41 Compare June 26, 2026 12:26
Add #pragma omp parallel for to major per-atom loops in MD module,
enabling multi-threaded execution for NEP/DPMD potentials and
thermostat/integrator operations.

Scope (23 files):
- source/source_md/: md_base, md_func, fire, msst, nhchain, verlet,
  run_md, md_statistics.h
- source/source_esolver/: esolver_nep, esolver_dp
- source/source_md/test/: 7 unit tests + md_test_fixture.h

Strategy: schedule(static) with if(nat>=256), reduction clauses,
atomic/critical for shared accumulators.
LJ esolver excluded (upstream refactored to UnitCellLite API).

Rebased onto deepmodeling/develop.

Co-Authored-By: Claude <noreply@anthropic.com>
@lijianing-sudo lijianing-sudo force-pushed the refactor/merge-openmp branch from 4483678 to 72fa195 Compare June 26, 2026 12:52
@lijianing-sudo

Copy link
Copy Markdown
Author

Updated the PR: - Removed unnecessary non-code files (Planners/, Results/, Test/, opt_logs/)

  • Excluded LJ esolver (upstream refactored to UnitCellLite API, incompatible)
  • Rebased onto latest deepmodeling/develop
  • Fixed test CMakeLists.txt for the new module_neighlist dependencies
PR now focuses on 23 source files: core MD loops (md_base, md_func, fire, msst, nhchain, verlet, run_md) + NEP/DPMD 

esolver interfaces + unit tests. All #pragma omp directives follow the same strategy: schedule(static) with ▎ if(nat>=256), reduction clauses for energy/virial, atomic/critical for shared accumulators.

Please let me know if any further changes are needed.

@lijianing-sudo

Copy link
Copy Markdown
Author

All 16 CI checks are now passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MD & LAM MD and Larege Atomic Models project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants