You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LAMMPS upstream development tip (commit 7f680de, dated 30 March 2026) cross-compiles to riscv64 with riscv64-linux-gnu-gcc 15.2.0 targeting rv64gcv_zba_zbb_zfh with zero upstream patches required.
Resulting lmp binary contains 63,913 RVV opcodes auto-vectorized by GCC, concentrated in long-range KSpace solvers, input parsers, and Fix/Compute constructors.
PairLJCut::compute, Neighbor::build, and Verlet::run carry zero RVV opcodes because GCC 15.2 cannot auto-vectorize the indexed neighbor-list access pattern without explicit gather intrinsics. This is reported honestly rather than buried in an aggregate count.
Runtime evidence via two examples: melt (4000 LJ atoms, 250 steps) completes end-to-end in 7.5 s under qemu-riscv64; peptide (2004 atoms, 300 steps with PPPM long-range) completes in 85.7 s with 25.01 % of loop time in vectorized KSpace code — runtime confirmation that the vectorized paths are not dead code in real workloads.
Ships as a 22 MB plug-and-play.deb: /usr/bin/lmp wrapper auto-discovers bundled potentials, lammps-rvv-demo runs the simulation + generates trajectory MP4/GIF in one command, lammps-rvv-verify runs a 5-gate forensic self-test. Lintian-clean, qemu extract-and-run verified.
4000-atom Lennard-Jones lattice melting over 250 timesteps. Simulation executed on the riscv64 build of lmp under qemu-riscv64 user-mode emulation; trajectory generated by the bundled visualize_dump.py tool that ships in the .deb.
Executive Summary
Metric
Value
Upstream patches needed
0
Toolchain
riscv64-linux-gnu-gcc 15.2.0
Target ISA
rv64gcv_zba_zbb_zfh
Packages enabled
KSPACE, MANYBODY, MOLECULE, RIGID
Binary size (stripped)
9.1 MB
Static library
liblammps.a, 62 MB
Total RVV opcodes in lmp
63,913
vsetvli e64,m1 count
10,028
vsetvli e32,* count
2,967
vfmacc.* count
287
vfmul.* count
1,198
vfadd.* count
356
vfred[ou]sum.* count
235
RVV in PairLJCut::compute (per-timestep hot path)
0
RVV in Neighbor::build (neighbor list)
0
RVV in Verlet::run (integrator)
0
RVV in PPPM::compute (long-range, exercised by peptide)
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is one of the most widely-used classical molecular dynamics codes in computational physics, materials science, chemistry, and biology. It is developed primarily at Sandia National Laboratories under the BSD-like (now GPL-2.0) license and has been the subject of more than 2,000 peer-reviewed publications.
This report ports the upstream LAMMPS development tip (commit 7f680de, dated 30 March 2026) to riscv64 with RVV auto-vectorization, packages it as a plug-and-play Debian package, and forensically characterises the vectorization that GCC 15.2 was and was not able to produce.
The exercise complements rather than replaces existing application-level ports in this repository: Doom 3.0.0 (#20, visual deliverable), OpenBLAS 0.3.33 (#25, dense linear algebra), the f64 HAL SIMD shim (#26), TensorFlow Lite v2.17.0 (#27, ML inference), and OpenMM 8.5.0 (#29, MD with explicit RVV intrinsics in CpuNonbondedForce). LAMMPS, like OpenMM, is named on the LFX target spreadsheet under the molecular dynamics category but is the larger and more widely-deployed of the two — making rigorous evidence about its RVV story particularly useful.
§2 Methodology
§2.1 Toolchain
Tool
Version
Notes
riscv64-linux-gnu-gcc
15.2.0
Auto-vectorizer is GCC 15's; substantially better than the 13.x baseline that produced silent scalar fallback in #25
riscv64-linux-gnu-g++
15.2.0
LAMMPS is primarily C++17
cmake
4.2.3
LAMMPS uses CMake under cmake/CMakeLists.txt (LAMMPS convention)
ninja
1.13.2
Faster than make for the ~500-target LAMMPS build
qemu-riscv64
10.2.1
User-mode emulation only; see §4.4 for the explicit limitation
riscv64-linux-gnu-strip
binutils 2.45
Used for .deb size reduction; does not alter executable code
Toolchain target flags (from riscv64-rvv-toolchain.cmake, reused across all ports in this repository):
The v extension enables RVV 1.0; zba/zbb are the bit-manipulation extensions present on essentially every shipping RVV-capable core; zfh is half-precision FP (not used by LAMMPS but enabled for uniformity across this repository's ports).
§2.2 Source-tree audit (read-only, before any build)
Shallow clone, then scan for architecture-specific code that might require patches:
Zero RISC-V references anywhere in src/. LAMMPS core has no per-architecture branches that would need a RISC-V case added.
Architecture-specific code is confined to the optional INTEL/, LEPTON/, MPI4WIN/, and PLUMED/ packages, none of which are enabled in this build.
The build system uses a CMake-out-of-source convention: configure from cmake/CMakeLists.txt, build artifacts go to a separate directory. Conclusion: zero patches required. This is the cleanest port outcome possible — the upstream source already compiles for riscv64 without modification.
Package selection rationale: KSPACE (long-range solvers — covers PPPM/Ewald, the primary auto-vectorization target), MOLECULE (bonded topology — needed for peptide), RIGID (SHAKE/RATTLE constraints — needed for water in peptide), MANYBODY (EAM/Tersoff/REBO/AIREBO — covers most other interesting RVV-vectorized hot paths). MPI/OpenMP disabled to keep the per-rank single-threaded vectorization story clean and the .deb minimal.
-j6 chosen because the WSL host has 6.7 GB RAM and the LAMMPS link step OOMs at -j12.
§2.4 Multi-method verification approach
Per the #25 standard, every claim is verified through at least two independent methods:
function-scoped objdump grep (negative: 0 in PairLJCut::compute)
peptide log.lammps section breakdown (25% KSpace)
.deb actually works
dpkg-deb -c/-I/-x
qemu-riscv64 on the extracted binary, 11 dump frames
Numerical correctness
n/a
dangerous_builds = 0 in peptide log, energy conservation visible across thermo prints
§2.5 Pitfalls Encountered
Pitfall 1: bundled in.melt does not produce a trajectory. The upstream examples/melt/in.melt ships with the dump line commented out as a non-default feature. Initial .deb build copied the pristine upstream file; the qemu smoke test then ran the simulation correctly but produced no trajectory, which the wrapper script silently accepted. Root cause: the smoke test only checked exit code, not for the dump file. Fix: (a) ship a modified in.melt with dump 1 all atom 25 dump.melt enabled, marked clearly as modified-from-upstream in the file header; (b) tighten the smoke-test gate to require a non-zero dump frame count. Both applied; see scripts/package-deb.sh v2.
Pitfall 2: silent scalar dump. Earlier static analysis showed 63,913 RVV opcodes in lmp, but a naive reading of that number would imply the LJ melt benchmark is vectorized. It is not. The top 15 RVV-carrying functions are entirely setup/parser/constructor code; the per-timestep MD hot path is scalar. Root cause: aggregate opcode counts are necessary but not sufficient; function-scoped attribution is required. Fix: §3.2 reports both numbers and §4.1 explicitly distinguishes "vectorization coverage in the binary" from "acceleration of any particular workload."
Pitfall 3: ffmpeg 8.x palettegen syntax under set -o pipefail. The visualization script's GIF-palette generation step produces a stderr warning under ffmpeg 8 about image-sequence patterns, which combined with set -o pipefail in the calling shell would fail the build. The warning is harmless (palette correctly generated as single image); silenced by piping stderr after the failure was confirmed cosmetic.
Pitfall 4: dpkg-deb | head SIGPIPE. First version of package-deb.sh used dpkg-deb -c "$DEB" | head -30. Under set -o pipefail the SIGPIPE from head's early close becomes a fatal error after the .deb is already built — confusing because the failure happens during verification rather than during build. Fix: write to a temp file then read with awk 'NR<=30'. No early close, no SIGPIPE.
§3 Findings
§3.1 Phase 1A — Clean build
$ ninja -j6
[539/539] Linking CXX executable lmp
$ file build-riscv64/lmp
ELF 64-bit LSB pie executable, UCB RISC-V, RVC, double-float ABI,
version 1 (GNU/Linux), dynamically linked,
interpreter /lib/ld-linux-riscv64-lp64d.so.1, not stripped
539 build targets, all clean, no warnings worth reporting. Bundled KISS FFT used (the build does not depend on external FFTW3 because we built with PPPM but without an explicit -DFFT=FFTW3). Static library liblammps.a (62 MB) emitted alongside the executable.
Result: 63,913 RVV opcodes. Distribution by mnemonic:
Mnemonic family
Count
Purpose
vsetvli e64,m1
10,028
f64 strip-mining setup, LMUL=1
vsetvli e64,m2
6
f64 with LMUL=2 (rare)
vsetvli e64,m4
0
(not chosen by auto-vec at any site)
vsetvli e32,*
2,967
f32/i32 strip-mining setup
vle*.v
(subset of total)
strided loads
vse*.v
(subset of total)
strided stores
vfmacc.*
287
fused multiply-accumulate
vfmul.*
1,198
element-wise multiply
vfadd.*
356
element-wise add
vfred[ou]sum.*
235
reduction sums (dot products, etc.)
The LMUL choice (m1, occasionally m2, never m4) reflects GCC 15's conservative cost model when the loop bounds are statically unknown — the dominant pattern in LAMMPS where loop counts depend on input geometry.
§3.2.2 Function-scoped attribution
awk '/^[0-9a-f]+ <.+>:/ {fn=$0; sub(/^[0-9a-f]+ </,"",fn); sub(/>:$/,"",fn); next} /\<(vfmacc|vfmul|vfadd|vfsub|vle[0-9]+|vse[0-9]+|vsetvl|vfred)/ {c[fn]++} END {for(f in c) printf "%6d %s\n", c[f], f}' logs/lmp.disasm \
| sort -rn | c++filt | head -15
Top 15 RVV-carrying functions:
RVV opcodes
Function
1172
std::__cxx11::basic_string<...> allocator
670
Variable::evaluate (input parser)
520
PPPMDisp::allocate (one-time KSpace allocator)
479
FixRigid::FixRigid (constructor)
424
PairAIREBO::spline_init (one-time spline tables)
399
ReadData::command (input file parser)
273
PPPMDisp::allocate_peratom
269
Variable::math_function (parser)
257
FixNH::FixNH (constructor)
255
lammps_extract_global (C API accessor)
251
FixAveGrid::FixAveGrid (constructor)
246
Atom::extract (accessor)
244
FixAveCorrelate::FixAveCorrelate (constructor)
243
lammps_extract_global_datatype (C API accessor)
243
Atom::extract_datatype (accessor)
Critical observation: not one of these is in the per-timestep MD compute path. They are constructors (FixNH, FixRigid, FixAveGrid, FixAveCorrelate, …), one-time setup (PPPMDisp::allocate, PairAIREBO::spline_init), parsers (Variable::evaluate, ReadData::command, Variable::math_function), and accessors (Atom::extract, lammps_extract_global).
PairLJCharmmCoulLong::allocate (one-time array allocator, not compute)
60
PairLJCut::allocate (one-time array allocator, not compute)
66
ESP::compute(int, int)
59
PPPMDisp::compute(int, int)
50
PairHybridScaled::compute(int, int)
48
PairTable::compute(int, int)
48
PairMEAMSpline::compute(int, int)
46
PPPM::compute(int, int)
46
PPPMStagger::compute(int, int)
44
EwaldDipoleSpin::compute(int, int)
0
PairLJCut::compute(int, int) ⬅ critical
0
Neighbor::build(int) ⬅ critical
0
Verlet::run(int) ⬅ critical
PairLJCut::compute, Neighbor::build, and Verlet::run — the three functions that consume essentially all wall time in a typical LJ MD run — are completely scalar. This is reported honestly rather than tuned away because the methodology matters more than the headline number.
Cross-compiled binary executed under qemu-riscv64 user-mode emulation on the WSL host. Input: bundled examples/melt/in.melt (modified to enable trajectory dump every 25 timesteps).
Energy conservation visible across all 6 thermo prints; temperature stabilises around the expected ~1.65 reduced units for a 3.0-start NVE melt. Dangerous build count: 0. Output file dump.melt is the file rendered into the GIF embedded above.
The melt example uses pair_style lj/cut without long-range electrostatics, so its execution does not exercise any of the vectorized code paths identified in §3.2.3 (the per-timestep work all flows through scalar PairLJCut::compute + Neighbor::build). A second benchmark is needed to demonstrate that the vectorized paths are not dead code.
examples/peptide/in.peptide simulates a 2004-atom solvated protein with pair_style lj/charmm/coul/long + kspace_style pppm — the long-range PPPM solver exercises the PPPM::compute (46 RVV opcodes) and PPPMDisp::compute (59) code paths identified in §3.2.3.
Final values stable; Dangerous builds = 0; Neighbor list builds = 26. The simulation is numerically correct, the vectorized paths execute as designed, and the section breakdown lets us state precisely what fraction of loop time hits which kind of code.
Two-method verification of vectorization-exercises-runtime:
Static (§3.2.3):PPPM::compute carries 46 RVV opcodes in the disassembly.
Runtime (this section): PPPM was constructed (PPPM initialization line in log) AND the Kspace section consumed 21.09 s of measured wall time.
The vectorized code is not dead.
§3.5 Phase 1C — Trajectory visualization pipeline
scripts/visualize_dump.py reads any LAMMPS custom-format trajectory dump and renders per-frame 3D scatter plots, then stitches them into both an MP4 (libx264, for download) and an animated GIF (palette-optimized, for inline GitHub embed).
Key design choices:
Coordinate handling. LAMMPS dumps can use any of four coordinate variants depending on the dump style and dump_modify settings: absolute wrapped (x y z), absolute unwrapped (xu yu zu), scaled wrapped (xs ys zs), or scaled unwrapped (xsu ysu zsu). The melt example uses the scaled variant by default (dump atom style). The visualizer auto-detects which variant is present in the dump header and applies the box-bounds transformation x = xlo + xs * (xhi - xlo) for the scaled variants. This means the script works on any LAMMPS dump file the user might throw at it, not just the bundled examples.
Slow azimuth sweep across the trajectory (60° over the full run) gives the GIF a 3D-rotational feel that helps the viewer parse the 3D structure of the lattice as it melts.
GIF size budget. GitHub renders GIFs inline in issues up to ~5 MB. The 4000-atom × 11-frame melt produces a 1.2 MB GIF after palettegen + paletteuse optimization — comfortably under the cap and giving us a real inline visual for the issue.
Reproduction:
The differentiator from a bare-binary deb is that this package ships an end-to-end usable environment: simulation runtime + force fields + working example + visualization + self-test, all installable in one dpkg -i invocation.
This is what makes the package plug-and-play. The user does not need to know that LAMMPS uses $LAMMPS_POTENTIALS to find force-field files. The wrapper sets it to the bundled directory if and only if the user has not already set it themselves. Real binary lives at /usr/libexec/lammps/lmp per Debian's convention for "internal" executables not intended for direct user invocation.
§3.6.2 The demo command
$ lammps-rvv-demo
[1/3] Running LAMMPS melt example (4000 atoms, 250 timesteps)...
✓ Simulation complete, 11 trajectory frames
[2/3] Generating trajectory MP4 + GIF...
[3/3] Done. Files at:
-rw-r--r-- 1 user user 1.2M ~/lammps-demo-output/melt.gif
-rw-r--r-- 1 user user 991K ~/lammps-demo-output/melt.mp4
Open ~/lammps-demo-output/melt.gif in any image viewer to see the trajectory.
End-to-end demonstration that the package works on the user's machine: takes one command, ~30 s on hardware (longer under emulation), produces a viewable result.
The forensic verification gates that any user (or LFX evaluator) might want to run, available as one command on the installed package. This is the audit surface of the package — anyone who installs it can independently verify the RVV-vectorization claim on their own machine.
§3.6.4 Verification of the .deb itself
Three independent methods on the host before shipping:
Clean. Suppressed tags are all known-OK: the wrapper bash scripts don't have man pages (acceptable for a research artifact), and the package is not in the Debian archive so the ITP-bug warning does not apply.
This is the bright-line "does the .deb actually work" gate. PASS.
§4 Discussion
§4.1 Honest framing of opcode count vs. acceleration
The TL;DR's headline number — 63,913 RVV opcodes — is true, verified by multiple independent methods, and reproducible to the exact integer by anyone with the toolchain. It is not the same as "LAMMPS is accelerated by RVV on RISC-V hardware."
The accurate statement, which §3.2.3 and §3.4 jointly support:
The lmp binary contains 63,913 RVV opcodes auto-vectorized by GCC 15.2.0 across setup, parsing, allocator, and KSpace long-range solver code paths. The per-timestep MD compute hot path consisting of PairLJCut::compute, Neighbor::build, and Verlet::run is scalar because GCC 15.2 cannot auto-vectorize the indexed neighbor-list access pattern without explicit gather intrinsics. Workloads that exercise the long-range PPPM path (peptide, rhodopsin, water-box simulations with electrostatics) do execute vectorized code at runtime — confirmed for peptide at 25.01 % of measured loop time. Workloads that do not exercise long-range solvers (pure LJ melts, granular dynamics) effectively run scalar despite the high binary-level opcode count.
This distinction matters because conflating "opcodes present in binary" with "acceleration of the user's workload" is a common reporting failure mode for RVV ports. The correct framing requires both function-scoped attribution (negative result: 0 in PairLJCut::compute) and runtime evidence (positive result: 25.01 % KSpace in peptide).
§4.2 Why PairLJCut::compute is scalar
The inner loop of PairLJCut::compute (from src/pair_lj_cut.cpp) has the classical neighbor-list structure:
GCC 15's auto-vectorizer rejects this loop because:
The j index is loaded from jlist[jj] — indirect addressing, requires a vector gather instruction (vluxei64.v in RVV) to vectorize the inner loop across jj.
The data access x[j][..] is then a gather of three doubles per atom from non-contiguous addresses.
The conditional if (rsq < cutsq) introduces a vector-mask requirement.
The accumulation into f[i][..] is a scatter that aliases with itself if the same j appears in multiple i's neighbor lists (it does — Newton's third law).
GCC 15 does emit RVV gather instructions in some contexts (notably for loops with #pragma omp simd and explicit aligned/restrict annotations), but it does not auto-discover them for the LAMMPS neighbor-list pattern. This is consistent with experience across the wider HPC community: explicit SIMD ports of MD codes invariably hand-write the gather/scatter intrinsics.
§4.3 What vectorizing the hot path would actually require
This is documented as future work rather than attempted here, but for completeness:
A dedicated RISCV package under src/RISCV/ mirroring the existing src/INTEL/ package. The INTEL package provides hand-vectorized SoA-data-layout versions of PairLJCut, Neighbor, and other hot paths using x86 AVX/AVX-512 intrinsics. A RISCV equivalent would use vluxei64.v for the gather, vsuxei64.v for the scatter, and vmflt.vv + masked operations for the cutoff predicate.
Estimated effort: ~2-3 person-months for a working PairLJCut and Neighbor::build. The INTEL package took multiple years of incremental work; matching that scope for RISCV is substantially larger.
A potential intermediate step is GCC profile-driven vectorization with -fopt-info-vec-missed to identify which patterns are almost vectorizable and amenable to source-level hints (__restrict__, __attribute__((vector_size)), or #pragma GCC ivdep). This was not pursued for this port.
Every wall-clock number in this report is qemu-riscv64 user-mode emulation time on a 12-core x86_64 WSL host, not hardware performance. Specifically:
The 7.55 s melt wall and 85.78 s peptide wall measure functional correctness throughput, not what the binary would do on RVV silicon.
QEMU user-mode does decode every RVV instruction into a host-side scalar emulation, so the runtime cost of an emulated vfmacc.vv is roughly the same as the scalar equivalent it replaced — there is no acceleration to measure.
A meaningful hardware speedup comparison would require: identical hardware (a SiFive HiFive Premier or BananaPi BPI-F3 T-Head TH1520-class machine), identical toolchain, build twice (with -march=rv64gc for the baseline and -march=rv64gcv... for the RVV build), run the same workload on hardware, compare. This is outside the scope of what can be performed without access to such hardware.
Reporting a "speedup" multiplier extrapolated from QEMU instruction counts is not a valid methodology and is not done in this report. If the next port in this series obtains hardware access, the comparison will be added in a follow-up.
This is the same QEMU-vs-hardware disclaimer that has run through every port in this repository starting with [Validation] OpenBLAS 0.3.33 RVV on GCC 15: Complementary Findings to #23 #25. It is repeated each time because the temptation to overclaim from QEMU numbers is real and the failure mode is recurring.
§5 Reproduction
All artifacts in this report are reproducible end-to-end from the repository:
Mentor: Kurt Keville (MIT) for the original mandate to apply forensic
standards to RISC-V HPC porting, and for the consistent demand that
QEMU numbers be reported as QEMU numbers.
Upstream:
LAMMPS developers at Sandia and the broader LAMMPS community for
maintaining a code base that, at the development tip, compiles cleanly
for a non-x86 architecture with zero modifications. This is uncommon
and worth noting.
The GCC team for the substantial improvements in the RVV auto-vectorizer
between 13.x and 15.x — the 0 RVV opcodes in PairLJCut::compute is a
remaining limitation, but the 63,913 elsewhere is real work the 13.x
toolchain would not have produced.
TL;DR
7f680de, dated 30 March 2026) cross-compiles toriscv64withriscv64-linux-gnu-gcc 15.2.0targetingrv64gcv_zba_zbb_zfhwith zero upstream patches required.lmpbinary contains 63,913 RVV opcodes auto-vectorized by GCC, concentrated in long-range KSpace solvers, input parsers, and Fix/Compute constructors.PairLJCut::compute,Neighbor::build, andVerlet::runcarry zero RVV opcodes because GCC 15.2 cannot auto-vectorize the indexed neighbor-list access pattern without explicit gather intrinsics. This is reported honestly rather than buried in an aggregate count.melt(4000 LJ atoms, 250 steps) completes end-to-end in 7.5 s underqemu-riscv64;peptide(2004 atoms, 300 steps with PPPM long-range) completes in 85.7 s with 25.01 % of loop time in vectorized KSpace code — runtime confirmation that the vectorized paths are not dead code in real workloads..deb:/usr/bin/lmpwrapper auto-discovers bundled potentials,lammps-rvv-demoruns the simulation + generates trajectory MP4/GIF in one command,lammps-rvv-verifyruns a 5-gate forensic self-test. Lintian-clean, qemu extract-and-run verified.4000-atom Lennard-Jones lattice melting over 250 timesteps. Simulation executed on the riscv64 build of
lmpunderqemu-riscv64user-mode emulation; trajectory generated by the bundledvisualize_dump.pytool that ships in the.deb.Executive Summary
riscv64-linux-gnu-gcc 15.2.0rv64gcv_zba_zbb_zfhliblammps.a, 62 MBlmpvsetvli e64,m1countvsetvli e32,*countvfmacc.*countvfmul.*countvfadd.*countvfred[ou]sum.*countPairLJCut::compute(per-timestep hot path)Neighbor::build(neighbor list)Verlet::run(integrator)PPPM::compute(long-range, exercised by peptide)PPPMDisp::computemeltwall time (qemu, 4000 atoms × 250 steps)melttrajectory frames producedpeptidewall time (qemu, 2004 atoms × 300 steps + PPPM)peptideKSpace section fractionpeptidePair section (scalar)peptideenergy conservation.debsize compressed.debsize installed.debSHA256f97e82e6475d59f96899cd21dd5767e4bf3a616b4f896658ab59fa4ec3ba2ef6§1 Motivation
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is one of the most widely-used classical molecular dynamics codes in computational physics, materials science, chemistry, and biology. It is developed primarily at Sandia National Laboratories under the BSD-like (now GPL-2.0) license and has been the subject of more than 2,000 peer-reviewed publications.
This report ports the upstream LAMMPS development tip (commit
7f680de, dated 30 March 2026) toriscv64with RVV auto-vectorization, packages it as a plug-and-play Debian package, and forensically characterises the vectorization that GCC 15.2 was and was not able to produce.The exercise complements rather than replaces existing application-level ports in this repository: Doom 3.0.0 (#20, visual deliverable), OpenBLAS 0.3.33 (#25, dense linear algebra), the f64 HAL SIMD shim (#26), TensorFlow Lite v2.17.0 (#27, ML inference), and OpenMM 8.5.0 (#29, MD with explicit RVV intrinsics in CpuNonbondedForce). LAMMPS, like OpenMM, is named on the LFX target spreadsheet under the molecular dynamics category but is the larger and more widely-deployed of the two — making rigorous evidence about its RVV story particularly useful.
§2 Methodology
§2.1 Toolchain
riscv64-linux-gnu-gccriscv64-linux-gnu-g++cmakecmake/CMakeLists.txt(LAMMPS convention)ninjaqemu-riscv64riscv64-linux-gnu-strip.debsize reduction; does not alter executable codeToolchain target flags (from
riscv64-rvv-toolchain.cmake, reused across all ports in this repository):The
vextension enables RVV 1.0;zba/zbbare the bit-manipulation extensions present on essentially every shipping RVV-capable core;zfhis half-precision FP (not used by LAMMPS but enabled for uniformity across this repository's ports).§2.2 Source-tree audit (read-only, before any build)
Shallow clone, then scan for architecture-specific code that might require patches:
Findings:
src/. LAMMPS core has no per-architecture branches that would need a RISC-V case added.INTEL/,LEPTON/,MPI4WIN/, andPLUMED/packages, none of which are enabled in this build.cmake/CMakeLists.txt, build artifacts go to a separate directory.Conclusion: zero patches required. This is the cleanest port outcome possible — the upstream source already compiles for
riscv64without modification.§2.3 Build configuration
cmake $SRC/cmake -G Ninja \ -DCMAKE_TOOLCHAIN_FILE=riscv64-rvv-toolchain.cmake \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=/usr \ -DBUILD_MPI=OFF -DBUILD_OMP=OFF \ -DBUILD_TOOLS=OFF -DBUILD_DOC=OFF -DBUILD_LAMMPS_SHELL=OFF \ -DPKG_KSPACE=ON -DPKG_MANYBODY=ON -DPKG_MOLECULE=ON -DPKG_RIGID=ON \ -DLAMMPS_EXCEPTIONS=ON ninja -j6Package selection rationale: KSPACE (long-range solvers — covers PPPM/Ewald, the primary auto-vectorization target), MOLECULE (bonded topology — needed for peptide), RIGID (SHAKE/RATTLE constraints — needed for water in peptide), MANYBODY (EAM/Tersoff/REBO/AIREBO — covers most other interesting RVV-vectorized hot paths). MPI/OpenMP disabled to keep the per-rank single-threaded vectorization story clean and the
.debminimal.-j6chosen because the WSL host has 6.7 GB RAM and the LAMMPS link step OOMs at-j12.§2.4 Multi-method verification approach
Per the #25 standard, every claim is verified through at least two independent methods:
file lmpqemu-riscv64 lmp -helpobjdump -d | grep -E '\\bv...'qemu-riscv64 lmp -in in.meltproduces valid outputobjdumpgrep (negative: 0 inPairLJCut::compute)log.lammpssection breakdown (25% KSpace).debactually worksdpkg-deb -c/-I/-xqemu-riscv64on the extracted binary, 11 dump framesdangerous_builds = 0in peptide log, energy conservation visible across thermo prints§2.5 Pitfalls Encountered
Pitfall 1: bundled
in.meltdoes not produce a trajectory. The upstreamexamples/melt/in.meltships with thedumpline commented out as a non-default feature. Initial.debbuild copied the pristine upstream file; the qemu smoke test then ran the simulation correctly but produced no trajectory, which the wrapper script silently accepted. Root cause: the smoke test only checked exit code, not for the dump file. Fix: (a) ship a modifiedin.meltwithdump 1 all atom 25 dump.meltenabled, marked clearly as modified-from-upstream in the file header; (b) tighten the smoke-test gate to require a non-zero dump frame count. Both applied; seescripts/package-deb.shv2.Pitfall 2: silent scalar dump. Earlier static analysis showed 63,913 RVV opcodes in
lmp, but a naive reading of that number would imply the LJ melt benchmark is vectorized. It is not. The top 15 RVV-carrying functions are entirely setup/parser/constructor code; the per-timestep MD hot path is scalar. Root cause: aggregate opcode counts are necessary but not sufficient; function-scoped attribution is required. Fix: §3.2 reports both numbers and §4.1 explicitly distinguishes "vectorization coverage in the binary" from "acceleration of any particular workload."Pitfall 3: ffmpeg 8.x palettegen syntax under
set -o pipefail. The visualization script's GIF-palette generation step produces a stderr warning under ffmpeg 8 about image-sequence patterns, which combined withset -o pipefailin the calling shell would fail the build. The warning is harmless (palette correctly generated as single image); silenced by piping stderr after the failure was confirmed cosmetic.Pitfall 4:
dpkg-deb | headSIGPIPE. First version ofpackage-deb.shuseddpkg-deb -c "$DEB" | head -30. Underset -o pipefailthe SIGPIPE fromhead's early close becomes a fatal error after the.debis already built — confusing because the failure happens during verification rather than during build. Fix: write to a temp file then read withawk 'NR<=30'. No early close, no SIGPIPE.§3 Findings
§3.1 Phase 1A — Clean build
539 build targets, all clean, no warnings worth reporting. Bundled KISS FFT used (the build does not depend on external FFTW3 because we built with PPPM but without an explicit
-DFFT=FFTW3). Static libraryliblammps.a(62 MB) emitted alongside the executable.§3.2 Phase 1B — Static RVV opcode forensics
§3.2.1 Aggregate opcode count
Result: 63,913 RVV opcodes. Distribution by mnemonic:
vsetvli e64,m1vsetvli e64,m2vsetvli e64,m4vsetvli e32,*vle*.vvse*.vvfmacc.*vfmul.*vfadd.*vfred[ou]sum.*The LMUL choice (m1, occasionally m2, never m4) reflects GCC 15's conservative cost model when the loop bounds are statically unknown — the dominant pattern in LAMMPS where loop counts depend on input geometry.
§3.2.2 Function-scoped attribution
Top 15 RVV-carrying functions:
std::__cxx11::basic_string<...>allocatorVariable::evaluate(input parser)PPPMDisp::allocate(one-time KSpace allocator)FixRigid::FixRigid(constructor)PairAIREBO::spline_init(one-time spline tables)ReadData::command(input file parser)PPPMDisp::allocate_peratomVariable::math_function(parser)FixNH::FixNH(constructor)lammps_extract_global(C API accessor)FixAveGrid::FixAveGrid(constructor)Atom::extract(accessor)FixAveCorrelate::FixAveCorrelate(constructor)lammps_extract_global_datatype(C API accessor)Atom::extract_datatype(accessor)Critical observation: not one of these is in the per-timestep MD compute path. They are constructors (
FixNH,FixRigid,FixAveGrid,FixAveCorrelate, …), one-time setup (PPPMDisp::allocate,PairAIREBO::spline_init), parsers (Variable::evaluate,ReadData::command,Variable::math_function), and accessors (Atom::extract,lammps_extract_global).§3.2.3 The MD compute hot path — explicit search
Result for the MD-relevant
::computemethods:PairLJCharmmCoulLong::allocate(one-time array allocator, not compute)PairLJCut::allocate(one-time array allocator, not compute)ESP::compute(int, int)PPPMDisp::compute(int, int)PairHybridScaled::compute(int, int)PairTable::compute(int, int)PairMEAMSpline::compute(int, int)PPPM::compute(int, int)PPPMStagger::compute(int, int)EwaldDipoleSpin::compute(int, int)PairLJCut::compute(int, int)⬅ criticalNeighbor::build(int)⬅ criticalVerlet::run(int)⬅ criticalPairLJCut::compute,Neighbor::build, andVerlet::run— the three functions that consume essentially all wall time in a typical LJ MD run — are completely scalar. This is reported honestly rather than tuned away because the methodology matters more than the headline number.§3.3 Phase 1B — Runtime correctness (
meltexample)Cross-compiled binary executed under
qemu-riscv64user-mode emulation on the WSL host. Input: bundledexamples/melt/in.melt(modified to enable trajectory dump every 25 timesteps).Result: exit 0, 7.55 s wall, 11 trajectory frames produced.
Final thermo (step 250):
Energy conservation visible across all 6 thermo prints; temperature stabilises around the expected ~1.65 reduced units for a 3.0-start NVE melt. Dangerous build count: 0. Output file
dump.meltis the file rendered into the GIF embedded above.§3.4 Phase 1B — Runtime evidence of vectorized paths (
peptideexample)The melt example uses
pair_style lj/cutwithout long-range electrostatics, so its execution does not exercise any of the vectorized code paths identified in §3.2.3 (the per-timestep work all flows through scalarPairLJCut::compute+Neighbor::build). A second benchmark is needed to demonstrate that the vectorized paths are not dead code.examples/peptide/in.peptidesimulates a 2004-atom solvated protein withpair_style lj/charmm/coul/long+kspace_style pppm— the long-range PPPM solver exercises thePPPM::compute(46 RVV opcodes) andPPPMDisp::compute(59) code paths identified in §3.2.3.Result: exit 0, 1:25 wall.
Runtime evidence that PPPM was actually executed (not just linked):
Section timing breakdown (300 timesteps, total 84.33 s loop time):
PairLJCharmm*::computeis also scalar per §3.2.3)PPPM::computecarries 46 RVV ops)Neighbor::buildis scalar)Energy conservation (step 300):
Final values stable;
Dangerous builds = 0;Neighbor list builds = 26. The simulation is numerically correct, the vectorized paths execute as designed, and the section breakdown lets us state precisely what fraction of loop time hits which kind of code.Two-method verification of vectorization-exercises-runtime:
PPPM::computecarries 46 RVV opcodes in the disassembly.PPPM initializationline in log) AND the Kspace section consumed 21.09 s of measured wall time.The vectorized code is not dead.
§3.5 Phase 1C — Trajectory visualization pipeline
scripts/visualize_dump.pyreads any LAMMPS custom-format trajectory dump and renders per-frame 3D scatter plots, then stitches them into both an MP4 (libx264, for download) and an animated GIF (palette-optimized, for inline GitHub embed).Key design choices:
dumpstyle anddump_modifysettings: absolute wrapped (x y z), absolute unwrapped (xu yu zu), scaled wrapped (xs ys zs), or scaled unwrapped (xsu ysu zsu). The melt example uses the scaled variant by default (dump atomstyle). The visualizer auto-detects which variant is present in the dump header and applies the box-bounds transformationx = xlo + xs * (xhi - xlo)for the scaled variants. This means the script works on any LAMMPS dump file the user might throw at it, not just the bundled examples.palettegen+paletteuseoptimization — comfortably under the cap and giving us a real inline visual for the issue.Reproduction:
python3 scripts/visualize_dump.py run-melt/dump.melt --out run-melt/melt --fps 4 # Produces melt.mp4 (1.0 MB, libx264) + melt.gif (1.2 MB, palette-optimized)§3.6 Phase 1D — Plug-and-play
.debThe differentiator from a bare-binary deb is that this package ships an end-to-end usable environment: simulation runtime + force fields + working example + visualization + self-test, all installable in one
dpkg -iinvocation.§3.6.1 Layout
/usr/bin/lmpLAMMPS_POTENTIALS/usr/bin/lammpslmp(Debian convention)/usr/bin/lammps-rvv-demo/usr/bin/lammps-rvv-verify/usr/libexec/lammps/lmp/usr/lib/riscv64-linux-gnu/liblammps.a/usr/share/lammps/potentials//usr/share/lammps/examples/melt/in.melt/usr/share/lammps/scripts/visualize_dump.py/usr/share/doc/lammps-riscv64-rvv/{README,copyright,changelog}/usr/share/man/man1/lmp.1.gzThe wrapper at
/usr/bin/lmp:This is what makes the package plug-and-play. The user does not need to know that LAMMPS uses
$LAMMPS_POTENTIALSto find force-field files. The wrapper sets it to the bundled directory if and only if the user has not already set it themselves. Real binary lives at/usr/libexec/lammps/lmpper Debian's convention for "internal" executables not intended for direct user invocation.§3.6.2 The demo command
$ lammps-rvv-demo [1/3] Running LAMMPS melt example (4000 atoms, 250 timesteps)... ✓ Simulation complete, 11 trajectory frames [2/3] Generating trajectory MP4 + GIF... [3/3] Done. Files at: -rw-r--r-- 1 user user 1.2M ~/lammps-demo-output/melt.gif -rw-r--r-- 1 user user 991K ~/lammps-demo-output/melt.mp4 Open ~/lammps-demo-output/melt.gif in any image viewer to see the trajectory.End-to-end demonstration that the package works on the user's machine: takes one command, ~30 s on hardware (longer under emulation), produces a viewable result.
§3.6.3 The self-test command
The forensic verification gates that any user (or LFX evaluator) might want to run, available as one command on the installed package. This is the audit surface of the package — anyone who installs it can independently verify the RVV-vectorization claim on their own machine.
§3.6.4 Verification of the
.debitselfThree independent methods on the host before shipping:
Method 1 —
dpkg-debmetadata + contents:290 entries total, mode bits correct, paths under FHS-compliant locations.
Method 2 —
lintian:Clean. Suppressed tags are all known-OK: the wrapper bash scripts don't have man pages (acceptable for a research artifact), and the package is not in the Debian archive so the ITP-bug warning does not apply.
Method 3 —
qemu-riscv64extract-and-run:dpkg-deb -x lammps-riscv64-rvv_30Mar26-1_riscv64.deb /tmp/extract qemu-riscv64 -L /usr/riscv64-linux-gnu \ -E LAMMPS_POTENTIALS=/tmp/extract/usr/share/lammps/potentials \ /tmp/extract/usr/libexec/lammps/lmp \ -in /tmp/extract/usr/share/lammps/examples/melt/in.melt -log none # → exit 0, 11 dump.melt frames producedThis is the bright-line "does the
.debactually work" gate. PASS.§4 Discussion
§4.1 Honest framing of opcode count vs. acceleration
The TL;DR's headline number — 63,913 RVV opcodes — is true, verified by multiple independent methods, and reproducible to the exact integer by anyone with the toolchain. It is not the same as "LAMMPS is accelerated by RVV on RISC-V hardware."
The accurate statement, which §3.2.3 and §3.4 jointly support:
This distinction matters because conflating "opcodes present in binary" with "acceleration of the user's workload" is a common reporting failure mode for RVV ports. The correct framing requires both function-scoped attribution (negative result: 0 in
PairLJCut::compute) and runtime evidence (positive result: 25.01 % KSpace in peptide).§4.2 Why
PairLJCut::computeis scalarThe inner loop of
PairLJCut::compute(fromsrc/pair_lj_cut.cpp) has the classical neighbor-list structure:GCC 15's auto-vectorizer rejects this loop because:
jlist[jj]— indirect addressing, requires a vector gather instruction (vluxei64.vin RVV) to vectorize the inner loop acrossjj.x[j][..]is then a gather of three doubles per atom from non-contiguous addresses.if (rsq < cutsq)introduces a vector-mask requirement.f[i][..]is a scatter that aliases with itself if the samejappears in multiplei's neighbor lists (it does — Newton's third law).GCC 15 does emit RVV gather instructions in some contexts (notably for loops with
#pragma omp simdand explicitaligned/restrictannotations), but it does not auto-discover them for the LAMMPS neighbor-list pattern. This is consistent with experience across the wider HPC community: explicit SIMD ports of MD codes invariably hand-write the gather/scatter intrinsics.§4.3 What vectorizing the hot path would actually require
This is documented as future work rather than attempted here, but for completeness:
src/RISCV/mirroring the existingsrc/INTEL/package. The INTEL package provides hand-vectorized SoA-data-layout versions ofPairLJCut,Neighbor, and other hot paths using x86 AVX/AVX-512 intrinsics. A RISCV equivalent would usevluxei64.vfor the gather,vsuxei64.vfor the scatter, andvmflt.vv+ masked operations for the cutoff predicate.PairLJCutandNeighbor::build. The INTEL package took multiple years of incremental work; matching that scope for RISCV is substantially larger.-fopt-info-vec-missedto identify which patterns are almost vectorizable and amenable to source-level hints (__restrict__,__attribute__((vector_size)), or#pragma GCC ivdep). This was not pursued for this port.CpuNonbondedForceFvectemplate-parameterizes the vector type, and the riscv64 build uses thefvec4specialization that GCC successfully vectorizes for a different (block-decomposed) data layout. LAMMPS'sPair*::computemethods are not similarly template-parameterized.§4.4 QEMU vs. hardware — the bright line
Every wall-clock number in this report is
qemu-riscv64user-mode emulation time on a 12-core x86_64 WSL host, not hardware performance. Specifically:vfmacc.vvis roughly the same as the scalar equivalent it replaced — there is no acceleration to measure.-march=rv64gcfor the baseline and-march=rv64gcv...for the RVV build), run the same workload on hardware, compare. This is outside the scope of what can be performed without access to such hardware.This is the same QEMU-vs-hardware disclaimer that has run through every port in this repository starting with [Validation] OpenBLAS 0.3.33 RVV on GCC 15: Complementary Findings to #23 #25. It is repeated each time because the temptation to overclaim from QEMU numbers is real and the failure mode is recurring.
§5 Reproduction
All artifacts in this report are reproducible end-to-end from the repository:
Direct end-to-end reproduction of the headline numbers:
Anyone with the listed toolchain versions on a Linux host should obtain bit-identical numbers.
§6 Files
lammps-port/README.mdlammps-port/lammps-phase1-bootstrap.shlammps-port/lammps-phase1b-verify.shlammps-port/riscv64-rvv-toolchain.cmakelammps-port/scripts/visualize_dump.pylammps-port/scripts/package-deb.shlammps-port/dist/lammps-riscv64-rvv_30Mar26-1_riscv64.deblammps-port/run-melt/dump.meltlammps-port/run-melt/melt.giflammps-port/run-melt/melt.mp4lammps-port/run-melt/log.lammpslammps-port/logs/peptide/peptide.loglammps-port/logs/top15-rvv-fns.txtlammps-port/logs/PHASE1B_EVIDENCE.txtRepo root: https://github.com/trg-rgb/riscv-hpc-port/tree/main/lammps-port
§7 Related work in this repository
§8 Acknowledgments
Mentor: Kurt Keville (MIT) for the original mandate to apply forensic
standards to RISC-V HPC porting, and for the consistent demand that
QEMU numbers be reported as QEMU numbers.
Upstream:
maintaining a code base that, at the development tip, compiles cleanly
for a non-x86 architecture with zero modifications. This is uncommon
and worth noting.
between 13.x and 15.x — the 0 RVV opcodes in
PairLJCut::computeis aremaining limitation, but the 63,913 elsewhere is real work the 13.x
toolchain would not have produced.