This is my closing statement for the Summer 2026 LFX mentorship on Broadening the RISC-V High Precision Code Base and Reach under Kurt Keville (MIT). It collects what I built, distils the methodology that emerged from building it, and sketches the work that should come next.
The portfolio
Eleven contributions to this issue tracker over the program, plus one upstream PR. Listed in chronological order with the role each played:
| # |
Subject |
Role in the arc |
| #13 |
Application + 12-week implementation plan |
Proposal |
| #14 |
RFC: ML/AI porting subdirectory structure and workflow |
Architectural |
| #17 |
TensorFlow Lite v2.17.0 first cross-compile (libtensorflow-lite.a, 21 MB, 243 objects) |
First port |
| #20 |
Chocolate Doom 3.0.0 on riscv64 |
Named LFX target, visual deliverable |
| #25 |
OpenBLAS 0.3.33 ZVL128B forensic — 14,355 RVV opcodes, Higham §3.5 eq. 3.13 bounds, 11/12 DGEMM |
Forensic methodology established |
| #26 |
f64 HAL SIMD shim — 4 backends, 20/20 bit-identical |
Backend-selection methodology |
| #27 |
TensorFlow Lite v2.17.0 plug-and-play .deb (INT8 CNN inference) |
Packaging methodology |
| OpenMathLib/OpenBLAS#5819 |
Upstream documentation PR |
Taking findings to upstream |
| #29 |
OpenMM 8.5.0 — 4-hunk patch, 14,425 RVV ops, 861 in calculateBlockIxn hot path, 12/12 platform tests |
Upstream-friendly minimal-patch port |
| #30 |
LAMMPS 30 Mar 2026 — zero patches, plug-and-play .deb with bundled trajectory visualiser |
Plug-and-play maturity |
| (verification issue) |
verify-rvv-port.sh + 3-port compliance matrix |
Methodology mechanised |
| (this issue) |
Closing statement |
Synthesis |
The progression matters: it goes proposal → architecture → first port → progressively richer ports → forensic methodology → mechanised methodology → close. Each step built on the lessons of the previous one, and the corrections found along the way (most recently the #30 PairLJCut count) are part of the story rather than something to hide.
Repository: https://github.com/trg-rgb/riscv-hpc-port
§1 The methodology, distilled
The methodology emerged from doing the work, not from being designed upfront. Five principles fall out:
§1.1 Function-scoped attribution over aggregate counts
A binary with 64,750 RVV opcodes whose per-timestep MD hot path has 24 is not the same thing as a binary with 7,609 RVV opcodes whose per-timestep MD hot path has 861. Both numbers are true; only the second describes a workload that is meaningfully accelerated by RVV.
The aggregate count is necessary but not sufficient. Always identify which functions carry the opcodes, then check whether those functions are on the workload's critical path.
This principle caught the #30 PairLJCut error in my own published work — the inverse case where I had reported the hot path as scalar (0) when it actually carried 24 opcodes from SLP-vectorisation of the function epilogue. The qualitative conclusion (the inner loop is scalar) was correct; the specific number was wrong. Fixed by the verification tool's exact-match mode.
§1.2 Backend-selection verification
When multiple backends exist (auto-vectoriser, intrinsics, hand-assembly, multi-target compile, multi-backend HAL), confirm via a backend-specific opcode that the intended one was compiled in.
- For auto-vec and intrinsics:
vsetvli (register-VL variant).
- For hand-assembly with compile-time fixed VL:
vsetivli (immediate-VL variant).
- For LMUL-specific ports targeting wide kernels:
vsetvli ... e64, m4 and similar.
The HAL work in #26 demonstrated this for the four-backend case; the OpenBLAS finding in the verification issue extended it to recognise hand-asm.
§1.3 Arith/setup ratio as a silent-fallback detector
A ratio dominated by vsetvli instructions with few arithmetic vector ops suggests the compiler emitted strip-mining preamble but bailed before vector arithmetic — the GCC 13.x silent-fallback pattern that #25 initially encountered before the toolchain upgrade.
The threshold I now use is 10% — derived empirically from observing that healthy auto-vectorised ports show ratios from 10% (LAMMPS) to over 400% (OpenMM), while a pathological silent-fallback would show under 1%. The threshold is forward-looking; no port in the current portfolio shows the pathology.
Hand-written assembly produces the opposite extreme — many arithmetic ops with zero vsetvli (one vsetivli setup at function entry handles many subsequent operations). The verification tool encodes this as an explicit escape clause rather than treating it as a failure.
§1.4 QEMU exposes correctness bugs, not emulator artifacts
When a numerical kernel fails its own correctness tests under qemu-riscv64 by orders of magnitude beyond machine epsilon, the kernel has a source-level bug. QEMU's RVV emulation is a functionally-correct implementation of the RVV 1.0 specification, maintained upstream with substantial review.
The statement "tests fail under QEMU; not a code bug; requires hardware validation" inverts cause and effect. Hardware will produce the same failed results because the kernel is incorrect. The correct framing for a partial-correctness result is: "QEMU exposed a numerical bug consistent with [strip-mining boundary / accumulator precision / VLEN assumption]; diagnosis and fix needed before hardware testing is meaningful."
Closely related: QEMU user-mode is for functional correctness, not performance. Wall-clock numbers under emulation, eBPF instruction counts × hardware-frequency multipliers, and similar extrapolations are not grounded in any hardware measurement. Every port in this repository explicitly labels QEMU times as QEMU times, and defers performance claims to follow-up benchmarks on actual shipping RISC-V silicon.
§1.5 Document honestly when a checked number was wrong
The #30 PairLJCut correction is the most concrete demonstration. I had published a wrong number. I built a tool. The tool surfaced the wrong number. I posted a correction comment rather than silently editing the issue body.
The discipline this enforces is that the methodology is not about being right the first time. It is about catching errors quickly, posting them publicly, and locking the correction such that future regressions are visible. Silent edits hide the methodology working; posted corrections demonstrate it.
The verification tool's =N exact-count assertion mode is the mechanical encoding of this principle: every port-level claim that fits the tool's checks is locked at the exact reported number, so any toolchain or source change that drifts the number triggers the gate immediately.
§2 What this work surfaced
Three threads run across the portfolio that I didn't expect at the start:
Vectorisation strategies differ more than the marketing suggests. Auto-vectorisation (LAMMPS), template-parameterised intrinsics (OpenMM fvec4), and hand-written assembly (OpenBLAS dgemm_kernel) produce qualitatively different opcode profiles that are immediately visible in the verification matrix — the arith/setup ratio, the presence or absence of indexed gathers (vluxei*), the use of vsetvli vs vsetivli. "Has RVV" is not a useful predicate; "has RVV in the workload's inner loop" is.
The same toolchain produces meaningfully different results between GCC versions. #25 initially saw silent scalar fallback under GCC 13 and full vectorisation under GCC 15. This means RVV-acceleration claims must specify the toolchain version, and re-verification on a different toolchain is a routine necessity, not an optional check.
Plug-and-play packaging is a separable engineering concern. #27, #29, and #30 each took their working binary and wrapped it in a .deb with auto-discovery of resource paths, bundled examples, a demo command (lammps-rvv-demo in #30 generates a trajectory MP4/GIF end-to-end), and a forensic self-test (lammps-rvv-verify). The combined effect is that "install the deb, run one command, see RVV-accelerated MD on your screen" became a real user experience by the end of the program. This is a different skill from porting but it is the skill that turns a port into a usable artifact.
§3 What comes next — Phase 2
The natural extension of verify-rvv-port.sh is end-to-end automated porting: given a source URL, mechanically produce a verified .deb. Examining the six ports in this repository, the workflow decomposes into mechanizable and non-mechanizable steps:
| Step |
Mechanizable? |
| Clone source |
✓ trivial |
| Detect build system |
✓ doable (look for CMakeLists.txt, configure.ac, Makefile, meson.build, ...) |
| Cross-compile with toolchain |
✓ for cmake; ⚠ for autotools/bazel/custom |
Choose configure flags (e.g. LAMMPS' PKG_KSPACE=ON) |
✗ judgment |
| Write patches when source doesn't build (e.g. OpenMM's 4-hunk patch) |
✗ judgment |
| Identify the hot function for verification |
✗ judgment |
| Run build |
✓ trivial |
Verify RVV (verify-rvv-port.sh) |
✓ done |
Package as .deb |
✓ done (per-port script) |
| Smoke-test under qemu |
✓ doable |
The four steps requiring judgment are the actual porting work. Rule-based systems cannot produce upstream-quality patches for arbitrary codebases or select the correct configure flags for a domain-specific package set.
The realistic Phase-2 work is therefore not a full autoporter but two smaller pieces:
port-rebuild.sh — an orchestrator that takes a port directory containing hand-written bootstrap.sh + verify.conf + package-deb.sh, and runs the whole pipeline mechanically (clone → configure → build → verify → package → smoke-test → report).
port-from-template.sh — a scaffolding tool that generates new port directories with the right skeleton so adding a new port becomes "fill in 3 files" rather than "discover the structure from scratch."
Together these would reduce the per-port effort from days to hours by automating the boilerplate, not the engineering.
Phase 3 is genuinely a research question. LLM-in-the-loop patch generation for build failures is interesting but substantively different from anything in this repository today, and over-promising on it would devalue the concrete work that has shipped.
Two smaller items also belong in the queue:
§4 Acknowledgments
Kurt Keville (MIT) for the original mandate to apply forensic standards to RISC-V HPC porting, and for the consistent demand that QEMU numbers be reported as QEMU numbers. The discipline that runs through every issue in this portfolio — distinguishing functional correctness from performance, aggregate counts from function-scoped attribution, what the emulator measures from what the hardware would measure — came from Kurt being a hard-nosed pragmatist about evidence quality. Mistakes I have made over the program have been corrected on the strength of that discipline, not despite it.
Upstream projects and maintainers. LAMMPS at Sandia for maintaining a code base that compiles cleanly for a non-x86 architecture with zero modifications at the development tip. The OpenBLAS and OpenMathLib maintainers for engagement on #5819. The OpenMM developers for the template-parameterised fvec4 machinery that made the riscv64 port a 4-hunk patch rather than a substantial rewrite. The GCC team for the substantial improvements in the RVV auto-vectoriser between 13.x and 15.x — the 0 RVV opcodes in PairLJCut::compute's inner loop is a remaining limitation, but the 63,913 elsewhere in the LAMMPS binary is real work that the 13.x toolchain would not have produced.
The Linux Foundation for the LFX mentorship infrastructure and the program that made this possible.
The repository https://github.com/trg-rgb/riscv-hpc-port will remain available, the tool is reusable for future ports beyond this program, and the methodology is documented to the level that anyone can pick up where I left off. Whoever does so should expect to find errors in my work — that is what the verification tool is for — and should feel free to post corrections in the same style I have, with the original claim preserved alongside the correction.
— Tanmay Gulhane (@trg-rgb), MIT World Peace University, Pune. May 2026.
This is my closing statement for the Summer 2026 LFX mentorship on Broadening the RISC-V High Precision Code Base and Reach under Kurt Keville (MIT). It collects what I built, distils the methodology that emerged from building it, and sketches the work that should come next.
The portfolio
Eleven contributions to this issue tracker over the program, plus one upstream PR. Listed in chronological order with the role each played:
libtensorflow-lite.a, 21 MB, 243 objects).deb(INT8 CNN inference)calculateBlockIxnhot path, 12/12 platform tests.debwith bundled trajectory visualiserverify-rvv-port.sh+ 3-port compliance matrixThe progression matters: it goes proposal → architecture → first port → progressively richer ports → forensic methodology → mechanised methodology → close. Each step built on the lessons of the previous one, and the corrections found along the way (most recently the #30 PairLJCut count) are part of the story rather than something to hide.
Repository: https://github.com/trg-rgb/riscv-hpc-port
§1 The methodology, distilled
The methodology emerged from doing the work, not from being designed upfront. Five principles fall out:
§1.1 Function-scoped attribution over aggregate counts
A binary with 64,750 RVV opcodes whose per-timestep MD hot path has 24 is not the same thing as a binary with 7,609 RVV opcodes whose per-timestep MD hot path has 861. Both numbers are true; only the second describes a workload that is meaningfully accelerated by RVV.
The aggregate count is necessary but not sufficient. Always identify which functions carry the opcodes, then check whether those functions are on the workload's critical path.
This principle caught the #30 PairLJCut error in my own published work — the inverse case where I had reported the hot path as scalar (0) when it actually carried 24 opcodes from SLP-vectorisation of the function epilogue. The qualitative conclusion (the inner loop is scalar) was correct; the specific number was wrong. Fixed by the verification tool's exact-match mode.
§1.2 Backend-selection verification
When multiple backends exist (auto-vectoriser, intrinsics, hand-assembly, multi-target compile, multi-backend HAL), confirm via a backend-specific opcode that the intended one was compiled in.
vsetvli(register-VL variant).vsetivli(immediate-VL variant).vsetvli ... e64, m4and similar.The HAL work in #26 demonstrated this for the four-backend case; the OpenBLAS finding in the verification issue extended it to recognise hand-asm.
§1.3 Arith/setup ratio as a silent-fallback detector
A ratio dominated by
vsetvliinstructions with few arithmetic vector ops suggests the compiler emitted strip-mining preamble but bailed before vector arithmetic — the GCC 13.x silent-fallback pattern that #25 initially encountered before the toolchain upgrade.The threshold I now use is 10% — derived empirically from observing that healthy auto-vectorised ports show ratios from 10% (LAMMPS) to over 400% (OpenMM), while a pathological silent-fallback would show under 1%. The threshold is forward-looking; no port in the current portfolio shows the pathology.
Hand-written assembly produces the opposite extreme — many arithmetic ops with zero
vsetvli(onevsetivlisetup at function entry handles many subsequent operations). The verification tool encodes this as an explicit escape clause rather than treating it as a failure.§1.4 QEMU exposes correctness bugs, not emulator artifacts
When a numerical kernel fails its own correctness tests under
qemu-riscv64by orders of magnitude beyond machine epsilon, the kernel has a source-level bug. QEMU's RVV emulation is a functionally-correct implementation of the RVV 1.0 specification, maintained upstream with substantial review.The statement "tests fail under QEMU; not a code bug; requires hardware validation" inverts cause and effect. Hardware will produce the same failed results because the kernel is incorrect. The correct framing for a partial-correctness result is: "QEMU exposed a numerical bug consistent with [strip-mining boundary / accumulator precision / VLEN assumption]; diagnosis and fix needed before hardware testing is meaningful."
Closely related: QEMU user-mode is for functional correctness, not performance. Wall-clock numbers under emulation, eBPF instruction counts × hardware-frequency multipliers, and similar extrapolations are not grounded in any hardware measurement. Every port in this repository explicitly labels QEMU times as QEMU times, and defers performance claims to follow-up benchmarks on actual shipping RISC-V silicon.
§1.5 Document honestly when a checked number was wrong
The #30 PairLJCut correction is the most concrete demonstration. I had published a wrong number. I built a tool. The tool surfaced the wrong number. I posted a correction comment rather than silently editing the issue body.
The discipline this enforces is that the methodology is not about being right the first time. It is about catching errors quickly, posting them publicly, and locking the correction such that future regressions are visible. Silent edits hide the methodology working; posted corrections demonstrate it.
The verification tool's
=Nexact-count assertion mode is the mechanical encoding of this principle: every port-level claim that fits the tool's checks is locked at the exact reported number, so any toolchain or source change that drifts the number triggers the gate immediately.§2 What this work surfaced
Three threads run across the portfolio that I didn't expect at the start:
Vectorisation strategies differ more than the marketing suggests. Auto-vectorisation (LAMMPS), template-parameterised intrinsics (OpenMM
fvec4), and hand-written assembly (OpenBLASdgemm_kernel) produce qualitatively different opcode profiles that are immediately visible in the verification matrix — the arith/setup ratio, the presence or absence of indexed gathers (vluxei*), the use ofvsetvlivsvsetivli. "Has RVV" is not a useful predicate; "has RVV in the workload's inner loop" is.The same toolchain produces meaningfully different results between GCC versions. #25 initially saw silent scalar fallback under GCC 13 and full vectorisation under GCC 15. This means RVV-acceleration claims must specify the toolchain version, and re-verification on a different toolchain is a routine necessity, not an optional check.
Plug-and-play packaging is a separable engineering concern. #27, #29, and #30 each took their working binary and wrapped it in a
.debwith auto-discovery of resource paths, bundled examples, a demo command (lammps-rvv-demoin #30 generates a trajectory MP4/GIF end-to-end), and a forensic self-test (lammps-rvv-verify). The combined effect is that "install the deb, run one command, see RVV-accelerated MD on your screen" became a real user experience by the end of the program. This is a different skill from porting but it is the skill that turns a port into a usable artifact.§3 What comes next — Phase 2
The natural extension of
verify-rvv-port.shis end-to-end automated porting: given a source URL, mechanically produce a verified.deb. Examining the six ports in this repository, the workflow decomposes into mechanizable and non-mechanizable steps:CMakeLists.txt,configure.ac,Makefile,meson.build, ...)PKG_KSPACE=ON)verify-rvv-port.sh).debThe four steps requiring judgment are the actual porting work. Rule-based systems cannot produce upstream-quality patches for arbitrary codebases or select the correct configure flags for a domain-specific package set.
The realistic Phase-2 work is therefore not a full autoporter but two smaller pieces:
port-rebuild.sh— an orchestrator that takes a port directory containing hand-writtenbootstrap.sh+verify.conf+package-deb.sh, and runs the whole pipeline mechanically (clone → configure → build → verify → package → smoke-test → report).port-from-template.sh— a scaffolding tool that generates new port directories with the right skeleton so adding a new port becomes "fill in 3 files" rather than "discover the structure from scratch."Together these would reduce the per-port effort from days to hours by automating the boilerplate, not the engineering.
Phase 3 is genuinely a research question. LLM-in-the-loop patch generation for build failures is interesting but substantively different from anything in this repository today, and over-promising on it would devalue the concrete work that has shipped.
Two smaller items also belong in the queue:
verify-rvv-port.shto handle statically-linked binaries cleanly (would let HAL [Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26 enter the compliance matrix without libc/locale RVV noise).§4 Acknowledgments
Kurt Keville (MIT) for the original mandate to apply forensic standards to RISC-V HPC porting, and for the consistent demand that QEMU numbers be reported as QEMU numbers. The discipline that runs through every issue in this portfolio — distinguishing functional correctness from performance, aggregate counts from function-scoped attribution, what the emulator measures from what the hardware would measure — came from Kurt being a hard-nosed pragmatist about evidence quality. Mistakes I have made over the program have been corrected on the strength of that discipline, not despite it.
Upstream projects and maintainers. LAMMPS at Sandia for maintaining a code base that compiles cleanly for a non-x86 architecture with zero modifications at the development tip. The OpenBLAS and OpenMathLib maintainers for engagement on #5819. The OpenMM developers for the template-parameterised
fvec4machinery that made the riscv64 port a 4-hunk patch rather than a substantial rewrite. The GCC team for the substantial improvements in the RVV auto-vectoriser between 13.x and 15.x — the 0 RVV opcodes inPairLJCut::compute's inner loop is a remaining limitation, but the 63,913 elsewhere in the LAMMPS binary is real work that the 13.x toolchain would not have produced.The Linux Foundation for the LFX mentorship infrastructure and the program that made this possible.
The repository https://github.com/trg-rgb/riscv-hpc-port will remain available, the tool is reusable for future ports beyond this program, and the methodology is documented to the level that anyone can pick up where I left off. Whoever does so should expect to find errors in my work — that is what the verification tool is for — and should feel free to post corrections in the same style I have, with the original claim preserved alongside the correction.
— Tanmay Gulhane (@trg-rgb), MIT World Peace University, Pune. May 2026.