Summary
A portable double-precision SIMD abstraction header — hal/simd.h — that
eliminates #ifdef chains at call sites and provides a single API
compiling correctly to four backends with compile-time dispatch:
| Backend |
Gate |
Key intrinsics |
| RISC-V RVV 1.0 |
__riscv && __riscv_v |
vfmacc_vv_f64m4, vle64_v_f64m4 |
| x86 AVX2 + FMA |
__AVX2__ && __FMA__ |
_mm256_fmadd_pd, _mm256_loadu_pd |
| x86 SSE2 |
__SSE2__ |
_mm_mul_pd, _mm_add_pd (2×__m128d) |
| Scalar |
any ISA |
Plain C — bit-identical reference |
Three design choices worth calling out, since this is where competing HAL
shims diverge:
- f64 throughout. The program description specifies "(double precision)."
f32-only shims do not satisfy this — and most HPC reference codes this
mentorship targets are f64 internally.
- LMUL=4 register grouping. Wider grouping than LMUL=1 — more
throughput per vsetvli strip-mine iteration on hardware that has
the registers free. The scalar tail handles arbitrary lengths.
- AVX2 + FMA gated separately. FMA3 is a separate CPUID feature
from AVX2 on older Intel parts; gating only on __AVX2__ produces
broken builds on those CPUs.
Higher-level operations
Built on the load / store / fmadd / hsum primitives:
hal_dot4 — 4-element dot product
hal_matvec_row — matrix row × vector (arbitrary length, SIMD body + scalar tail)
hal_axpy4 — BLAS-1 AXPY: y = alpha*x + y
hal_matvec_row is the operation driving TensorFlow Lite fully-connected
layer inference in the companion issue (link to be added once filed).
Validation — dual-backend on riscv64
Both backends built with the same source tree and tested under
qemu-riscv64 10.2.1:
| Build flag |
Binary |
Architecture (self-reported) |
Tests |
-march=rv64gc |
test_hal_riscv64_scalar |
scalar |
20/20 PASS |
-march=rv64gcv |
test_hal_riscv64_rvv |
riscv_rvv |
20/20 PASS |
Numerical results are bit-identical between backends (verified via
diff after stripping the architecture-identification line). Every test
across hal_dot4, hal_fmadd_f64x4, hal_matvec_row, hal_axpy4,
and hal_sub_f64x4 produces err = 0.00e+00 on both backends.
Backend selection — verified by disassembly
It is easy to claim "the RVV backend was selected" by reading the
self-reported architecture line. The harder claim is verified by
disassembling both binaries and counting opcodes:
| Binary |
Total RVV opcodes (a) |
f64m4-specific opcodes (b) |
test_hal_riscv64_scalar |
0 |
0 |
test_hal_riscv64_rvv |
596 |
8 |
(a) Counted via objdump -d | grep -cE 'vle64|vse64|vfmacc|vsetvl|vfmul|vfadd|vfredosum'
(b) Counted via objdump -d | grep -cE 'vfmacc.vv|vfredosum.vs|vsetvli.*e64,m4'
The f64m4-specific intrinsics (vfmacc.vv, vfredosum.vs,
vsetvli ... e64,m4) only appear when the HAL RVV backend code is
compiled — GCC auto-vectorization does not emit LMUL=4 grouping on
its own. Their presence in the RVV binary (8 ops) and total absence
in the scalar binary (0 ops) confirms backend selection works as
designed; it is not GCC auto-vec masquerading as the HAL path.
Reproduction
git clone https://github.com/trg-rgb/riscv-hpc-port
cd riscv-hpc-port/hal
make verify
Expected final output:
--- Scalar ---
Results: 20 PASS, 0 FAIL
--- RVV ---
Results: 20 PASS, 0 FAIL
BIT-IDENTICAL across backends
RVV opcodes in RVV binary: 596
f64m4 opcodes in scalar binary (must be 0): 0
Toolchain: riscv64-linux-gnu-gcc 15.2.0, qemu-riscv64 10.2.1
(Ubuntu 24.04, x86_64 host).
Files
hal/simd.h — the shim
hal/test_hal.c — 20-test harness
hal/Makefile — reproduction recipe (make verify)
hal/test_hal_results.txt — curated dual-backend summary
hal/test_hal_results_scalar.txt, hal/test_hal_results_rvv.txt — raw runs
Repository
https://github.com/trg-rgb/riscv-hpc-port/tree/main/hal
Related work
Summary
A portable double-precision SIMD abstraction header —
hal/simd.h— thateliminates
#ifdefchains at call sites and provides a single APIcompiling correctly to four backends with compile-time dispatch:
__riscv && __riscv_vvfmacc_vv_f64m4,vle64_v_f64m4__AVX2__ && __FMA___mm256_fmadd_pd,_mm256_loadu_pd__SSE2___mm_mul_pd,_mm_add_pd(2×__m128d)Three design choices worth calling out, since this is where competing HAL
shims diverge:
f32-only shims do not satisfy this — and most HPC reference codes this
mentorship targets are f64 internally.
throughput per
vsetvlistrip-mine iteration on hardware that hasthe registers free. The scalar tail handles arbitrary lengths.
from AVX2 on older Intel parts; gating only on
__AVX2__producesbroken builds on those CPUs.
Higher-level operations
Built on the load / store / fmadd / hsum primitives:
hal_dot4— 4-element dot producthal_matvec_row— matrix row × vector (arbitrary length, SIMD body + scalar tail)hal_axpy4— BLAS-1 AXPY:y = alpha*x + yhal_matvec_rowis the operation driving TensorFlow Lite fully-connectedlayer inference in the companion issue (link to be added once filed).
Validation — dual-backend on riscv64
Both backends built with the same source tree and tested under
qemu-riscv64 10.2.1:-march=rv64gctest_hal_riscv64_scalarscalar-march=rv64gcvtest_hal_riscv64_rvvriscv_rvvNumerical results are bit-identical between backends (verified via
diffafter stripping the architecture-identification line). Every testacross
hal_dot4,hal_fmadd_f64x4,hal_matvec_row,hal_axpy4,and
hal_sub_f64x4produceserr = 0.00e+00on both backends.Backend selection — verified by disassembly
It is easy to claim "the RVV backend was selected" by reading the
self-reported architecture line. The harder claim is verified by
disassembling both binaries and counting opcodes:
test_hal_riscv64_scalartest_hal_riscv64_rvv(a) Counted via
objdump -d | grep -cE 'vle64|vse64|vfmacc|vsetvl|vfmul|vfadd|vfredosum'(b) Counted via
objdump -d | grep -cE 'vfmacc.vv|vfredosum.vs|vsetvli.*e64,m4'The f64m4-specific intrinsics (
vfmacc.vv,vfredosum.vs,vsetvli ... e64,m4) only appear when the HAL RVV backend code iscompiled — GCC auto-vectorization does not emit LMUL=4 grouping on
its own. Their presence in the RVV binary (8 ops) and total absence
in the scalar binary (0 ops) confirms backend selection works as
designed; it is not GCC auto-vec masquerading as the HAL path.
Reproduction
Expected final output:
Toolchain:
riscv64-linux-gnu-gcc 15.2.0,qemu-riscv64 10.2.1(Ubuntu 24.04, x86_64 host).
Files
hal/simd.h— the shimhal/test_hal.c— 20-test harnesshal/Makefile— reproduction recipe (make verify)hal/test_hal_results.txt— curated dual-backend summaryhal/test_hal_results_scalar.txt,hal/test_hal_results_rvv.txt— raw runsRepository
https://github.com/trg-rgb/riscv-hpc-port/tree/main/hal
Related work
hal_matvec_rowfor dense-layer compute): link added once issue filed