Skip to content

[Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26

@trg-rgb

Description

@trg-rgb

Summary

A portable double-precision SIMD abstraction header — hal/simd.h — that
eliminates #ifdef chains at call sites and provides a single API
compiling correctly to four backends with compile-time dispatch:

Backend Gate Key intrinsics
RISC-V RVV 1.0 __riscv && __riscv_v vfmacc_vv_f64m4, vle64_v_f64m4
x86 AVX2 + FMA __AVX2__ && __FMA__ _mm256_fmadd_pd, _mm256_loadu_pd
x86 SSE2 __SSE2__ _mm_mul_pd, _mm_add_pd (2×__m128d)
Scalar any ISA Plain C — bit-identical reference

Three design choices worth calling out, since this is where competing HAL
shims diverge:

  1. f64 throughout. The program description specifies "(double precision)."
    f32-only shims do not satisfy this — and most HPC reference codes this
    mentorship targets are f64 internally.
  2. LMUL=4 register grouping. Wider grouping than LMUL=1 — more
    throughput per vsetvli strip-mine iteration on hardware that has
    the registers free. The scalar tail handles arbitrary lengths.
  3. AVX2 + FMA gated separately. FMA3 is a separate CPUID feature
    from AVX2 on older Intel parts; gating only on __AVX2__ produces
    broken builds on those CPUs.

Higher-level operations

Built on the load / store / fmadd / hsum primitives:

  • hal_dot4 — 4-element dot product
  • hal_matvec_row — matrix row × vector (arbitrary length, SIMD body + scalar tail)
  • hal_axpy4 — BLAS-1 AXPY: y = alpha*x + y

hal_matvec_row is the operation driving TensorFlow Lite fully-connected
layer inference in the companion issue (link to be added once filed).

Validation — dual-backend on riscv64

Both backends built with the same source tree and tested under
qemu-riscv64 10.2.1:

Build flag Binary Architecture (self-reported) Tests
-march=rv64gc test_hal_riscv64_scalar scalar 20/20 PASS
-march=rv64gcv test_hal_riscv64_rvv riscv_rvv 20/20 PASS

Numerical results are bit-identical between backends (verified via
diff after stripping the architecture-identification line). Every test
across hal_dot4, hal_fmadd_f64x4, hal_matvec_row, hal_axpy4,
and hal_sub_f64x4 produces err = 0.00e+00 on both backends.

Backend selection — verified by disassembly

It is easy to claim "the RVV backend was selected" by reading the
self-reported architecture line. The harder claim is verified by
disassembling both binaries and counting opcodes:

Binary Total RVV opcodes (a) f64m4-specific opcodes (b)
test_hal_riscv64_scalar 0 0
test_hal_riscv64_rvv 596 8

(a) Counted via objdump -d | grep -cE 'vle64|vse64|vfmacc|vsetvl|vfmul|vfadd|vfredosum'
(b) Counted via objdump -d | grep -cE 'vfmacc.vv|vfredosum.vs|vsetvli.*e64,m4'

The f64m4-specific intrinsics (vfmacc.vv, vfredosum.vs,
vsetvli ... e64,m4) only appear when the HAL RVV backend code is
compiled — GCC auto-vectorization does not emit LMUL=4 grouping on
its own. Their presence in the RVV binary (8 ops) and total absence
in the scalar binary (0 ops) confirms backend selection works as
designed; it is not GCC auto-vec masquerading as the HAL path.

Reproduction

git clone https://github.com/trg-rgb/riscv-hpc-port
cd riscv-hpc-port/hal
make verify

Expected final output:

--- Scalar ---
Results: 20 PASS, 0 FAIL
--- RVV    ---
Results: 20 PASS, 0 FAIL
BIT-IDENTICAL across backends
RVV opcodes in RVV binary: 596
f64m4 opcodes in scalar binary (must be 0): 0

Toolchain: riscv64-linux-gnu-gcc 15.2.0, qemu-riscv64 10.2.1
(Ubuntu 24.04, x86_64 host).

Files

  • hal/simd.h — the shim
  • hal/test_hal.c — 20-test harness
  • hal/Makefile — reproduction recipe (make verify)
  • hal/test_hal_results.txt — curated dual-backend summary
  • hal/test_hal_results_scalar.txt, hal/test_hal_results_rvv.txt — raw runs

Repository

https://github.com/trg-rgb/riscv-hpc-port/tree/main/hal

Related work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions