[Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64

## Summary

A portable double-precision SIMD abstraction header — `hal/simd.h` — that
eliminates `#ifdef` chains at call sites and provides a single API
compiling correctly to four backends with compile-time dispatch:

| Backend           | Gate                          | Key intrinsics                            |
| ----------------- | ----------------------------- | ----------------------------------------- |
| RISC-V RVV 1.0    | `__riscv && __riscv_v`        | `vfmacc_vv_f64m4`, `vle64_v_f64m4`        |
| x86 AVX2 + FMA    | `__AVX2__ && __FMA__`         | `_mm256_fmadd_pd`, `_mm256_loadu_pd`      |
| x86 SSE2          | `__SSE2__`                    | `_mm_mul_pd`, `_mm_add_pd` (2×`__m128d`)  |
| Scalar            | any ISA                       | Plain C — bit-identical reference         |

Three design choices worth calling out, since this is where competing HAL
shims diverge:

1. **f64 throughout.** The program description specifies "(double precision)."
   f32-only shims do not satisfy this — and most HPC reference codes this
   mentorship targets are f64 internally.
2. **LMUL=4 register grouping.** Wider grouping than LMUL=1 — more
   throughput per `vsetvli` strip-mine iteration on hardware that has
   the registers free. The scalar tail handles arbitrary lengths.
3. **AVX2 + FMA gated separately.** FMA3 is a separate CPUID feature
   from AVX2 on older Intel parts; gating only on `__AVX2__` produces
   broken builds on those CPUs.

## Higher-level operations

Built on the load / store / fmadd / hsum primitives:

- `hal_dot4` — 4-element dot product
- `hal_matvec_row` — matrix row × vector (arbitrary length, SIMD body + scalar tail)
- `hal_axpy4` — BLAS-1 AXPY: `y = alpha*x + y`

`hal_matvec_row` is the operation driving TensorFlow Lite fully-connected
layer inference in the companion issue (link to be added once filed).

## Validation — dual-backend on riscv64

Both backends built with the same source tree and tested under
`qemu-riscv64 10.2.1`:

| Build flag        | Binary                        | Architecture (self-reported) | Tests       |
| ----------------- | ----------------------------- | ---------------------------- | ----------- |
| `-march=rv64gc`   | `test_hal_riscv64_scalar`     | `scalar`                     | **20/20 PASS** |
| `-march=rv64gcv`  | `test_hal_riscv64_rvv`        | `riscv_rvv`                  | **20/20 PASS** |

**Numerical results are bit-identical between backends** (verified via
`diff` after stripping the architecture-identification line). Every test
across `hal_dot4`, `hal_fmadd_f64x4`, `hal_matvec_row`, `hal_axpy4`,
and `hal_sub_f64x4` produces `err = 0.00e+00` on both backends.

## Backend selection — verified by disassembly

It is easy to claim "the RVV backend was selected" by reading the
self-reported architecture line. The harder claim is verified by
disassembling both binaries and counting opcodes:

| Binary                        | Total RVV opcodes (a) | f64m4-specific opcodes (b) |
| ----------------------------- | --------------------- | -------------------------- |
| `test_hal_riscv64_scalar`     | 0                     | **0**                      |
| `test_hal_riscv64_rvv`        | 596                   | **8**                      |

(a) Counted via `objdump -d | grep -cE 'vle64|vse64|vfmacc|vsetvl|vfmul|vfadd|vfredosum'`
(b) Counted via `objdump -d | grep -cE 'vfmacc.vv|vfredosum.vs|vsetvli.*e64,m4'`

The f64m4-specific intrinsics (`vfmacc.vv`, `vfredosum.vs`,
`vsetvli ... e64,m4`) only appear when the HAL RVV backend code is
compiled — GCC auto-vectorization does not emit LMUL=4 grouping on
its own. Their presence in the RVV binary (8 ops) and total absence
in the scalar binary (0 ops) confirms backend selection works as
designed; it is not GCC auto-vec masquerading as the HAL path.

## Reproduction

```
git clone https://github.com/trg-rgb/riscv-hpc-port
cd riscv-hpc-port/hal
make verify
```

Expected final output:

```
--- Scalar ---
Results: 20 PASS, 0 FAIL
--- RVV    ---
Results: 20 PASS, 0 FAIL
BIT-IDENTICAL across backends
RVV opcodes in RVV binary: 596
f64m4 opcodes in scalar binary (must be 0): 0
```

Toolchain: `riscv64-linux-gnu-gcc 15.2.0`, `qemu-riscv64 10.2.1`
(Ubuntu 24.04, x86_64 host).

## Files

- `hal/simd.h` — the shim
- `hal/test_hal.c` — 20-test harness
- `hal/Makefile` — reproduction recipe (`make verify`)
- `hal/test_hal_results.txt` — curated dual-backend summary
- `hal/test_hal_results_scalar.txt`, `hal/test_hal_results_rvv.txt` — raw runs

## Repository

https://github.com/trg-rgb/riscv-hpc-port/tree/main/hal

## Related work

- TensorFlow Lite riscv64 inference engine (uses HAL primitives in `hal_matvec_row` for dense-layer compute): _link added once issue filed_
- OpenBLAS 0.3.33 ZVL128B build with verified 14,355 RVV opcodes: #25
- Upstream OpenBLAS documentation PR clarifying RVV target selection: https://github.com/OpenMathLib/OpenBLAS/pull/5819
- Chocolate Doom 3.0.0 running on riscv64 (LFX spreadsheet-named target): #20

Build flag	Binary	Architecture (self-reported)	Tests
`-march=rv64gc`	`test_hal_riscv64_scalar`	`scalar`	20/20 PASS
`-march=rv64gcv`	`test_hal_riscv64_rvv`	`riscv_rvv`	20/20 PASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26

Summary

Higher-level operations

Validation — dual-backend on riscv64

Backend selection — verified by disassembly

Reproduction

Files

Repository

Related work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Backend	Gate	Key intrinsics
RISC-V RVV 1.0	`__riscv && __riscv_v`	`vfmacc_vv_f64m4`, `vle64_v_f64m4`
x86 AVX2 + FMA	`__AVX2__ && __FMA__`	`_mm256_fmadd_pd`, `_mm256_loadu_pd`
x86 SSE2	`__SSE2__`	`_mm_mul_pd`, `_mm_add_pd` (2×`__m128d`)
Scalar	any ISA	Plain C — bit-identical reference

[Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26

Description

Summary

Higher-level operations

Validation — dual-backend on riscv64

Backend selection — verified by disassembly

Reproduction

Files

Repository

Related work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions