You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TensorFlow Lite v2.17.0 cross-compiled to riscv64 as a static library
(libtensorflow-lite.a, 21 MB, 243 verified object files), with
end-to-end inference of a real-world INT8-quantized CNN validated under qemu-riscv64. Packaged as a Debian package
(libtensorflow-lite-dev_2.17.0-1_riscv64.deb, 4.1 MB compressed,
22 MB uncompressed) for installation on riscv64 systems.
The program description specifies "Community AI / ML and HPC (double
precision) applications." This issue addresses the AI / ML half of
that scope — to my knowledge the only ML inference deliverable in
the current applicant pool.
Build
CMake cross-compile from x86_64 host to riscv64 target:
XNNPACK uses NEON / SSE2 intrinsics; no fast path for base RV64GC.
-DTFLITE_ENABLE_RUY=OFF
Ruy has no riscv64 fast path; same rationale.
-DBUILD_SHARED_LIBS=OFF
Static library — no dynamic linker path issues under qemu-riscv64.
Both XNNPACK and Ruy being disabled means inference uses TFLite's
reference C++ kernels. These are correct but unoptimized. Adding
RISC-V RVV fast paths to either backend is a real upstream gap
visible as follow-up mentorship work — the HAL primitives in #26
(hal_matvec_row, hal_fmadd_f64x4) are exactly the building blocks
for accelerating TFLite's fully-connected and convolution kernels.
Library verification
Metric
Value
Output
libtensorflow-lite.a
Size
21 MB (21,883,450 bytes)
Object files in archive
243
Binary format
elf64-littleriscv
Architecture
riscv:rv64
Verified with riscv64-linux-gnu-objdump and riscv64-linux-gnu-nm —
real inference engine symbols compiled for riscv64, not stubs.
End-to-end inference on riscv64
Metric
Value
Model
INT8-quantized CNN, 59.9 MB
Task
Real-world agricultural image classification
Init time
227 ms
First inference
~798 s
Average inference
~781 s
Memory footprint
100.7 MB
The ~800 s inference time is qemu-riscv64 user-mode emulation
overhead, not a RISC-V performance number. Every riscv64 instruction
is translated and executed in software on the x86 host via QEMU's TCG
JIT. On real RV64GC silicon — even before adding RVV acceleration —
this CNN would run at orders of magnitude lower latency.
What this benchmark validates is correctness: the inference engine
produces correct classification outputs across all 243 compiled object
files under emulation, with deterministic per-frame results matching
an x86_64 baseline run of the same model. Performance validation is
hardware-bound future work (HiFive Unmatched, VisionFive 2, or similar).
Debian package
Packaged for installation on riscv64 systems:
$ dpkg-deb --info dist/libtensorflow-lite-dev_2.17.0-1_riscv64.deb
new Debian package, version 2.0.
size 4218580 bytes: control archive=540 bytes.
Package: libtensorflow-lite-dev
Version: 2.17.0-1
Architecture: riscv64
...
Contents: libtensorflow-lite.a (21 MB) under /usr/lib/riscv64-linux-gnu/,
plus 1161 .h files under /usr/include/tensorflow/lite/ preserving
the upstream directory structure. Installable via dpkg -i on a
riscv64 system.
Files
tflite/results/tflite_build_results.txt — full CMake configure + build log
Summary
TensorFlow Lite v2.17.0 cross-compiled to riscv64 as a static library
(
libtensorflow-lite.a, 21 MB, 243 verified object files), withend-to-end inference of a real-world INT8-quantized CNN validated under
qemu-riscv64. Packaged as a Debian package(
libtensorflow-lite-dev_2.17.0-1_riscv64.deb, 4.1 MB compressed,22 MB uncompressed) for installation on riscv64 systems.
The program description specifies "Community AI / ML and HPC (double
precision) applications." This issue addresses the AI / ML half of
that scope — to my knowledge the only ML inference deliverable in
the current applicant pool.
Build
CMake cross-compile from x86_64 host to riscv64 target:
Flag rationale:
-DTFLITE_ENABLE_XNNPACK=OFF-DTFLITE_ENABLE_RUY=OFF-DBUILD_SHARED_LIBS=OFFBoth XNNPACK and Ruy being disabled means inference uses TFLite's
reference C++ kernels. These are correct but unoptimized. Adding
RISC-V RVV fast paths to either backend is a real upstream gap
visible as follow-up mentorship work — the HAL primitives in #26
(
hal_matvec_row,hal_fmadd_f64x4) are exactly the building blocksfor accelerating TFLite's fully-connected and convolution kernels.
Library verification
libtensorflow-lite.aelf64-littleriscvriscv:rv64Verified with
riscv64-linux-gnu-objdumpandriscv64-linux-gnu-nm—real inference engine symbols compiled for riscv64, not stubs.
End-to-end inference on riscv64
The ~800 s inference time is qemu-riscv64 user-mode emulation
overhead, not a RISC-V performance number. Every riscv64 instruction
is translated and executed in software on the x86 host via QEMU's TCG
JIT. On real RV64GC silicon — even before adding RVV acceleration —
this CNN would run at orders of magnitude lower latency.
What this benchmark validates is correctness: the inference engine
produces correct classification outputs across all 243 compiled object
files under emulation, with deterministic per-frame results matching
an x86_64 baseline run of the same model. Performance validation is
hardware-bound future work (HiFive Unmatched, VisionFive 2, or similar).
Debian package
Packaged for installation on riscv64 systems:
Contents:
libtensorflow-lite.a(21 MB) under/usr/lib/riscv64-linux-gnu/,plus 1161
.hfiles under/usr/include/tensorflow/lite/preservingthe upstream directory structure. Installable via
dpkg -ion ariscv64 system.
Files
tflite/results/tflite_build_results.txt— full CMake configure + build logtflite/results/benchmark_results.txt— end-to-end inference timingtflite/results/libtensorflow-lite.a— the static librarytflite/toolchain/riscv64-toolchain.cmake— cross-compile toolchain filetflite/bin/benchmark_model— static riscv64 ELF benchmark binary (no sysroot dependency)tflite/dist/libtensorflow-lite-dev_2.17.0-1_riscv64.deb— installable packageRepository
https://github.com/trg-rgb/riscv-hpc-port/tree/main/tflite
Future work (mentorship-scoped)
Related issues
hal_matvec_row— candidate backend for TFLite fully-connected layer acceleration): [Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26