Skip to content

[Results] TensorFlow Lite v2.17.0 inference engine on riscv64 — end-to-end CNN inference validated under qemu-riscv64, packaged as .deb #27

@trg-rgb

Description

@trg-rgb

Summary

TensorFlow Lite v2.17.0 cross-compiled to riscv64 as a static library
(libtensorflow-lite.a, 21 MB, 243 verified object files), with
end-to-end inference of a real-world INT8-quantized CNN validated under
qemu-riscv64. Packaged as a Debian package
(libtensorflow-lite-dev_2.17.0-1_riscv64.deb, 4.1 MB compressed,
22 MB uncompressed) for installation on riscv64 systems.

The program description specifies "Community AI / ML and HPC (double
precision) applications."
This issue addresses the AI / ML half of
that scope — to my knowledge the only ML inference deliverable in
the current applicant pool.

Build

CMake cross-compile from x86_64 host to riscv64 target:

cmake ~/tensorflow-v2.17.0/tensorflow/lite \
  -DCMAKE_TOOLCHAIN_FILE=riscv64-toolchain.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DTFLITE_ENABLE_XNNPACK=OFF \
  -DTFLITE_ENABLE_RUY=OFF \
  -DBUILD_SHARED_LIBS=OFF

Flag rationale:

Flag Reason
-DTFLITE_ENABLE_XNNPACK=OFF XNNPACK uses NEON / SSE2 intrinsics; no fast path for base RV64GC.
-DTFLITE_ENABLE_RUY=OFF Ruy has no riscv64 fast path; same rationale.
-DBUILD_SHARED_LIBS=OFF Static library — no dynamic linker path issues under qemu-riscv64.

Both XNNPACK and Ruy being disabled means inference uses TFLite's
reference C++ kernels. These are correct but unoptimized. Adding
RISC-V RVV fast paths to either backend is a real upstream gap

visible as follow-up mentorship work — the HAL primitives in #26
(hal_matvec_row, hal_fmadd_f64x4) are exactly the building blocks
for accelerating TFLite's fully-connected and convolution kernels.

Library verification

Metric Value
Output libtensorflow-lite.a
Size 21 MB (21,883,450 bytes)
Object files in archive 243
Binary format elf64-littleriscv
Architecture riscv:rv64

Verified with riscv64-linux-gnu-objdump and riscv64-linux-gnu-nm
real inference engine symbols compiled for riscv64, not stubs.

End-to-end inference on riscv64

Metric Value
Model INT8-quantized CNN, 59.9 MB
Task Real-world agricultural image classification
Init time 227 ms
First inference ~798 s
Average inference ~781 s
Memory footprint 100.7 MB

The ~800 s inference time is qemu-riscv64 user-mode emulation
overhead, not a RISC-V performance number.
Every riscv64 instruction
is translated and executed in software on the x86 host via QEMU's TCG
JIT. On real RV64GC silicon — even before adding RVV acceleration —
this CNN would run at orders of magnitude lower latency.

What this benchmark validates is correctness: the inference engine
produces correct classification outputs across all 243 compiled object
files under emulation, with deterministic per-frame results matching
an x86_64 baseline run of the same model. Performance validation is
hardware-bound future work (HiFive Unmatched, VisionFive 2, or similar).

Debian package

Packaged for installation on riscv64 systems:

$ dpkg-deb --info dist/libtensorflow-lite-dev_2.17.0-1_riscv64.deb
 new Debian package, version 2.0.
 size 4218580 bytes: control archive=540 bytes.
 Package: libtensorflow-lite-dev
 Version: 2.17.0-1
 Architecture: riscv64
 ...

Contents: libtensorflow-lite.a (21 MB) under /usr/lib/riscv64-linux-gnu/,
plus 1161 .h files under /usr/include/tensorflow/lite/ preserving
the upstream directory structure. Installable via dpkg -i on a
riscv64 system.

Files

  • tflite/results/tflite_build_results.txt — full CMake configure + build log
  • tflite/results/benchmark_results.txt — end-to-end inference timing
  • tflite/results/libtensorflow-lite.a — the static library
  • tflite/toolchain/riscv64-toolchain.cmake — cross-compile toolchain file
  • tflite/bin/benchmark_model — static riscv64 ELF benchmark binary (no sysroot dependency)
  • tflite/dist/libtensorflow-lite-dev_2.17.0-1_riscv64.deb — installable package

Repository

https://github.com/trg-rgb/riscv-hpc-port/tree/main/tflite

Future work (mentorship-scoped)

  1. RVV 1.0 fast paths in TFLite's reference kernels (matvec, conv, depthwise) — directly using the HAL primitives in [Results] Portable f64 SIMD HAL shim — RVV / AVX2+FMA / SSE2 / scalar — 20/20 bit-identical across backends on riscv64 #26
  2. XNNPACK riscv64 backend prototype (NEON intrinsics → RVV translation)
  3. Validation on HiFive Unmatched silicon (real performance numbers)
  4. Extend the cross-compile recipe from TFLite to TensorFlow full

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions