WebAssembly for Embedded Systems: Comprehensive Research Report

ARM Cortex-M and RISC-V Optimization Study

Executive Summary

This report provides comprehensive research on WebAssembly deployment in embedded systems, specifically targeting ARM Cortex-M and RISC-V architectures. The findings cover runtime implementations, memory management strategies, performance optimizations, and real-world deployment considerations.

Key Findings:

WAMR (WebAssembly Micro Runtime) emerges as the most feature-complete embedded runtime
AOT compilation achieves 50-79% of native performance on embedded platforms
Memory footprints as low as 29.4KB (AOT) to 90KB (interpreter) are achievable
Hardware acceleration can provide up to 142x speedup for specific workloads
64KB page size and lack of small integer types remain significant embedded challenges

1. Existing WebAssembly Runtimes for Embedded Systems

1.1 WAMR (WebAssembly Micro Runtime)

Architecture & Platform Support:

Maintainer: Bytecode Alliance
License: Apache 2.0 with LLVM exception
Supported Architectures:
- ARM/THUMB (Cortex-M7, Cortex-A15 tested)
- AArch64 (Cortex-A57, Cortex-A53 tested)
- RISC-V (64-bit and 32-bit, LP64 and LP64D configurations)
- X86-32/64, XTENSA, MIPS, ARC

Memory Footprint (Cortex-M4F Configuration):

Fast Interpreter:  ~58.9 KB
Classic Interpreter: ~56.3 KB
AOT Runtime:       ~29.4 KB
WASI Library:      ~21.4 KB
Built-in libc:      ~3.7 KB

Execution Modes:

Classic Interpreter (CI)
- Small footprint, low memory consumption
- Relatively slow execution
- Required for source-level debugging
- Best for: Development and severely resource-constrained systems
Fast Interpreter (FI)
- ~2x faster than classic interpreter
- Pre-compiles WebAssembly opcodes to internal opcodes
- 150% performance improvement on CoreMark
- 42% reduction in generated instructions
- 30% increased memory consumption
- Cannot coexist with other engines in same binary
- Best for: Embedded systems needing speed without JIT overhead
AOT (Ahead-of-Time Compilation)
- Nearly native speed (50-79% of native on embedded)
- Very small footprint
- Quick startup
- Uses LLVM backend for optimization
- Self-contained module loader for Linux, Windows, macOS, Android, SGX, MCU
- Best for: Production environments requiring optimal performance
JIT (Just-in-Time Compilation)
- Fast JIT: Small footprint, quick startup, good performance
- LLVM JIT: Best execution speed, longer compilation time
- Multi-tier JIT: Supports dynamic tier-up from Fast to LLVM JIT
- Best for: Long-running applications with adequate resources

Embedded-Specific Features:

XIP (Execute In Place): Run AOT files directly from ROM/flash
Indirect function calls: Reduces relocations for XIP mode
Configurable libc: Minimal built-in subset or full WASI
Threading: Pthread APIs and wasi-threads support
Socket support: Berkeley/POSIX socket implementations
RTOS Integration: Zephyr, RT-Thread, ESP-IDF, FreeRTOS, NuttX

Fast Interpreter Optimizations:

Register-based conversion: Stack-based bytecode converted to register operations
- "Register-based architecture requires 47% less executed VM instructions"
Fast bytecode dispatching: Pre-resolved handler addresses during loading (~7% improvement)
Bytecode fusion: Eliminates redundant stack operations
Pre-decode LEB128: Integer decoding once during loading

Performance Benchmarks:

X86-64 Platform (Intel i7-7700):

Matrix: WAMR-AOT is 22x faster than wasm3
CoreMark: WAMR-AOT delivers 8.79x better scores than wasm3
Native comparison: WAMR-AOT achieves 68-79% of native performance

ARM Cortex-M7 (Zephyr OS):

Matrix: WAMR-AOT runs 30x faster than wasm3
Gimli: 19x performance advantage for WAMR-AOT
Fast interpreter: 1.65-2.03x faster than classic variant

ARM32 (AllWinner V3S MCU) - CoreMark:

Interpreter mode: 32 CoreMark
AOT compilation: 611 CoreMark
Native performance: 1157 CoreMark
AOT achieves ~50% of native performance

RISC-V 32-bit (ESP32 C3):

Performance: ~50% of native in AOT mode
Code size: AOT reduces size by ~25%

Memory Consumption (CoreMark workload):

WAMR Classic:    365 KB
WAMR Fast:       485 KB
wasm3:           514 KB

1.2 wasm3

Architecture & Philosophy:

Design: Interpreter-based approach (no JIT)
Philosophy: Prioritizes size, portability, and security over raw speed
Minimum Requirements:
- Code footprint: ~64 KB
- RAM requirement: ~10 KB

Platform Support:

x86, x86_64, ARM, RISC-V, PowerPC, MIPS, Xtensa, ARC32
MCUs: Arduino, ESP8266, ESP32
SBCs: Raspberry Pi, Orange Pi
Mobile platforms, browsers, routers

Features:

WebAssembly spec compliance with partial WASI
Linear memory limits under 64KB support
Custom page sizes for memory optimization
Gas metering for resource-controlled execution
Self-hosting capabilities
Available as Arduino library

Performance:

Significantly slower than AOT runtimes
10x slowdown vs native on Cortex-M (compared to aWsm's ~40% slowdown)
Trades speed for "easy to compile and integrate" characteristics

Use Case: Best for severely resource-constrained devices where JIT is unavailable or impractical

1.3 aWsm (Awasm)

Design Approach:

AOT compilation using LLVM
Focuses on generating fast code, simplicity, portability
Implements Software Fault Isolation (SFI) and Control-Flow Integrity (CFI)

Platform Support:

x86-64, aarch64 (Raspberry Pi), thumb (ARM Cortex-M4 and M7)

Performance on Cortex-M (PolyBench benchmarks):

Cortex-M7: 40.2% slowdown vs native
Cortex-M4: 24.9% slowdown vs native
Microprocessors: Within 10% of native
Microcontrollers: Within 40% of native

Optimizations:

Configurable page sizes (supports sub-64KB pages)
Selective linking (avoids expensive f32/f64 operations)
Minimal runtime footprint (<5K lines of C)
Seven distinct bounds checking approaches

Memory Capabilities:

Can run on systems with only 64-128KB SRAM

1.4 Other Embedded Runtimes

wasmi:

Strong embedding support
Interpreter-only execution
Rust-based implementation

Bobbin-wasm:

Written in Rust
#[no_std], allocation-free
Designed for ARM Cortex-M SoCs

wasmtime/wasmer:

Weak embedded support
Primarily desktop/server focused
Multiple backend options but larger footprints

1.5 Runtime Comparison Summary

Runtime	Embedding	Execution Modes	Best For
WAMR	Excellent	Interpreter, Fast Interpreter, AOT, JIT	Production embedded, IoT, edge
wasm3	Excellent	Interpreter only	Severely constrained devices
aWsm	Good	AOT only	Performance-critical embedded
wasmi	Good	Interpreter only	Rust-based embedded projects
wasmtime	Poor	JIT + AOT	Server/desktop environments
wasmer	Poor	Multiple backends	Server/desktop environments

2. Memory Management Optimizations (MMU/MPU Usage)

2.1 WebAssembly's Software-Based Memory Safety

Core Approach:

Each WebAssembly module executes within a sandboxed environment
Fault isolation techniques separate modules from host runtime
Software-based bounds checking by default
Applications execute independently and cannot escape sandbox

Limitations:

WebAssembly's sandboxing does not inherently provide memory safety for applications written in unsafe languages (C/C++)
Traditional approach relies on software checks, not hardware protection

2.2 ARM Cortex-M MPU Integration

MPU (Memory Protection Unit) Characteristics:

Trimmed-down version of MMU
Provides only memory protection support (no virtual memory)
Common in low-power processors
Ideal for sandboxing untrusted code (third-party applications)

OmniWasm Project:

Target: ARM Cortex-M7 (216 MHz processor)
Approach: Novel bounds checking mechanism leveraging MPU hardware
Features:
- Software Fault Isolation (SFI): Ensures loads/stores stay within sandbox
- Control-Flow Integrity (CFI): Prevents execution hijacking
- Granular fault isolation for legacy C/C++ code
Challenges:
- MPU usage complicated by need for interleaved memory instructions
- CFI runtimes require access to both runtime data structures and sandbox memory

Technical Challenges:

Most embedded systems lack MMU (hardware virtual memory)
Tiny IoT devices may not have hardware necessary for full Linux OS
MPU provides limited protection compared to full MMU
Requires careful integration with CFI runtime metadata access

2.3 Hardware-Accelerated Memory Protection

Cage Research (ARM MTE/PAC):

Uses ARM's Memory Tagging Extension (MTE)
Implements Pointer Authentication Codes (PAC)
Ensures memory safety at runtime
Works with unmodified C/C++ programs compiled to WebAssembly
Hardware-accelerated safe WebAssembly execution

Key Insight: WebAssembly traditionally relies on software-based sandboxing rather than hardware MPU/MMU features, but recent research explores hardware acceleration for enhanced memory protection in embedded environments.

2.4 Memory Layout Considerations

WebAssembly Memory Model:

Stack pointer stored at address 4
Stack allocated early in program
Malloc implementation must avoid allocating over stack
No guard pages currently (stack overflow can clobber heap)
Requires explicit stack checks in generated code

Embedded Challenges:

64KB minimum page size too large for many embedded systems
Some devices have only 64KB total memory
Requires patching LLVM to reduce page sizes (down to 1 byte)
Memory regions allocated in 64KB multiples cause unused memory waste

Stack Size Configuration:

Clang WebAssembly linker allows static stack size setting
Typical embedded configurations: 32KB stack
Must balance between adequate space and memory constraints
No simple rule for determining requirements (depends on RTOS, compilation options)

2.5 Memory Footprint Measurements

RISC-V 32-bit (ESP32 C3):

Interpreter:      94,928 bytes code + 2,068 bytes data
Fast Interpreter: 103,418 bytes code + 2,076 bytes data
AOT mode:         72,040 bytes code + 1,732 bytes data

Code Size Comparison:

WASM bytecode: 10.5 KB (CoreMark)
Native binary: 23 KB (CoreMark)
WebAssembly demonstrates significant size savings

3. Direct Function Calls and Linking Optimizations

3.1 AOT Compilation and Linking

WAMR AOT Process:

Uses wamrc tool to compile WebAssembly bytecode to native machine code
Leverages LLVM backend for optimization
Self-implemented AOT module loader for cross-platform support
Works on Linux, Windows, macOS, Android, SGX, MCU systems

Direct vs Indirect Calls:

Direct calls: Better performance but require relocations
Indirect calls: Required for XIP (Execute In Place) mode
AOT functions look up function pointers from table in exec_env
Tradeoff between performance and ROM/flash execution capability

3.2 XIP (Execute In Place) Optimization

Purpose:

Run AOT files directly from read-only memory (ROM/flash)
Reduces memory consumption
Solves lack of executable memory issue on some devices

Implementation Strategy:

Indirect Function Mode:
- Functions don't call each other directly
- Look up function pointers from table passed via exec_env
- Minimizes relocations needed
LLVM Intrinsic Replacement:
- Eliminates calls to LLVM intrinsic functions
- Replaces with runtime-implemented alternatives
- Example: aot_intrinsic_fadd_f32 instead of llvm.experimental.constrained.fadd.f32

AOT File Generation for XIP:

# Generic XIP
wamrc --xip -o output.aot input.wasm

# ARM Cortex-M55 (with FPU)
wamrc --target=thumbv8m.main --cpu=cortex-m55 --xip \
      --enable-builtin-intrinsics=i64.common

# ARM Cortex-M3 (no FPU)
wamrc --target=thumbv7m --cpu=cortex-m3 --xip \
      --enable-builtin-intrinsics=i64.common,fp.common,fpxint

Tuning Options:

--enable-indirect-mode: Use indirect function calls
--disable-llvm-intrinsics: Replace intrinsics with runtime functions
--enable-builtin-intrinsics=<list>: Customize based on hardware capabilities

Known Limitations:

Some relocations to .rodata sections may still require code patching
Future work needed for complete read-only execution

3.3 Function Call Optimization Techniques

Profile-Guided Optimization:

Build profile-guided inliner on top of AOT compiler
Profile WebAssembly indirect calls
Inline most frequent call targets
Can achieve 20% reduction in execution time for compute-intensive loops

Link-Time Optimization (LTO):

Optimizations across different source files
Better function inlining
Dead code elimination
Requires compilation flag support

WebAssembly Runtime Library Linking:

Compiled modules linked against target library (e.g., libwart.a)
End-to-end compiler workflow:
1. Run wat and LLVM's llc to create object file
2. Link against WebAssembly runtime library

3.4 Static Linking and Module Merging

Multi-Memory Support:

Tools can merge multiple WebAssembly modules into one (static linking)
Previously failed when modules defined >1 memory
Multi-memory proposal closes this gap
Enables better linking and optimization opportunities

4. Multi-Memory Proposals and Embedded Use Cases

4.1 Multi-Memory Proposal Overview

Status: Phase 4 (W3C process)

Core Feature:

Ability to use multiple memories within single WebAssembly module
Removes single-memory limitation

4.2 Embedded-Relevant Use Cases

1. Security (Memory Isolation):

Public memory: Shared with outside for data exchange
Private memory: Kept encapsulated inside module
Critical for embedded systems running untrusted code
Prevents data leakage between security domains

2. Threading Isolation:

Shared memory: Used between multiple threads
Thread-local memory: Used in single-threaded manner
Beneficial even within single module
Reduces contention and synchronization overhead

3. Linking Multiple Modules:

Static linking tools can now merge modules with multiple memories
Previously impossible when modules defined >1 memory
Closes gap in toolchain capabilities
Enables better code organization

4. Scaling Beyond 4GB:

32-bit address space limitation workaround
Multiple memories provide efficient scaling
Important for data-intensive embedded applications
Bridge until 64-bit memories available

5. Polyfilling Advanced Features:

Garbage collection emulation
Interface types emulation
Auxiliary memory distinct from module's address space
Enables advanced features on current WebAssembly

4.3 Embedded Systems Context

Environments:

Can be embedded in many different environments
Compiled on all modern architectures
Desktop, mobile, embedded systems alike
Multiple memories enhance portability and flexibility

Implementation Status:

Supported in modern runtimes (WAMR, Wasmtime, etc.)
Chrome shipped support (Intent to Ship declared)
Firefox implementation in progress
Enabled in LLVM backend

5. Code Size and Performance Optimizations

5.1 Code Size Optimization Strategies

Compilation Flags:

-Os and -Oz: Geared towards smaller code size
-O2 and -O3: Focus on speed
Link Time Optimization (LTO): Cross-file optimizations

Language-Specific Considerations:

Rust: Can produce very small WebAssembly (2KB compressed achievable)
C/C++: Smaller initial binary sizes, more control over memory
High-level languages: Larger runtime overhead

Compression:

WebAssembly compresses very well via gzip/brotli
Can significantly reduce apparent bloat
Important for network transfer in OTA updates

Dead Code Elimination:

Remove unused functions and data
LTO enables better dead code detection
Critical for embedded where every byte counts

5.2 Performance Optimization Techniques

Interpreter Optimizations (WAMR Fast Interpreter):

Stack-to-Register Conversion:
- 47% fewer executed VM instructions
- Simulates execution during preprocessing
- Calculates slot IDs instead of evaluating values
Bytecode Fusion:
- Combines related operations
- Example: get_local, i32.const, i32.add, set_local → 2 fused ops
- Eliminates redundant stack manipulation
Fast Bytecode Dispatching:
- Pre-resolve handler addresses at load time
- ~7% performance improvement on CoreMark
Pre-Decode LEB128:
- Decode integers once during loading
- Small integers (<255): no size overhead
- Larger constants: pooled with 16-bit indexing

Execution Frame Structure:

Constant space: Pre-calculated values
Local space: Function local variables
Dynamic space: Intermediate computation values
Preserve space: Original values when locals modified before consumption

AOT Optimization with LLVM:

Full LLVM optimization pipeline available
Platform-specific code generation
Sophisticated optimizations missed by source compilers can be applied
Target-specific instruction selection

5.3 Performance vs Native Comparison

Research Findings (General WebAssembly):

Average slowdown: 45% (Firefox) to 55% (Chrome) vs native
Peak slowdowns: up to 2.5x
Design constraints cause overhead:
- Stack overflow checks
- Indirect call checks
- Reserved registers

Embedded Specific (AOT Compilation):

WAMR AOT: 50-79% of native performance
aWsm on Cortex-M7: 40.2% slowdown
aWsm on Cortex-M4: 24.9% slowdown
Generally acceptable for embedded use cases

Interpreter Performance:

WAMR Fast Interpreter: ~150% improvement over classic
wasm3: >10x slowdown vs native
Generally too slow for real-time embedded applications

5.4 Embedded-Specific Performance Challenges

Resource Constraints:

Software WebAssembly execution involves interpretation, JIT, profiling
On resource-constrained devices, overhead exceeds actual computation
Runtime costs more significant than on desktop systems

WebAssembly Specification Limitations:

64KB pages too large (devices may have only 64KB total memory)
No separation of RO and RW memory
Prevents optimizations essential for density
Lacks i8/i16 types (only i32/i64)
Mandatory 64-bit arithmetic wasteful on 8/16-bit hardware

Memory Overhead:

Modules may need own runtime (memory allocator)
Increases module size and memory usage
64KB page alignment causes unused memory
Tasks requiring less still allocated full pages

5.5 Benchmark Results Summary

PolyBench/CoreMark Comparisons:

WAMR on ARM32 (CoreMark):

Native: 1157
AOT: 611 (52.8% of native)
Interpreter: 32 (2.8% of native)

WAMR on x86-64:

Matrix: AOT 22x faster than wasm3
CoreMark: AOT 8.79x faster than wasm3

aWsm on Cortex-M (PolyBench):

M7: 59.8% of native (40.2% slowdown)
M4: 75.1% of native (24.9% slowdown)

Code Size:

WebAssembly: Often smaller than native (10.5KB vs 23KB for CoreMark)
AOT: 25% smaller than interpreter on RISC-V
Compression further improves ratios

6. Real-Time Constraints and Deterministic Behavior

6.1 Real-Time Performance Characteristics

WAMR Real-Time Capabilities:

Meets many real-time use cases
Predictable and efficient performance
Minimal jitter
AOT compilation can outperform native GCC-compiled code in some cases

Memory Safety Without GC:

WebAssembly ensures memory safety without garbage collection
Critical for real-time systems
GC introduces latency and unpredictability
WebAssembly's linear memory model is deterministic

6.2 Determinism Considerations

Threading and Non-Determinism:

WebAssembly originally had no threads
No non-determinism from concurrent memory access
Recent thread support requires careful handling
WAMR supports pthread APIs and wasi-threads

Execution Determinism:

WebAssembly semantics are fully deterministic
Same input always produces same output (without threads)
Important for safety-critical embedded systems
Reproducible behavior aids debugging

6.3 Hardware Acceleration for Real-Time

WebAssembly Hardware Accelerator:

Platform: Altera Cyclone IV FPGA (DE2-115 board)
Design: Verilog HDL implementation
Performance: Up to 142x speedup for selected algorithms
Clock: 50 MHz on FPGA (ASIC could run much faster)
Resource Usage: 6,246 LUTs, 1,563 registers

Benefits:

Bypasses interpretation and JIT compilation overhead
Direct bytecode execution in hardware
Massive performance boost for compute-intensive tasks
Minimal hardware overhead for integration

Limitations:

FPGA-specific implementation
Limited to specific instruction subset
Not general-purpose solution
Best for specialized workloads

6.4 RTOS Integration

Supported Real-Time Operating Systems:

FreeRTOS: Lightweight, traditional embedded RTOS
Zephyr: Modern, feature-rich, open collaboration
ThreadX: Commercial RTOS option
NuttX: Apache-licensed RTOS
RT-Thread: Chinese open-source RTOS

Zephyr Integration (Ocre Project):

OCI-like application containers
1,000x lighter than Linux containers (Docker/Podman)
Built as Zephyr module
Easy integration with existing firmware
Supports OTA updates via WebAssembly modules

Example Deployment:

Nordic nRF52840 microcontroller running WAMR on Zephyr
Portability: Same WebAssembly binary runs on:
- Microcontroller
- Cloud servers
- Web browsers
Demonstrates "write once, run anywhere" for embedded

6.5 Resource Requirements for RTOS

Minimum Requirements:

WAMR footprint: As small as 50KB
RAM: Can run in systems with 64-128KB SRAM
Various RTOS options for tiny IoT devices
Real-time computation, memory management, networking support

RT-Thread Example:

Platform: ARM Cortex-M4 (120MHz)
RAM: 640KB
Runtime: WAMR
Demonstrates feasibility on modest hardware

6.6 Challenges for Hard Real-Time

Timing Predictability:

AOT provides most predictable timing
Interpreter has variable execution times
JIT introduces compilation delays
Hardware acceleration offers best determinism

Memory Allocation:

Dynamic allocation can cause unpredictability
WebAssembly linear memory pre-allocated
No garbage collection pauses
Fixed-size stack and heap preferred

Interrupt Handling:

RTOS integration must handle interrupts properly
WebAssembly isolation may add latency
Critical paths may need native implementation
Hybrid approach often necessary

7. ARM Cortex-M Specific Optimizations

7.1 Architecture-Specific Code Generation

WAMR Cortex-M Support:

Tested Platforms:
- ARM Cortex-M7 (ARMV7)
- ARM Cortex-M4 (THUMB)
- ARM Cortex-A15 (ARMV7)

Compiler Targets:

# Cortex-M55 with FPU
--target=thumbv8m.main --cpu=cortex-m55

# Cortex-M7 with FPU
--target=thumbv7em --cpu=cortex-m7

# Cortex-M4 with FPU
--target=thumbv7em --cpu=cortex-m4

# Cortex-M3 (no FPU)
--target=thumbv7m --cpu=cortex-m3

7.2 FPU Handling

With FPU Support:

Can use hardware floating-point operations
Faster f32/f64 operations
Enable with: --enable-builtin-intrinsics=i64.common

Without FPU Support:

Software floating-point emulation required
Significant performance penalty
Enable with: --enable-builtin-intrinsics=i64.common,fp.common,fpxint

7.3 Cortex-M Memory Protection (MPU)

MPU Features:

8-16 programmable regions (depending on variant)
Region size must be power of 2
Minimum region size varies (32 bytes to 256 bytes)
Access permissions: Read/Write/Execute
Useful for sandboxing third-party code

WebAssembly Integration:

OmniWasm leverages MPU for bounds checking
Efficient granular fault isolation
CFI metadata access challenges
Requires careful memory layout planning

7.4 Thumb Instruction Set

Advantages:

16-bit instruction encoding
Reduced code size (important for flash-constrained devices)
Lower memory bandwidth requirements
Power efficiency

WAMR Support:

Full THUMB instruction set support
AOT compiler generates Thumb code
Optimized for code density
Performance comparable to 32-bit ARM mode

7.5 Performance Results on Cortex-M

WAMR Benchmarks:

Cortex-M7 (Zephyr OS, -Os optimization):

Matrix: AOT 30x faster than wasm3
Gimli: AOT 19x faster than wasm3
Fast interpreter: 1.65-2.03x faster than classic

Cortex-M4F Configuration:

Binary sizes: 29.4KB (AOT) to 58.9KB (Fast Interpreter)
Acceptable performance for most embedded use cases

aWsm Benchmarks:

Cortex-M7: 40.2% slowdown vs native
Cortex-M4: 24.9% slowdown vs native
PolyBench suite used for testing

7.6 Cortex-M Memory Constraints

Typical Configurations:

Flash: 256KB to 2MB
RAM: 64KB to 512KB
Some variants: As low as 32KB RAM

WebAssembly Challenges:

64KB page size problematic
Stack + heap + module must fit in limited RAM
XIP mode critical for flash execution
AOT preferred for size/performance balance

7.7 ARM-Specific Optimizations

WebAssembly Bitmask Operations:

ARM community has documented specific optimizations
Efficient implementation of WebAssembly SIMD bitmask operations
Leverages AArch64 instruction set features

Memory Tagging Extension (MTE):

Available on ARMv8.5-A and later
Cage project uses MTE for memory safety
Hardware-accelerated bounds checking
Not available on Cortex-M (Cortex-A only)

8. RISC-V WebAssembly Implementations

8.1 RISC-V Platform Support

WAMR RISC-V Support:

64-bit: Full support (RISC-V LP64 and LP64D)
32-bit: Interpreter only
Tested on various RISC-V SoCs
WALI implementation supports riscv-64 host ISA

Wasmer RISC-V Support (v3.2+):

Linux RISC-V support
LLVM compiler backend
Cranelift compiler backend
Enables WebAssembly on RISC-V servers and embedded

8.2 Performance on RISC-V

ESP32-C3 (RISC-V 32-bit):

WebAssembly achieves ~50% of native performance
Performance gap linked to portability/isolation overhead
AOT mode: ~50% of native on CoreMark
Acceptable for many embedded use cases

Memory Footprint (ESP32-C3):

Interpreter:      94,928 bytes code + 2,068 bytes data
Fast Interpreter: 103,418 bytes code + 2,076 bytes data
AOT mode:         72,040 bytes code + 1,732 bytes data

8.3 RISC-V vs WebAssembly Comparison

Similarities:

Both are open ISAs
Both prioritize simplicity and modularity
Both support multiple privilege levels
Both have extensible design

Differences:

RISC-V is hardware ISA, WebAssembly is virtual ISA
RISC-V has physical memory model, WebAssembly has linear memory
RISC-V is closer to hardware, WebAssembly is higher abstraction
WebAssembly provides stronger isolation guarantees

Complementary Nature:

WebAssembly can run on RISC-V
RISC-V can host WebAssembly runtimes
Both benefit from open ecosystem
Together enable open software/hardware stack

8.4 RISC-V Embedded Applications

Use Cases:

IoT devices (ESP32-C3 example)
Edge computing nodes
Secure processing elements
Upgradeable firmware via WebAssembly

WALI Deployment:

Tested on 24 diverse edge devices
10 resource-constrained single-board computers
Demonstrates WebAssembly viability on RISC-V edge
Thin kernel interfaces for efficiency

8.5 RISC-V Optimization Opportunities

Instruction Set Extensions:

Custom extensions possible
Could accelerate WebAssembly operations
B extension (bit manipulation) useful for WebAssembly
V extension (vector) for SIMD support

Compiler Optimizations:

LLVM RISC-V backend improving
Better code generation for RISC-V targets
AOT compilation leverages RISC-V features
Ongoing optimization work in LLVM community

8.6 RISC-V Development Tools

Emulators and Simulators:

RISC-V emulators written in Rust+WebAssembly
WebAssembly-based RISC-V simulators for education
Browser-based RISC-V development environments
Cross-platform development workflows

Example Projects:

riscv-rust: RISC-V emulator in Rust+WebAssembly
rvemu: RISC-V emulator for CLI and Web
Enables RISC-V software development in browsers
WebAssembly and RISC-V mutual ecosystem support

9. Current State of Embedded WebAssembly (2024-2025)

9.1 Standardization Progress

WASI (WebAssembly System Interface):

WASI 0.2 (Preview 2): Released January 25, 2024
Component Model: Integrated with WASI 0.2
WASI 0.1: Still widely used in production
Embedded-specific APIs: In development
- wasi-i2c: I2C protocol interface
- USB interfaces
- GPIO and hardware control

WebAssembly Proposals:

Multi-memory: Phase 4 (standardized)
Reference types: Phase 4 (standardized)
Garbage collection: Phase 4 (standardized in 2024)
Threads: Available in major runtimes
SIMD: Fixed-width 128-bit SIMD standardized
Exception handling: In progress

9.2 Component Model Impact

Key Benefits:

Language-agnostic composition
Modular, portable, compositional interfaces
Mix and match languages in single application
Focus on problem-solving vs boilerplate
WIT (WebAssembly Interface Types) Bindgen tooling

Embedded Relevance:

Small binary size maintained
Low memory footprint
Deterministic execution preserved
Early support for constrained environments
Viable for IoT and embedded devices

Practical Status (2024):

Moving from theory to practice
WIT Bindgen production-ready
Real-world deployments emerging
Tooling ecosystem maturing

9.3 Industry Adoption

Embedded UI Development:

Qt exploring WebAssembly for embedded systems
Cross-platform UI development
Single codebase for multiple targets
Reduced development and maintenance costs

Industrial Automation:

Attraction for embedded industrial software
Safety-critical systems exploration
Predictable behavior important
Update/upgrade flexibility valued

IoT and Edge:

Lightweight, efficient, secure runtime
Perfect for resource-limited devices
Platform-independent deployment
OTA update capabilities

9.4 Research and Development (2024)

Recent Publications:

"Potential of WebAssembly for Embedded Systems" (ArXiv, 2024)
"Hardware-Based WebAssembly Accelerator" (Electronics, 2024)
"Benchmarking WebAssembly for Embedded Systems" (ACM TACO, 2024)
"Cyber-physical WebAssembly" (ArXiv, 2024)

Active Research Areas:

Hardware acceleration (FPGA/ASIC)
Memory protection integration (MPU/MTE)
Real-time guarantees
Code size reduction
Performance optimization for constrained devices

9.5 Tooling Ecosystem

Compilers:

LLVM: Primary backend for AOT compilation
Emscripten: C/C++ to WebAssembly
wasm-pack: Rust to WebAssembly
TinyGo: Go subset for embedded WebAssembly
AssemblyScript: TypeScript-like language

Runtimes (Embedded Focus):

WAMR: Most feature-complete for embedded
wasm3: Smallest footprint interpreter
wasmi: Rust-based embedded runtime
WasmEdge: Edge computing focus
aWsm: Performance-focused AOT

Development Tools:

wamrc: WAMR AOT compiler
wasm-objdump: Inspect WebAssembly binaries
wasm-opt: Optimize WebAssembly modules
WIT Bindgen: Component model tooling

9.6 Remaining Challenges

Specification Issues:

64KB page size too large for deeply embedded
Lack of i8/i16 types (only i32/i64)
No RO/RW memory separation in spec
Community discussion ongoing (GitHub issue #899)

Performance Gaps:

45-55% slowdown vs native (general WebAssembly)
25-50% slowdown on embedded (AOT compilation)
Interpreter mode too slow for many real-time tasks
Stack overflow checks add overhead

Memory Overhead:

Module runtime requirements
64KB page alignment waste
Stack + heap sizing challenges
Limited by 32-bit address space

Tooling Gaps:

Embedded-specific profiling tools
Real-time debugging capabilities
Size optimization toolchains
Hardware-specific optimizations

9.7 Future Outlook

Short Term (2025-2026):

Better WASI embedded APIs
Improved tooling for size optimization
More RTOS integrations
Component model adoption in embedded

Medium Term (2027-2028):

Hardware acceleration becoming practical
Custom memory page sizes in spec
i8/i16 type support
Enhanced real-time guarantees

Long Term (2029+):

WebAssembly as standard embedded runtime
Hardware WebAssembly accelerators in SoCs
Mature safety-critical certifications
Dominant platform for embedded software

9.8 Recommendations for Adoption

When to Use WebAssembly in Embedded:

✅ Need for portability across platforms
✅ Secure sandboxing of untrusted code
✅ Over-the-air updates and flexibility
✅ Multi-language support required
✅ Moderate performance requirements (50%+ of native acceptable)
✅ Memory available: >128KB RAM, >256KB flash

When to Avoid:

❌ Hard real-time requirements (<1ms jitter)
❌ Need >95% of native performance
❌ Severely constrained: <64KB RAM
❌ Safety-critical certified code required (not yet certified)
❌ Heavy floating-point on non-FPU systems

Best Practices:

Use AOT compilation for production
Enable XIP for flash-constrained systems
Profile and optimize module size
Consider hybrid approach (WebAssembly + native)
Test on target hardware early
Use Fast Interpreter for development, AOT for production

10. Performance Benchmarks and Case Studies

10.1 Benchmark Suites Used

CoreMark:

Industry-standard CPU benchmark
Measures processor and compiler performance
List processing, matrix manipulation, state machine, CRC
Single-number score for comparison
Widely used in embedded systems

PolyBench:

30 numerical computation benchmarks
Linear algebra, image processing, physics simulation
Static control flow
Mathematical operations focus
Good for WebAssembly evaluation but not fully representative

Dhrystone:

Older benchmark (being replaced by CoreMark)
More compiler benchmark than hardware
Still used in some embedded contexts
Less relevant for modern evaluation

10.2 Comprehensive Benchmark Results

WAMR Performance Summary:

Platform: X86-64 (Intel i7-7700, Ubuntu 18.04, GCC O3)

Workload	Native	WAMR AOT	WAMR Fast	WAMR Classic	wasm3
Matrix	100%	68-79%	~35%	~20%	3-4%
CoreMark	100%	68-79%	~40%	~25%	8-9%

Platform: ARM Cortex-M7 (Zephyr OS, -Os optimization)

Workload	WAMR AOT	WAMR Fast	wasm3
Matrix	30x	~10x	1x
Gimli	19x	~6x	1x

Platform: ARM32 (AllWinner V3S MCU, CoreMark)

Native: 1157 CoreMark
WAMR AOT: 611 CoreMark (52.8%)
WAMR Interpreter: 32 CoreMark (2.8%)

Platform: RISC-V 32-bit (ESP32-C3)

AOT: ~50% of native performance
Interpreter: ~20-25% of native performance

aWsm Performance (PolyBench):

ARM Cortex-M7:

Native: 100%
aWsm AOT: 59.8% (40.2% slowdown)
wasm3: <10% (>10x slowdown)

ARM Cortex-M4:

Native: 100%
aWsm AOT: 75.1% (24.9% slowdown)
wasm3: <10%

10.3 Memory Consumption Benchmarks

Runtime Binary Sizes (Cortex-M4F):

WAMR Components:
  AOT Runtime:          29.4 KB
  Classic Interpreter:  56.3 KB
  Fast Interpreter:     58.9 KB
  WASI Library:         21.4 KB
  Built-in libc:         3.7 KB

Module Sizes (CoreMark):

WebAssembly bytecode:  10.5 KB
Native ARM binary:     23.0 KB
Savings:               54.3%

Peak Memory Usage (CoreMark workload):

WAMR Classic:    365 KB
WAMR Fast:       485 KB
wasm3:           514 KB

RISC-V Memory Footprint (ESP32-C3):

Runtime          Code Size    Data Size    Total
Interpreter      94,928 B     2,068 B      96,996 B
Fast Interp.     103,418 B    2,076 B      105,494 B
AOT              72,040 B     1,732 B      73,772 B

10.4 Real-World Case Studies

Case Study 1: Vision-Based IoT Sensors

Application: Deep learning inference pipeline on edge devices

Architecture:

Image signal processor → raw sensor input
DNN inference → object detection
Output normalization
Configurable business logic

WebAssembly Benefits:

Each stage as isolated applet
Over-the-air programmability
Platform independence
Security isolation

Results:

Successful deployment on ARM Cortex-M
Acceptable performance with AOT
Flexible update mechanism
Reduced development time

Case Study 2: Nordic nRF52840 with Zephyr (Ocre)

Platform: Nordic nRF52840 Microcontroller

Configuration:

CPU: ARM Cortex-M4F @ 64 MHz
Flash: 1 MB
RAM: 256 KB
RTOS: Zephyr

Implementation:

WAMR runtime integrated as Zephyr module
WebAssembly application modules
OTA update capability

Portability Demonstration:

Same WebAssembly binary runs on:
- nRF52840 microcontroller
- Cloud servers (x86-64)
- Web browsers
True "write once, run anywhere"

Metrics:

Runtime footprint: ~60 KB
Application modules: 5-50 KB each
Update time: <1 second
Performance: Acceptable for sensor processing

Case Study 3: RT-Thread on Cortex-M4

Platform: ARM Cortex-M4 @ 120 MHz, 640 KB RAM

Runtime: WAMR on RT-Thread RTOS

Applications:

Sensor data processing
Communication protocols
Business logic modules

Results:

Modular application architecture
Easy addition of new features
Third-party code sandboxing
Successful production deployment

10.5 Hardware Accelerator Case Study

FPGA WebAssembly Accelerator

Platform: Altera Cyclone IV FPGA (DE2-115)

Design:

Verilog HDL implementation
Direct WebAssembly bytecode execution
Hardware instruction decoder
Integrated with ARM processor

Resources:

6,246 LUTs
1,563 registers
50 MHz clock (FPGA limitation)

Performance:

Up to 142x speedup for selected algorithms
Compute-intensive operations benefit most
Memory-bound operations see less benefit

Conclusions:

Hardware acceleration viable for critical paths
FPGA proves concept; ASIC would be faster
Hybrid approach (software + hardware) optimal
Cost-benefit analysis needed per application

10.6 Benchmark Analysis and Insights

Performance Patterns:

AOT vs Interpreter: 5-30x performance difference
WebAssembly vs Native: 25-50% overhead on embedded (AOT)
Fast Interpreter: 2-3x improvement over classic
Platform Dependency: Better results on more powerful cores

Memory Patterns:

Binary Size: WebAssembly often smaller than native
Runtime Overhead: 30-100% memory increase for runtime
Module Caching: Benefits repeated execution
AOT Efficiency: Best size/performance balance

Code Size Optimizations:

Compression (gzip/brotli): 60-80% reduction
Dead code elimination: 10-30% reduction
LTO: 5-15% additional reduction
Language choice: Significant impact (Rust smaller than C++ with STL)

Practical Takeaways:

AOT essential for production embedded use
Fast Interpreter good for development
Hardware acceleration worthwhile for compute-heavy workloads
Memory constraints more challenging than performance
WebAssembly overhead acceptable for 50%+ use cases

11. Key Findings and Recommendations

11.1 Runtime Selection Guide

Requirement	Recommended Runtime	Rationale
Production embedded, performance critical	WAMR (AOT mode)	Best performance, small footprint, XIP support
Severely constrained (<64KB RAM)	wasm3	Smallest footprint, simple integration
Development/debugging	WAMR (Classic Interpreter)	Debugging support, reasonable performance
Rust ecosystem	wasmi	Native Rust, no_std support
Maximum performance on Cortex-M	aWsm	Excellent performance, mature SFI/CFI
RTOS integration (Zephyr)	WAMR	Native Zephyr module support

11.2 Optimization Priorities

For Code Size:

Use AOT compilation (25% smaller than interpreter)
Enable LTO and size optimizations (-Os/-Oz)
Dead code elimination
Choose size-efficient language (Rust > C > C++)
Compress for OTA updates (gzip/brotli)

For Performance:

Always use AOT for production (>5x faster than interpreter)
Enable LLVM optimizations (-O2/-O3 during AOT)
Profile-guided optimization where available
Consider hardware acceleration for critical paths
Use Fast Interpreter for development balance

For Memory:

Minimize runtime features (disable unneeded WASI)
Use stack-based allocation where possible
Pre-allocate linear memory to exact needs
XIP mode for flash-constrained systems
Share runtime across multiple modules

11.3 Platform-Specific Recommendations

ARM Cortex-M:

Use WAMR with AOT compilation
Enable XIP for flash execution
Specify exact CPU variant for optimal code generation
Configure FPU intrinsics appropriately
Consider aWsm for maximum performance
Leverage MPU for additional isolation (OmniWasm approach)

RISC-V:

WAMR best supported (interpreter + AOT)
AOT achieves ~50% native performance
64-bit RISC-V preferred (better support)
32-bit limited to interpreter in most runtimes
Watch for improved LLVM RISC-V backend optimizations

Cortex-A (Application Processors):

Can use JIT compilation
More memory available for runtime
LLVM JIT provides best performance
Consider multi-tier JIT for balanced startup/runtime

11.4 Use Case Recommendations

IoT Sensors:

✅ Excellent fit
Use AOT for efficiency
OTA updates via WebAssembly modules
Sandboxing for third-party code

Industrial Control:

⚠️ Depends on real-time requirements
AOT for predictable timing
Hybrid approach (critical paths native)
Thorough testing required

Automotive Embedded:

⚠️ Promising but immature
Await safety certifications
Consider for non-critical subsystems
Monitor standardization progress

Consumer Devices:

✅ Good fit
Flexibility for feature updates
Cross-platform development savings
App ecosystem potential

Edge AI:

✅ Excellent fit
Isolated inference workloads
Model updates without firmware change
Reasonable performance overhead acceptable

11.5 Technical Recommendations

Memory Management:

Avoid garbage collection languages for hard real-time
Pre-allocate linear memory to avoid growth
Use multi-memory proposal for isolation when available
Monitor stack usage carefully (no guard pages)
Consider MPU integration for additional protection

Performance Optimization:

Profile on actual target hardware (not desktop)
Optimize hot paths (consider native for <5% of code)
Use SIMD where available (fixed-width 128-bit)
Minimize host function calls (overhead significant)
Batch operations to reduce isolation crossing

Development Workflow:

Develop with Fast Interpreter (quick iteration)
Test with AOT on target hardware (realistic performance)
Profile and optimize bottlenecks
Consider hardware acceleration for remaining gaps
Validate real-time constraints thoroughly

11.6 Future-Proofing Strategies

Standards Adoption:

Follow WASI evolution (0.2 released 2024)
Adopt Component Model for modularity
Prepare for embedded-specific WASI APIs (I2C, GPIO)
Monitor multi-memory proposal usage

Tooling:

Invest in LLVM/wamrc toolchain knowledge
Develop size optimization expertise
Build automated performance testing
Create embedded-specific testing frameworks

Architecture:

Design for module isolation
Plan for OTA update workflows
Consider hybrid native/WebAssembly approach
Build in profiling and monitoring

12. References and Resources

12.1 Key Papers

"Potential of WebAssembly for Embedded Systems" (ArXiv, 2024)
- https://arxiv.org/html/2405.09213v1
- Comprehensive analysis of embedded WebAssembly state
"Hardware-Based WebAssembly Accelerator for Embedded System" (Electronics, 2024)
- https://www.mdpi.com/2079-9292/13/20/3979
- FPGA accelerator achieving 142x speedup
"Benchmarking WebAssembly for Embedded Systems" (ACM TACO, 2024)
- https://dl.acm.org/doi/10.1145/3736169
- Systematic performance evaluation
"Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code" (USENIX ATC 2019)
- https://www.usenix.org/conference/atc19/presentation/jangda
- Foundational performance analysis
"OmniWasm: Efficient, Granular Fault Isolation and Control-Flow Integrity for Arm"
- Research on MPU-based WebAssembly sandboxing for Cortex-M

12.2 Primary Resources

WAMR (WebAssembly Micro Runtime):

GitHub: https://github.com/bytecodealliance/wasm-micro-runtime
Documentation: https://bytecodealliance.github.io/wamr.dev/
Performance: https://github.com/bytecodealliance/wasm-micro-runtime/wiki/Performance

wasm3:

GitHub: https://github.com/wasm3/wasm3
Interpreter Design: https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md

aWsm:

GitHub: https://github.com/gwsystems/aWsm
Research-focused AOT runtime

WebAssembly Specifications:

Core Spec: https://webassembly.github.io/spec/
Multi-memory: https://github.com/WebAssembly/multi-memory
WASI: https://wasi.dev/

12.3 Community and Standards

Bytecode Alliance:

Main site: https://bytecodealliance.org/
Focus on WebAssembly security and standards

W3C WebAssembly Working Group:

Specifications and proposals
Community discussions

Embedded WebAssembly:

GitHub Org: https://github.com/embedded-wasm
Community projects and resources

12.4 Tools and Compilers

LLVM:

WebAssembly backend for AOT compilation
https://llvm.org/

wamrc:

WAMR AOT compiler
Part of WAMR repository

Emscripten:

C/C++ to WebAssembly toolchain
https://emscripten.org/

TinyGo:

Go subset for embedded WebAssembly
https://tinygo.org/

Rust + wasm-pack:

Rust to WebAssembly toolchain
https://rustwasm.github.io/

12.5 RTOS Integration

Zephyr:

https://zephyrproject.org/
Modern RTOS with WAMR support
Ocre project for container-like WebAssembly

FreeRTOS:

https://www.freertos.org/
Traditional embedded RTOS
Community WAMR integration

RT-Thread:

https://www.rt-thread.io/
Chinese open-source RTOS
Native WAMR support

12.6 Benchmarks

CoreMark:

https://www.eembc.org/coremark/
Industry-standard embedded benchmark

PolyBench:

https://github.com/MatthiasJReisinger/PolyBenchC-4.2.1
Numerical computation benchmarks

wasm-score:

https://github.com/bytecodealliance/wasm-score
Standalone WebAssembly benchmark suite

Appendix A: Compilation Examples

A.1 WAMR AOT Compilation

# Basic AOT compilation
wamrc -o output.aot input.wasm

# XIP mode (execute from ROM/flash)
wamrc --xip -o output.aot input.wasm

# ARM Cortex-M7 with FPU
wamrc --target=thumbv7em --cpu=cortex-m7 \
      --enable-builtin-intrinsics=i64.common \
      --xip -o output.aot input.wasm

# ARM Cortex-M4 with FPU
wamrc --target=thumbv7em --cpu=cortex-m4 \
      --enable-builtin-intrinsics=i64.common \
      --xip -o output.aot input.wasm

# ARM Cortex-M3 (no FPU)
wamrc --target=thumbv7m --cpu=cortex-m3 \
      --enable-builtin-intrinsics=i64.common,fp.common,fpxint \
      --xip -o output.aot input.wasm

# ARM Cortex-M55 (ARMv8-M)
wamrc --target=thumbv8m.main --cpu=cortex-m55 \
      --enable-builtin-intrinsics=i64.common \
      --xip -o output.aot input.wasm

# RISC-V 64-bit
wamrc --target=riscv64 --cpu=generic-rv64 \
      -o output.aot input.wasm

# With size optimization
wamrc --size-level=3 --xip \
      -o output.aot input.wasm

# With LLVM optimization level
wamrc -O3 --xip -o output.aot input.wasm

A.2 Building WAMR Runtime

# Clone WAMR
git clone https://github.com/bytecodealliance/wasm-micro-runtime.git
cd wasm-micro-runtime

# Build iwasm (interpreter) with Fast Interpreter
cd product-mini/platforms/linux
mkdir build && cd build
cmake .. -DWAMR_BUILD_FAST_INTERP=1
make

# Build iwasm with AOT support
cmake .. -DWAMR_BUILD_AOT=1
make

# Build with JIT support
cmake .. -DWAMR_BUILD_JIT=1
make

# Build with multi-tier JIT
cmake .. -DWAMR_BUILD_JIT=1 -DWAMR_BUILD_FAST_JIT=1
make

# Embedded configuration (minimal features)
cmake .. -DWAMR_BUILD_INTERP=1 \
         -DWAMR_BUILD_FAST_INTERP=0 \
         -DWAMR_BUILD_AOT=1 \
         -DWAMR_BUILD_LIBC_BUILTIN=1 \
         -DWAMR_BUILD_LIBC_WASI=0
make

A.3 C/C++ to WebAssembly (Emscripten)

# Basic compilation
emcc hello.c -o hello.wasm

# Size optimization
emcc hello.c -Os -o hello.wasm

# Aggressive size optimization
emcc hello.c -Oz --no-entry -o hello.wasm

# Standalone WASI module
emcc hello.c -o hello.wasm \
     -s STANDALONE_WASM=1 \
     -s EXPORTED_FUNCTIONS='["_main"]'

# With LTO
emcc hello.c -O3 -flto -o hello.wasm

# Stack size configuration
emcc hello.c -o hello.wasm \
     -s STACK_SIZE=32768

A.4 Rust to WebAssembly

# Add wasm32-wasi target
rustup target add wasm32-wasi

# Compile for WASI
cargo build --target wasm32-wasi --release

# Size optimization
RUSTFLAGS='-C opt-level=z -C link-arg=-s' \
cargo build --target wasm32-wasi --release

# Further size reduction with wasm-opt
wasm-opt -Oz -o optimized.wasm \
         target/wasm32-wasi/release/app.wasm

# Profile-guided optimization
RUSTFLAGS='-C profile-generate' \
cargo build --target wasm32-wasi --release
# ... run with sample data ...
RUSTFLAGS='-C profile-use' \
cargo build --target wasm32-wasi --release

Appendix B: Memory Configuration Examples

B.1 Linear Memory Configuration

// C example: WebAssembly module with custom memory
// Compiled with: emcc -s INITIAL_MEMORY=128KB

#include <stdlib.h>

// Stack allocation (preferred in embedded)
void process_data(void) {
    char buffer[1024];  // Stack-allocated
    // ... process ...
}

// Avoid dynamic allocation if possible
void* data = malloc(1024);  // Heap allocation
// ... use ...
free(data);

B.2 Multi-Memory Configuration

;; WebAssembly Text Format example with multiple memories
(module
  ;; Public memory for host communication
  (memory $public 1)  ;; 1 page (64KB)
  (export "memory" (memory $public))

  ;; Private memory for internal use
  (memory $private 2)  ;; 2 pages (128KB)

  ;; Function using both memories
  (func $process
    ;; Access public memory
    (i32.load (memory $public) (i32.const 0))

    ;; Access private memory
    (i32.load (memory $private) (i32.const 0))
  )
)

B.3 WAMR Memory Limits

// WAMR initialization with memory limits
RuntimeInitArgs init_args;
memset(&init_args, 0, sizeof(RuntimeInitArgs));

// Set memory pool (for embedded)
init_args.mem_alloc_type = Alloc_With_Pool;
init_args.mem_alloc_option.pool.heap_buf = global_heap_buf;
init_args.mem_alloc_option.pool.heap_size = sizeof(global_heap_buf);

// Initialize runtime
wasm_runtime_full_init(&init_args);

// Module instantiation with memory limit
wasm_module_inst_t module_inst =
    wasm_runtime_instantiate(module,
                            64 * 1024,  // 64KB stack
                            256 * 1024, // 256KB heap
                            error_buf,
                            sizeof(error_buf));

Appendix C: Performance Optimization Checklist

C.1 Compilation Optimizations

Use AOT compilation for production
Enable LLVM optimization level (-O2 or -O3)
Enable Link-Time Optimization (LTO)
Specify exact target CPU
Configure appropriate optimization for size vs speed
Enable XIP if executing from ROM/flash
Configure FPU intrinsics based on hardware
Use profile-guided optimization if available

C.2 Code Optimizations

Minimize host function calls
Batch operations to reduce boundary crossing
Use SIMD where appropriate
Avoid unnecessary type conversions
Minimize indirect calls
Use stack allocation over heap where possible
Pre-calculate constants
Optimize hot paths identified by profiling

C.3 Memory Optimizations

Set linear memory to exact required size
Disable unused WASI features
Use built-in libc instead of full WASI libc
Remove debug information from production builds
Enable dead code elimination
Compress modules for OTA transfer
Share runtime across multiple modules
Use XIP to reduce RAM usage

C.4 Runtime Configuration

Choose appropriate execution mode (AOT/interpreter/JIT)
Configure stack size based on profiling
Set heap size to minimum required
Disable debugging features in production
Enable fast interpreter if not using AOT
Configure gas metering only if needed
Optimize module loading path
Cache compiled modules where possible

Conclusion

WebAssembly is rapidly maturing as a viable runtime for embedded systems, particularly on ARM Cortex-M and RISC-V platforms. While challenges remain around memory overhead, page sizes, and performance gaps, the ecosystem has developed sophisticated solutions:

Runtime Maturity: WAMR provides production-ready embedded support with AOT achieving 50-79% of native performance, acceptable for most embedded use cases.

Memory Efficiency: Footprints as small as 29KB for AOT runtime and innovative XIP support enable deployment on flash-constrained systems.

Platform Support: Comprehensive ARM Cortex-M and RISC-V support with architecture-specific optimizations.

Standardization Progress: WASI 0.2, Component Model, and multi-memory proposals advancing embedded capabilities.

Real-World Viability: Successful deployments in IoT sensors, edge computing, and RTOS integration demonstrate practical applicability.

The technology is not a "silver bullet" but offers compelling benefits for portability, security, and flexibility. As tools, standards, and hardware acceleration mature, WebAssembly's role in embedded systems will continue to expand.

Recommendation: For new embedded projects requiring portability, OTA updates, or sandboxing, WebAssembly (via WAMR with AOT) should be seriously evaluated. Start with Fast Interpreter for development, profile thoroughly on target hardware, and deploy with AOT for optimal production performance.

Report compiled: 2025-11-16 Primary sources: Academic papers, official documentation, benchmark studies, and community resources Focus: ARM Cortex-M and RISC-V embedded systems

Uh oh!

FilesExpand file tree

01_embedded_systems.md

Latest commit

History

01_embedded_systems.md

File metadata and controls

WebAssembly for Embedded Systems: Comprehensive Research Report

ARM Cortex-M and RISC-V Optimization Study

Executive Summary

1. Existing WebAssembly Runtimes for Embedded Systems

1.1 WAMR (WebAssembly Micro Runtime)

1.2 wasm3

1.3 aWsm (Awasm)

1.4 Other Embedded Runtimes

1.5 Runtime Comparison Summary

2. Memory Management Optimizations (MMU/MPU Usage)

2.1 WebAssembly's Software-Based Memory Safety

2.2 ARM Cortex-M MPU Integration

2.3 Hardware-Accelerated Memory Protection

2.4 Memory Layout Considerations

2.5 Memory Footprint Measurements

3. Direct Function Calls and Linking Optimizations

3.1 AOT Compilation and Linking

3.2 XIP (Execute In Place) Optimization

3.3 Function Call Optimization Techniques

3.4 Static Linking and Module Merging

4. Multi-Memory Proposals and Embedded Use Cases

4.1 Multi-Memory Proposal Overview

4.2 Embedded-Relevant Use Cases

4.3 Embedded Systems Context

5. Code Size and Performance Optimizations

5.1 Code Size Optimization Strategies

5.2 Performance Optimization Techniques

5.3 Performance vs Native Comparison

5.4 Embedded-Specific Performance Challenges

5.5 Benchmark Results Summary

6. Real-Time Constraints and Deterministic Behavior

6.1 Real-Time Performance Characteristics

6.2 Determinism Considerations

6.3 Hardware Acceleration for Real-Time

6.4 RTOS Integration

6.5 Resource Requirements for RTOS

6.6 Challenges for Hard Real-Time

7. ARM Cortex-M Specific Optimizations

7.1 Architecture-Specific Code Generation

7.2 FPU Handling

7.3 Cortex-M Memory Protection (MPU)

7.4 Thumb Instruction Set

7.5 Performance Results on Cortex-M

7.6 Cortex-M Memory Constraints

7.7 ARM-Specific Optimizations

8. RISC-V WebAssembly Implementations

8.1 RISC-V Platform Support

8.2 Performance on RISC-V

8.3 RISC-V vs WebAssembly Comparison

8.4 RISC-V Embedded Applications

8.5 RISC-V Optimization Opportunities

8.6 RISC-V Development Tools

9. Current State of Embedded WebAssembly (2024-2025)

9.1 Standardization Progress

9.2 Component Model Impact

9.3 Industry Adoption

9.4 Research and Development (2024)

9.5 Tooling Ecosystem

9.6 Remaining Challenges

9.7 Future Outlook

9.8 Recommendations for Adoption

10. Performance Benchmarks and Case Studies

10.1 Benchmark Suites Used

10.2 Comprehensive Benchmark Results

10.3 Memory Consumption Benchmarks

10.4 Real-World Case Studies

10.5 Hardware Accelerator Case Study

10.6 Benchmark Analysis and Insights

11. Key Findings and Recommendations

11.1 Runtime Selection Guide

11.2 Optimization Priorities

11.3 Platform-Specific Recommendations

11.4 Use Case Recommendations