Skip to content

Commit a29d23c

Browse files
inline branch and make opcode read unchecked
1 parent cc02503 commit a29d23c

3 files changed

Lines changed: 19 additions & 6 deletions

File tree

bench.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
This file documents the coremark bench results to document performance improvements over time.
1+
This file documents the coremark bench results to keep track of performance improvements over time.
22
- ac6b813: avg ≈ 500 (introduced benchmarking without criterion)
33
- 5f4af0c: avg = 529.773949, n = 20 (improved leb128 handling with unsafe)
44
- we try to avoid unsafe code in module/validator/instance directly
@@ -7,14 +7,27 @@ This file documents the coremark bench results to document performance improveme
77
- current design: two-level indirection (one array covering all possible code indices and directing them to a densely packed side table)
88
- we also try to not use nightly features... making error creation cold path is a really elegant solution in this regard
99
- repr(C) for the SideTableEntry struct caused mysterious improvements, not sure if it is a fluke
10-
- current: avg = 855.71814, n = 20 (remove defensive malformed check in main loop)
10+
- cc02503: avg = 855.71814, n = 20 (remove defensive malformed check in main loop)
1111
- since the module is already validated at run time, there is no reason for the check to exist, it was a remnant of early development phase that lacked proper handling for some malformed modules
12+
- current: no significant difference
1213

14+
On nightly, the performance is slightly better (sometimes reaching 900)
15+
16+
Next step: use direct threading to improve branch prediction
1317

1418
Hardware Overview:
1519
- Model Name: MacBook Pro
1620
- Model Identifier: Mac16,8
1721
- Model Number: MX2H3LL/A
1822
- Chip: Apple M4 Pro
1923
- Total Number of Cores: 12 (8 performance and 4 efficiency)
20-
- Memory: 24 GB
24+
- Memory: 24 GB
25+
26+
Performance of other Rust-based interpreters:
27+
wasmi: ~1700
28+
tinywasm: ~630
29+
30+
Goal:
31+
We expect/hope to reach ~1200 after threaded dispatch implementation. It seems like Ben Titzer only reached performance comparable to production-ready, optimizing interpreters through manually crafted assembly code for hot paths.
32+
33+
Higher performance may not be pursued after the point and instead I might focus on adding more instructions to achieve Wasm 2.0 spec parity (should be easy with AI).

src/instance.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -640,7 +640,7 @@ impl Instance {
640640
let mem = self.memory.as_ref();
641641
let tab = self.table.as_ref();
642642

643-
macro_rules! next_op { () => {{ let byte = bytes[pc]; pc += 1; byte }} }
643+
macro_rules! next_op { () => {{ let byte = unsafe { *bytes.get_unchecked(pc) }; pc += 1; byte }} }
644644
macro_rules! pop_val { () => {{
645645
match stack.pop() { Some(v) => v, None => return Err(Error::trap(STACK_UNDERFLOW)) }
646646
}} }
@@ -1375,7 +1375,7 @@ impl Instance {
13751375
}
13761376
}
13771377

1378-
#[inline]
1378+
#[inline(always)]
13791379
fn branch(pc: &mut usize, stack: &mut Vec<WasmValue>, control: &mut Vec<ControlFrame>, depth: u32) -> bool {
13801380
let len = control.len();
13811381
if depth as usize >= len { return true; }

src/wasm_memory.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ impl WasmMemory {
7676
pub fn store_f64(&mut self, ptr: u32, offset: u32, v: f64) -> Result<(), &'static str> {
7777
self.store_u64(ptr, offset, v.to_bits())
7878
}
79-
#[inline]
79+
#[inline(always)]
8080
pub fn write_bytes(&mut self, offset: u32, bytes: &[u8]) -> Result<(), &'static str> {
8181
let start = offset as usize;
8282
let end = start.checked_add(bytes.len()).ok_or(OOB_MEMORY_ACCESS)?;

0 commit comments

Comments
 (0)