inline branch and make opcode read unchecked

InvalidPathException · InvalidPathException · commit a29d23c4f046 · 2025-10-04T09:57:57.000-04:00
diff --git a/bench.md b/bench.md
@@ -1,4 +1,4 @@
-This file documents the coremark bench results to document performance improvements over time.
+This file documents the coremark bench results to keep track of performance improvements over time.
 - ac6b813: avg ≈ 500 (introduced benchmarking without criterion)
 - 5f4af0c: avg = 529.773949, n = 20 (improved leb128 handling with unsafe)
     - we try to avoid unsafe code in module/validator/instance directly
@@ -7,14 +7,27 @@ This file documents the coremark bench results to document performance improveme
     - current design: two-level indirection (one array covering all possible code indices and directing them to a densely packed side table)
     - we also try to not use nightly features... making error creation cold path is a really elegant solution in this regard
     - repr(C) for the SideTableEntry struct caused mysterious improvements, not sure if it is a fluke
-- current: avg = 855.71814, n = 20 (remove defensive malformed check in main loop)
+- cc02503: avg = 855.71814, n = 20 (remove defensive malformed check in main loop)
     - since the module is already validated at run time, there is no reason for the check to exist, it was a remnant of early development phase that lacked proper handling for some malformed modules
+- current: no significant difference
 
+On nightly, the performance is slightly better (sometimes reaching 900)
+
+Next step: use direct threading to improve branch prediction 
 
 Hardware Overview:
 - Model Name: MacBook Pro
 - Model Identifier: Mac16,8
 - Model Number: MX2H3LL/A
 - Chip: Apple M4 Pro
 - Total Number of Cores: 12 (8 performance and 4 efficiency)
-- Memory: 24 GB
+- Memory: 24 GB
+
+Performance of other Rust-based interpreters:
+wasmi: ~1700
+tinywasm: ~630
+
+Goal:
+We expect/hope to reach ~1200 after threaded dispatch implementation. It seems like Ben Titzer only reached performance comparable to production-ready, optimizing interpreters through manually crafted assembly code for hot paths. 
+
+Higher performance may not be pursued after the point and instead I might focus on adding more instructions to achieve Wasm 2.0 spec parity (should be easy with AI).
diff --git a/src/instance.rs b/src/instance.rs
@@ -640,7 +640,7 @@ impl Instance {
         let mem = self.memory.as_ref();
         let tab = self.table.as_ref();
 
-        macro_rules! next_op { () => {{ let byte = bytes[pc]; pc += 1; byte }} }
+        macro_rules! next_op { () => {{ let byte = unsafe { *bytes.get_unchecked(pc) }; pc += 1; byte }} }
         macro_rules! pop_val { () => {{
             match stack.pop() { Some(v) => v, None => return Err(Error::trap(STACK_UNDERFLOW)) }
         }} }
@@ -1375,7 +1375,7 @@ impl Instance {
         }
     }
 
-    #[inline]
+    #[inline(always)]
     fn branch(pc: &mut usize, stack: &mut Vec<WasmValue>, control: &mut Vec<ControlFrame>, depth: u32) -> bool {
         let len = control.len();
         if depth as usize >= len { return true; }
diff --git a/src/wasm_memory.rs b/src/wasm_memory.rs
@@ -76,7 +76,7 @@ impl WasmMemory {
     pub fn store_f64(&mut self, ptr: u32, offset: u32, v: f64) -> Result<(), &'static str> {
         self.store_u64(ptr, offset, v.to_bits())
     }
-    #[inline]
+    #[inline(always)]
     pub fn write_bytes(&mut self, offset: u32, bytes: &[u8]) -> Result<(), &'static str> {
         let start = offset as usize;
         let end = start.checked_add(bytes.len()).ok_or(OOB_MEMORY_ACCESS)?;

Original file line number	Diff line number	Diff line change
`@@ -76,7 +76,7 @@ impl WasmMemory {`
`76`	`76`	`pub fn store_f64(&mut self, ptr: u32, offset: u32, v: f64) -> Result<(), &'static str> {`
`77`	`77`	`self.store_u64(ptr, offset, v.to_bits())`
`78`	`78`	`}`
`79`		`- #[inline]`
	`79`	`+ #[inline(always)]`
`80`	`80`	`pub fn write_bytes(&mut self, offset: u32, bytes: &[u8]) -> Result<(), &'static str> {`
`81`	`81`	`let start = offset as usize;`
`82`	`82`	`let end = start.checked_add(bytes.len()).ok_or(OOB_MEMORY_ACCESS)?;`