|
| 1 | +# JitEngine Refactor — RwLock → LazyLock::get_mut (Pre-populate-then-freeze) |
| 2 | + |
| 3 | +## FIRST: Read .claude/rules/borrow-strategy.md (from q2, same principle applies) |
| 4 | + |
| 5 | +## The Problem |
| 6 | + |
| 7 | +JitEngine currently uses `RwLock<HashMap<u64, ScanKernel>>` for the kernel cache. |
| 8 | +Every query during runtime takes a read-lock, and cache misses take a write-lock |
| 9 | +to compile. This means: |
| 10 | + |
| 11 | +- Lock contention under concurrent access |
| 12 | +- Unpredictable latency spikes when a new config triggers compilation mid-tick |
| 13 | +- RwLock overhead on EVERY cache hit (atomic operations, memory barriers) |
| 14 | +- Compilation (521µs) happening during gameplay / graph queries / replay |
| 15 | + |
| 16 | +## The Fix |
| 17 | + |
| 18 | +Two-phase architecture using `LazyLock::get_mut` (stable in Rust 1.94): |
| 19 | + |
| 20 | +``` |
| 21 | +Phase 1 — BUILD (mutable, single-threaded): |
| 22 | + LazyLock::get_mut(&mut engine.cache) → &mut HashMap |
| 23 | + Compile ALL kernels upfront. 100 kernels × 521µs = 52ms. |
| 24 | + No locks. No contention. Runs during startup/loading. |
| 25 | +
|
| 26 | +Phase 2 — RUN (frozen, zero-cost reads): |
| 27 | + &LazyLock → &HashMap (immutable reference) |
| 28 | + Function pointer lookup = HashMap::get(). No lock. No atomic. |
| 29 | + Zero contention. Zero compilation. Zero latency spikes. |
| 30 | + |
| 31 | + get_mut() returns None after first deref — the cache is frozen. |
| 32 | + Any attempt to compile during Phase 2 is a compile error, not a |
| 33 | + runtime error. The type system enforces the freeze. |
| 34 | +``` |
| 35 | + |
| 36 | +## Implementation |
| 37 | + |
| 38 | +### Current code (src/hpc/jitson_cranelift/engine.rs): |
| 39 | + |
| 40 | +```rust |
| 41 | +// CURRENT — wrong |
| 42 | +pub struct JitEngine { |
| 43 | + module: JITModule, |
| 44 | + pub caps: CpuCaps, |
| 45 | + scan_cache: RwLock<HashMap<u64, (*const u8, FuncId)>>, // ← lock on every access |
| 46 | +} |
| 47 | + |
| 48 | +impl JitEngine { |
| 49 | + pub fn compile_scan(&self, params: ScanParams) -> Result<ScanKernel, JitError> { |
| 50 | + // read-lock to check cache |
| 51 | + // write-lock to compile on miss |
| 52 | + // contention, latency spike |
| 53 | + } |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +### New code: |
| 58 | + |
| 59 | +```rust |
| 60 | +use std::sync::LazyLock; |
| 61 | +use std::collections::HashMap; |
| 62 | + |
| 63 | +/// JIT compilation engine with two-phase lifecycle: |
| 64 | +/// 1. BUILD: compile kernels via `populate()` — mutable access |
| 65 | +/// 2. RUN: lookup kernels via `get()` — immutable, zero-cost |
| 66 | +/// |
| 67 | +/// After the first immutable access, `get_mut()` returns `None` and |
| 68 | +/// no more kernels can be compiled. The cache is frozen. |
| 69 | +pub struct JitEngine { |
| 70 | + module: JITModule, |
| 71 | + pub caps: CpuCaps, |
| 72 | + /// Kernel cache. Mutable during BUILD, frozen during RUN. |
| 73 | + cache: LazyLock<KernelCache>, |
| 74 | +} |
| 75 | + |
| 76 | +/// The frozen kernel registry. Array-indexed for hot path. |
| 77 | +struct KernelCache { |
| 78 | + /// Hash → function pointer. Immutable after freeze. |
| 79 | + map: HashMap<u64, CachedKernel>, |
| 80 | + /// Ordered list for prefetch chain (WAL precompile queue order). |
| 81 | + prefetch_chain: Vec<(u64, *const u8)>, |
| 82 | +} |
| 83 | + |
| 84 | +struct CachedKernel { |
| 85 | + fn_ptr: *const u8, |
| 86 | + func_id: FuncId, |
| 87 | + params: ScanParams, |
| 88 | +} |
| 89 | + |
| 90 | +// Safety: compiled code pages are immutable. Function pointers are Send+Sync. |
| 91 | +unsafe impl Send for KernelCache {} |
| 92 | +unsafe impl Sync for KernelCache {} |
| 93 | + |
| 94 | +impl JitEngine { |
| 95 | + pub fn new() -> Result<Self, JitError> { |
| 96 | + JitEngineBuilder::new().build() |
| 97 | + } |
| 98 | + |
| 99 | + // ── Phase 1: BUILD (mutable) ────────────────────────────── |
| 100 | + |
| 101 | + /// Compile a scan kernel and add it to the cache. |
| 102 | + /// Only works during BUILD phase (before any `get()` call). |
| 103 | + /// Panics if called after freeze. |
| 104 | + pub fn compile(&mut self, params: ScanParams) -> Result<u64, JitError> { |
| 105 | + let cache = LazyLock::get_mut(&mut self.cache) |
| 106 | + .expect("JitEngine: cannot compile after freeze — cache is immutable"); |
| 107 | + |
| 108 | + let hash = params_hash(¶ms, None); |
| 109 | + if cache.map.contains_key(&hash) { |
| 110 | + return Ok(hash); // already compiled |
| 111 | + } |
| 112 | + |
| 113 | + let (fn_ptr, func_id) = self.compile_inner(¶ms, None)?; |
| 114 | + cache.map.insert(hash, CachedKernel { fn_ptr, func_id, params: params.clone() }); |
| 115 | + cache.prefetch_chain.push((hash, fn_ptr)); |
| 116 | + Ok(hash) |
| 117 | + } |
| 118 | + |
| 119 | + /// Compile a hybrid scan kernel (JIT loop + external SIMD function). |
| 120 | + pub fn compile_hybrid(&mut self, params: ScanParams, distance_fn: &str) -> Result<u64, JitError> { |
| 121 | + let cache = LazyLock::get_mut(&mut self.cache) |
| 122 | + .expect("JitEngine: cannot compile after freeze"); |
| 123 | + |
| 124 | + let hash = params_hash(¶ms, Some(distance_fn)); |
| 125 | + if cache.map.contains_key(&hash) { |
| 126 | + return Ok(hash); |
| 127 | + } |
| 128 | + |
| 129 | + let (fn_ptr, func_id) = self.compile_inner(¶ms, Some(distance_fn))?; |
| 130 | + cache.map.insert(hash, CachedKernel { fn_ptr, func_id, params: params.clone() }); |
| 131 | + cache.prefetch_chain.push((hash, fn_ptr)); |
| 132 | + Ok(hash) |
| 133 | + } |
| 134 | + |
| 135 | + /// Compile all kernels from a precompile queue (batch BUILD). |
| 136 | + pub fn compile_batch(&mut self, queue: &[ScanParams]) -> Result<Vec<u64>, JitError> { |
| 137 | + queue.iter().map(|p| self.compile(p.clone())).collect() |
| 138 | + } |
| 139 | + |
| 140 | + /// Compile all palette kernels (bits_per_index 1-8). |
| 141 | + pub fn compile_palette_kernels(&mut self) -> Result<(), JitError> { |
| 142 | + for bits in 1..=8 { |
| 143 | + self.compile(ScanParams { |
| 144 | + threshold: u32::MAX, // palette unpack doesn't threshold |
| 145 | + top_k: 4096, // full section |
| 146 | + prefetch_ahead: 4, |
| 147 | + focus_mask: None, |
| 148 | + record_size: bits as u32, |
| 149 | + })?; |
| 150 | + } |
| 151 | + Ok(()) |
| 152 | + } |
| 153 | + |
| 154 | + // ── Phase 2: RUN (frozen, zero-cost) ────────────────────── |
| 155 | + |
| 156 | + /// Look up a compiled kernel by hash. Zero-cost after freeze. |
| 157 | + /// Returns None if the kernel wasn't compiled during BUILD. |
| 158 | + #[inline(always)] |
| 159 | + pub fn get(&self, hash: u64) -> Option<ScanKernel> { |
| 160 | + // First access to self.cache freezes it via LazyLock::deref() |
| 161 | + // After this, get_mut() returns None — no more compilation possible |
| 162 | + self.cache.map.get(&hash).map(|k| ScanKernel::from_raw(k.fn_ptr, k.params.clone())) |
| 163 | + } |
| 164 | + |
| 165 | + /// Prefetch the NEXT kernel's code page. |
| 166 | + /// Call while executing kernel N to warm L1 for kernel N+1. |
| 167 | + #[inline(always)] |
| 168 | + pub fn prefetch_next(&self, current_hash: u64) { |
| 169 | + let chain = &self.cache.prefetch_chain; |
| 170 | + if let Some(idx) = chain.iter().position(|(h, _)| *h == current_hash) { |
| 171 | + if let Some((_, next_ptr)) = chain.get(idx + 1) { |
| 172 | + #[cfg(target_arch = "x86_64")] |
| 173 | + unsafe { |
| 174 | + core::arch::x86_64::_mm_prefetch( |
| 175 | + *next_ptr as *const i8, |
| 176 | + core::arch::x86_64::_MM_HINT_T0, |
| 177 | + ); |
| 178 | + } |
| 179 | + } |
| 180 | + } |
| 181 | + } |
| 182 | + |
| 183 | + /// Number of compiled kernels. |
| 184 | + pub fn kernel_count(&self) -> usize { |
| 185 | + self.cache.map.len() |
| 186 | + } |
| 187 | + |
| 188 | + /// Is the cache frozen? (Has any `get()` been called?) |
| 189 | + pub fn is_frozen(&self) -> bool { |
| 190 | + // If get_mut returns None on a &mut self, it's frozen. |
| 191 | + // But we can't call get_mut without &mut self. |
| 192 | + // After first deref, LazyLock is initialized → frozen. |
| 193 | + LazyLock::get(&self.cache).is_some() |
| 194 | + } |
| 195 | +} |
| 196 | +``` |
| 197 | + |
| 198 | +## Usage Pattern |
| 199 | + |
| 200 | +### Pumpkin Server Startup |
| 201 | + |
| 202 | +```rust |
| 203 | +fn main() { |
| 204 | + // ── Phase 1: BUILD (during "Loading..." screen) ── |
| 205 | + let mut jit = JitEngine::new().unwrap(); |
| 206 | + |
| 207 | + // Palette kernels (7 bit widths) |
| 208 | + jit.compile_palette_kernels().unwrap(); |
| 209 | + |
| 210 | + // Noise kernels (per biome) |
| 211 | + for biome in &world.biomes { |
| 212 | + jit.compile(biome.noise_params.to_scan_params()).unwrap(); |
| 213 | + } |
| 214 | + |
| 215 | + // Property mask kernels |
| 216 | + jit.compile(waterlogged_mask_params()).unwrap(); |
| 217 | + jit.compile(tick_eligible_params()).unwrap(); |
| 218 | + |
| 219 | + // Distance threshold kernels |
| 220 | + for radius in [16.0, 32.0, 64.0, 128.0] { |
| 221 | + jit.compile(radius_scan_params(radius)).unwrap(); |
| 222 | + } |
| 223 | + |
| 224 | + println!("JIT: {} kernels compiled in BUILD phase", jit.kernel_count()); |
| 225 | + // "JIT: 97 kernels compiled in BUILD phase" — took ~50ms |
| 226 | + |
| 227 | + // ── Phase 2: RUN (frozen, zero-cost) ── |
| 228 | + // First .get() call freezes the cache via LazyLock::deref() |
| 229 | + let kernel = jit.get(palette_hash_4bit).unwrap(); |
| 230 | + // From here: .compile() would panic. Cache is immutable. |
| 231 | + // Every .get() is a HashMap lookup. No lock. No atomic. No contention. |
| 232 | + |
| 233 | + // Share across threads |
| 234 | + let jit = Arc::new(jit); // Arc, not Arc<RwLock> — already frozen |
| 235 | + |
| 236 | + // Tick loop — zero-cost kernel access |
| 237 | + loop { |
| 238 | + let kernel = jit.get(current_palette_hash).unwrap(); |
| 239 | + unsafe { kernel.scan(query, field, len, size, out) }; |
| 240 | + jit.prefetch_next(current_palette_hash); // warm next kernel |
| 241 | + } |
| 242 | +} |
| 243 | +``` |
| 244 | + |
| 245 | +### q2 Cockpit Replay |
| 246 | + |
| 247 | +```rust |
| 248 | +fn start_replay(engine: &mut VizEngine, jit: &mut JitEngine, versions: &[PathBuf]) { |
| 249 | + // ── Phase 1: BUILD — compile all version kernels before play ── |
| 250 | + for (i, version_file) in versions.iter().enumerate() { |
| 251 | + let params = version_scan_params(i, version_file); |
| 252 | + jit.compile(params).unwrap(); |
| 253 | + } |
| 254 | + println!("Replay: {} kernels pre-compiled", jit.kernel_count()); |
| 255 | + |
| 256 | + // ── Phase 2: RUN — play button starts, zero compilation during playback ── |
| 257 | + for (i, version_file) in versions.iter().enumerate() { |
| 258 | + let hash = version_hash(i); |
| 259 | + let kernel = jit.get(hash).unwrap(); // frozen, instant |
| 260 | + // Process version through thinking graph... |
| 261 | + // No latency spikes. No lock contention. Smooth playback. |
| 262 | + } |
| 263 | +} |
| 264 | +``` |
| 265 | + |
| 266 | +## Why This Matters |
| 267 | + |
| 268 | +``` |
| 269 | +RwLock cache hit: ~25ns (atomic read, memory barrier) |
| 270 | +LazyLock frozen get: ~5ns (plain HashMap::get, no synchronization) |
| 271 | +
|
| 272 | +RwLock cache miss: 521µs + write-lock contention |
| 273 | +LazyLock cache miss: panic (compile error if you try after freeze) |
| 274 | +
|
| 275 | +100 kernel lookups/tick × 20ns saved = 2µs/tick saved |
| 276 | +At 20 TPS = 40µs/second saved on lock overhead alone |
| 277 | +``` |
| 278 | + |
| 279 | +The real win isn't nanoseconds. It's DETERMINISM. The tick loop never |
| 280 | +stalls for compilation. The replay never hiccups. The demo never stutters. |
| 281 | +Every kernel that will ever be needed is compiled before the first tick. |
| 282 | +If you forgot one, you get a panic at startup, not a lag spike in front |
| 283 | +of an audience. |
| 284 | + |
| 285 | +## The Amiga Parallel |
| 286 | + |
| 287 | +Amiga demo coders pre-computed copper lists during the vertical blank |
| 288 | +interval. When the display beam reached the visible area, every list |
| 289 | +was ready. No computation during rendering. The beam just read addresses. |
| 290 | + |
| 291 | +`LazyLock::get_mut` IS the vertical blank interval: |
| 292 | +- BUILD = VBI (compile everything, nobody's watching) |
| 293 | +- RUN = visible area (just read function pointers, zero computation) |
| 294 | + |
| 295 | +The freeze is not a limitation. It's a GUARANTEE. |
| 296 | + |
| 297 | +## What NOT to do |
| 298 | + |
| 299 | +- Do NOT keep RwLock as a fallback "just in case" |
| 300 | +- Do NOT add a `compile_if_missing()` method that works during RUN |
| 301 | +- Do NOT use `OnceCell` or `OnceLock` (they don't have the get_mut → freeze semantic) |
| 302 | +- Do NOT make the panic optional — if a kernel is missing during RUN, that's a bug |
| 303 | +- SIMD stays on slices. This refactor doesn't touch SIMD paths. |
0 commit comments