Skip to content

Commit 9b65097

Browse files
author
Jan Hübener
committed
feat(.claude): add prompts — migration spec, Pumpkin actions, LazyLock refactor
Restoring .claude/prompts that were added separately. The actual code (jitson migration, LazyLock refactor) was already done by CC session on this branch.
1 parent 5c35f71 commit 9b65097

3 files changed

Lines changed: 780 additions & 0 deletions

File tree

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# JitEngine Refactor — RwLock → LazyLock::get_mut (Pre-populate-then-freeze)
2+
3+
## FIRST: Read .claude/rules/borrow-strategy.md (from q2, same principle applies)
4+
5+
## The Problem
6+
7+
JitEngine currently uses `RwLock<HashMap<u64, ScanKernel>>` for the kernel cache.
8+
Every query during runtime takes a read-lock, and cache misses take a write-lock
9+
to compile. This means:
10+
11+
- Lock contention under concurrent access
12+
- Unpredictable latency spikes when a new config triggers compilation mid-tick
13+
- RwLock overhead on EVERY cache hit (atomic operations, memory barriers)
14+
- Compilation (521µs) happening during gameplay / graph queries / replay
15+
16+
## The Fix
17+
18+
Two-phase architecture using `LazyLock::get_mut` (stable in Rust 1.94):
19+
20+
```
21+
Phase 1 — BUILD (mutable, single-threaded):
22+
LazyLock::get_mut(&mut engine.cache) → &mut HashMap
23+
Compile ALL kernels upfront. 100 kernels × 521µs = 52ms.
24+
No locks. No contention. Runs during startup/loading.
25+
26+
Phase 2 — RUN (frozen, zero-cost reads):
27+
&LazyLock → &HashMap (immutable reference)
28+
Function pointer lookup = HashMap::get(). No lock. No atomic.
29+
Zero contention. Zero compilation. Zero latency spikes.
30+
31+
get_mut() returns None after first deref — the cache is frozen.
32+
Any attempt to compile during Phase 2 is a compile error, not a
33+
runtime error. The type system enforces the freeze.
34+
```
35+
36+
## Implementation
37+
38+
### Current code (src/hpc/jitson_cranelift/engine.rs):
39+
40+
```rust
41+
// CURRENT — wrong
42+
pub struct JitEngine {
43+
module: JITModule,
44+
pub caps: CpuCaps,
45+
scan_cache: RwLock<HashMap<u64, (*const u8, FuncId)>>, // ← lock on every access
46+
}
47+
48+
impl JitEngine {
49+
pub fn compile_scan(&self, params: ScanParams) -> Result<ScanKernel, JitError> {
50+
// read-lock to check cache
51+
// write-lock to compile on miss
52+
// contention, latency spike
53+
}
54+
}
55+
```
56+
57+
### New code:
58+
59+
```rust
60+
use std::sync::LazyLock;
61+
use std::collections::HashMap;
62+
63+
/// JIT compilation engine with two-phase lifecycle:
64+
/// 1. BUILD: compile kernels via `populate()` — mutable access
65+
/// 2. RUN: lookup kernels via `get()` — immutable, zero-cost
66+
///
67+
/// After the first immutable access, `get_mut()` returns `None` and
68+
/// no more kernels can be compiled. The cache is frozen.
69+
pub struct JitEngine {
70+
module: JITModule,
71+
pub caps: CpuCaps,
72+
/// Kernel cache. Mutable during BUILD, frozen during RUN.
73+
cache: LazyLock<KernelCache>,
74+
}
75+
76+
/// The frozen kernel registry. Array-indexed for hot path.
77+
struct KernelCache {
78+
/// Hash → function pointer. Immutable after freeze.
79+
map: HashMap<u64, CachedKernel>,
80+
/// Ordered list for prefetch chain (WAL precompile queue order).
81+
prefetch_chain: Vec<(u64, *const u8)>,
82+
}
83+
84+
struct CachedKernel {
85+
fn_ptr: *const u8,
86+
func_id: FuncId,
87+
params: ScanParams,
88+
}
89+
90+
// Safety: compiled code pages are immutable. Function pointers are Send+Sync.
91+
unsafe impl Send for KernelCache {}
92+
unsafe impl Sync for KernelCache {}
93+
94+
impl JitEngine {
95+
pub fn new() -> Result<Self, JitError> {
96+
JitEngineBuilder::new().build()
97+
}
98+
99+
// ── Phase 1: BUILD (mutable) ──────────────────────────────
100+
101+
/// Compile a scan kernel and add it to the cache.
102+
/// Only works during BUILD phase (before any `get()` call).
103+
/// Panics if called after freeze.
104+
pub fn compile(&mut self, params: ScanParams) -> Result<u64, JitError> {
105+
let cache = LazyLock::get_mut(&mut self.cache)
106+
.expect("JitEngine: cannot compile after freeze — cache is immutable");
107+
108+
let hash = params_hash(&params, None);
109+
if cache.map.contains_key(&hash) {
110+
return Ok(hash); // already compiled
111+
}
112+
113+
let (fn_ptr, func_id) = self.compile_inner(&params, None)?;
114+
cache.map.insert(hash, CachedKernel { fn_ptr, func_id, params: params.clone() });
115+
cache.prefetch_chain.push((hash, fn_ptr));
116+
Ok(hash)
117+
}
118+
119+
/// Compile a hybrid scan kernel (JIT loop + external SIMD function).
120+
pub fn compile_hybrid(&mut self, params: ScanParams, distance_fn: &str) -> Result<u64, JitError> {
121+
let cache = LazyLock::get_mut(&mut self.cache)
122+
.expect("JitEngine: cannot compile after freeze");
123+
124+
let hash = params_hash(&params, Some(distance_fn));
125+
if cache.map.contains_key(&hash) {
126+
return Ok(hash);
127+
}
128+
129+
let (fn_ptr, func_id) = self.compile_inner(&params, Some(distance_fn))?;
130+
cache.map.insert(hash, CachedKernel { fn_ptr, func_id, params: params.clone() });
131+
cache.prefetch_chain.push((hash, fn_ptr));
132+
Ok(hash)
133+
}
134+
135+
/// Compile all kernels from a precompile queue (batch BUILD).
136+
pub fn compile_batch(&mut self, queue: &[ScanParams]) -> Result<Vec<u64>, JitError> {
137+
queue.iter().map(|p| self.compile(p.clone())).collect()
138+
}
139+
140+
/// Compile all palette kernels (bits_per_index 1-8).
141+
pub fn compile_palette_kernels(&mut self) -> Result<(), JitError> {
142+
for bits in 1..=8 {
143+
self.compile(ScanParams {
144+
threshold: u32::MAX, // palette unpack doesn't threshold
145+
top_k: 4096, // full section
146+
prefetch_ahead: 4,
147+
focus_mask: None,
148+
record_size: bits as u32,
149+
})?;
150+
}
151+
Ok(())
152+
}
153+
154+
// ── Phase 2: RUN (frozen, zero-cost) ──────────────────────
155+
156+
/// Look up a compiled kernel by hash. Zero-cost after freeze.
157+
/// Returns None if the kernel wasn't compiled during BUILD.
158+
#[inline(always)]
159+
pub fn get(&self, hash: u64) -> Option<ScanKernel> {
160+
// First access to self.cache freezes it via LazyLock::deref()
161+
// After this, get_mut() returns None — no more compilation possible
162+
self.cache.map.get(&hash).map(|k| ScanKernel::from_raw(k.fn_ptr, k.params.clone()))
163+
}
164+
165+
/// Prefetch the NEXT kernel's code page.
166+
/// Call while executing kernel N to warm L1 for kernel N+1.
167+
#[inline(always)]
168+
pub fn prefetch_next(&self, current_hash: u64) {
169+
let chain = &self.cache.prefetch_chain;
170+
if let Some(idx) = chain.iter().position(|(h, _)| *h == current_hash) {
171+
if let Some((_, next_ptr)) = chain.get(idx + 1) {
172+
#[cfg(target_arch = "x86_64")]
173+
unsafe {
174+
core::arch::x86_64::_mm_prefetch(
175+
*next_ptr as *const i8,
176+
core::arch::x86_64::_MM_HINT_T0,
177+
);
178+
}
179+
}
180+
}
181+
}
182+
183+
/// Number of compiled kernels.
184+
pub fn kernel_count(&self) -> usize {
185+
self.cache.map.len()
186+
}
187+
188+
/// Is the cache frozen? (Has any `get()` been called?)
189+
pub fn is_frozen(&self) -> bool {
190+
// If get_mut returns None on a &mut self, it's frozen.
191+
// But we can't call get_mut without &mut self.
192+
// After first deref, LazyLock is initialized → frozen.
193+
LazyLock::get(&self.cache).is_some()
194+
}
195+
}
196+
```
197+
198+
## Usage Pattern
199+
200+
### Pumpkin Server Startup
201+
202+
```rust
203+
fn main() {
204+
// ── Phase 1: BUILD (during "Loading..." screen) ──
205+
let mut jit = JitEngine::new().unwrap();
206+
207+
// Palette kernels (7 bit widths)
208+
jit.compile_palette_kernels().unwrap();
209+
210+
// Noise kernels (per biome)
211+
for biome in &world.biomes {
212+
jit.compile(biome.noise_params.to_scan_params()).unwrap();
213+
}
214+
215+
// Property mask kernels
216+
jit.compile(waterlogged_mask_params()).unwrap();
217+
jit.compile(tick_eligible_params()).unwrap();
218+
219+
// Distance threshold kernels
220+
for radius in [16.0, 32.0, 64.0, 128.0] {
221+
jit.compile(radius_scan_params(radius)).unwrap();
222+
}
223+
224+
println!("JIT: {} kernels compiled in BUILD phase", jit.kernel_count());
225+
// "JIT: 97 kernels compiled in BUILD phase" — took ~50ms
226+
227+
// ── Phase 2: RUN (frozen, zero-cost) ──
228+
// First .get() call freezes the cache via LazyLock::deref()
229+
let kernel = jit.get(palette_hash_4bit).unwrap();
230+
// From here: .compile() would panic. Cache is immutable.
231+
// Every .get() is a HashMap lookup. No lock. No atomic. No contention.
232+
233+
// Share across threads
234+
let jit = Arc::new(jit); // Arc, not Arc<RwLock> — already frozen
235+
236+
// Tick loop — zero-cost kernel access
237+
loop {
238+
let kernel = jit.get(current_palette_hash).unwrap();
239+
unsafe { kernel.scan(query, field, len, size, out) };
240+
jit.prefetch_next(current_palette_hash); // warm next kernel
241+
}
242+
}
243+
```
244+
245+
### q2 Cockpit Replay
246+
247+
```rust
248+
fn start_replay(engine: &mut VizEngine, jit: &mut JitEngine, versions: &[PathBuf]) {
249+
// ── Phase 1: BUILD — compile all version kernels before play ──
250+
for (i, version_file) in versions.iter().enumerate() {
251+
let params = version_scan_params(i, version_file);
252+
jit.compile(params).unwrap();
253+
}
254+
println!("Replay: {} kernels pre-compiled", jit.kernel_count());
255+
256+
// ── Phase 2: RUN — play button starts, zero compilation during playback ──
257+
for (i, version_file) in versions.iter().enumerate() {
258+
let hash = version_hash(i);
259+
let kernel = jit.get(hash).unwrap(); // frozen, instant
260+
// Process version through thinking graph...
261+
// No latency spikes. No lock contention. Smooth playback.
262+
}
263+
}
264+
```
265+
266+
## Why This Matters
267+
268+
```
269+
RwLock cache hit: ~25ns (atomic read, memory barrier)
270+
LazyLock frozen get: ~5ns (plain HashMap::get, no synchronization)
271+
272+
RwLock cache miss: 521µs + write-lock contention
273+
LazyLock cache miss: panic (compile error if you try after freeze)
274+
275+
100 kernel lookups/tick × 20ns saved = 2µs/tick saved
276+
At 20 TPS = 40µs/second saved on lock overhead alone
277+
```
278+
279+
The real win isn't nanoseconds. It's DETERMINISM. The tick loop never
280+
stalls for compilation. The replay never hiccups. The demo never stutters.
281+
Every kernel that will ever be needed is compiled before the first tick.
282+
If you forgot one, you get a panic at startup, not a lag spike in front
283+
of an audience.
284+
285+
## The Amiga Parallel
286+
287+
Amiga demo coders pre-computed copper lists during the vertical blank
288+
interval. When the display beam reached the visible area, every list
289+
was ready. No computation during rendering. The beam just read addresses.
290+
291+
`LazyLock::get_mut` IS the vertical blank interval:
292+
- BUILD = VBI (compile everything, nobody's watching)
293+
- RUN = visible area (just read function pointers, zero computation)
294+
295+
The freeze is not a limitation. It's a GUARANTEE.
296+
297+
## What NOT to do
298+
299+
- Do NOT keep RwLock as a fallback "just in case"
300+
- Do NOT add a `compile_if_missing()` method that works during RUN
301+
- Do NOT use `OnceCell` or `OnceLock` (they don't have the get_mut → freeze semantic)
302+
- Do NOT make the panic optional — if a kernel is missing during RUN, that's a bug
303+
- SIMD stays on slices. This refactor doesn't touch SIMD paths.

0 commit comments

Comments
 (0)