Skip to content

Commit d354424

Browse files
authored
release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression) (#94)
* fix(cse): cost-aware dedup gate eliminates gale +6.3% regression The v0.4.0 audit measured LOOM's CSE growing the gale_ffi kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate expression including 1-2 byte constants. Replacing `i32.const -22` (2 bytes encoded) with `local.tee N / local.get N` (2+2 = 4 bytes) plus an additional local declaration (~2 bytes amortized) is unconditionally a size regression on cheap constants — and gale is full of them (errno values). Fix: add a cost gate `Expr::worth_dedup(occurrences)` that estimates net byte savings before deciding to dedup. Skip when: net = (N - 1) * (cost - 2) - 4 ≤ 0 Examples: i32.const 42 (cost=2, N=10): savings = -4 → skip i32.add cost=5, N=2: savings = -1 → skip i32.add cost=5, N=3: savings = +2 → keep Measurement on gale_ffi: v0.5.0: code section 811 → 862 bytes (+6.3% regression) v0.6.0: code section 811 → 808 bytes (-0.4% net win) Tests: - test_cse_phase4_duplicate_constants_above_cost_threshold: LARGE constants (5+ byte LEB128) still get deduplicated. - test_cse_phase4_keeps_small_constants: regression test for the gale fix. Cheap constants must survive CSE. Pick #1 from v0.6.0 wasm-opt-gap research agent's plan. Trace: REQ-3, REQ-14 * docs(research): gale source patterns + wasm-opt pass gap analysis Two research outputs from v0.6.0 planning, both grounded in real Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale. source-pattern-analysis.md — eight optimization-relevant patterns in gale source with file:line citations: - Closed-set FSM dispatch (br_table targets) — 6 near-identical matches over SchedThreadState in sched.rs:649-779 - Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern) — canonical example in sched.rs:404-444 (next_up_smp), four more found - Verus-bounded loops — 24 `decreases` clauses, all of form MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16) - Tail-call dispatch matches, leaf-inline candidates, bit-mask axiom ingestion (event.rs lemmas), 2D state-machine matches, and Verus annotations as trusted axioms (1607 clauses total) wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by expected payoff on kernel code, cheapest-first: 1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE in this PR 2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks) 3. RedundantSetElimination liveness (600 LOC, 1.5 wks) 4. Compare-operand canonicalization (400 LOC, 1.0 wk) 5. merge-locals (500 LOC, 1.5 wks) 6. directize call_indirect → call (700 LOC, 2.0 wks) 7. simplify-locals sinking mode (1500 LOC, 4.0 wks) The first three together (~1.3 weeks combined) flip the sign of the gale +6.3% regression. Pick #1 is shipped in this PR; picks #2 and #3 are tracked for follow-up PRs in v0.6.0. Trace: REQ-7
1 parent 01b436a commit d354424

4 files changed

Lines changed: 945 additions & 32 deletions

File tree

.claude/settings.local.json

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,25 @@
3333
"Bash(cargo +nightly fuzz build:*)",
3434
"Bash(cargo +nightly fuzz run:*)",
3535
"Bash(timeout 10 cargo +nightly fuzz run:*)",
36-
"WebFetch(domain:docs.rs)"
36+
"WebFetch(domain:docs.rs)",
37+
"Bash(git checkout *)",
38+
"Bash(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h LIBRARY_PATH=/opt/homebrew/lib cargo build --release)",
39+
"Bash(git add *)",
40+
"Bash(git commit -m ' *)",
41+
"Bash(git push *)",
42+
"Bash(gh pr *)",
43+
"Bash(git fetch *)",
44+
"Bash(git stash *)",
45+
"Bash(git reset *)",
46+
"Bash(git tag -a v0.5.0 1e96c68 -m ' *)",
47+
"Bash(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h LIBRARY_PATH=/opt/homebrew/lib cargo test --release --lib)",
48+
"Bash(wasm-tools dump *)",
49+
"Bash(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h LIBRARY_PATH=/opt/homebrew/lib cargo test --release --lib -- test_cse)",
50+
"Bash(awk '/test_cse_phase4_keeps_small_constants/,/^test result/' /private/tmp/claude-501/-Users-r-git-pulseengine-loom/c4560ac2-258a-4b13-acc5-e9e2a47f62e5/tasks/bng9rzwln.output)",
51+
"Bash(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h LIBRARY_PATH=/opt/homebrew/lib cargo test --release --lib -- test_cse_phase4_keeps_small_constants --nocapture)",
52+
"Bash(Z3_SYS_Z3_HEADER=/opt/homebrew/include/z3.h LIBRARY_PATH=/opt/homebrew/lib cargo test --release --lib -- test_cse_phase4 --nocapture)",
53+
"Bash(grep -E \"Failed$|^error\" /private/tmp/claude-501/-Users-r-git-pulseengine-loom/c4560ac2-258a-4b13-acc5-e9e2a47f62e5/tasks/bafdrty57.output)",
54+
"Bash(cargo fmt *)"
3755
],
3856
"deny": [],
3957
"ask": []
Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
# Gale Pattern Analysis for LOOM Optimization Research
2+
3+
**Source tree:** `/Users/r/git/pulseengine/z/gale/src/` (Verus-verified Rust
4+
model of Zephyr's kernel-scheduler primitives — sem / mutex / sched /
5+
wait_queue / timeout / event / futex / cpu_mask / atomic).
6+
7+
**Method:** read-only inspection of the largest scheduler-relevant files
8+
(`sched.rs` 916 LoC, `mutex.rs` 560, `sem.rs` 539, `wait_queue.rs` 478,
9+
`timeout.rs` 709, `event.rs`, `futex.rs`, `cpu_mask.rs`, `priority.rs`,
10+
`poll.rs`, `atomic.rs`, `ipi.rs`).
11+
12+
**Module-wide quick facts**
13+
14+
| metric | count | comment |
15+
|---|---|---|
16+
| `requires` clauses | 627 | trusted preconditions |
17+
| `ensures` clauses | 980 | post-conditions LOOM can ingest |
18+
| `decreases` clauses | 24 | bounded-loop ranking functions |
19+
| `match` statements (8 hot files) | 29 | many are 7-arm enum dispatches |
20+
21+
The repository is *unusually* dense in (a) closed-set enum dispatches, (b)
22+
`Result<…, EINVAL>` early-return guards, and (c) bounded for-loops over
23+
fixed-size arrays (`[Option<Thread>; 64]`, `MAX_CPUS = 16`,
24+
`MAX_PARTITIONS`, `MAX_POLL_EVENTS`). Every property below is grounded in
25+
that structure.
26+
27+
---
28+
29+
## 1. Closed-set state-machine dispatch (br_table candidate)
30+
31+
**Pattern.** A 7-arm `match` on a small `#[repr(u8)]`-style enum where every
32+
arm is either `Ok(NewState)` or `Err(EINVAL)`. The discriminant is dense
33+
(0..6), so rustc/LLVM emits a `br_table` (jump table). Many of these
34+
functions are then composed into outer state machines.
35+
36+
**Examples (file:line).**
37+
- `src/sched.rs:649-657` `sched_suspend` — 7 arms, 6 of them `Err(EINVAL)`.
38+
- `src/sched.rs:669-677` `sched_resume` — 7 arms, only `Suspended` succeeds.
39+
- `src/sched.rs:721-729` `sched_sleep` — only `Running` succeeds.
40+
- `src/sched.rs:740-748` `sched_wakeup` — only `Sleeping` succeeds.
41+
- `src/sched.rs:759-767` `sched_pend` — only `Running` succeeds.
42+
- `src/sched.rs:779-787` `sched_unpend` — only `Pending` succeeds.
43+
44+
**Two-axis pair-match (state×event).**
45+
- `src/sched.rs:530-543` `is_valid_transition``match (from, to)` over
46+
`(ThreadState, ThreadState)` (4×4 grid).
47+
- `src/sched.rs:592-630` `sched_is_valid_transition``match (from, to)`
48+
over `(SchedThreadState, SchedThreadState)` (7×7 = 49 cases, encoded as
49+
~25 explicit arms + wildcard).
50+
51+
**Why it matters for LOOM.**
52+
1. *Constant-folding through dispatch.* A caller invoking `sched_resume`
53+
on a known-`Suspended` value should fold to `Ok(Ready)` — single move.
54+
Today this requires an inter-procedural enum-discriminant tracker.
55+
2. *Switch-table → perfect-hash.* Where >50 % of arms are
56+
`Err(EINVAL)`, LOOM can rewrite to `if state == OnlyOk { Ok(...) } else
57+
{ Err(EINVAL) }` — kills the indirect jump entirely. Branch-prediction
58+
wins on the sparse-success pattern that dominates here.
59+
3. *FSM equivalence proofs.* The pair-match form (`(from, to)`) is the
60+
canonical example LOOM/ISLE rule synthesis was designed for. Every
61+
`_ => false` is a refutable rule the optimizer can prove sound from the
62+
`ensures` clause directly above it (e.g. `from === Dead ==> !result`).
63+
64+
**Expected payoff.** Concrete: 10-30 % cycle reduction on `sched_*`
65+
hot paths (one less jump-table indirect, dead code eliminated). Strategic:
66+
this is the cleanest source-level evidence I've seen for the
67+
"trusted-axiom-driven branch elimination" idea in
68+
`docs/research/novel-optimization-ideas.md`.
69+
70+
---
71+
72+
## 2. Default-then-override (the v0.4.0/v0.5.0 hoist-guard pattern)
73+
74+
**Pattern.** `let mut x = DEFAULT; ...; if cond { x = OVERRIDE }; x`.
75+
LOOM v0.4/v0.5 added hoist guards specifically for this idiom because
76+
LLVM frequently fails to forward-propagate when the override is conditional
77+
on `Option::is_some()` or a state-machine match.
78+
79+
**Examples.**
80+
81+
1. **`src/sched.rs:404-444` `next_up_smp` — the canonical case.**
82+
```
83+
let mut thread: Option<Thread> = runq_best; // default
84+
if let Some(mirqp) = cpu_state.metairq_preempted { ... thread = Some(mirqp); ... }
85+
let candidate = match thread { Some(t) => t, None => cpu_state.idle_thread };
86+
let mut chosen = candidate; // default
87+
if current_is_active { ... chosen = current; ... }
88+
if !should_preempt(...) { chosen = current; }
89+
```
90+
Two stacked default-then-override variables (`thread`, `chosen`) feeding
91+
into a single returned `SmpSchedOutcome`. Exactly the shape that
92+
motivated the hoist guards.
93+
94+
2. **`src/sched.rs:183-208` `RunQueue::add` insertion search.**
95+
```
96+
let mut insert_pos: u32 = self.len; // default = "append at end"
97+
let mut i: u32 = 0;
98+
let mut found: bool = false;
99+
while ... { if thr_pri < entry_pri { insert_pos = i; found = true; } ... }
100+
```
101+
`insert_pos` defaults to `self.len`, overridden when the loop finds a
102+
spot. The companion `WaitQueue::pend` at `src/wait_queue.rs:292-332`
103+
is byte-for-byte the same pattern.
104+
105+
3. **`src/cpu_mask.rs:104-143` `cpu_mask_mod`.**
106+
```
107+
if is_running { return CpuMaskResult { mask: current_mask, error: EINVAL }; }
108+
let new_mask: u32 = (current_mask | enable) & !disable;
109+
if new_mask == 0 { return CpuMaskResult { mask: current_mask, error: EINVAL }; }
110+
if pin_only && (new_mask & (new_mask - 1)) != 0 { return CpuMaskResult { mask: current_mask, error: EINVAL }; }
111+
CpuMaskResult { mask: new_mask, error: OK }
112+
```
113+
The `mask: current_mask` field is "default", `error: OK/EINVAL` is the
114+
override. LLVM partly handles this via SROA but the `Result`-shaped
115+
struct often forces a stack slot.
116+
117+
4. **`src/futex.rs:290-344` `wake`.**
118+
`let mut woken: u32 = 0; ... match thread { Some(t) => { woken = 1 } None => {} }`
119+
— classic counter default + match-arm increment.
120+
121+
5. **`src/ipi.rs:109-141` `compute_ipi_mask`.**
122+
`let mut mask: u32 = 0u32; ... if cond { mask = mask | bit; }` — bitmap
123+
build via OR-into-default.
124+
125+
**Why it matters.** This is *the* pattern LOOM already targets; gale gives
126+
LOOM a clean, dense, real-world benchmark suite. Every example above has
127+
a Verus `ensures` clause that pins the final value as a function of inputs
128+
— LOOM can use those clauses to prove the override is unconditional in
129+
specific call sites and collapse the assignment.
130+
131+
---
132+
133+
## 3. Verus-verified bounded loops (full-unroll / k-induction targets)
134+
135+
24 `decreases` clauses across the kernel. All ranking functions are
136+
`MAX_CONST - i` or `len - i` with `MAX` as a `pub const u32`.
137+
138+
**Examples.**
139+
140+
| file:line | bound | iterations |
141+
|---|---|---|
142+
| `src/wait_queue.rs:206`, `319`, `359`, `437` | `MAX_WAITERS = 64` | ≤ 64 |
143+
| `src/sched.rs:197` | `MAX_RUNQ_SIZE = 64` | ≤ 64 |
144+
| `src/poll.rs:558,577,598` | `MAX_POLL_EVENTS` | ≤ 32 (typical) |
145+
| `src/mem_domain.rs:202,306,387,460,478` | `MAX_PARTITIONS` | ≤ 16 |
146+
| `src/device_init.rs:329` | `dev.num_deps - i` | ≤ small |
147+
| `src/ipi.rs:121` | `num_cpus - idx`, `MAX_CPUS = 16` | ≤ 16 |
148+
| `src/futex.rs:329` | `count - i`, `count ≤ MAX_WAITERS = 64` | ≤ 64 |
149+
150+
**Why it matters for LOOM.**
151+
- *Full unroll candidates.* Every loop with `decreases MAX_CPUS - i` and
152+
`MAX_CPUS = 16` is a guaranteed-bounded, fixed-trip loop. LOOM can fully
153+
unroll without runtime check (the `requires` clause already proved the
154+
bound) — converting a loop body into a sequence of 16 SIMD-friendly ops.
155+
- *k-induction-friendly.* Loops carrying invariants like
156+
`forall|k: int| 0 <= k < i ==> entries[k].is_some()` (e.g.
157+
`src/wait_queue.rs:194-200`) are perfect input to bounded SMT-based
158+
loop verification. The Verus invariants *are* the k-induction
159+
hypothesis.
160+
- *Auto-vectorization.* `runq_shift_left` / `runq_shift_and_insert`
161+
(`src/sched.rs:80-105`, marked `#[verifier::external_body]`) are pure
162+
array shifts on `[Option<Thread>; 64]` — LOOM can replace with
163+
`memmove` / SIMD shuffle once it ingests the bound.
164+
165+
**Expected payoff.** Big — 2-5× on the wait-queue / runqueue ops if
166+
unrolled + vectorized. These are the absolute hottest paths in any
167+
RTOS kernel.
168+
169+
---
170+
171+
## 4. Tail-call / dispatch-only `match`
172+
173+
**Pattern.** `match enum { V1 => f(), V2 => g(), V3 => h(), ... }` where
174+
every arm is a single function call with the same signature. LLVM should
175+
emit jump-threaded tail calls but often fails when arms construct
176+
different `Ok(...)` variants.
177+
178+
**Examples.**
179+
180+
- `src/sched.rs:649-657`, `669-677`, `721-729`, `740-748`, `759-767`,
181+
`779-787` — every `sched_*` lifecycle function has the form
182+
`match state { S1 => Ok(T1), _ => Err(EINVAL) }` where each arm is a
183+
single-expression constructor. Six functions, six identical shapes —
184+
the classic case for *cross-function rule synthesis* (LOOM ingests one
185+
rule, applies to all six).
186+
187+
- `src/event.rs:296-336` `wait_decide` — three-way dispatch
188+
(`Matched` / `Pend` / `Timeout`), each arm builds a `WaitDecideResult`
189+
struct. The struct constructions share the `matched_events: 0` tail —
190+
LOOM's struct-field constant-folding should collapse two of three arms.
191+
192+
- `src/sched.rs:318-321`, `421-424`, `510-513` — three sites with the
193+
*exact same* idiom `match opt { Some(t) => t, None => fallback }`.
194+
Identical AST → ideal for caching one optimization recipe.
195+
196+
**Payoff.** Smaller per-site (a few cycles) but accumulates because these
197+
helpers are called from every kernel object operation.
198+
199+
---
200+
201+
## 5. Cross-function leaf-inlining + constant propagation
202+
203+
**Tiny leaf functions called from one site.**
204+
205+
- `src/sched.rs:271-278` `prio_cmp(a,b) -> i64` — three-line wrapper around
206+
`i64` subtraction, called only from `next_up_smp`
207+
(`src/sched.rs:430`) where one operand is `current.priority`. Inline +
208+
fold = direct `priority` field comparison.
209+
210+
- `src/sched.rs:285-300` `should_preempt` — pure function of three bools,
211+
called twice in `next_up_smp` and once in `update_cache`. Three call
212+
sites, each with at least one boolean known at the call site (e.g.
213+
`chosen.is_metairq` is often statically `false` for non-MetaIRQ paths).
214+
Specialize → branch elimination.
215+
216+
- `src/sched.rs:483-494` `update_metairq_preempt` — called once. Inline.
217+
218+
- `src/priority.rs:62-70` `is_higher_than``self.value < other.value`
219+
wrapped in a method. Called from many sites; the `requires
220+
self.inv()` clause guarantees the field read is in-bounds — LOOM can
221+
drop it.
222+
223+
- `src/atomic.rs:271-279`/`282-290` `inc`/`dec` — wrappers around
224+
`add(1)`/`sub(1)`. Two-line shims; *every* increment in the kernel goes
225+
through them.
226+
227+
**Constant-argument propagation.**
228+
229+
- `compute_ipi_mask` (`src/ipi.rs:88-142`) is called with
230+
`MAX_CPUS = 16` in production. The loop bound `num_cpus <= max_cpus <=
231+
MAX_CPUS <= 32` means the inner `1u32 << idx` shift is always in range —
232+
LOOM can specialize the function for `max_cpus = 16` and unroll.
233+
234+
**Payoff.** Each individually small, cumulatively large because these
235+
helpers sit on every fast path.
236+
237+
---
238+
239+
## 6. Bit-manipulation / mask arithmetic
240+
241+
**Pattern.** Sequences of `&`/`|`/`!`/`<<` that are provably equivalent to
242+
simpler forms once their precondition is known.
243+
244+
**Examples.**
245+
246+
- `src/cpu_mask.rs:66-68` spec: `(current | enable) & !disable`. The
247+
module proves a power-of-two property
248+
`(m & (m-1)) == 0` (line 83) — LOOM can use this as an
249+
axiom whenever a mask flows from `cpu_pin_compute`.
250+
251+
- `src/event.rs:144-152` `clear`: `events &= !clear_events`. Combined
252+
with `post`'s `events |= new_events` (line 82-ish) the module gives
253+
*complete* monotonicity / idempotence proofs (`events.rs:208-258`).
254+
Every Verus `by (bit_vector)` proof in this file is an axiom LOOM can
255+
ingest:
256+
- `(events | new_events) & events == events` (post-monotonic)
257+
- `value & !value == 0u32` (set/clear roundtrip)
258+
- `(events | new_events) | new_events == events | new_events` (idempotent)
259+
260+
- `src/cpu_mask.rs:81-86` `validate_pin_mask`: `mask != 0 && (mask &
261+
(mask - 1)) == 0` — power-of-two test, hardware popcount candidate.
262+
263+
- `src/ipi.rs:127-134` mask construction inside an unrolled loop —
264+
`mask |= bit; bit = 1 << idx;` is exactly the pattern that compiles to
265+
PEXT/PDEP on x86 and `bit-build` on ARMv8.2+.
266+
267+
**Payoff.** ISA-specific (popcount, PDEP, single-cycle bit-set) — but
268+
LOOM's strategic goal of *ingesting verified bit-vector axioms as
269+
optimization rewrite rules* lines up perfectly. Each `assert(...) by
270+
(bit_vector)` in gale is, modulo translation, an ISLE/eqsat rewrite
271+
rule.
272+
273+
---
274+
275+
## 7. State-machine `match (state, event)` (LOOM ISLE rule-synth target)
276+
277+
The premier example in the codebase:
278+
279+
- `src/sched.rs:592-630` `sched_is_valid_transition(from, to)` — pure
280+
`match (from, to) { ... }` over `(SchedThreadState, SchedThreadState)`,
281+
7 states each. The `ensures` clause `from === Dead ==> !result` (line
282+
590) is a *trusted axiom* LOOM can use: any caller path that reaches
283+
this with statically-known `from = Dead` simplifies to `false` and
284+
prunes the entire downstream branch.
285+
286+
- `src/sched.rs:530-543` `is_valid_transition` — same shape, smaller FSM
287+
(4 states).
288+
289+
- `src/event.rs:289-336` `wait_decide(events, desired, wait_type,
290+
is_no_wait)` — effectively `match (wait_type, condition_met,
291+
is_no_wait)` collapsed into nested `if`s. The `ensures` clauses
292+
(lines 304-310) prove `matched_events == 0` on the non-Matched paths —
293+
LOOM can hoist that constant out.
294+
295+
**Why it matters.** LOOM's Wave 2 ISLE rule verification work was about
296+
precisely this kind of `match (s, e)` table. Gale is the test bed.
297+
298+
---
299+
300+
## 8. Verus annotations as ingestable axioms
301+
302+
`requires` (627), `ensures` (980), and `decreases` (24) clauses are
303+
machine-checked by Verus + Z3. LOOM can ingest them as *trusted* axioms
304+
without re-proving anything.
305+
306+
**Highest-value annotations (sample).**
307+
308+
| file:line | clause | what LOOM gains |
309+
|---|---|---|
310+
| `src/sched.rs:312-316` | `runq_best.is_none() ==> result === Idle` etc. | branch elimination on `next_up` calls |
311+
| `src/sched.rs:589-590` | `from === Dead ==> !result` | dead-code on FSM paths |
312+
| `src/sem.rs:265-273` | `wait_q.len_spec() == 0 && count < limit ==> count' == count + 1` | scalar replacement for the `count+1` path |
313+
| `src/timeout.rs:316-339` | full `Ok((new_tick, fired))` pre/post | timeout-fast-path inlining |
314+
| `src/cpu_mask.rs:111-122` | `error == OK ==> mask == ((current\|enable) & !disable)` | mask-arith folding |
315+
| `src/event.rs:213,220,242,250` | bit-vector lemmas (`by (bit_vector)`) | direct rewrite-rule ingestion |
316+
| `src/futex.rs:280-288` | `result.woken <= MAX_WAITERS` | bound propagation into callers |
317+
| `src/atomic.rs:222-231` | CAS success/failure semantics | lock-free fast-path specialization |
318+
319+
This is the largest, most concentrated source of verified pre/post-
320+
conditions I'm aware of in the LOOM benchmark space — strictly more
321+
machine-readable than C-level Frama-C ACSL in upstream Zephyr.
322+
323+
**Payoff.** Strategic, not numeric: gale lets LOOM measure what
324+
fraction of axioms actually trigger optimizations, and how often
325+
LLVM-only baselines miss the same opportunities. This is the
326+
primary research lever.
327+
328+
---
329+
330+
## Summary table
331+
332+
| # | Pattern | Sites (approx) | Payoff |
333+
|---|---|---|---|
334+
| 1 | Closed-set FSM `match` (br_table → const-fold) | ~10 in sched.rs alone | high |
335+
| 2 | Default-then-override (LOOM v0.4/v0.5 hoist) | ~7 hot sites | high (already targeted) |
336+
| 3 | Verus-bounded loops (full-unroll + SIMD) | 24 `decreases` | very high on wait/run queues |
337+
| 4 | Tail-call dispatch `match` | ~12 | medium |
338+
| 5 | Leaf-inline + const-prop | ~8 leaf fns | medium-high cumulative |
339+
| 6 | Bit-mask axiom ingestion | events/cpu_mask/ipi | ISA-dependent, strategic |
340+
| 7 | `match (state, event)` FSM | 2 large + 1 medium | high (LOOM ISLE target) |
341+
| 8 | Verus pre/post as trusted axioms | 1607 clauses | strategic — primary lever |
342+
343+
## Recommended next steps for LOOM
344+
345+
1. **Pick `sched.rs` lines 530-787 as the canary benchmark.** Six near-
346+
identical `match`-on-enum FSM functions, all with `ensures` clauses,
347+
all called from FFI (`pub`). It's the densest source of patterns 1, 7,
348+
and 8 simultaneously.
349+
2. **Wire `decreases` clauses into LOOM's loop-unrolling cost model.**
350+
Twenty-four loops with statically-known bounds tied to `pub const`s —
351+
minimal effort, large payoff.
352+
3. **Treat `assert(...) by (bit_vector)` as ground truth** for the
353+
bit-mask rewrite-rule database (pattern 6). The Verus bit-vector solver
354+
has already discharged them.
355+
4. **Use gale's compile output as the regression suite for hoist guards
356+
(pattern 2)**`next_up_smp` alone exercises three layered defaults.
357+
358+
---
359+
360+
*Generated 2026-04-29 from gale source tree at
361+
`/Users/r/git/pulseengine/z/gale/`. Read-only inspection; no files
362+
modified.*

0 commit comments

Comments
 (0)