|
| 1 | +# Gale Pattern Analysis for LOOM Optimization Research |
| 2 | + |
| 3 | +**Source tree:** `/Users/r/git/pulseengine/z/gale/src/` (Verus-verified Rust |
| 4 | +model of Zephyr's kernel-scheduler primitives — sem / mutex / sched / |
| 5 | +wait_queue / timeout / event / futex / cpu_mask / atomic). |
| 6 | + |
| 7 | +**Method:** read-only inspection of the largest scheduler-relevant files |
| 8 | +(`sched.rs` 916 LoC, `mutex.rs` 560, `sem.rs` 539, `wait_queue.rs` 478, |
| 9 | +`timeout.rs` 709, `event.rs`, `futex.rs`, `cpu_mask.rs`, `priority.rs`, |
| 10 | +`poll.rs`, `atomic.rs`, `ipi.rs`). |
| 11 | + |
| 12 | +**Module-wide quick facts** |
| 13 | + |
| 14 | +| metric | count | comment | |
| 15 | +|---|---|---| |
| 16 | +| `requires` clauses | 627 | trusted preconditions | |
| 17 | +| `ensures` clauses | 980 | post-conditions LOOM can ingest | |
| 18 | +| `decreases` clauses | 24 | bounded-loop ranking functions | |
| 19 | +| `match` statements (8 hot files) | 29 | many are 7-arm enum dispatches | |
| 20 | + |
| 21 | +The repository is *unusually* dense in (a) closed-set enum dispatches, (b) |
| 22 | +`Result<…, EINVAL>` early-return guards, and (c) bounded for-loops over |
| 23 | +fixed-size arrays (`[Option<Thread>; 64]`, `MAX_CPUS = 16`, |
| 24 | +`MAX_PARTITIONS`, `MAX_POLL_EVENTS`). Every property below is grounded in |
| 25 | +that structure. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 1. Closed-set state-machine dispatch (br_table candidate) |
| 30 | + |
| 31 | +**Pattern.** A 7-arm `match` on a small `#[repr(u8)]`-style enum where every |
| 32 | +arm is either `Ok(NewState)` or `Err(EINVAL)`. The discriminant is dense |
| 33 | +(0..6), so rustc/LLVM emits a `br_table` (jump table). Many of these |
| 34 | +functions are then composed into outer state machines. |
| 35 | + |
| 36 | +**Examples (file:line).** |
| 37 | +- `src/sched.rs:649-657` `sched_suspend` — 7 arms, 6 of them `Err(EINVAL)`. |
| 38 | +- `src/sched.rs:669-677` `sched_resume` — 7 arms, only `Suspended` succeeds. |
| 39 | +- `src/sched.rs:721-729` `sched_sleep` — only `Running` succeeds. |
| 40 | +- `src/sched.rs:740-748` `sched_wakeup` — only `Sleeping` succeeds. |
| 41 | +- `src/sched.rs:759-767` `sched_pend` — only `Running` succeeds. |
| 42 | +- `src/sched.rs:779-787` `sched_unpend` — only `Pending` succeeds. |
| 43 | + |
| 44 | +**Two-axis pair-match (state×event).** |
| 45 | +- `src/sched.rs:530-543` `is_valid_transition` — `match (from, to)` over |
| 46 | + `(ThreadState, ThreadState)` (4×4 grid). |
| 47 | +- `src/sched.rs:592-630` `sched_is_valid_transition` — `match (from, to)` |
| 48 | + over `(SchedThreadState, SchedThreadState)` (7×7 = 49 cases, encoded as |
| 49 | + ~25 explicit arms + wildcard). |
| 50 | + |
| 51 | +**Why it matters for LOOM.** |
| 52 | +1. *Constant-folding through dispatch.* A caller invoking `sched_resume` |
| 53 | + on a known-`Suspended` value should fold to `Ok(Ready)` — single move. |
| 54 | + Today this requires an inter-procedural enum-discriminant tracker. |
| 55 | +2. *Switch-table → perfect-hash.* Where >50 % of arms are |
| 56 | + `Err(EINVAL)`, LOOM can rewrite to `if state == OnlyOk { Ok(...) } else |
| 57 | + { Err(EINVAL) }` — kills the indirect jump entirely. Branch-prediction |
| 58 | + wins on the sparse-success pattern that dominates here. |
| 59 | +3. *FSM equivalence proofs.* The pair-match form (`(from, to)`) is the |
| 60 | + canonical example LOOM/ISLE rule synthesis was designed for. Every |
| 61 | + `_ => false` is a refutable rule the optimizer can prove sound from the |
| 62 | + `ensures` clause directly above it (e.g. `from === Dead ==> !result`). |
| 63 | + |
| 64 | +**Expected payoff.** Concrete: 10-30 % cycle reduction on `sched_*` |
| 65 | +hot paths (one less jump-table indirect, dead code eliminated). Strategic: |
| 66 | +this is the cleanest source-level evidence I've seen for the |
| 67 | +"trusted-axiom-driven branch elimination" idea in |
| 68 | +`docs/research/novel-optimization-ideas.md`. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## 2. Default-then-override (the v0.4.0/v0.5.0 hoist-guard pattern) |
| 73 | + |
| 74 | +**Pattern.** `let mut x = DEFAULT; ...; if cond { x = OVERRIDE }; x`. |
| 75 | +LOOM v0.4/v0.5 added hoist guards specifically for this idiom because |
| 76 | +LLVM frequently fails to forward-propagate when the override is conditional |
| 77 | +on `Option::is_some()` or a state-machine match. |
| 78 | + |
| 79 | +**Examples.** |
| 80 | + |
| 81 | +1. **`src/sched.rs:404-444` `next_up_smp` — the canonical case.** |
| 82 | + ``` |
| 83 | + let mut thread: Option<Thread> = runq_best; // default |
| 84 | + if let Some(mirqp) = cpu_state.metairq_preempted { ... thread = Some(mirqp); ... } |
| 85 | + let candidate = match thread { Some(t) => t, None => cpu_state.idle_thread }; |
| 86 | + let mut chosen = candidate; // default |
| 87 | + if current_is_active { ... chosen = current; ... } |
| 88 | + if !should_preempt(...) { chosen = current; } |
| 89 | + ``` |
| 90 | + Two stacked default-then-override variables (`thread`, `chosen`) feeding |
| 91 | + into a single returned `SmpSchedOutcome`. Exactly the shape that |
| 92 | + motivated the hoist guards. |
| 93 | + |
| 94 | +2. **`src/sched.rs:183-208` `RunQueue::add` insertion search.** |
| 95 | + ``` |
| 96 | + let mut insert_pos: u32 = self.len; // default = "append at end" |
| 97 | + let mut i: u32 = 0; |
| 98 | + let mut found: bool = false; |
| 99 | + while ... { if thr_pri < entry_pri { insert_pos = i; found = true; } ... } |
| 100 | + ``` |
| 101 | + `insert_pos` defaults to `self.len`, overridden when the loop finds a |
| 102 | + spot. The companion `WaitQueue::pend` at `src/wait_queue.rs:292-332` |
| 103 | + is byte-for-byte the same pattern. |
| 104 | + |
| 105 | +3. **`src/cpu_mask.rs:104-143` `cpu_mask_mod`.** |
| 106 | + ``` |
| 107 | + if is_running { return CpuMaskResult { mask: current_mask, error: EINVAL }; } |
| 108 | + let new_mask: u32 = (current_mask | enable) & !disable; |
| 109 | + if new_mask == 0 { return CpuMaskResult { mask: current_mask, error: EINVAL }; } |
| 110 | + if pin_only && (new_mask & (new_mask - 1)) != 0 { return CpuMaskResult { mask: current_mask, error: EINVAL }; } |
| 111 | + CpuMaskResult { mask: new_mask, error: OK } |
| 112 | + ``` |
| 113 | + The `mask: current_mask` field is "default", `error: OK/EINVAL` is the |
| 114 | + override. LLVM partly handles this via SROA but the `Result`-shaped |
| 115 | + struct often forces a stack slot. |
| 116 | + |
| 117 | +4. **`src/futex.rs:290-344` `wake`.** |
| 118 | + `let mut woken: u32 = 0; ... match thread { Some(t) => { woken = 1 } None => {} }` |
| 119 | + — classic counter default + match-arm increment. |
| 120 | + |
| 121 | +5. **`src/ipi.rs:109-141` `compute_ipi_mask`.** |
| 122 | + `let mut mask: u32 = 0u32; ... if cond { mask = mask | bit; }` — bitmap |
| 123 | + build via OR-into-default. |
| 124 | + |
| 125 | +**Why it matters.** This is *the* pattern LOOM already targets; gale gives |
| 126 | +LOOM a clean, dense, real-world benchmark suite. Every example above has |
| 127 | +a Verus `ensures` clause that pins the final value as a function of inputs |
| 128 | +— LOOM can use those clauses to prove the override is unconditional in |
| 129 | +specific call sites and collapse the assignment. |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## 3. Verus-verified bounded loops (full-unroll / k-induction targets) |
| 134 | + |
| 135 | +24 `decreases` clauses across the kernel. All ranking functions are |
| 136 | +`MAX_CONST - i` or `len - i` with `MAX` as a `pub const u32`. |
| 137 | + |
| 138 | +**Examples.** |
| 139 | + |
| 140 | +| file:line | bound | iterations | |
| 141 | +|---|---|---| |
| 142 | +| `src/wait_queue.rs:206`, `319`, `359`, `437` | `MAX_WAITERS = 64` | ≤ 64 | |
| 143 | +| `src/sched.rs:197` | `MAX_RUNQ_SIZE = 64` | ≤ 64 | |
| 144 | +| `src/poll.rs:558,577,598` | `MAX_POLL_EVENTS` | ≤ 32 (typical) | |
| 145 | +| `src/mem_domain.rs:202,306,387,460,478` | `MAX_PARTITIONS` | ≤ 16 | |
| 146 | +| `src/device_init.rs:329` | `dev.num_deps - i` | ≤ small | |
| 147 | +| `src/ipi.rs:121` | `num_cpus - idx`, `MAX_CPUS = 16` | ≤ 16 | |
| 148 | +| `src/futex.rs:329` | `count - i`, `count ≤ MAX_WAITERS = 64` | ≤ 64 | |
| 149 | + |
| 150 | +**Why it matters for LOOM.** |
| 151 | +- *Full unroll candidates.* Every loop with `decreases MAX_CPUS - i` and |
| 152 | + `MAX_CPUS = 16` is a guaranteed-bounded, fixed-trip loop. LOOM can fully |
| 153 | + unroll without runtime check (the `requires` clause already proved the |
| 154 | + bound) — converting a loop body into a sequence of 16 SIMD-friendly ops. |
| 155 | +- *k-induction-friendly.* Loops carrying invariants like |
| 156 | + `forall|k: int| 0 <= k < i ==> entries[k].is_some()` (e.g. |
| 157 | + `src/wait_queue.rs:194-200`) are perfect input to bounded SMT-based |
| 158 | + loop verification. The Verus invariants *are* the k-induction |
| 159 | + hypothesis. |
| 160 | +- *Auto-vectorization.* `runq_shift_left` / `runq_shift_and_insert` |
| 161 | + (`src/sched.rs:80-105`, marked `#[verifier::external_body]`) are pure |
| 162 | + array shifts on `[Option<Thread>; 64]` — LOOM can replace with |
| 163 | + `memmove` / SIMD shuffle once it ingests the bound. |
| 164 | + |
| 165 | +**Expected payoff.** Big — 2-5× on the wait-queue / runqueue ops if |
| 166 | +unrolled + vectorized. These are the absolute hottest paths in any |
| 167 | +RTOS kernel. |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## 4. Tail-call / dispatch-only `match` |
| 172 | + |
| 173 | +**Pattern.** `match enum { V1 => f(), V2 => g(), V3 => h(), ... }` where |
| 174 | +every arm is a single function call with the same signature. LLVM should |
| 175 | +emit jump-threaded tail calls but often fails when arms construct |
| 176 | +different `Ok(...)` variants. |
| 177 | + |
| 178 | +**Examples.** |
| 179 | + |
| 180 | +- `src/sched.rs:649-657`, `669-677`, `721-729`, `740-748`, `759-767`, |
| 181 | + `779-787` — every `sched_*` lifecycle function has the form |
| 182 | + `match state { S1 => Ok(T1), _ => Err(EINVAL) }` where each arm is a |
| 183 | + single-expression constructor. Six functions, six identical shapes — |
| 184 | + the classic case for *cross-function rule synthesis* (LOOM ingests one |
| 185 | + rule, applies to all six). |
| 186 | + |
| 187 | +- `src/event.rs:296-336` `wait_decide` — three-way dispatch |
| 188 | + (`Matched` / `Pend` / `Timeout`), each arm builds a `WaitDecideResult` |
| 189 | + struct. The struct constructions share the `matched_events: 0` tail — |
| 190 | + LOOM's struct-field constant-folding should collapse two of three arms. |
| 191 | + |
| 192 | +- `src/sched.rs:318-321`, `421-424`, `510-513` — three sites with the |
| 193 | + *exact same* idiom `match opt { Some(t) => t, None => fallback }`. |
| 194 | + Identical AST → ideal for caching one optimization recipe. |
| 195 | + |
| 196 | +**Payoff.** Smaller per-site (a few cycles) but accumulates because these |
| 197 | +helpers are called from every kernel object operation. |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## 5. Cross-function leaf-inlining + constant propagation |
| 202 | + |
| 203 | +**Tiny leaf functions called from one site.** |
| 204 | + |
| 205 | +- `src/sched.rs:271-278` `prio_cmp(a,b) -> i64` — three-line wrapper around |
| 206 | + `i64` subtraction, called only from `next_up_smp` |
| 207 | + (`src/sched.rs:430`) where one operand is `current.priority`. Inline + |
| 208 | + fold = direct `priority` field comparison. |
| 209 | + |
| 210 | +- `src/sched.rs:285-300` `should_preempt` — pure function of three bools, |
| 211 | + called twice in `next_up_smp` and once in `update_cache`. Three call |
| 212 | + sites, each with at least one boolean known at the call site (e.g. |
| 213 | + `chosen.is_metairq` is often statically `false` for non-MetaIRQ paths). |
| 214 | + Specialize → branch elimination. |
| 215 | + |
| 216 | +- `src/sched.rs:483-494` `update_metairq_preempt` — called once. Inline. |
| 217 | + |
| 218 | +- `src/priority.rs:62-70` `is_higher_than` — `self.value < other.value` |
| 219 | + wrapped in a method. Called from many sites; the `requires |
| 220 | + self.inv()` clause guarantees the field read is in-bounds — LOOM can |
| 221 | + drop it. |
| 222 | + |
| 223 | +- `src/atomic.rs:271-279`/`282-290` `inc`/`dec` — wrappers around |
| 224 | + `add(1)`/`sub(1)`. Two-line shims; *every* increment in the kernel goes |
| 225 | + through them. |
| 226 | + |
| 227 | +**Constant-argument propagation.** |
| 228 | + |
| 229 | +- `compute_ipi_mask` (`src/ipi.rs:88-142`) is called with |
| 230 | + `MAX_CPUS = 16` in production. The loop bound `num_cpus <= max_cpus <= |
| 231 | + MAX_CPUS <= 32` means the inner `1u32 << idx` shift is always in range — |
| 232 | + LOOM can specialize the function for `max_cpus = 16` and unroll. |
| 233 | + |
| 234 | +**Payoff.** Each individually small, cumulatively large because these |
| 235 | +helpers sit on every fast path. |
| 236 | + |
| 237 | +--- |
| 238 | + |
| 239 | +## 6. Bit-manipulation / mask arithmetic |
| 240 | + |
| 241 | +**Pattern.** Sequences of `&`/`|`/`!`/`<<` that are provably equivalent to |
| 242 | +simpler forms once their precondition is known. |
| 243 | + |
| 244 | +**Examples.** |
| 245 | + |
| 246 | +- `src/cpu_mask.rs:66-68` spec: `(current | enable) & !disable`. The |
| 247 | + module proves a power-of-two property |
| 248 | + `(m & (m-1)) == 0` (line 83) — LOOM can use this as an |
| 249 | + axiom whenever a mask flows from `cpu_pin_compute`. |
| 250 | + |
| 251 | +- `src/event.rs:144-152` `clear`: `events &= !clear_events`. Combined |
| 252 | + with `post`'s `events |= new_events` (line 82-ish) the module gives |
| 253 | + *complete* monotonicity / idempotence proofs (`events.rs:208-258`). |
| 254 | + Every Verus `by (bit_vector)` proof in this file is an axiom LOOM can |
| 255 | + ingest: |
| 256 | + - `(events | new_events) & events == events` (post-monotonic) |
| 257 | + - `value & !value == 0u32` (set/clear roundtrip) |
| 258 | + - `(events | new_events) | new_events == events | new_events` (idempotent) |
| 259 | + |
| 260 | +- `src/cpu_mask.rs:81-86` `validate_pin_mask`: `mask != 0 && (mask & |
| 261 | + (mask - 1)) == 0` — power-of-two test, hardware popcount candidate. |
| 262 | + |
| 263 | +- `src/ipi.rs:127-134` mask construction inside an unrolled loop — |
| 264 | + `mask |= bit; bit = 1 << idx;` is exactly the pattern that compiles to |
| 265 | + PEXT/PDEP on x86 and `bit-build` on ARMv8.2+. |
| 266 | + |
| 267 | +**Payoff.** ISA-specific (popcount, PDEP, single-cycle bit-set) — but |
| 268 | +LOOM's strategic goal of *ingesting verified bit-vector axioms as |
| 269 | +optimization rewrite rules* lines up perfectly. Each `assert(...) by |
| 270 | +(bit_vector)` in gale is, modulo translation, an ISLE/eqsat rewrite |
| 271 | +rule. |
| 272 | + |
| 273 | +--- |
| 274 | + |
| 275 | +## 7. State-machine `match (state, event)` (LOOM ISLE rule-synth target) |
| 276 | + |
| 277 | +The premier example in the codebase: |
| 278 | + |
| 279 | +- `src/sched.rs:592-630` `sched_is_valid_transition(from, to)` — pure |
| 280 | + `match (from, to) { ... }` over `(SchedThreadState, SchedThreadState)`, |
| 281 | + 7 states each. The `ensures` clause `from === Dead ==> !result` (line |
| 282 | + 590) is a *trusted axiom* LOOM can use: any caller path that reaches |
| 283 | + this with statically-known `from = Dead` simplifies to `false` and |
| 284 | + prunes the entire downstream branch. |
| 285 | + |
| 286 | +- `src/sched.rs:530-543` `is_valid_transition` — same shape, smaller FSM |
| 287 | + (4 states). |
| 288 | + |
| 289 | +- `src/event.rs:289-336` `wait_decide(events, desired, wait_type, |
| 290 | + is_no_wait)` — effectively `match (wait_type, condition_met, |
| 291 | + is_no_wait)` collapsed into nested `if`s. The `ensures` clauses |
| 292 | + (lines 304-310) prove `matched_events == 0` on the non-Matched paths — |
| 293 | + LOOM can hoist that constant out. |
| 294 | + |
| 295 | +**Why it matters.** LOOM's Wave 2 ISLE rule verification work was about |
| 296 | +precisely this kind of `match (s, e)` table. Gale is the test bed. |
| 297 | + |
| 298 | +--- |
| 299 | + |
| 300 | +## 8. Verus annotations as ingestable axioms |
| 301 | + |
| 302 | +`requires` (627), `ensures` (980), and `decreases` (24) clauses are |
| 303 | +machine-checked by Verus + Z3. LOOM can ingest them as *trusted* axioms |
| 304 | +without re-proving anything. |
| 305 | + |
| 306 | +**Highest-value annotations (sample).** |
| 307 | + |
| 308 | +| file:line | clause | what LOOM gains | |
| 309 | +|---|---|---| |
| 310 | +| `src/sched.rs:312-316` | `runq_best.is_none() ==> result === Idle` etc. | branch elimination on `next_up` calls | |
| 311 | +| `src/sched.rs:589-590` | `from === Dead ==> !result` | dead-code on FSM paths | |
| 312 | +| `src/sem.rs:265-273` | `wait_q.len_spec() == 0 && count < limit ==> count' == count + 1` | scalar replacement for the `count+1` path | |
| 313 | +| `src/timeout.rs:316-339` | full `Ok((new_tick, fired))` pre/post | timeout-fast-path inlining | |
| 314 | +| `src/cpu_mask.rs:111-122` | `error == OK ==> mask == ((current\|enable) & !disable)` | mask-arith folding | |
| 315 | +| `src/event.rs:213,220,242,250` | bit-vector lemmas (`by (bit_vector)`) | direct rewrite-rule ingestion | |
| 316 | +| `src/futex.rs:280-288` | `result.woken <= MAX_WAITERS` | bound propagation into callers | |
| 317 | +| `src/atomic.rs:222-231` | CAS success/failure semantics | lock-free fast-path specialization | |
| 318 | + |
| 319 | +This is the largest, most concentrated source of verified pre/post- |
| 320 | +conditions I'm aware of in the LOOM benchmark space — strictly more |
| 321 | +machine-readable than C-level Frama-C ACSL in upstream Zephyr. |
| 322 | + |
| 323 | +**Payoff.** Strategic, not numeric: gale lets LOOM measure what |
| 324 | +fraction of axioms actually trigger optimizations, and how often |
| 325 | +LLVM-only baselines miss the same opportunities. This is the |
| 326 | +primary research lever. |
| 327 | + |
| 328 | +--- |
| 329 | + |
| 330 | +## Summary table |
| 331 | + |
| 332 | +| # | Pattern | Sites (approx) | Payoff | |
| 333 | +|---|---|---|---| |
| 334 | +| 1 | Closed-set FSM `match` (br_table → const-fold) | ~10 in sched.rs alone | high | |
| 335 | +| 2 | Default-then-override (LOOM v0.4/v0.5 hoist) | ~7 hot sites | high (already targeted) | |
| 336 | +| 3 | Verus-bounded loops (full-unroll + SIMD) | 24 `decreases` | very high on wait/run queues | |
| 337 | +| 4 | Tail-call dispatch `match` | ~12 | medium | |
| 338 | +| 5 | Leaf-inline + const-prop | ~8 leaf fns | medium-high cumulative | |
| 339 | +| 6 | Bit-mask axiom ingestion | events/cpu_mask/ipi | ISA-dependent, strategic | |
| 340 | +| 7 | `match (state, event)` FSM | 2 large + 1 medium | high (LOOM ISLE target) | |
| 341 | +| 8 | Verus pre/post as trusted axioms | 1607 clauses | strategic — primary lever | |
| 342 | + |
| 343 | +## Recommended next steps for LOOM |
| 344 | + |
| 345 | +1. **Pick `sched.rs` lines 530-787 as the canary benchmark.** Six near- |
| 346 | + identical `match`-on-enum FSM functions, all with `ensures` clauses, |
| 347 | + all called from FFI (`pub`). It's the densest source of patterns 1, 7, |
| 348 | + and 8 simultaneously. |
| 349 | +2. **Wire `decreases` clauses into LOOM's loop-unrolling cost model.** |
| 350 | + Twenty-four loops with statically-known bounds tied to `pub const`s — |
| 351 | + minimal effort, large payoff. |
| 352 | +3. **Treat `assert(...) by (bit_vector)` as ground truth** for the |
| 353 | + bit-mask rewrite-rule database (pattern 6). The Verus bit-vector solver |
| 354 | + has already discharged them. |
| 355 | +4. **Use gale's compile output as the regression suite for hoist guards |
| 356 | + (pattern 2)** — `next_up_smp` alone exercises three layered defaults. |
| 357 | + |
| 358 | +--- |
| 359 | + |
| 360 | +*Generated 2026-04-29 from gale source tree at |
| 361 | +`/Users/r/git/pulseengine/z/gale/`. Read-only inspection; no files |
| 362 | +modified.* |
0 commit comments