Skip to content

Commit af438fc

Browse files
avrabeclaude
andauthored
v1.1.15 — inliner: acyclic CF (#226) + multi-site small leaves & whole-function DCE (#228) (#229)
* test(219): add seam-SROA grounding fixture (pack/unpack round-trip) Faithful local repro of gale #219's C<->Rust decide seam: `decide` returns {action, new_count} packed into an i64; the shim unpacks them. After `--passes inline` the `run` body contains the pack (extend_i32_u/shl/or) → i64 carrier local → unpack (and/shr_u/wrap_i64) round-trip on one value — exactly the residue to dissolve. Verified: optimizes + validates today; inline leaves the round-trip intact (confirmed via `loom optimize --passes inline` + wasm-tools print). The seam-SROA pass (next) must forward the scalar components and drop the carrier, yielding `(action & 0xff) + new_count` — each rewrite Z3-proven. Refs #219 * test(219): add gale's verified real repro (sem.loom.wasm) + notes gale attached the authentic post-loom-inline dissolved z_impl_k_sem_give on the issue. Decoded + verified: sha256 f81da42d… (matches), 5254 B, valid wasm. It contains the u64 pack/unpack round-trip to dissolve — pack (extend_i32_u/shl/or) at body lines 59-63, unpack (and/shr_u/wrap_i64) at 90/104, with a dead i64 carrier. Grounds the seam-SROA pass against the real body, not just the synthetic fixture. Design confirmed on #219 (proof-carrying: Z3 obligation per ISLE rewrite; ISLE-term route preferred). Implementation (mem2reg + pack/unpack forwarding + DCE) follows. Refs #219 * feat(219): ISLE term scaffolding for seam-SROA (i32.wrap_i64 / i64.extend_i32_u) Lays the ISLE-layer foundation for the #219 pack/unpack forwarding rules: - types.isle: I32WrapI64 / I64ExtendI32U ValueData variants. - constructors.isle: decl + extern constructor + extern extractor for each. - loom-shared/src/lib.rs: i32_wrap_i64_extract / i64_extend_i32_u_extract (mirror iadd32_extract, single-field). - build.rs: register seam_sroa.isle AND add the two extractors to the fix_extractor_ownership list (cranelift-isle 0.132.1 emits extractor args as owned `Value`; the post-processor rewrites them to `&Value` — without this the generated trie move-then-borrows arg0 and loom-shared won't compile). - seam_sroa.isle: the unconditional, Z3-trivial rule `wrap_i64(extend_i32_u(x)) → x`. loom-shared builds clean. IMPORTANT (next step): the generated ISLE `simplify` is NOT in loom's live rewrite path — build.rs documents this, and the active rewriters are the Rust `rewrite_pure_impl` / `rewrite_with_dataflow` (loom-shared/src/lib.rs), validated by the Z3 translation validator per pass. So this rule does not fire yet; the LIVE seam-SROA rewrite must be added as a `rewrite_pure_impl` match arm (ValueData::I32WrapI64{val} where val is I64ExtendI32U{inner} → inner), with the Z3 verifier discharging the obligation. ISLE terms retained as proof-layer documentation of the identity. Refs #219 * feat(219): live seam-SROA rewrite — wrap_i64(extend_i32_u(x)) → x The actual rewrite (the ISLE rule committed earlier is in loom's dead simplify path; the live rewriters are the Rust rewrite_pure_impl/rewrite_with_dataflow, validated by the Z3 translation validator per pass). Adds rewrite_pure_impl match arms for the two #219 conversion terms: - I32WrapI64: recurse operand; if it is I64ExtendI32U(inner) → inner (unconditionally sound: extract(31,0, zero_ext(32,x)) = x — the u64-ABI round-trip a packed scalar pays at the dissolved decide seam); also folds wrap_i64(i64.const N) → i32.const (low 32). - I64ExtendI32U: recurse operand; folds extend_i32_u(i32.const N) → i64.const (zero-extended). Test: test_seam_sroa_wrap_extend_identity — wrap_i64(extend_i32_u(local.get 0)) dissolves to local.get 0. Full `cargo test --release --lib --test-threads=4` green, 0 failures. This is the foundation case. The sem repro's full dissolution additionally needs the carrier mem2reg + the masked and/shr_u forwarding rules (steps 2-3), since its unpack uses and/shr_u (not wrap) on the packed u64. Refs #219 * feat(219): step-3 masked pack/unpack SROA rules (the &mask and >>k halves) Dissolves the u64-pack unpack arithmetic, in rewrite_pure_impl (Z3-validated per pass; pure-integer so the verifier actually verifies — no skip). Each rule is unconditionally sound with only constant side-conditions (no range analysis): I64And (the `& mask` unpack): - (shl Z k) & M → 0 when M's set bits are all below k (M>>k==0, 0<k<64) - (or A B) & M → survivor & M when one OR operand is such a cleared shift I64ShrU (the `>> k` unpack): - (shr_u (shl Z k) k) → Z & (u64::MAX>>k) (shl-then-shr-same masks low 64-k) - (shr_u (or P Q) k) → (P>>k)|(Q>>k) when an operand is (shl _ k) (logical shift distributes over OR; targeted to the pack shape — no general bloat; recursion collapses the shl side via the rule above) Together these reduce a forwarded pack `(or (shl (extend_u hi) 32) lo)`: & 0xff → lo & 0xff (low field) >> 32 → (extend_u hi) & 0xffffffff (high field; lo const shifts out) Tests: test_seam_sroa_mask_clears_shifted_pack, test_seam_sroa_shr_extracts_ high_field (+ the step-1 wrap/extend test). Full `cargo test --release --lib --test-threads=4` green (399/28/17, 0 failures). These fire once inlining delivers pack/unpack adjacency (the inline-modelability extension for control-flow + trap-call decides is the remaining prerequisite, tracked separately — higher blast radius, needs gale silicon validation). Refs #219 * docs(219): PR-C design — precise acyclic CF symbolic execution (verifier) Design for the #219 inline prerequisite: wire in precise br_table/br_if/br symbolic execution (the deferred ExecutionState/merge_states scaffolding) so the verifier can soundly prove inline-equivalence of br_table-dispatching decides. Approach: state-merging-with-ITE via a label/continuation stack (NOT path forking), exact for the acyclic single-exit case. Grounded in Alive2 (PLDI'21, the production SMT bounded-TV precedent), Kuznetsov state-merging (PLDI'12), WASP (ECOOP'22 wasm br_table rules), Boogie block-predicate WP. Includes the soundness argument, the explicit pitfalls (selector constraint, fall-through join, ⊥-not- havoc for trap arms, branch arity, over-approx-and-skip), 2-phase staging (sem_give integer first, then memory for mutex/pipe), and the adversarial verify-or-revert + silicon test gate. For gale review before implementation. Refs #219 * feat(219): PR-C M2 — is_noreturn_callee classifier (divergent-call detection) First component of the PR-C precise-CF inline-verifier extension (design approved by gale on #219, 2026-06-17). `is_noreturn_callee(func)`: conservative-sound — true iff the body has no Return and no Br/BrIf/BrTable anywhere (recursing) and ends in Unreachable. Admits exactly the core::panicking helpers (panic_fmt → unreachable; panic_const_add_overflow → call panic_fmt; unreachable) and rejects any normal-returning or branching function. The by-body inline modeler (PR-C M3, next) will use this to encode a `call` to such a callee as ⊥ "diverge if reached" — never havoc — so a trapping branch constrains only reachability, never the post-state (Alive2 `ub` discipline; avoids the #155/#159 havoc-vs-concrete false-positive surface). Test: test_is_noreturn_callee (bare trap / trap-after-call / normal-return / no-trailing-trap / branch-escape). #[allow(dead_code)] until the M3 gate wires it. Refs #219 * feat(219): PR-C M1a — precise acyclic CF executor (isolated, Z3-tested) The soundness-critical core of the inline-modelability extension: a precise symbolic executor for structured, acyclic wasm control flow (block / if / br / br_if / br_table) by state-merging-with-ITE over a label/join stack — never path-forking. For a bounded acyclic region this is EXACT (a passing SMT check is a real proof). Precedent: Alive2 (PLDI'21), Kuznetsov state-merging (PLDI'12), WASP br_table rules (ECOOP'22). See docs/design/pr-c-precise-cf-verification.md. Built as a NEW ISOLATED function, not by mutating the shared encoders (encode_block_body / encode_function_to_smt_impl_inner stay untouched -> zero regression risk to existing passes). encode_simple_instruction's catch-all HAVOCS (pushes a fresh BV without popping operands), so delegating an unmodeled op would desync the stack; instead an allowlist gate (is_acyclic_modelable_body) front-loads soundness and the executor runs only on admitted functions. Over-approximate-and-skip everything else: loops (back-edge), calls, memory, unknown opcodes, multi-result/param blocks. br_table default guard is UNSIGNED bvuge(sel, n) — a signed >= would leave selectors in [2^31, 2^32) in a partition gap. All #[cfg(feature="verification")] + #[allow(dead_code)] until M3 wires it into inline_functions' verify-or-revert. Tests (5, green): br_table switch proves equivalent to the if/else chain; identity; detects wrong-default and swapped-arm counterexamples; gate rejects loops/memory, admits the switch. Also folds in cargo-fmt drift in loom-shared from #219 SROA @4356441. Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C M1b — Call by-body modeling + no-return diverge in the acyclic executor Extends exec_acyclic to model a `call F` BY BODY (the inline-soundness principle from #155 extended to control flow): pop args, recursively encode the callee through exec_acyclic with params bound to the args, declared locals zero-init, globals SHARED with the caller (callee global writes propagate back); push the merged single result and narrow the caller path to the callee's returning paths. A call to a no-return callee (is_noreturn_callee: panic -> unreachable) sets reachable=false — the path DIVERGES (drops out of the result ITE), NEVER havoc (Alive2 ub discipline; havoc reopens the #155/#159 false-positive surface). Threads VerificationSignatureContext (function_bodies, the existing #151 callee lookup) + a recursion depth into exec_acyclic / encode_acyclic_function / is_acyclic_modelable_body. MAX_ACYCLIC_CALL_DEPTH=8 bounds by-body inlining and makes a cyclic/recursive call graph fail safely (reject, never infinite-unroll). The gate now admits Call(g) iff g is a local function that is no-return or recursively acyclic-modelable; imports/call_indirect/over-deep stay rejected. exec_acyclic remains the soundness enforcer (Err, never havoc, on anything unmodelable). Still #[cfg(verification)] + #[allow(dead_code)] until M3 wires it into inline_functions' verify-or-revert. Tests (7 total, green): + call-by-body of a br_table callee proves equivalent to its inlined switch (the seam capability); + no-return call diverges not havoc. Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C M3.1 — verify_acyclic_equivalence precise fast-path Adds the equivalence harness over the precise acyclic CF executor: encodes both functions via encode_acyclic_function (deterministic input names unify in Z3) and checks a == b has no counterexample. Mirrors verify_function_equivalence's soundness boundaries (no width-matching the final equality; Unknown -> Err). Returns Err whenever either side is not acyclic-modelable, a width mismatches, or the encoding fails — so the caller (inline_functions' verify-or-revert, wired in M3.2) FALLS BACK to the existing encoder. This only ADDS proving power for the br_table-callee inline case; it never removes a check. Void functions defer to the fallback (no side-effect equivalence modeled yet). Still #[allow(dead_code)] until M3.2 wires it into inline_functions. Tests (9 total, green): + test_verify_acyclic_equivalence_inline — a caller that calls a br_table decide verifies equivalent to the inlined switch, a wrong-default body does NOT verify, and a loop function Errs (caller falls back). Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C M3.2 — wire the precise acyclic fast-path into the inliner TranslationValidator::verify() now tries verify_acyclic_equivalence for the inline pass before the existing encoder. The main encoder models br_table approximately and a br_table callee is not straight-line-by-body-modelable, so it cannot prove `call F` (br_table callee) equal to its inlined body → the inline reverts. The acyclic executor is EXACT for acyclic CF + by-body calls, so it can. PURELY ADDITIVE + scoped to pass_name == "inline_functions": ONLY a precise Ok(true) short-circuits to accept; Ok(false)/Err/panic FALL THROUGH to the existing verify_function_equivalence_with_context unchanged (acyclic call wrapped in its own catch_unwind). This can only ADD accepts (the br_table-callee inlines), never weaken an existing check; every non-inline pass is byte-identical. This also makes the M1a/M1b/M3.1 executor chain live (the #[allow(dead_code)] markers are now harmless no-ops). No br_table callee is admitted as an inline CANDIDATE yet (function_inline_modelable is still straight-line — M3.3), so this is a behavioral no-op on the current corpus: full `cargo test --release --lib` is 408 passed / 0 failed / 2 ignored, unchanged. clippy + fmt clean. The capability is proven by the 9 acyclic unit tests; M3.3 enables it end-to-end on the seam. Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C M3.3 — admit acyclic-CF callees as inline candidates + Unreachable trap function_inline_modelable now admits a callee under EITHER regime A (straight-line incl. memory — the existing by-body modeler, #155/#157/#161) OR regime B (acyclic control flow + pure-integer ops + by-body calls to no-return/recursively-modelable local callees — the precise acyclic executor, PR-C). Mirrors verify::is_acyclic_modelable_body so the candidate gate admits exactly what the acyclic verifier can prove (admitting an unprovable callee → caller reverts, the #159 collateral problem). Threads module context (functions + num_imported_funcs) into the gate + a recursion-depth bound; a br_table+memory callee passes neither regime → stays opaque until Phase 2. Also handles Unreachable as a trap (⊥) in exec_acyclic and both gates: the path diverges (reachable=false, drops out of the result ITE), never havoc — the seam decides end their overflow arms in `call panic_const; unreachable`, so this is required to model them at all. Full `cargo test --release --lib` = 409 passed / 0 failed / 2 ignored (+1: test_acyclic_unreachable_diverges). fmt + clippy clean. SCOPE NOTE (found by inspecting the real repro-219/sem.loom.wasm): decide (gale_k_sem_give_decide) IS regime-B-eligible (acyclic br_table + i64 pack + no-return panic call, no memory), so it is now an inline candidate. BUT its caller z_impl_k_sem_give READS MEMORY (i32.load offset=8/12 for decide's args) and calls imports, so proving decide's inline INTO z_impl needs the acyclic executor to model z_impl's memory = PHASE 2. So sem_give's seam will not dissolve until Phase 2 (memory Array threaded through exec_acyclic). M3.1-M3.3 are correct + necessary infra; Phase 2 is the next real enabler for sem_give. Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C Phase 2a — thread memory Array through the acyclic executor The acyclic executor now models linear memory (loads/stores), required because real seam callers (z_impl_k_sem_give) read memory for the decide's args. AcyclicState gains `memory: Array`. Memory is MUTABLE branch state, so it is ITE-merged at every join exactly like locals/globals: BranchEntry/MergedState carry the Array, acyclic_merge_entries ITE-merges it (Bool::ite over the Array sort), acyclic_push_branch records it, acyclic_apply_join sets it, and the If arm saves/restores it for the else branch. exec_acyclic handles I32/I64 load/store + partial-width load8/16/store8/16 EXPLICITLY via the shared encode_i32_load_from / encode_i64_load_from / encode_*_store_into / encode_partial_* helpers (the same ones the main encoder + by-body modeler use -> bit-identical memory expressions); they are NEVER delegated to encode_simple_instruction (which havocs). memory is threaded through encode_acyclic_call (shared + propagated like globals) and initialized in encode_acyclic_function as Array::new_const("memory", ...) -- same name as the main encoder so both encodings unify. Gates: new is_acyclic_memory_op (verify.rs) admitted by is_acyclic_modelable_body; lib-side is_acyclic_memory_modelable_instr admitted by the regime-B inline gate. Call gate stays STRICT (no impure-call havoc yet -- that is Phase 2b, the soundness-critical part). So sem_give's z_impl (which also calls k_spin_lock import + z_unpend) is not yet fully modelable; 2a is the contained, testable foundation. Full `cargo test --release --lib` = 410 passed / 0 failed / 2 ignored. New test test_acyclic_memory_merge: store-on-each-branch then load proves equivalent to the direct if/else (validates the per-join memory ITE-merge feeds the loaded value). test_acyclic_modelable_gate updated (memory now admitted). fmt + clippy clean. Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(219): PR-C Phase 2b — impure-call havoc in the acyclic executor The acyclic executor now models impure calls (imports + non-acyclic-modelable local callees) by HAVOC, so memory-bearing seam callers like z_impl_k_sem_give (which call k_spin_lock + z_unpend) become verifier-modelable end-to-end. exec_acyclic's Call arm is now 3-tier (mirrors the main encoder): no-return -> diverge; acyclic single-result local -> by-body (encode_acyclic_call); else (import / multi-result / too-deep / loop-containing) -> acyclic_havoc_call: pop args, push fresh `call_{g}_result_{i}`, havoc all globals + memory with a SHARED per-encode counter. The counter is threaded (&mut u64) through exec_acyclic + encode_acyclic_call and reset per encode_acyclic_function -- so the Nth impure call gets the SAME havoc id in the original and optimized encodings -> they unify in Z3. Havoc is conservative-sound: it never forges a value equal to a concrete one, so a havoc-vs-concrete mismatch refuses to prove (the #155/#159 guard). is_acyclic_modelable_body is now purely STRUCTURAL (drops ctx/depth params -- any direct call is modelable since havoc covers the opaque case; clippy only_used_in_recursion). The lib candidate gate is unchanged (decide already admitted via its no-return panic call). Full `cargo test --release --lib` = 413 passed / 0 failed / 2 ignored. New tests: test_acyclic_havoc_call_unifies (havoc id parity); test_acyclic_havoc_not_concrete (havoc'd result NOT provably equal to a concrete value -- soundness); test_acyclic_havoc_memory (load after impure call reads havoc'd memory, not the initial -- soundness). fmt + clippy clean. Next: e2e dissolve on repro-219/sem.loom.wasm (decide's inline into z_impl should now be PROVEN by the acyclic path -> no `call $gale_k_sem_give_decide`). Refs #219 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(228): inline modelable multi-site small leaves + expose whole-function DCE The inline_functions candidate predicate `(call_count==1 || size<10) && size<limit` made MULTI_CALL_SITE_LIMIT dead: a multi-call-site callee only passed when size<10, so nothing in [10,50) ever inlined (gale's 23-insn gust `mix` seam). Govern purely by the site-count-dependent budget (`size < limit`); the per-inline Z3 verify-or-revert stays the correctness backstop. Companion: multi-site inlining duplicates the body and orphans the original, but `loom optimize` never ran whole-function DCE (its `dce` = eliminate_dead_code is intra-function only). Expose eliminate_dead_functions as a CLI `dce-functions` pass after `inline` (default-on). It keeps exports, the start function, and element-table (indirect-call) targets live, conservative on parse failure (#196-safe). Test: test_inline_multisite_small_leaf_228 (11-insn 2-site leaf inlines at both sites). e2e: --passes inline,dce-functions removes the orphan + validates. 414 lib tests green. Refs #228 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(release): v1.1.15 — inliner: acyclic CF (#226) + multi-site leaves & whole-function DCE (#228) Found by gale's gust codegen bench. Both fixes are Z3-gated (verify-or-revert). - #226: precise acyclic-CF symbolic executor (PR-C) + Regime B candidate gate — inlines acyclic-CF callees with by-body / divergent (no-return) calls (gale's mix: bounds-check if + panic + load). Was straight-line-only before. - #228: candidate predicate governed by the site-count budget (size < limit) so multi-site leaves in [10,50) inline; new default-on CLI dce-functions pass (#196-safe whole-function DCE) removes the orphans multi-site inlining creates. The #219 seam-teardown (carrier forwarding) is intentionally NOT here — frozen on feat/219-seam-sroa until the G474RE silicon dual-flash confirms it. Refs #226 #228 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent fe5dc4e commit af438fc

14 files changed

Lines changed: 2310 additions & 106 deletions

File tree

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,38 @@ All notable changes to LOOM will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [1.1.15] - 2026-06-22
9+
10+
Inliner release: the dissolve pipeline now inlines two real seam shapes it
11+
previously left as opaque `call`s, both gated by the Z3 translation validator
12+
(verify-or-revert). Found by gale's `gust` codegen bench. The #219 seam-teardown
13+
(carrier forwarding) is intentionally NOT in this release — it stays on its branch
14+
until the G474RE silicon dual-flash confirms it.
15+
16+
### Added
17+
18+
- **Acyclic control-flow inlining (#226).** The inline-modelability gate was
19+
straight-line only, so a `#[inline]` leaf whose body is `if oob { call
20+
panic_bounds_check; unreachable }` + a load + arithmetic (gale's `mix`) stayed
21+
an opaque `call`. A precise acyclic-CF symbolic executor (PR-C) + a matching
22+
candidate gate (Regime B) now admit acyclic CF + by-body calls + divergent
23+
(no-return) calls, all proven by the validator. `mix` now inlines.
24+
- **Whole-function dead-code elimination in `loom optimize` (#228).** New
25+
`dce-functions` CLI pass (default-on, after `inline`) exposes
26+
`eliminate_dead_functions`: multi-site inlining duplicates a body and orphans
27+
the original, which the intra-function `dce` could never remove. Keeps exports,
28+
the start function, and element-table (indirect-call) targets live —
29+
conservative on a parse failure (#196-safe).
30+
31+
### Fixed
32+
33+
- **Small multi-call-site leaves are inlined again (#228).** The candidate
34+
predicate `(call_count == 1 || size < 10) && size < limit` made
35+
`MULTI_CALL_SITE_LIMIT` (50) dead code — a multi-site callee only passed when
36+
`size < 10`, so nothing in `[10, 50)` ever inlined (gale's 23-instruction,
37+
2-site `gust_mix`). Now governed purely by the site-count-dependent budget
38+
(`size < limit`); the per-inline Z3 verify-or-revert stays the correctness gate.
39+
840
## [1.1.14] - 2026-06-15
941

1042
Correctness + release-engineering release. One optimizer correctness fix (#220);

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ members = [
99
]
1010

1111
[workspace.package]
12-
version = "1.1.14"
12+
version = "1.1.15"
1313
authors = ["PulseEngine <https://github.com/pulseengine>"]
1414
edition = "2024"
1515
license = "Apache-2.0"
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# PR-C: Precise acyclic control-flow symbolic execution in the Z3 verifier
2+
3+
Status: **design, for review** (gale #219). Prerequisite for inlining the 5
4+
`br_table`-dispatching decides (sem_give 860, mutex_unlock 472, pipe_write,
5+
pipe_read, msgq_put) and for removing the existing `BrIf`/`BrTable` skips in
6+
LICM/CSE/code_folding/coalesce_locals.
7+
8+
## Problem
9+
10+
loom's live verifier (`verify.rs`) does **not** soundly model multi-way control
11+
flow. In both `encode_function_to_smt_impl_inner` and `encode_block_body`:
12+
13+
- `BrIf(_depth)` pops the condition and **continues** ("a more precise encoding
14+
would fork paths here").
15+
- `BrTable { .. }` pops the index and **`break`s** ("treat as terminating for
16+
now").
17+
- `Br(_depth)` / `Return` `break`; branch **depth is ignored** (no label stack).
18+
19+
So a function whose result depends on a `br_table` dispatch is modeled as if the
20+
dispatch didn't happen. Reusing this to prove an inline equivalent would let Z3
21+
pick any arm for any selector → a **false** equivalence → an unsound inline that
22+
loom's structural CI cannot catch (the #196/#220 failure class). This is why
23+
LICM/CSE/etc. currently **skip** `BrIf`/`BrTable` functions entirely.
24+
25+
The precise machinery already exists as **dead scaffolding**, flagged in-source
26+
as the deferred upgrade: `struct ExecutionState { stack, locals, globals,
27+
path_condition: Bool, reachable }`, `merge_states(cond, t, f)`, and
28+
`BlockResult.branch_depth`. `merge_bv(cond, a, b) = cond.ite(a, b)` is live (used
29+
for `If`). PR-C wires the scaffolding in.
30+
31+
## Chosen approach (with precedent)
32+
33+
**State-merging-with-ITE driven by an explicit label/continuation stack — never
34+
path-forking.** For a bounded, acyclic (loops=0 on the critical path),
35+
single-normal-exit region this is **exact** (a passing SMT check is a real
36+
proof, not a bounded approximation), and it produces **one** verification
37+
condition instead of N per-arm obligations.
38+
39+
Precedent (all verified against primary sources):
40+
41+
- **Alive2** (Lopes, Lee, Hur, Liu, Regehr, *Bounded Translation Validation for
42+
LLVM*, PLDI 2021) — the closest production precedent. "We do **not** fork
43+
expressions across paths in the CFG"; phi/branch merges become "a single SMT
44+
expression per register"; the final state is "a linear chain of `ite`
45+
expressions"; an explicit **`ub`** flag makes ill-defined paths refine rather
46+
than vanish. `switch` is the N-way generalization of its 2-way branch merge.
47+
- **Kuznetsov, Kinder, Bucur, Candea**, *Efficient State Merging in Symbolic
48+
Execution*, PLDI 2012 — the exact merge rule `(ℓ, pc'∨pc'', λv. ite(pc',
49+
s'[v], s''[v]))`; "does not over-approximate, no false positives."
50+
- **WASP** (Marques et al., *Concolic Execution for WebAssembly*, ECOOP 2022) —
51+
the literal wasm rules: `br_table` arm `k≤n``Cont(br j_{k+1})` with path
52+
condition `ŝ = k`; default `k≥n``Cont(br jn)` with `ŝ ≥ n`. The selector
53+
constraint is conjoined onto every arm; the target is resolved by label index.
54+
- **Barnett & Leino**, *Weakest-Precondition of Unstructured Programs*, PASTE
55+
2005 — per-block predicates keep the VC **linear** in paths (a wide `br_table`
56+
doesn't blow up); acyclic ⇒ topological, no recursion.
57+
- **WebAssembly Core Spec** + Haas et al. (PLDI 2017) — wasm is structured with
58+
a label stack; `br N` targets the N-th enclosing label (0 = innermost);
59+
`br_table` takes the default when the operand is out of bounds.
60+
61+
Per-path forking is also sound but is the wrong fit: exponential over a wide
62+
`br_table`, N separate obligations.
63+
64+
## The model
65+
66+
Carry a single symbolic `ExecutionState`:
67+
`{ stack: Vec<BV>, locals: Vec<BV>, globals: Vec<BV>, memory: Array,
68+
path_condition: Bool, trapped: Bool }` (`trapped` = Alive2's `ub`/⊥).
69+
70+
A **label/continuation stack**: each entry records `{ arity, join: JoinAccumulator }`.
71+
`block`/`if`/`loop` push a label; the entry's `join` accumulates the merged state
72+
of every branch that targets it.
73+
74+
- `block tf … end`, `if … end`: push a label whose continuation is the code
75+
**after** `end` (forward join).
76+
- `br N`: resolve the N-th label from the top; **merge** the current state into
77+
that label's `join` under `path_condition` (carry `arity` result values); mark
78+
the current path terminated.
79+
- `br_if N`: split — branch path `pc ∧ (cond≠0)` merges into the target join;
80+
fall-through path `pc ∧ (cond=0)` continues.
81+
- `br_table j0..jn-1 (default jd)` (per WASP): for each arm `i`, merge under
82+
`pc ∧ (sel = i)` into label `ji`'s join; default under `pc ∧ (sel ≥ n)` into
83+
`jd`'s join. Guards **partition** the selector domain (totality from the
84+
default complement).
85+
- At each `end` / the function exit (the arms' post-dominator), the label's
86+
`join` is the **ITE-merge** of all incoming branches: `path_condition` =
87+
disjunction; each value = `ite(guard, …)` (reusing `merge_bv` / `merge_states`).
88+
- `unreachable`, and a `Call` to a no-return callee (`is_noreturn_callee`: no
89+
`Return`, no `Br*` to the function label, every path ends in `Unreachable`),
90+
set `trapped` (⊥) — **never havoc**. They constrain reachability only; on a
91+
trapped path the return value is don't-care and the path drops out of the
92+
result ITE.
93+
94+
**Crucially**: this *one* executor models **both** the original `call decide`
95+
(by-body) and the inlined body, so they yield the **same** merged expression →
96+
equivalence is provable (the #155 by-body principle, extended to CF).
97+
98+
## Soundness — and the pitfalls we explicitly guard
99+
100+
1. **Exact for acyclic** — no fixpoint, no loop invariants; finite single pass;
101+
merge only at structured-block ends + the single exit (post-dominators).
102+
2. **Selector constraint preserved** — every `br_table` arm carries `sel = i`,
103+
default carries `sel ≥ n`; guards partition the domain. (Avoids the
104+
"terminate" false-equivalence.)
105+
3. **Fall-through resolved via the label stack** to the correct join (a `br` to
106+
an outer label skips intervening joins).
107+
4. **⊥, not havoc, for trap/unreachable/no-return arms** — they refine, never
108+
forge a matching value (the #155/#159 + Alive2 `ub` discipline).
109+
5. **Branch arity recorded**`br N` carries the label's result values; the
110+
merged stack shape stays correct.
111+
6. **Over-approximate-and-skip** anything unmodeled (LOOM's "skip rather than
112+
risk"): any instruction/shape outside the precise model → the callee is not
113+
by-body-modelable → opaque fallback (no inline), never a false ≡.
114+
115+
## Scope / staging
116+
117+
- **Loops stay out** — still unrolled (`MAX_LOOP_UNROLL`) / k-induction / skipped
118+
as today. PR-C is acyclic-only. A back-edge ⇒ not by-body-modelable.
119+
- **Phase 1 — integer acyclic CF (no memory):** activate `ExecutionState` +
120+
label stack + `merge_states`; implement `Br`/`BrIf`/`BrTable`/`Block`/`If`
121+
precisely in `encode_block_body`; add `is_noreturn_callee` + diverge-on-trap;
122+
extend `is_inline_modelable_instr`/`callee_inlinable_by_body` to admit acyclic
123+
CF + no-return calls; route the by-body modeler through the precise executor.
124+
**Unblocks sem_give (860)** — pure integer + br_table, no memory.
125+
- **Phase 2 — thread `memory: Array` through the CF executor** (today
126+
`encode_block_body` has no memory param). Unblocks **mutex_unlock (472)** +
127+
**pipe_write/read** (sret stores through the shadow frame).
128+
129+
## Test strategy (the soundness gate)
130+
131+
- **Adversarial verify-or-revert** (the hard gate): deliberately-wrong inlined
132+
bodies MUST produce a Z3 counterexample → revert — wrong arm selected, dropped
133+
selector guard, dropped trap branch, mis-merged join, wrong branch depth.
134+
- **Differential**: correct inline of a `br_table` fixture proves; a near-miss
135+
reverts. Fixtures ordered per gale: msgq_put (trap-only) → sem_give
136+
(`block`+`br_table`+`panic→unreachable`) → mutex_unlock (most blocks, sret).
137+
- **Unit**: `is_noreturn_callee`, the label-stack resolution, `merge_states`.
138+
- **Regression**: full `cargo test` + dogfooding (loom optimizing itself)
139+
unchanged; re-validate the LICM/CSE/etc. paths once their `br_table` skips are
140+
removed (a bonus coverage win, but it widens what they verify).
141+
- **e2e + silicon**: `loom optimize` on `repro-219/sem.loom.wasm` dissolves (no
142+
`call $..._decide`, no i64 pack/unpack in `z_impl_k_sem_give`); then gale's
143+
G474RE re-flash (sem_give 860→, mutex_unlock 472→) is the kill-criterion.
144+
145+
## Risk
146+
147+
Substantial, soundness-critical (touches the core verifier). De-risked by: the
148+
acyclic restriction (the only source of unsoundness/incompleteness — loops — is
149+
excluded), the existing `ExecutionState`/`merge_states` scaffolding, the
150+
Alive2/WASP precedent, and the adversarial-revert + silicon gates. Staged so
151+
Phase 1 (sem_give) lands and validates before Phase 2 (memory). WIP branch, no
152+
merge until adversarial tests pass + gale silicon-validates.
153+
154+
## References
155+
156+
- Lopes, Lee, Hur, Liu, Regehr. *Alive2: Bounded Translation Validation for
157+
LLVM.* PLDI 2021. https://users.cs.utah.edu/~regehr/alive2-pldi21.pdf
158+
- Kuznetsov, Kinder, Bucur, Candea. *Efficient State Merging in Symbolic
159+
Execution.* PLDI 2012. https://dslab.epfl.ch/pubs/stateMerging.pdf
160+
- Avgerinos, Rebert, Cha, Brumley. *Enhancing Symbolic Execution with
161+
Veritesting.* ICSE 2014. https://softsec.kaist.ac.kr/~sangkilc/papers/avgerinos-icse14.pdf
162+
- Marques, Fragoso Santos, Santos, Adão. *Concolic Execution for WebAssembly
163+
(WASP).* ECOOP 2022. https://drops.dagstuhl.de/opus/volltexte/2022/16239/pdf/LIPIcs-ECOOP-2022-11.pdf
164+
- Barnett & Leino. *Weakest-Precondition of Unstructured Programs.* PASTE 2005.
165+
https://www.microsoft.com/en-us/research/wp-content/uploads/2005/01/krml157.pdf
166+
- Haas et al. *Bringing the Web up to Speed with WebAssembly.* PLDI 2017.
167+
- Van Hattum et al. *Lightweight, Modular Verification for Wasm-to-Native
168+
Instruction Selection (Crocus/veri-isle).* ASPLOS 2024.

loom-cli/src/main.rs

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -502,6 +502,27 @@ fn optimize_command(
502502
track_pass("inline", before, after);
503503
}
504504

505+
// loom#228: whole-function (module-level) dead-function elimination.
506+
// Multi-site inlining duplicates a callee's body into each caller and
507+
// leaves the ORIGINAL orphaned; the body-level `dce` (eliminate_dead_code)
508+
// is intra-function and never removes a whole function, so the orphan
509+
// survived every `loom optimize` run and grew the module. Run it right
510+
// after `inline` so the orphan is gone before downstream passes touch it.
511+
// eliminate_dead_functions keeps exports, the start function, and every
512+
// element-segment (indirect-call table) target live — conservative on a
513+
// parse failure — so it can never drop a reachable function (#196).
514+
if should_run("dce-functions") {
515+
println!(" Running: dce-functions");
516+
let before = count_instructions(&module);
517+
let removed = loom_core::fused_optimizer::eliminate_dead_functions(&mut module)
518+
.context("Whole-function DCE failed")?;
519+
if removed > 0 {
520+
println!(" removed {removed} dead function(s)");
521+
}
522+
let after = count_instructions(&module);
523+
track_pass("dce-functions", before, after);
524+
}
525+
505526
if should_run("precompute") {
506527
println!(" Running: precompute");
507528
let before = count_instructions(&module);

0 commit comments

Comments
 (0)