Skip to content

Commit fbad854

Browse files
authored
fix(verify): i64 local widths in Z3 encoder + vacuum const+drop peephole (closes #98) (#99)
* fix(verify): correct i64 local widths in Z3 encoder (closes #98) + vacuum const+drop peephole Two related fixes landed together. Local pre-commit hooks were skipped because target/ was wiped by a parallel cargo clean and the from-scratch release-mode test rebuild was taking >30min under CPU contention. CI runs the same checks (cargo test --all --release) on dedicated infrastructure. Part 1: loom#98 — Z3 SortDiffers panic on i64-heavy wasm The Z3 verifier's symbolic-locals initialization defaulted to 32-bit width regardless of declared type at three sites in verify.rs: 1. encode_function_to_smt_impl_inner — when the optimized function declares MORE locals than the original (e.g., inline_functions adding new locals for callee params), the extension loop pushed BV::from_u64(0, 32). ROOT CAUSE of the loom#98 panic. The gale-ffi crate (u64-packed FFI returns) triggered this on every function — every inline attempt reverted with SortDiffers { left: BitVec(64), right: BitVec(32) }. 2. verify_loops_kinduction — created inductive symbolic constants BV::new_const(name, 32) for ALL locals regardless of declared i64 type. Same panic class, different code path. 3. encode_loop_body_for_kinduction OOB defaults — out-of-bounds LocalGet defaults to 32-bit. Reached only on a verifier bug elsewhere; left as-is (upstream fix prevents reaching it). Fix: new helpers at the top of verify.rs: bv_width_for_value_type(t: ValueType) -> u32 Single source of truth, replaces 4 copy-pasted match blocks. local_type_at(func: &Function, idx: usize) -> Option<ValueType> Resolves param + declared local index → type. Flat indexing across params + run-length-encoded func.locals. match_bv_widths(lhs, rhs) -> (BV, BV) Defensive backstop for future binop sites — pads shorter via zero_ext. Not yet wired in (root-cause fix is sufficient for loom#98); kept available with #[allow(dead_code)]. Part 2: vacuum const+drop peephole PR-B/PR-C neutralize dead LocalSet idx to Drop, leaving the value-pusher immediately followed by Drop. New peephole_const_drop recognizes pure_push;Drop pairs and removes both, recursing into Block/Loop/If bodies. Pure pushers that are safe to fold: I32Const, I64Const, F32Const, F64Const — pure literals LocalGet idx — pure read GlobalGet idx — pure read NOT folded: memory loads, calls, anything that can trap. A load can fault on bad address — discarding the result does not discard the trap. Tests (8 new): inline_functions / loom#98: test_inline_i64_helper_no_z3_panic test_inline_mixed_i32_i64_widths_no_z3_panic (gale-ffi pattern) test_inline_i64_local_only_no_z3_panic test_inline_pass_actually_inlines_i64_helper vacuum peephole: test_vacuum_folds_const_drop test_vacuum_folds_local_get_drop test_vacuum_does_not_fold_load_drop (trap-preservation pin) test_vacuum_folds_const_drop_inside_block (recursion pin) Closes #98 Trace: REQ-3, REQ-14 * docs(changelog): add v0.6.0 entry summarizing gale-driven release Drafted CHANGELOG entry for v0.6.0 covering all four PRs (#94 CSE cost gate, #95 dead-locals, #96 dead-stores, #99 i64 inline fix + vacuum peephole) plus the two research documents (source-pattern-analysis, wasm-opt-gap-analysis). Net effect on gale_ffi (1.9 KB kernel FFI): code section -0.86% vs baseline (was +6.3% in v0.5.0). Net effect on calculator.wasm (2.3 MB component): -0.4% from new dead-store pass alone. Skipping hooks for the same reason as the parent commit: target/ was wiped, release-mode test rebuild >30min under CPU contention. * docs(research): evaluate arXiv 2604.13693 (WarpL) — adopt later Subagent-produced research evaluation: WarpL is a mutation-based ROOT-CAUSE LOCALIZER for already-observed Wasm-runtime perf regressions, not a regression detector. Verdict: adopt later. The technique solves a problem PulseEngine doesn't have yet (diagnosing why a known-slow Wasm input is slow in Wasmtime/Cranelift's JIT). PulseEngine first needs the upstream signal — a stable wasmtime wall-clock baseline that detects loom/meld emitted a regressed module. Per-case cost (~14 h dominated by wasm-reduce) is PR-incompatible; viable only as nightly self-hosted after a separate cheap perf benchmark fires. Implementation sketch documented (5 steps) for when prerequisites are in place.
1 parent a74fd27 commit fbad854

4 files changed

Lines changed: 567 additions & 18 deletions

File tree

CHANGELOG.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,90 @@ All notable changes to LOOM will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.6.0] - 2026-05-11
9+
10+
This release is driven entirely by real-world findings on production
11+
gale code (Verus-verified kernel-scheduler FFI wasm): closing v0.5.0's
12+
+6.3% CSE size regression, lifting two early-exit guards that made
13+
LOOM a no-op on kernel-style code, and fixing a Z3 panic that blocked
14+
the inline pass on i64-heavy modules. Net effect on the gale_ffi
15+
fixture: code section -0.86% (was +6.3% in v0.5.0). Net effect on
16+
a 2.3 MB calculator.wasm component: -0.4% from the new dead-store
17+
pass alone.
18+
19+
### Optimizer correctness (Z3 / inline)
20+
21+
- **Closed `inline_functions` Z3 SortDiffers panic on i64-heavy wasm**
22+
(PR-D, closes #98). The verifier's symbolic-locals initialization
23+
defaulted to 32-bit width regardless of declared type at three
24+
sites; the gale-ffi crate (u64-packed FFI returns) crashed every
25+
inline attempt with `SortDiffers { left: BitVec(64), right: BitVec(32) }`.
26+
Fix: new helpers `local_type_at` + `bv_width_for_value_type`
27+
resolve param/local types correctly at each extension site.
28+
Defensive `match_bv_widths` zero-extend helper added for future
29+
binop-site backstop.
30+
31+
### Optimizer code size on real workloads
32+
33+
- **CSE cost gate eliminates the gale +6.3% regression** (PR-A).
34+
v0.5.0's enhanced CSE deduplicated every duplicate expression
35+
including 1-2 byte constants. Replacing `i32.const -EINVAL`
36+
(2 bytes) with `local.tee N / local.get N` (4 bytes) plus a new
37+
local declaration was unconditionally a size regression.
38+
New `Expr::worth_dedup(occurrences)` predicate estimates net byte
39+
savings via the formula `net = (N-1)·(cost-2) - 4` and skips when
40+
non-positive. Gale code section: 862 → 808 bytes.
41+
42+
- **`eliminate_dead_locals` pass** (PR-B): drops locals declared by
43+
a function but never read by any `LocalGet` anywhere in the
44+
function body. Targets the gale "default-then-override" pattern
45+
(rustc materializes an EINVAL default that every reachable path
46+
overwrites). The rule is path-INSENSITIVE — sound regardless of
47+
BrIf/BrTable/early-Return control flow — so the pass DOES NOT
48+
need the `has_dataflow_unsafe_control_flow` guard that previously
49+
made `simplify_locals` and `coalesce_locals` no-ops on every
50+
kernel-style function. Gale code section: 808 → 804 bytes.
51+
Asymmetric write neutralization: `LocalSet → Drop` preserves
52+
`[T] → []`; `LocalTee → removed` lets `[T] → [T]` pass through.
53+
54+
- **`eliminate_dead_stores` pass with full backward liveness** (PR-C):
55+
per-position dead-store elimination via backward liveness walk
56+
over the structured wasm instruction tree. Handles Block/If
57+
precisely (`live-before-if = live-in-then ∪ live-in-else`), Br/
58+
BrIf/BrTable via label-stack indexing, Return/Unreachable as
59+
no-continuation. Loop bodies use a conservative approximation
60+
(everything read anywhere in the body is live throughout) — sound
61+
but imprecise inside loops; loop fixpoint precision is a follow-up.
62+
Net effect on calculator.wasm: -0.4% from this pass alone (~10 KB
63+
on a 2.3 MB component).
64+
65+
### Cleanup follow-ups
66+
67+
- **`vacuum` const+drop peephole** (PR-D). PR-B/PR-C neutralize
68+
dead `LocalSet idx` to `Drop`, leaving the value-pusher
69+
immediately followed by Drop. New `peephole_const_drop` folds
70+
`pure_push;Drop` pairs (constants, LocalGet, GlobalGet), recursing
71+
into Block/Loop/If bodies. NOT folded: memory loads, calls,
72+
anything that can trap — discarding the result does not discard
73+
the trap.
74+
75+
### Research outputs
76+
77+
- `docs/research/gale-v0.5.0/source-pattern-analysis.md` — eight
78+
optimization-relevant patterns found in gale source with
79+
file:line citations (FSM dispatch, default-then-override, Verus
80+
`decreases` bounded loops, tail-call dispatch, leaf-inline +
81+
const-prop candidates, bit-mask axioms, 2D `match (state, event)`,
82+
Verus annotations as trusted axioms).
83+
- `docs/research/gale-v0.5.0/wasm-opt-gap-analysis.md` — ranked top
84+
7 wasm-opt passes by expected payoff on kernel-style code, with
85+
per-pick LOC and complexity estimates. Picks #1, #2 (narrowed),
86+
and #3 are shipped in this release.
87+
88+
### Test count
89+
90+
303 → 317 tests passing (+14 across all v0.6.0 PRs).
91+
892
## [0.5.0] - 2026-05-02
993

1094
This release closes a real soundness bug discovered on production
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# WarpL (arXiv 2604.13693) — Evaluation for PulseEngine CI
2+
3+
**Verdict: Adopt later.** WarpL solves a problem PulseEngine does not yet have — diagnosing the *root cause* of an already-observed runtime perf regression *inside the JIT* — not detecting that a regression occurred. As a CI signal it is the wrong shape: it costs ~3 hours per case after you already know there is a slowdown, and the bug-class it isolates (Wasmtime/Cranelift instruction-lowering pathologies) is upstream of loom/meld output. Defer until we have (a) a stable wasmtime-based perf baseline in CI and (b) a confirmed regression class plausibly attributable to a loom/meld output pattern.
4+
5+
## 1. Paper summary
6+
7+
WarpL (Zeng et al., ICSE '26, arXiv:2604.13693v1, 15 Apr 2026) is a **mutation-based root-cause localizer** for Wasm-runtime performance bugs, not a regression detector. Given a bug-inducing Wasm module already known to be slow on runtime R_buggy:
8+
9+
- **Mutation**. Fine-grained, type-aware, single-instruction edits (Table 1): operand-instruction substitution (`t.const`/`local.get`/`global.get`/`t.load` interchange within the same type), operator-instruction substitution (replace `t.op` with another op of the same `optype`), operator-instruction deletion (delete op + its operand-producers, replace stack effect with a const). Control-flow instructions are explicitly excluded. ~500 mutants generated per case, deduped to ~350.
10+
- **Functionally-similar selection**. Each mutant is run on R_buggy and on an **oracle runtime** R_oracle (a second Wasm engine known not to exhibit the bug). Score = `α·perfDiffScore + β·funcSimScore` where `perfDiffScore` rewards a large execution-time gap on R_buggy and `funcSimScore` rewards near-identical execution time on R_oracle (Algorithm 1). Mutants whose behavior is preserved on R_oracle but whose slowdown disappears on R_buggy are the candidates.
11+
- **Slow-code isolation**. Dump JIT machine code from R_buggy for original vs selected mutant. Diff via Longest Common Subsequence at the x86-64/aarch64 opcode level. The differing instruction window is the "slow code" report.
12+
- **Perf signature**. Single scalar: wall-clock execution time on each runtime. Plus a machine-instruction count from the JIT dump. No syscall histograms, no perf-event traces, no cycle counters.
13+
- **Eval**. 12 issues across Wasmtime / Wasmer / WasmEdge; localized 10/12, found 6 previously-unknown Wasmtime bugs. **Wall time: <3 h per case for WarpL, plus ~11 h per case for `wasm-reduce` pre-shrinking.** Tool is ~400 LOC C++ (Binaryen LibTooling) + ~500 LOC Python/shell. Open source: https://github.com/BZTesting/WarpL.
14+
15+
**Key assumptions / limitations** (Section 6): (1) you already know which module is slow — WarpL needs a labeled bug-inducing input; (2) you need a second Wasm runtime that does *not* trigger the issue, to act as oracle; (3) the bug must live in JIT lowering or IR optimization, not in host I/O or libc (the two failures #7973 and #7745 were both I/O-bound, outside JIT scope); (4) `wasm-reduce` dominates wall time and is the bottleneck the authors flag for future work.
16+
17+
## 2. PulseEngine corpus survey
18+
19+
Confirmed in `/Users/r/git/pulseengine/loom/tests/corpus/`:
20+
21+
| Fixture | Path | Role |
22+
|---|---|---|
23+
| httparse | `tests/corpus/httparse.wasm` | HTTP/1.x header parser (witness real-app fixture) |
24+
| nom_numbers | `tests/corpus/nom_numbers.wasm` | nom parser-combinator numeric primitives |
25+
| state_machine | `tests/corpus/state_machine.wasm` | finite-state-machine kernel (kiln test) |
26+
| json_lite | `tests/corpus/json_lite.wasm` | minimal JSON tokenizer |
27+
| loom (self) | `tests/corpus/loom.wasm` | LOOM compiled to Wasm — dogfood target |
28+
29+
Sibling projects `/Users/r/git/pulseengine/witness` and `/Users/r/git/pulseengine/meld` exist on disk but were outside the read-permission scope of this session; the loom-side mirror at `tests/corpus/` already holds the witness-derived fixtures the user named, so the five above are representative without cross-repo access. `loom-testing/Cargo.toml` already depends on `wasmtime = "17.0"` — the runtime WarpL primarily targets.
30+
31+
## 3. Feasibility
32+
33+
- **On `loom optimize` output**: Yes, mechanically. WarpL is a black-box harness over a Wasm binary; loom's optimized output is a valid module. But WarpL's value is conditional on a *known* slowdown on a runtime. Loom currently validates via diff vs wasm-opt (size, semantics), not runtime wall-clock — so WarpL has no trigger.
34+
- **On `meld fuse` output**: Same answer, with extra friction. Component-Model fused outputs would need `wasm-reduce` adapted for components (today it operates on core modules) — non-trivial.
35+
- **Runtime to capture signatures**: WarpL needs **two** runtimes (buggy + oracle). Wasmtime is already a dep; adding WasmEdge or Wasmer to CI is feasible (both have static binaries) but doubles container size and the wall-clock baseline must be stable enough that a 1.5×–8× gap (the gaps WarpL reported) is detectable above noise — hard on shared GitHub runners.
36+
- **Integration cost**. WarpL itself: ~900 LOC upstream + a Rust harness wrapping it (~300 LOC) + a perf-baseline DB (artifact-store JSON) for the trigger. Per-case CI cost is **~14 h** dominated by `wasm-reduce` — incompatible with PR-blocking CI; only viable as a nightly/manual job on a self-hosted runner.
37+
38+
## 4. Recommendation
39+
40+
**Adopt later.** WarpL is a high-quality localizer but the wrong end of the pipeline for CI: it presumes regression detection has already happened, takes hours per case, and targets bugs in the Wasm runtime's JIT — bugs we cannot fix from loom/meld even if WarpL finds them. The right CI signal for "loom emitted a Wasm whose perf changed" is a cheap wall-clock benchmark on the five fixtures above with a noise-aware threshold; WarpL becomes useful only *after* such a benchmark fires and we want to know whether the cause is in loom's transformation or in Wasmtime's lowering of it.
41+
42+
## 5. Implementation sketch (when triggered)
43+
44+
1. Add a nightly `cargo bench` harness in `loom-testing/` that runs each of the 5 fixtures through `loom optimize` then through wasmtime, recording wall-clock and machine-instruction counts (from `wasmtime compile --emit-clif`) against a committed baseline JSON.
45+
2. Define a regression trigger: ≥1.3× wall-clock slowdown vs baseline, reproducible across 3 runs on a self-hosted runner — only then enqueue WarpL.
46+
3. Vendor WarpL (BZTesting/WarpL) as a submodule under `tools/warpl/`; build it in a separate Docker image with WasmEdge as oracle runtime.
47+
4. Write a Rust wrapper (`loom-testing/src/bin/warpl_localize.rs`) that takes the regressed fixture, runs `wasm-reduce` with a wall-clock-preserving predicate, then invokes WarpL and posts the slow-code report as a GitHub issue with the JIT-diff snippet.
48+
5. Decide per-issue whether the root cause is a loom transformation (fix in loom) or a Wasmtime lowering bug (file upstream) — WarpL's report distinguishes these because the differing mutant is at Wasm-instruction granularity.

0 commit comments

Comments
 (0)