Skip to content

Commit 887133e

Browse files
avrabeclaude
andauthored
fix(vcr-ra): optimized path preserves callee-saved r4-r8 — AAPCS prologue (#490, #242) (#495)
The optimized ARM selection path (the default, non-`--relocatable` self-contained image) uses r4-r8 as scratch / promoted locals but emitted no prologue, so a caller's callee-saved registers were silently clobbered — a systemic AAPCS violation. It was masked because non-leaf callers route through the direct selector (which over-saves) and the frozen byte gate only covers the `--relocatable` direct path; it blocks flipping the non-leaf thin-forwarder prologue lever (#428). Fix: after range-realloc, wrap a body that genuinely touches a callee-saved register in a conservative `push {r4-r8,lr}` / `pop {r4-r8,pc}`; `shrink_callee_saved_saves` then trims the save to the registers actually used. Two new liveness helpers: - `body_uses_callee_saved` — the emit-side twin of `shrink_callee_saved_saves`, reading the same control-flow allowlist + `reg_effect` classification as push-conditions. Conservative on `reg_effect` `None` (an unmodeled op may touch a callee-saved register → force the push): under-saving is the AAPCS bug, over-saving is harmless (shrink pads it down). - `ensure_callee_saved_prologue` — wraps the body, no-op when a prologue already exists (direct path) or the body is callee-saved-free (a spurious push is permanent — shrink never removes one, only pads). Decided on the POST-realloc body, where realloc has lowered low-pressure r4-r8 scratch back to r0-r3, so a save is added only for genuinely-clobbered registers (a pre-realloc decision over-saves). Runs in both realloc-on (shrink trims) and realloc-off (full save kept) configurations. Scope: callee-saved (r4-r8) only. Caller-saved (r0-r3) preservation across import calls on the optimized path is a separate known gap (#197). Functions with >4 params (incoming stack args) decline to the direct selector, so the wrap is a no-op there and cannot perturb stack-arg offsets (verified byte-identical to `--no-optimize`). Default-on (a correctness fix is not flag-gated); it adds save/restore cycles to optimized-path functions that use r4-r8 — a non-shipped path (gale ships `--relocatable`), so no silicon cost. Frozen byte gate stays bit-identical (direct path untouched). Validation: - New unicorn sentinel oracle (`callee_saved_490_differential.py`, CI-gated): sets r4-r8 to sentinels, asserts result == wasmtime AND sentinels restored, across 16-bit push and 32-bit PUSH.W (high-register) forms; passes default, `SYNTH_RANGE_REALLOC=0`, and `SYNTH_DEAD_FRAME_ELIM=1`. FAILs without the fix. - 4 new liveness unit tests; full workspace suite + bulk-memory optimized-path differential green; frozen byte gate bit-identical. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent d79f6ad commit 887133e

6 files changed

Lines changed: 472 additions & 1 deletion

File tree

.github/workflows/ci.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,3 +343,40 @@ jobs:
343343
env:
344344
SYNTH: ./target/debug/synth
345345
run: python scripts/repro/const_addr_fold_riscv_differential.py
346+
347+
callee-saved-490-oracle:
348+
name: optimized-path callee-saved preservation oracle
349+
# VCR-ORACLE-001 (#242, #490): EXECUTE optimized-path functions that use
350+
# r4-r8 under unicorn (UC_ARCH_ARM / Thumb) with r4-r8 set to sentinels, and
351+
# assert both the result matches wasmtime AND every callee-saved register is
352+
# restored at return. This is the runtime gate for the #490 fix (the
353+
# optimized path now emits the `push {r4-r8,lr}` / `pop {r4-r8,pc}` it was
354+
# missing); the frozen byte gate only covers the `--relocatable` direct path,
355+
# so nothing else exercises the optimized path's AAPCS compliance. Covers the
356+
# 16-bit push and the 32-bit PUSH.W (high-register) forms. Isolated job:
357+
# emulation deps pip-installed here ONLY.
358+
runs-on: ubuntu-latest
359+
steps:
360+
- uses: actions/checkout@v7
361+
- uses: dtolnay/rust-toolchain@stable
362+
- name: Cache Cargo dependencies
363+
uses: actions/cache@v5
364+
with:
365+
path: |
366+
~/.cargo/registry
367+
~/.cargo/git
368+
target/
369+
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
370+
restore-keys: |
371+
${{ runner.os }}-cargo-
372+
- name: Build synth
373+
run: cargo build -p synth-cli
374+
- uses: actions/setup-python@v5
375+
with:
376+
python-version: "3.x"
377+
- name: Install emulation deps
378+
run: pip install wasmtime unicorn pyelftools
379+
- name: Run callee-saved preservation oracle
380+
env:
381+
SYNTH: ./target/debug/synth
382+
run: python scripts/repro/callee_saved_490_differential.py

crates/synth-backend/src/arm_backend.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -441,9 +441,22 @@ fn compile_wasm_to_arm(
441441
} else {
442442
out
443443
};
444+
// #490 (epic #242): the optimized selector uses r4-r8 as scratch /
445+
// promoted locals but emits no prologue, silently clobbering a caller's
446+
// callee-saved registers. Add the missing `push {r4-r8,lr}` /
447+
// `pop {r4-r8,pc}` HERE — on the post-realloc body, where realloc has
448+
// lowered low-pressure r4-r8 scratch back to r0-r3, so a save is added
449+
// only for registers genuinely clobbered. `shrink_callee_saved_saves`
450+
// (next) then trims it to the used set. No-op on the direct path (it
451+
// already has its own prologue) and on callee-saved-free leaves.
452+
let out = synth_synthesis::liveness::ensure_callee_saved_prologue(&out);
444453
synth_synthesis::liveness::shrink_callee_saved_saves(&out).unwrap_or(out)
445454
} else {
446-
arm_instrs
455+
// Range-realloc off (`SYNTH_RANGE_REALLOC=0`): the optimized path still
456+
// must preserve the callee-saved registers it clobbers (#490). No shrink
457+
// (it is coupled to the realloc lever), so the conservative full save
458+
// stays — correct, just not minimised in this debug configuration.
459+
synth_synthesis::liveness::ensure_callee_saved_prologue(&arm_instrs)
447460
};
448461

449462
// VCR-RA-001 SHADOW ALLOCATION (#209/#242): run the register allocator on

crates/synth-synthesis/src/liveness.rs

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2487,6 +2487,112 @@ pub fn shrink_callee_saved_saves(instrs: &[ArmInstruction]) -> Option<Vec<ArmIns
24872487
Some(out)
24882488
}
24892489

2490+
/// #490 (epic #242): does this body clobber/read any callee-saved register, so
2491+
/// the optimized path must wrap it in a `push {r4-r8,lr}` / `pop {r4-r8,pc}`
2492+
/// prologue/epilogue to honour AAPCS?
2493+
///
2494+
/// This is the **emit-side twin** of [`shrink_callee_saved_saves`]: it reads the
2495+
/// exact same per-op classification (the control-flow allowlist + `reg_effect`)
2496+
/// as *push-conditions* rather than *decline-conditions*, so the two passes
2497+
/// always agree. [`ensure_callee_saved_prologue`] emits a conservative full
2498+
/// `push {r4-r8,lr}` when this returns `true`; `shrink_callee_saved_saves` then
2499+
/// trims that push to the registers actually touched (or declines, leaving the
2500+
/// full push, on frame/call/unmodeled bodies it cannot prove safe).
2501+
///
2502+
/// **Conservative on `None` (fail-safe).** An op `reg_effect` cannot model
2503+
/// (i64-pair pseudo-ops, calls, anything outside the modeled scope) *might* write
2504+
/// a callee-saved register, so it forces the push. Under-saving is the exact
2505+
/// AAPCS violation #490 fixes; over-saving an unused register is harmless (shrink
2506+
/// pads it down, or — on a body shrink declines — it is a few dead bytes). The
2507+
/// only load-bearing answer is `false`: a spurious push on a callee-saved-free
2508+
/// leaf is permanent (shrink never *removes* a push), so the control-flow
2509+
/// allowlist and the cs-intersection test must be exact, mirroring shrink. The
2510+
/// decision runs on the **post-range-realloc** body, where low-pressure r4-r8
2511+
/// scratch has already been lowered to r0-r3, so only genuine clobbers remain.
2512+
pub fn body_uses_callee_saved(instrs: &[ArmInstruction]) -> bool {
2513+
use ArmOp::*;
2514+
const CALLEE_SAVED: [Reg; 5] = [Reg::R4, Reg::R5, Reg::R6, Reg::R7, Reg::R8];
2515+
for ins in instrs {
2516+
match &ins.op {
2517+
// Function-boundary markers and pure label control flow carry no
2518+
// callee-saved data effect (mirrors shrink's allowlist exactly).
2519+
Push { .. }
2520+
| Pop { .. }
2521+
| Label { .. }
2522+
| B { .. }
2523+
| BOffset { .. }
2524+
| BCondOffset { .. }
2525+
| Bhs { .. }
2526+
| Blo { .. }
2527+
| Bcc { .. } => {}
2528+
Bx { rm } if *rm == Reg::LR => {}
2529+
op => match reg_effect(op) {
2530+
Some(e) => {
2531+
if e.defs
2532+
.iter()
2533+
.chain(e.uses.iter())
2534+
.any(|r| CALLEE_SAVED.contains(r))
2535+
{
2536+
return true;
2537+
}
2538+
}
2539+
// Unmodeled op → assume it may touch a callee-saved register.
2540+
None => return true,
2541+
},
2542+
}
2543+
}
2544+
false
2545+
}
2546+
2547+
/// #490 (epic #242): give the optimized path the callee-saved prologue/epilogue
2548+
/// it is missing. The optimized selector uses r4-r8 as scratch / promoted locals
2549+
/// but emits no `push`/`pop`, so a caller's r4-r8 are silently clobbered — a
2550+
/// systemic AAPCS violation. When the (post-realloc) body genuinely touches a
2551+
/// callee-saved register, wrap it in a conservative full `push {r4-r8,lr}` /
2552+
/// `pop {r4-r8,pc}`; [`shrink_callee_saved_saves`] (run next) trims that to the
2553+
/// registers actually used, or declines and leaves the full save on a
2554+
/// frame/call/unmodeled body it cannot prove safe.
2555+
///
2556+
/// No-op when a prologue already exists (the direct selector emits its own
2557+
/// `push {…,lr}`) or the body is callee-saved-free (e.g. a pure-r0-r3 leaf —
2558+
/// left byte-identical, which is load-bearing: an added push could never be
2559+
/// removed later). The push is inserted at index 0, landing before any
2560+
/// `sub sp,#frame` the optimized path reserved; each `bx lr` becomes
2561+
/// `pop {…,pc}`, landing after the matching `add sp,#frame`.
2562+
pub fn ensure_callee_saved_prologue(instrs: &[ArmInstruction]) -> Vec<ArmInstruction> {
2563+
use ArmOp::*;
2564+
// Already has a prologue (direct path) → leave it for shrink to size.
2565+
if instrs
2566+
.iter()
2567+
.any(|i| matches!(&i.op, Push { regs } if regs.contains(&Reg::LR)))
2568+
{
2569+
return instrs.to_vec();
2570+
}
2571+
if !body_uses_callee_saved(instrs) {
2572+
return instrs.to_vec();
2573+
}
2574+
let mut out = Vec::with_capacity(instrs.len() + 1);
2575+
out.push(ArmInstruction {
2576+
op: Push {
2577+
regs: vec![Reg::R4, Reg::R5, Reg::R6, Reg::R7, Reg::R8, Reg::LR],
2578+
},
2579+
source_line: None,
2580+
});
2581+
for ins in instrs {
2582+
if matches!(&ins.op, Bx { rm } if *rm == Reg::LR) {
2583+
out.push(ArmInstruction {
2584+
op: Pop {
2585+
regs: vec![Reg::R4, Reg::R5, Reg::R6, Reg::R7, Reg::R8, Reg::PC],
2586+
},
2587+
source_line: ins.source_line,
2588+
});
2589+
} else {
2590+
out.push(ins.clone());
2591+
}
2592+
}
2593+
out
2594+
}
2595+
24902596
/// VCR-RA-002 (#390, epic #242): eliminate a provably-dead stack frame.
24912597
///
24922598
/// `compute_local_layout` reserves a frame slot — materialized as
@@ -5885,6 +5991,101 @@ mod tests {
58855991
assert_eq!(out[4], seq[4]);
58865992
}
58875993

5994+
// ---- ensure_callee_saved_prologue (#490) ----
5995+
5996+
#[test]
5997+
fn ensure_prologue_wraps_body_that_touches_callee_saved() {
5998+
// Optimized-path shape: no prologue, body writes r4/r5, returns via bx lr.
5999+
let seq = vec![
6000+
ins(ArmOp::Movw {
6001+
rd: Reg::R4,
6002+
imm16: 100,
6003+
}),
6004+
ins(ArmOp::Add {
6005+
rd: Reg::R5,
6006+
rn: Reg::R0,
6007+
op2: Operand2::Reg(Reg::R4),
6008+
}),
6009+
ins(ArmOp::Mov {
6010+
rd: Reg::R0,
6011+
op2: Operand2::Reg(Reg::R5),
6012+
}),
6013+
ins(ArmOp::Bx { rm: Reg::LR }),
6014+
];
6015+
let out = ensure_callee_saved_prologue(&seq);
6016+
assert_eq!(
6017+
out[0].op,
6018+
ArmOp::Push {
6019+
regs: vec![Reg::R4, Reg::R5, Reg::R6, Reg::R7, Reg::R8, Reg::LR]
6020+
}
6021+
);
6022+
assert_eq!(
6023+
out.last().unwrap().op,
6024+
ArmOp::Pop {
6025+
regs: vec![Reg::R4, Reg::R5, Reg::R6, Reg::R7, Reg::R8, Reg::PC]
6026+
}
6027+
);
6028+
// Body untouched between the new prologue and epilogue.
6029+
assert_eq!(out[1..=3], seq[0..=2]);
6030+
// And shrink then sizes it to the registers actually used ({r4,r5} → pad).
6031+
let shrunk = shrink_callee_saved_saves(&out).expect("shrinks");
6032+
assert_eq!(
6033+
shrunk[0].op,
6034+
ArmOp::Push {
6035+
regs: vec![Reg::R4, Reg::R5, Reg::R6, Reg::LR]
6036+
}
6037+
);
6038+
}
6039+
6040+
#[test]
6041+
fn ensure_prologue_leaves_callee_saved_free_leaf_untouched() {
6042+
// Pure r0-r3 leaf: no callee-saved register touched → no push added
6043+
// (a spurious push would be permanent — shrink never removes one).
6044+
let seq = vec![
6045+
ins(ArmOp::Add {
6046+
rd: Reg::R0,
6047+
rn: Reg::R0,
6048+
op2: Operand2::Imm(1),
6049+
}),
6050+
ins(ArmOp::Bx { rm: Reg::LR }),
6051+
];
6052+
assert_eq!(ensure_callee_saved_prologue(&seq), seq);
6053+
assert!(!body_uses_callee_saved(&seq));
6054+
}
6055+
6056+
#[test]
6057+
fn ensure_prologue_is_conservative_on_unmodeled_ops() {
6058+
// An op reg_effect cannot model (here a Bl call) might clobber a
6059+
// callee-saved register → force the push (fail-safe). shrink then
6060+
// declines on the call and leaves the full save.
6061+
let seq = vec![
6062+
ins(ArmOp::Bl {
6063+
label: "func_1".to_string(),
6064+
}),
6065+
ins(ArmOp::Bx { rm: Reg::LR }),
6066+
];
6067+
assert!(body_uses_callee_saved(&seq));
6068+
let out = ensure_callee_saved_prologue(&seq);
6069+
assert!(matches!(&out[0].op, ArmOp::Push { regs } if regs.contains(&Reg::LR)));
6070+
assert_eq!(shrink_callee_saved_saves(&out), None);
6071+
}
6072+
6073+
#[test]
6074+
fn ensure_prologue_skips_body_that_already_has_one() {
6075+
// Direct-path output already carries its own `push {…,lr}` → no-op
6076+
// (shrink owns sizing it; double-wrapping would corrupt the frame).
6077+
let seq = vec![
6078+
prologue(),
6079+
ins(ArmOp::Add {
6080+
rd: Reg::R5,
6081+
rn: Reg::R0,
6082+
op2: Operand2::Imm(1),
6083+
}),
6084+
epilogue(),
6085+
];
6086+
assert_eq!(ensure_callee_saved_prologue(&seq), seq);
6087+
}
6088+
58886089
// ---- elide_dead_frame (VCR-RA-002, #390) ----
58896090

58906091
fn frame_sub(n: i32) -> ArmInstruction {

crates/synth-synthesis/src/optimizer_bridge.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5516,6 +5516,13 @@ impl OptimizerBridge {
55165516
arm_instrs.push(ArmOp::Bx { rm: Reg::LR });
55175517
}
55185518

5519+
// #490: the callee-saved prologue/epilogue the optimized path needs to
5520+
// honour AAPCS is added AFTER range-realloc (in `arm_backend`), where the
5521+
// re-allocated body shows which callee-saved registers are *genuinely*
5522+
// live — realloc lowers low-pressure r4-r8 scratch back to r0-r3, so a
5523+
// decision here (pre-realloc) would over-save. See
5524+
// `liveness::ensure_callee_saved_prologue`.
5525+
55195526
Ok(arm_instrs)
55205527
}
55215528
}

scripts/repro/callee_saved_490.wat

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
(module
2+
(memory 1)
3+
;; #490 fixture (epic #242): functions the OPTIMIZED path lowers using r4-r8 as
4+
;; scratch. Before the fix the optimized path emitted no prologue and silently
5+
;; clobbered a caller's callee-saved registers; the differential sets r4-r8 to
6+
;; sentinels, runs each function, and asserts the sentinels survive AND the
7+
;; result matches wasmtime — proving the push/pop the fix adds preserves them.
8+
;;
9+
;; Pressure is tuned to exercise the callee-saved range WITHOUT exhausting the
10+
;; R0-R8 pool (the optimized path has no spill, so an exhausting fold would hit
11+
;; a separate pre-existing miscompile and conflate two issues). Each function
12+
;; below is differential-verified to compute correctly on the optimized path.
13+
14+
;; low: two live temps → 16-bit `push {r4,r5,r6,lr}` (r6 is shrink's even-pad).
15+
(func $low (export "low") (param i32) (result i32)
16+
(i32.add (local.get 0) (i32.const 100)))
17+
18+
;; mid: three live subterms → `push {r4,r5,r6,lr}`.
19+
(func $mid (export "mid") (param i32 i32) (result i32)
20+
(i32.add (i32.mul (local.get 0) (local.get 1))
21+
(i32.sub (local.get 0) (local.get 1))))
22+
23+
;; wide: enough simultaneously-live products to reach r8 → 32-bit Thumb-2
24+
;; `push.w {r4-r8,lr}`, exercising the high-register save the 16-bit form
25+
;; cannot encode.
26+
(func $wide (export "wide") (param i32 i32 i32) (result i32)
27+
(i32.add
28+
(i32.add (i32.mul (local.get 0) (local.get 1))
29+
(i32.mul (local.get 1) (local.get 2)))
30+
(i32.mul (local.get 0) (local.get 2))))
31+
)

0 commit comments

Comments
 (0)