cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP by hbrooks · Pull Request #9 · ellipsis-dev-test/go

hbrooks · 2026-05-28T14:54:22Z

Originally PR golang#79689 in golang/go by @gaul

The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc and only sees source-level MOVDload/MOVDstore values. Spill/reload code that regalloc inserts later for OpStoreReg/OpLoadReg becomes individual LDR/STR instructions that never get a chance to be paired, even when two spills target adjacent 8-byte stack slots. Add a small post-preprocess peephole in obj7.go that walks the final Prog list and fuses strictly-adjacent AMOVD pairs that share a base register and have consecutive 8-byte offsets into a single ALDP/ASTP. The second Prog is converted to obj.ANOP (0 bytes) rather than unlinked so that branch targets, line-number entries, and inline markers referencing it remain valid. Fusion is gated on several safety conditions: - same base register and same Addr.Name (AUTO/PARAM/NONE), distinct destination registers (LDP with Rt == Rt2 is UNPREDICTABLE), and no pre/post-index or register-offset addressing - lower offset in [-512, +504] step 8 so the resolved encoding is a single LDP/STP (not the assembler's adjuster + LDP, which would be no smaller than the original pair) - for loads, neither destination equals the base register (LDP with Rt == Rn is UNPREDICTABLE, and an LDR pair where the first load writes the base reads the post-write base in the second instruction) - the second instruction is not a branch target; otherwise paths that jump directly to it would skip the work the LDP/STP now does at the first instruction's position TestPairLDRSTR in cmd/internal/obj/arm64 drives pairLDRSTR directly with hand-constructed Prog chains, covering each fusion path and each gating condition. test/codegen/memcombine.go pins down the spill/reload pattern that the SSA pair pass misses but this peephole catches. The end-to-end regression in test/fixedbugs/spillreload_arm64_pair.go exercises the conditional-call-with-adjacent-reloads pattern from runtime.schedule that miscompiled before the branch-target check was added. BenchmarkSpillReloadPair in cmd/compile/internal/test goes from 1.96 to 1.82 ns/op on an Apple M4 Max with the coalescer enabled (~7%). Found via armlint: adjacent LDR/STR findings drop from 4345 -> 212 on gofmt (95.1% reduction) and 26007 -> 1969 on cmd/go (92.4% reduction). Text section shrinks by 16640 bytes (1.395%) on gofmt and 96288 bytes (1.449%) on cmd/go, eliding 4160 and 24072 instructions respectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9

cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9
hbrooks wants to merge 1 commit into
masterfrom
demo/pr-79689

hbrooks commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hbrooks commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants