Skip to content

cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9

Open
hbrooks wants to merge 1 commit into
masterfrom
demo/pr-79689
Open

cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9
hbrooks wants to merge 1 commit into
masterfrom
demo/pr-79689

Conversation

@hbrooks
Copy link
Copy Markdown

@hbrooks hbrooks commented May 28, 2026

Originally PR golang#79689 in golang/go by @gaul

The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level MOVDload/MOVDstore values. Spill/reload code
that regalloc inserts later for OpStoreReg/OpLoadReg becomes individual
LDR/STR instructions that never get a chance to be paired, even when two
spills target adjacent 8-byte stack slots.

Add a small post-preprocess peephole in obj7.go that walks the final
Prog list and fuses strictly-adjacent AMOVD pairs that share a base
register and have consecutive 8-byte offsets into a single ALDP/ASTP.
The second Prog is converted to obj.ANOP (0 bytes) rather than unlinked
so that branch targets, line-number entries, and inline markers
referencing it remain valid.

Fusion is gated on several safety conditions:
  - same base register and same Addr.Name (AUTO/PARAM/NONE), distinct
    destination registers (LDP with Rt == Rt2 is UNPREDICTABLE), and no
    pre/post-index or register-offset addressing
  - lower offset in [-512, +504] step 8 so the resolved encoding is a
    single LDP/STP (not the assembler's adjuster + LDP, which would be
    no smaller than the original pair)
  - for loads, neither destination equals the base register (LDP with
    Rt == Rn is UNPREDICTABLE, and an LDR pair where the first load
    writes the base reads the post-write base in the second instruction)
  - the second instruction is not a branch target; otherwise paths that
    jump directly to it would skip the work the LDP/STP now does at the
    first instruction's position

TestPairLDRSTR in cmd/internal/obj/arm64 drives pairLDRSTR directly with
hand-constructed Prog chains, covering each fusion path and each gating
condition. test/codegen/memcombine.go pins down the spill/reload pattern
that the SSA pair pass misses but this peephole catches. The end-to-end
regression in test/fixedbugs/spillreload_arm64_pair.go exercises the
conditional-call-with-adjacent-reloads pattern from runtime.schedule that
miscompiled before the branch-target check was added.
BenchmarkSpillReloadPair in cmd/compile/internal/test goes from 1.96
to 1.82 ns/op on an Apple M4 Max with the coalescer enabled (~7%).

Found via armlint: adjacent LDR/STR findings drop from 4345 -> 212 on
gofmt (95.1% reduction) and 26007 -> 1969 on cmd/go (92.4% reduction).
Text section shrinks by 16640 bytes (1.395%) on gofmt and 96288 bytes
(1.449%) on cmd/go, eliding 4160 and 24072 instructions respectively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants