cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9
Open
hbrooks wants to merge 1 commit into
Open
cmd/internal/obj/arm64: fuse adjacent spill/reload LDR/STR into LDP/STP#9hbrooks wants to merge 1 commit into
hbrooks wants to merge 1 commit into
Conversation
The SSA pair pass (cmd/compile/internal/ssa/pair.go) runs before regalloc
and only sees source-level MOVDload/MOVDstore values. Spill/reload code
that regalloc inserts later for OpStoreReg/OpLoadReg becomes individual
LDR/STR instructions that never get a chance to be paired, even when two
spills target adjacent 8-byte stack slots.
Add a small post-preprocess peephole in obj7.go that walks the final
Prog list and fuses strictly-adjacent AMOVD pairs that share a base
register and have consecutive 8-byte offsets into a single ALDP/ASTP.
The second Prog is converted to obj.ANOP (0 bytes) rather than unlinked
so that branch targets, line-number entries, and inline markers
referencing it remain valid.
Fusion is gated on several safety conditions:
- same base register and same Addr.Name (AUTO/PARAM/NONE), distinct
destination registers (LDP with Rt == Rt2 is UNPREDICTABLE), and no
pre/post-index or register-offset addressing
- lower offset in [-512, +504] step 8 so the resolved encoding is a
single LDP/STP (not the assembler's adjuster + LDP, which would be
no smaller than the original pair)
- for loads, neither destination equals the base register (LDP with
Rt == Rn is UNPREDICTABLE, and an LDR pair where the first load
writes the base reads the post-write base in the second instruction)
- the second instruction is not a branch target; otherwise paths that
jump directly to it would skip the work the LDP/STP now does at the
first instruction's position
TestPairLDRSTR in cmd/internal/obj/arm64 drives pairLDRSTR directly with
hand-constructed Prog chains, covering each fusion path and each gating
condition. test/codegen/memcombine.go pins down the spill/reload pattern
that the SSA pair pass misses but this peephole catches. The end-to-end
regression in test/fixedbugs/spillreload_arm64_pair.go exercises the
conditional-call-with-adjacent-reloads pattern from runtime.schedule that
miscompiled before the branch-target check was added.
BenchmarkSpillReloadPair in cmd/compile/internal/test goes from 1.96
to 1.82 ns/op on an Apple M4 Max with the coalescer enabled (~7%).
Found via armlint: adjacent LDR/STR findings drop from 4345 -> 212 on
gofmt (95.1% reduction) and 26007 -> 1969 on cmd/go (92.4% reduction).
Text section shrinks by 16640 bytes (1.395%) on gofmt and 96288 bytes
(1.449%) on cmd/go, eliding 4160 and 24072 instructions respectively.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Originally PR golang#79689 in golang/go by @gaul