fix: go unwinding stops at systemstacks#1313
Conversation
Hope KubeCon went well! I ended up getting nerd-sniped by this problem and wanted to dig into it since the previous PR was closed. No worries at all if you already have a fix in the works based on #1279. I currently have something working, but if you think your approach is the better way to tackle this, we can absolutely close this PR. Just let me know how you'd like to proceed! |
dfb9247 to
6961a92
Compare
fabled
left a comment
There was a problem hiding this comment.
Awesome stuff! Added first round of comments and questions.
| // synthetic marker for Go's stack scanner and scheduler, not a real return address. | ||
| // | ||
| // https://github.com/golang/go/blob/917949cc1d16c652cb09ba369718f45e5d814d8f/src/runtime/asm_amd64.s#L886 | ||
| GoLabelsOffsets *go_offs = go_get_go_offsets(); |
There was a problem hiding this comment.
go_get_go_offsets is used in many places, and this is inside a rolled loop. I'm wondering if it'd make sense to read this data in the native unwinder beginning to the PerCPURecord? The Go plugins also could use it and avoid the lookup later on.
There was a problem hiding this comment.
We may want to copy GoLabelsOffsets into PerCPURecord to avoid reading the same values again. I'm not sure what would be the best place for the first read - collect_trace seems to initialize the struct, so it should work there, but from my understanding it would mean calling this for every non-Go process, adding one bpf_read for each of them (at the moment it's only for go processes).
Not sure if we want to do this here or in a follow-up.
florianl
left a comment
There was a problem hiding this comment.
Just minor comments. Overall I think a coredump test would be great here.
| // to perform Go stack unwinding. | ||
| interpreterLoaders = append(interpreterLoaders, golabels.Loader) |
There was a problem hiding this comment.
Load this interpreter only if it is configured.
| // to perform Go stack unwinding. | |
| interpreterLoaders = append(interpreterLoaders, golabels.Loader) | |
| // to perform Go stack unwinding. | |
| if includeTracers.Has(types.Labels) || includeTracers.Has(types.GoTracer) { | |
| interpreterLoaders = append(interpreterLoaders, golabels.Loader) | |
| } |
There was a problem hiding this comment.
I don't think we want to tie loading Go labels (and thus unwinding) to symbolication (GoTracer), which is why I made sure to always load the labels. Since the use of the Go tracer is configurable, if a user decides to disable it in favor of a method other than local symbolication, it will cause an issue because they won't have the offsets for unwinding.
There was a problem hiding this comment.
While I don't have a strong preference, there is a clear use case for users who need native unwinding and want to disable all other interpreters, including labels. This is often done to minimize memory usage.
For approaches that are specifically focused (e.g., only on Go), it is reasonable to expect users will enable GoTracer if they are interested in unwinding past runtime.mcall. Therefore, the question is: Are there many scenarios where users require unwinding beyond runtime.mcall but without Go symbolization?
There was a problem hiding this comment.
Are there many scenarios where users require unwinding beyond
runtime.mcallbut without Go symbolization?
Yes. When Go symbolization is done on the backend, we want to disable local symbolization (noticeable memory overhead on Go-heavy nodes) while still unwinding past runtime.mcall. golabels is cheap enough that we don't mind enabling it for the offsets.
That said, golabels now serves two purposes: extracting goroutine labels, and exposing the runtime offsets the native unwinder needs to cross runtime.mcall / runtime.systemstack (and runtime.asmcgocall later). We don't see a realistic case where a user wants native unwinding but not to cross these boundaries, so gating on Labels feels like the wrong abstraction.
What do you think about splitting the offsets out into a dedicated interpreter that's always loaded for Go binaries (e.g. a new GoRuntime or GoOffsets interpreter)? This would allow GoTracer to focus on symbolization and keep Labels for label extraction only. We can do that in a follow-up PR to keep this one focused on the unwinder fix, unless you'd rather see it
There was a problem hiding this comment.
I think we need to follow up here. There's now three separate Go support things: unwinding, labels and symbolization. Depending on configuration all three might want to be managed separately. It would also reduce memory overhead if all three shared the same ebpf per-pid structures.
It might make even sense to preload the data to UnwindState or PerCPURecord on the tracer entry function. This would reduce runtime overhead of looking up the data potentially multiple times through out the unwind.
But agreed, worth a separate follow up PR / ticket.
There was a problem hiding this comment.
Sounds good to me!
How do you usually track this kind of follow-up?
ea8cc5b to
c92c28e
Compare
|
I have included coredump tests. The relevant coredump files ( |
|
@fabled friendly ping for review |
fabled
left a comment
There was a problem hiding this comment.
Looks mostly good to me. Some nits and potential optimization opportunities pointed out in comments.
| // to perform Go stack unwinding. | ||
| interpreterLoaders = append(interpreterLoaders, golabels.Loader) |
There was a problem hiding this comment.
I think we need to follow up here. There's now three separate Go support things: unwinding, labels and symbolization. Depending on configuration all three might want to be managed separately. It would also reduce memory overhead if all three shared the same ebpf per-pid structures.
It might make even sense to preload the data to UnwindState or PerCPURecord on the tracer entry function. This would reduce runtime overhead of looking up the data potentially multiple times through out the unwind.
But agreed, worth a separate follow up PR / ticket.
380a486 to
e0bdb10
Compare
e0bdb10 to
d8ebcaf
Compare
|
@fabled could you have another look? |
| "lwp": 10910, | ||
| "frames": [ | ||
| "linux-vdso.1.so+0xcfd", | ||
| "runtime.nanotime1+0 in /home/ubuntu/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/sys_linux_amd64.s:265", |
There was a problem hiding this comment.
Seems nanotime1 needs special unwind handling also.
There was a problem hiding this comment.
Thanks for spotting this! I had already started looking on my end to follow up on asmcgocall, but I can add this one to the list too.
There was a problem hiding this comment.
See #1331 for asmcgocall support. Though it should probably be converted to do frame pointer function unwind override.
| "runtime.schedule+0 in /home/ubuntu/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/proc.go:4138", | ||
| "runtime.goschedImpl+0 in /home/ubuntu/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/proc.go:4301", | ||
| "runtime.gosched_m+0 in /home/ubuntu/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/proc.go:4306", | ||
| "runtime.mcall+0 in /home/ubuntu/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.0.linux-amd64/src/runtime/asm_amd64.s:462" |
There was a problem hiding this comment.
This, and other instances of mcall seem to be terminating the unwind. Just double checking if this is intended? And if yes, where's the stack trace that this testcase is expected to verify?
There was a problem hiding this comment.
Good catch. This testcase is mostly meant to exercise the hard-to-catch mcall states after dropg(), where we recover a candidate goroutine from the g0 stack and then have to validate candidate.g.m before using its gobuf.
The main case I wanted to preserve is when candidate.g.m is non-nil, meaning the goroutine has already been rescheduled on another M. In that case, continuing would risk reading a stale/overwritten gobuf and producing a wrong stack trace, so stopping at runtime.mcall is the expected safe behavior.
This was the scenario I had trouble catching reliably in deployments, so I considered this safe-stop regression coverage more valuable here than only adding success cases, which are usually much easier to notice when they break.
That said, I agree the intent is not obvious from the expected stack trace alone. This probably deserves either more documentation in the testsource, or an additional more targeted test.
There was a problem hiding this comment.
My understanding might also be wrong since it's my first time working on unwinding issues, so please feel free to challenge what I'm saying.
There was a problem hiding this comment.
So would it then make sense add a second mcall test case that actually shows unwind over mcall?
There was a problem hiding this comment.
While trying to add and adapt the existing tests, I ran into 3 problems:
1. Go offsets weren't loaded in the coredump tests
files: ebpfhelpers.go and ebpfmaps.go
I fixed that, and then realized my existing mcall tests weren't actually testing what I wanted - the result was biased by the fact that unwinding was failing in every case.
2. TLS reads return 0 on arm64
file: go_runtime.h
After the fix, running the coredump analysis:
./coredump analyze --case testdata/arm64/go-mcall.json --debug-ebpftsd_base at 0x0, g offset: 16
tsd_base at 0x0, g offset: 16
tsd_base at 0x0, g offset: 16
... (every Go-runtime thread)
Looking into it, this is expected for programs that don't explicitly import "C", because the Go runtime gates TLS on runtime detection via runtime.iscgo. The problem on our side is that we only gate on the buildInfo. That means a program compiled with CGO_ENABLED=1 but with no actual import "C" slips through, and the existing R28 fallback (which only fires when tls_offset == 0) never triggers. I modified that path so it falls back correctly when the TLS read fails or returns 0.
3. Spill-fallback path is broken on arm64
After fixing the other two, I noticed we had an unwinding issue on the argument-spill fallback path on arm64. gosched_m isn't supposed to spill, so we shouldn't be able to unwind past it - but here we do.
# LWP 22277
runtime.osyield
runtime.goschedImpl
runtime.gosched_m
runtime.mcall+0
runtime.gopark
goroutine
runtime.block
main.main
runtime.main
runtime.goexit
# LWP 22281
runtime.lock2
runtime.findRunnable
runtime.schedule
runtime.goschedImpl
runtime.gosched_m
runtime.mcall+0
runtime.gopark
runtime.bgsweep
runtime.gcenable.gowrap1
runtime.goexit
# LWP 22280
runtime.futex
runtime.semasleep
runtime.lock2
runtime.goschedImpl
runtime.gosched_m
runtime.mcall+0
runtime.gopark
runtime.(*scavengerState).park
runtime.bgscavenge
runtime.gcenable.gowrap2
runtime.goexit
On amd64, runtime.mcall deterministically pushes old_g to (g0.sched.sp - 8) via PUSHQ AX. Arm64 has no equivalent - only certain callees (park_m, preemptPark, exitsyscall0, goyield_m, goschedguarded_m) spill R0 there. When the callee doesn't spill (gosched_m, goexit0, gopreempt_m), the slot keeps whatever was there last - typically old_g from a previous mcall(park_m) on the same M. Our BPF checks only prove that the value still looks like a parked goroutine (m == nil, sched.sp != 0); they do not prove that it is the goroutine passed to the current mcall.
I don't see an obvious quick fix off the top of my head, so I suggest that for now we don't enable the fallback on arm64. This implies we'll have more unwinding stops than on amd64. Unless you see a way to fix this, I think the best path is to move forward with this PR and leave it for a follow-up.
Next steps
I suggest the following:
- Don't fall back on argument-spill on arm64 for
mcall. - Regenerate the
mcalltest data on arm64. - Open a follow-up for arm64
mcallunwinding. - Open a follow-up to split Go runtime offsets from Go labels.
Open question
On arm64, we currently decide whether to extract a TLS g offset from build info
(CGO_ENABLED) rather than from the runtime value of runtime.iscgo. This can
produce tls_offset != 0 for binaries built with CGO_ENABLED=1 but with no
actual import "C", while the Go runtime itself takes the non-cgo path and keeps
g in R28. Is that distinction intentional or a constraint for our use case, or should we try
to match Go's runtime behavior more closely and gate this on runtime.iscgo
instead?
Thanks @fabled for pushing on this. Sorry, I should have paid closer attention to the arm64 runtime path earlier.
Also, small ping @florianl since you already approved this. I’d appreciate your thoughts on the points above.
There was a problem hiding this comment.
@wehzzz can you please resolve the merge conflict?
So would it then make sense add a second mcall test case that actually shows unwind over mcall?
For me it would be fine to do this as a follow up.
There was a problem hiding this comment.
Yes, I will do so!
For me it would be fine to do this as a follow up.
Regarding this point, it was mostly about this comment: #1313 (comment). When I tried to add a second mcall test, I encountered several issues. I fixed some of them, but right now the main blocker is that the spill fallback I implemented for arm64 can't work, and we'll need to find a workaround.
Before I dive into anything, I wanted to make sure the steps documented in the previous comment works for you, so we can try to get this merged as quickly as possible.
7ac1a24 to
eac0e41
Compare
Unwinding Through Go Stack Switches
Reference: #1275, Based on @florianl's work #1279.
systemstack
Locating the user goroutine
During systemstack,
m.curgalways points to the frozen user goroutine (systemstack does not calldropg). The profiler readscurg.sched.spwhich points into the user stack where the frame pointer prologue saved FP and LR/RA.gobuf.pccontainssystemstack_switch+8(a synthetic UNDEF marker for Go's stack scanner) and is intentionally ignored.AMD64 recovery
The Go linker injects
PUSH RBP; MOV RSP, RBPinto systemstack (obj6.go#L637).gosave_systemstack_switchthen savesgobuf.sp = LEAQ 8(SP), which skips the useless return address fromCALL gosaveand points to the saved RBP:ARM64 recovery
Go's ARM64 assembler (
cmd/internal/obj/arm64, functionpreprocessin obj7.go#L691) injects aMOVD.Wprologue instead of the standardSTP. UnlikeSTP X29, X30which places R29 at the lower address and LR at the higher address, theMOVD.Wsequence places LR at SP+0 and R29 below SP at SP-8.Verified by disassembling
runtime.systemstack(go tool objdump -s 'runtime\.systemstack' ./binary):The profiler reads:
mcall
Locating the user goroutine
mcall's callee (
park_m,goexit0, etc.) callsdropg()which setsm.curg = nilandg.m = nil. In most samples,m.curgis nil. The profiler falls back to reading the goroutine pointer from the g0 stack at*(g0.sched.sp - 8).AMD64: mcall does
PUSHQ AX(old_g) before calling fn. This writes old_g tog0.sched.sp - 8deterministically. This is hand-written assembly, stable across all Go versions.Verified by disassembling
runtime.mcall(go tool objdump -s 'runtime\.mcall' ./binary):ARM64: mcall passes old_g in R0 (no push). The goroutine pointer is at
g0.sched.sp - 8only if fn spills R0 to its ABIInternal arg area. This depends on the compiler's spill decisions, not hand-written assembly. Stable for the current ABI; a future ABI change could alter the spill offset. The DWARFDW_OP_fbreg 8should be used as the source of truth.Verified by disassembling
runtime.mcallon ARM64:fn's prologue then spills R0 to
entry_SP + 8 = g0.sched.sp - 16 + 8 = g0.sched.sp - 8. This is confirmed by both disassembly and DWARF.park_mspills (go tool objdump -s 'runtime\.park_m' ./binary):DWARF confirms (
dwarfdump -i -S match=runtime.park_m -Wc ./binary):ARM64: functions that don't spill
goexit0does NOT spill - passes R0 directly togdestroy(go tool objdump -s 'runtime\.goexit0' ./binary):DWARF confirms - location list only contains
DW_OP_reg0, noDW_OP_fbreg:All mcall callees can be found with:
Each callee's spill behavior can be verified with DWARF:
If the
gpparameter's location list containsDW_OP_fbreg 8, it spills tog0.sched.sp - 8. If it only containsDW_OP_reg0followed byend-of-list, it does not spill.When the profiler cannot resolve the user goroutine through mcall (non-spilling function on ARM64, or goroutine already rescheduled on another M), it stops unwinding at the mcall frame (
*stop = true) without crossing to the user stack. The g0 frames (park_m, schedule, findRunnable, etc.) are still captured. This is the same behavior as before this change (UNWIND_COMMAND_STOP). There is no regression.park_mDW_OP_fbreg 8preemptParkDW_OP_fbreg 8exitsyscall0DW_OP_fbreg 8goyield_mDW_OP_fbreg 8goschedguarded_mDW_OP_fbreg 8goexit0DW_OP_reg0onlygosched_mDW_OP_reg0onlygopreempt_mStale goroutine detection
The goroutine at the g0 slot may have been rescheduled on another M since the mcall. Its gobuf could then contain values from systemstack on that other thread (observed in testing:
systemstack_switchframes from a different M). The profiler validatescandidate.m == nil(parked, gobuf reliable) vscandidate.m != nil(rescheduled, STOP).Recovery
Unlike systemstack, mcall saves the caller's actual registers into gobuf (not a synthetic marker):
systemstack test
amd64arm64mcall test
amd64arm64