🐛 Bug Report
Sapling version: 0.2.20260522-084851 (Homebrew bottle, commit 1e764c94)
OS: Amazon Linux 2023 (Linux 6.12.88, x86_64)
Install method: brew install sapling (Linuxbrew, Python 3.12.13_2 from Homebrew)
Mode: git-backed repos (sl clone <github URL>)
Summary
sl clone <github URL> reliably crashes with SIGSEGV (exit 139, or exit 255 via chg) after the pull completes but before the working copy is updated. The pull-completed metadata is left behind in .sl/, but no files are checked out.
The crash is in the always-on sampling profiler background thread, specifically in sapling_cext_evalframe_resolve_code_object, called from profiler_loop (not from the signal handler).
The root cause is closely related to the segfault that the existing eden/scm/lib/sampling-profiler/munmap-segv-example/ (commit 09fa42845cb) was added to reproduce, but the fix a82fe9f9804 "backtrace-python: avoid unsafe (segfault) interp frame read" only addresses the signal-handler-side frame read. The profiler-loop-side deferred dereference of the PyCodeObject* (captured from the C stack during the signal) is still racy, and PyCodeObject->co_filename / ->co_name can be read after the object is freed.
Reproduction
100% reproducible on this machine:
$ sl clone https://github.com/Marukome0743/raspi-signage.git
From https://github.com/Marukome0743/raspi-signage
* [new ref] 1b6ae7c2a0e26ac30efdbc39b109d510bff93a82 -> remote/main
chg: abort: cannot communicate (errno = 32, Broken pipe)
$ echo $?
255
# Without chg, the segfault surfaces directly:
$ CHGDISABLE=1 sl clone https://github.com/Marukome0743/raspi-signage.git
From https://github.com/Marukome0743/raspi-signage
* [new ref] 1b6ae7c2a0e26ac30efdbc39b109d510bff93a82 -> remote/main
Segmentation fault (core dumped)
$ echo $?
139
Reproduction matrix across multiple repos
Each run on a fresh empty target dir.
| Repo |
sl clone exit |
files in WC |
with profiling.always-on-enabled=False |
octocat/Hello-World |
0 |
1 |
0 (1 file) |
Marukome0743/raspi-signage (155 files) |
255 / 139 (segv) |
0 |
0 (155 files) |
facebook/sapling (~11k files) |
255 / 139 (segv) |
0 |
0 (10946 files) |
Hello-World works because the update step finishes faster than the 1s profiler sampling interval can fire on a problematic frame. Larger repos almost always trip the sampler.
Workaround
Set in ~/.config/sapling/sapling.conf (or pass via --config):
[profiling]
always-on-enabled = False
HGPROF=noop does not help — it only disables the Python-level profiler, not the Rust sampling profiler.
Crash location
Crashing thread is the pfc[worker/...] sampling-profiler worker. Crash at sapling_cext_evalframe_resolve_code_object+14 with rdi = 0x8000000000000002 — i.e., the first arg PyCodeObject* code is a non-canonical address (high bit set on x86-64), which faults on any access.
Full backtrace of the crashing thread
Program terminated with signal SIGSEGV, Segmentation fault.
#0 sapling_cext_evalframe_resolve_code_object ()
#1 <backtrace_python::PythonSupplementalFrameResolver as
backtrace_ext::SupplementalFrameResolver>::resolve_supplemental_info ()
#2 backtrace_ext::Frame::resolve ()
#3 sampling_profiler::frame_handler::profiler_loop ()
#4 std::sys::backtrace::__rust_begin_short_backtrace ()
#5 core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#6 <std::sys::thread::unix::Thread>::new::thread_start ()
#7 start_thread () from libc.so.6
#8 clone3 () from libc.so.6
Register state at crash (selected):
rip = sapling_cext_evalframe_resolve_code_object+14
rdi = 0x8000000000000002 (first arg PyCodeObject* code, non-canonical)
The garbage code pointer was captured from the C stack on the Python thread by the signal handler — read at sp + OFFSET_SP_CODE — and then dereferenced later on the profiler thread (no longer holding the Python thread paused, no GIL). At that point the captured value can be stale (frame slot not yet initialized when SIGPROF fired, or already reused) or point to a PyCodeObject that has since been freed.
Relevant source paths and captured offsets
Code paths involved:
eden/scm/lib/backtrace-python/src/lib.rs:149 — read_stack(OFFSET_SP_CODE) captures the code value from the signal handler.
eden/scm/lib/backtrace-python/src/lib.rs:114-133 — resolve_supplemental_info calls evalframe_sys::resolve_code_object(code, ...) from the profiler loop thread.
eden/scm/lib/backtrace-python/evalframe-sys/src/evalframe.c:203-224 — sapling_cext_evalframe_resolve_code_object dereferences code->co_filename / ->co_name. Its safety comment requires the owning Python thread to be paused, but at the call site above that is no longer true.
Runtime offsets captured on this build via bindings.backtrace:
OFFSET_IP: 91
OFFSET_SP_FRAME: 16
OFFSET_SP_CODE: 0
OFFSET_SP_LINE_NO: 40
SUPPORTED_INFO.os_arch: True
SUPPORTED_INFO.c_evalframe: True
OFFSET_SP_CODE = 0 is notable — the profiler reads code directly from *sp. A signal that fires before Sapling_PyEvalFrame has written code to its stack slot (or after the function has returned and the slot is reused) yields whatever is on the stack at that point.
Why the existing fix is not enough
09fa42845cb (sampling-profiler: add an example to reproduce segfault, Apr 23, 2026) added the munmap-segv-example/ reproducer.
a82fe9f9804 (backtrace-python: avoid unsafe (segfault) interp frame read, Apr 30, 2026) replaced the unsafe interp-frame read in maybe_extract_supplemental_info with values stored on the Sapling_PyEvalFrame stack frame.
Both are in this release (1e764c94). That fix was on the signal-handler side — the frame pointer is no longer dereferenced from the signal handler. But the profiler-loop side still dereferences the captured PyCodeObject* to read co_filename / co_name, and that read is what crashes here.
Suggested directions
Roughly in increasing invasiveness:
- Hold a strong reference to the captured
PyCodeObject* for the duration the profiler may dereference it. (Tricky — INCREF/DECREF need the GIL.)
- Defer the dereference to a point that holds the GIL — e.g., extract the
co_filename / co_name UTF-8 strings on the Python thread before resuming it after the signal, instead of from the profiler-loop thread.
- Validate the pointer before dereferencing (
sigsetjmp around the read, or process_vm_readv against /proc/self/mem). Easier, but does not cover the "pointer now points to a different live object" case.
- Skip capturing
code at all from the signal handler — capture line_no plus a probe-time marker only, and synthesize the function-name string later by walking the current Python stack from the profiler thread under the GIL.
Build / install info
brew info sapling:
sapling: stable 0.2.20260522-084851 (bottled), HEAD
/home/linuxbrew/.linuxbrew/Cellar/sapling/0.2.20260522-084851
Dependencies (runtime):
gh, libssh2, node, openssl@3, python@3.12 (3.12.13_2), zlib-ng-compat,
bzip2, curl, gcc, glibc (2.39)
System:
Amazon Linux 2023, Linux 6.12.88-119.157.amzn2023.x86_64
System glibc: 2.34
/home/linuxbrew/.linuxbrew loader resolves libpython3.12.so to the Homebrew
glibc 2.39 build at runtime (verified via LD_DEBUG=libs).
I did not try to verify against a make oss source build: make oss is reported broken on Python 3.12+ (#1032, #1141) and this environment only has Python 3.14 available as the system default. However, the crashing code is unconditionally compiled in regardless of build flavor, and the default [profiling] always-on-enabled = True is set in eden/scm/lib/config/loader/src/builtin_static/production.rs, so the bug is not Homebrew-specific.
Happy to provide a core dump, additional traces, or test patches if useful.
🐛 Bug Report
Sapling version:
0.2.20260522-084851(Homebrew bottle, commit1e764c94)OS: Amazon Linux 2023 (Linux 6.12.88, x86_64)
Install method:
brew install sapling(Linuxbrew, Python 3.12.13_2 from Homebrew)Mode: git-backed repos (
sl clone <github URL>)Summary
sl clone <github URL>reliably crashes with SIGSEGV (exit 139, or exit 255 via chg) after the pull completes but before the working copy is updated. The pull-completed metadata is left behind in.sl/, but no files are checked out.The crash is in the always-on sampling profiler background thread, specifically in
sapling_cext_evalframe_resolve_code_object, called fromprofiler_loop(not from the signal handler).The root cause is closely related to the segfault that the existing
eden/scm/lib/sampling-profiler/munmap-segv-example/(commit09fa42845cb) was added to reproduce, but the fixa82fe9f9804 "backtrace-python: avoid unsafe (segfault) interp frame read"only addresses the signal-handler-side frame read. The profiler-loop-side deferred dereference of thePyCodeObject*(captured from the C stack during the signal) is still racy, andPyCodeObject->co_filename/->co_namecan be read after the object is freed.Reproduction
100% reproducible on this machine:
Reproduction matrix across multiple repos
Each run on a fresh empty target dir.
sl cloneexitprofiling.always-on-enabled=Falseoctocat/Hello-WorldMarukome0743/raspi-signage(155 files)facebook/sapling(~11k files)Hello-Worldworks because the update step finishes faster than the 1s profiler sampling interval can fire on a problematic frame. Larger repos almost always trip the sampler.Workaround
Set in
~/.config/sapling/sapling.conf(or pass via--config):HGPROF=noopdoes not help — it only disables the Python-level profiler, not the Rust sampling profiler.Crash location
Crashing thread is the
pfc[worker/...]sampling-profiler worker. Crash atsapling_cext_evalframe_resolve_code_object+14withrdi = 0x8000000000000002— i.e., the first argPyCodeObject* codeis a non-canonical address (high bit set on x86-64), which faults on any access.Full backtrace of the crashing thread
Register state at crash (selected):
rip=sapling_cext_evalframe_resolve_code_object+14rdi=0x8000000000000002(first argPyCodeObject* code, non-canonical)The garbage
codepointer was captured from the C stack on the Python thread by the signal handler — read atsp + OFFSET_SP_CODE— and then dereferenced later on the profiler thread (no longer holding the Python thread paused, no GIL). At that point the captured value can be stale (frame slot not yet initialized when SIGPROF fired, or already reused) or point to aPyCodeObjectthat has since been freed.Relevant source paths and captured offsets
Code paths involved:
eden/scm/lib/backtrace-python/src/lib.rs:149—read_stack(OFFSET_SP_CODE)captures thecodevalue from the signal handler.eden/scm/lib/backtrace-python/src/lib.rs:114-133—resolve_supplemental_infocallsevalframe_sys::resolve_code_object(code, ...)from the profiler loop thread.eden/scm/lib/backtrace-python/evalframe-sys/src/evalframe.c:203-224—sapling_cext_evalframe_resolve_code_objectdereferencescode->co_filename/->co_name. Its safety comment requires the owning Python thread to be paused, but at the call site above that is no longer true.Runtime offsets captured on this build via
bindings.backtrace:OFFSET_SP_CODE = 0is notable — the profiler readscodedirectly from*sp. A signal that fires beforeSapling_PyEvalFramehas writtencodeto its stack slot (or after the function has returned and the slot is reused) yields whatever is on the stack at that point.Why the existing fix is not enough
09fa42845cb(sampling-profiler: add an example to reproduce segfault, Apr 23, 2026) added themunmap-segv-example/reproducer.a82fe9f9804(backtrace-python: avoid unsafe (segfault) interp frame read, Apr 30, 2026) replaced the unsafe interp-frame read inmaybe_extract_supplemental_infowith values stored on theSapling_PyEvalFramestack frame.Both are in this release (
1e764c94). That fix was on the signal-handler side — the frame pointer is no longer dereferenced from the signal handler. But the profiler-loop side still dereferences the capturedPyCodeObject*to readco_filename/co_name, and that read is what crashes here.Suggested directions
Roughly in increasing invasiveness:
PyCodeObject*for the duration the profiler may dereference it. (Tricky — INCREF/DECREF need the GIL.)co_filename/co_nameUTF-8 strings on the Python thread before resuming it after the signal, instead of from the profiler-loop thread.sigsetjmparound the read, orprocess_vm_readvagainst/proc/self/mem). Easier, but does not cover the "pointer now points to a different live object" case.codeat all from the signal handler — captureline_noplus a probe-time marker only, and synthesize the function-name string later by walking the current Python stack from the profiler thread under the GIL.Build / install info
I did not try to verify against a
make osssource build:make ossis reported broken on Python 3.12+ (#1032, #1141) and this environment only has Python 3.14 available as the system default. However, the crashing code is unconditionally compiled in regardless of build flavor, and the default[profiling] always-on-enabled = Trueis set ineden/scm/lib/config/loader/src/builtin_static/production.rs, so the bug is not Homebrew-specific.Happy to provide a core dump, additional traces, or test patches if useful.