From 524301923c37cbe34c79e298d8ef37fa26743dc6 Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Tue, 30 Jun 2026 16:00:10 +0000
Subject: [PATCH 1/9] fix(VEX): classify arm64 plain B as Ijk_Boring, not
 Ijk_Call

The AArch64 B{L} decoder tagged the whole opcode group as Ijk_Call,
but only BL (bit 31 = 1, writes the link register) is a call; a plain
B (bit 31 = 0) is an ordinary unconditional branch.

Mislabelling B as a call made Callgrind treat every branch to a
function epilogue or tail target as a call. At -O0 a conditional like
`return n < 2 ? n : fib(...)` compiles the base case to `b <epilogue>`,
so each base case was counted as a recursive call -- inflating
recursive/cyclic call graphs and inventing phantom self-edges on arm64
(e.g. fib recursion 64 -> 98; mutual is_even/is_odd gaining self-loops).

Align plain B with B.cond and the register-indirect JMP, which already
use Ijk_Boring. Fixes the callgrind-utils recursion/mutual snapshot
failures.
---
 VEX/priv/guest_arm64_toIR.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/VEX/priv/guest_arm64_toIR.c b/VEX/priv/guest_arm64_toIR.c
index 6e77b34c7..62927537c 100644
--- a/VEX/priv/guest_arm64_toIR.c
+++ b/VEX/priv/guest_arm64_toIR.c
@@ -7422,7 +7422,7 @@ Bool dis_ARM64_branch_etc(/*MB_OUT*/DisResult* dres, UInt insn,
    /* -------------------- B{L} uncond -------------------- */
    if (INSN(30,26) == BITS5(0,0,1,0,1)) {
       /* 000101 imm26  B  (PC + sxTo64(imm26 << 2))
-         100101 imm26  B  (PC + sxTo64(imm26 << 2))
+         100101 imm26  BL (PC + sxTo64(imm26 << 2))
       */
       UInt  bLink  = INSN(31,31);
       ULong uimm64 = INSN(25,0) << 2;
@@ -7432,7 +7432,11 @@ Bool dis_ARM64_branch_etc(/*MB_OUT*/DisResult* dres, UInt insn,
       }
       putPC(mkU64(guest_PC_curr_instr + simm64));
       dres->whatNext = Dis_StopHere;
-      dres->jk_StopHere = Ijk_Call;
+      /* Only BL (which writes the link register) is a call; a plain B is
+         an ordinary unconditional branch.  Mislabelling B as Ijk_Call makes
+         callgrind treat every branch to a function epilogue / tail target as
+         a call, corrupting recursive and cyclic call graphs on arm64. */
+      dres->jk_StopHere = bLink ? Ijk_Call : Ijk_Boring;
       DIP("b%s 0x%llx\n", bLink == 1 ? "l" : "",
                           guest_PC_curr_instr + simm64);
       return True;

From d3d8091d8461ecd2c0fee21cee4a490192bb37b9 Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 11:06:39 +0000
Subject: [PATCH 2/9] fix(callgrind): track guest X30 for arm64 shadow-stack
 return addresses

On arm64, bl/blr write the return address to X30 and SP does not move
across a call/return pair, so unlike on x86 the return detector cannot
fall back on SP progress and depends entirely on each frame's recorded
ret_addr. When a call entered skipped code (a libc PLT hop), the
skipped->nonskipped jump was pushed with setup_bbcc's spliced
'nonskipped' source BB, whose last jump is the very call that created
the skip frame below: the emulated frame duplicated that frame's
statically computed return address, the callee's single ret popped only
the top entry, and the leaked equal-SP skip frame starved the pop
budget of the next same-SP return. Misclassified returns were then
re-promoted into phantom calls back into the live caller, cloning
non-recursive functions as bogus 'N recursion levels
(complex_fractal_benchmark'2) and misattributing follow-up work
("free calls X").

Record the guest X30 -- the architectural return target -- for frames
entered by a real call, and record ret_addr = 0 for emulated/spliced
pushes so the return matcher absorbs them down the same-SP run and pops
the group at the frame of the real call underneath, which also restores
the pre-call nonskipped state. x86 keeps the static computation and is
behaviorally unchanged.

Verified by the arm64_plt_phantom_recursion / arm64_free_tailcall_phantom
fixture snapshots (flipped to the correct shapes), the structural guards
in arm64_fractal_alloc_no_free_misattribution, the no-phantom-clone
assertion in rust_fixture_full_trace, and an unchanged in-tree
'vg_regtest callgrind' pass/fail set.
---
 CODSPEED-CHANGELOG.md | 13 ++++++++++++
 callgrind/callstack.c | 47 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/CODSPEED-CHANGELOG.md b/CODSPEED-CHANGELOG.md
index 229119c0e..ddf42427a 100644
--- a/CODSPEED-CHANGELOG.md
+++ b/CODSPEED-CHANGELOG.md
@@ -113,3 +113,16 @@ valgrind --tool=callgrind \
     --obj-skip=/lib/x86_64-linux-gnu/libpthread.so.0 \
     ./your_program
 ```
+
+## Fixes
+
+### Callgrind: arm64 shadow call stack tracks the link register (X30)
+
+**Fix**: On arm64, `push_call_stack` records the guest link register (X30) as a frame's return address instead of statically computing "address after the call instruction" from the (possibly spliced) source BB, and emulated/spliced pushes consistently record `ret_addr = 0`.
+
+**Motivation**: On arm64, `bl`/`blr` write the return address to X30 and leave SP untouched, so Callgrind's return detector cannot fall back on SP movement like on x86 and depends entirely on each frame's recorded return address. When a call crossed into skipped code (a libc call through the PLT), the skipped-to-nonskipped jump was pushed with a spliced source BB (`setup_bbcc`'s `nonskipped` splice) whose last jump is the same call that created the skip frame below it. The emulated frame therefore duplicated the skip frame's return address, the callee's single return popped only the top frame, and the leaked equal-SP skip frame later starved the one-pop budget of the next same-SP return. The resulting misclassified returns were re-promoted into phantom calls back into the live caller, producing bogus `'2` recursion clones of non-recursive functions (`complex_fractal_benchmark'2`) and "free calls X" misattribution. x86 is unaffected (returns are detected from SP movement alone) and keeps the previous behavior.
+
+**How it works**:
+- Frames entered by a real call record the guest X30 at callee entry - the architectural return target - which is identical to the static value for honestly attributed calls and correct even when the delayed push runs with a spliced source.
+- Emulated calls (promoted tail jumps) and the skipped-to-nonskipped splice record `ret_addr = 0`, so the return matcher walks over them down the same-SP run and pops them as a group together with the frame of the real call underneath.
+- Regression coverage: the `callgrind-utils` fixture snapshots `arm64_plt_phantom_recursion` / `arm64_free_tailcall_phantom` (now pinning the correct, phantom-free shapes), the structural guards in `arm64_fractal_alloc_no_free_misattribution`, and a direct no-`complex_fractal_benchmark'` assertion in `rust_fixture_full_trace` that fails loudly on every platform independent of snapshot noise.
diff --git a/callgrind/callstack.c b/callgrind/callstack.c
index 8951639d7..4b2800e8a 100644
--- a/callgrind/callstack.c
+++ b/callgrind/callstack.c
@@ -26,6 +26,9 @@
 
 #include "global.h"
 #include "pub_tool_stacktrace.h"
+#if defined(VGA_arm64)
+#include "pub_tool_guest.h"     /* VexGuestArchState, for guest_X30 */
+#endif
 
 /*------------------------------------------------------------*/
 /*--- Call stack, operations                               ---*/
@@ -234,6 +237,49 @@ void CLG_(push_call_stack)(BBCC* from, UInt jmp, BBCC* to, Addr sp, Bool skip)
 
     /* return address is only is useful with a real call;
      * used to detect RET w/o CALL */
+#if defined(VGA_arm64)
+    /* On arm64 a call does not push its return address: `bl`/`blr` write it
+     * to the link register X30, and SP does not move across the call/return
+     * pair. So unlike on x86, where a leftover frame is swept by the pure
+     * SP-progress rules, the return detector here fully depends on ret_addr:
+     * a return pops the same-SP run of frames above and including the one
+     * whose recorded ret_addr matches the return target (non-matching
+     * frames in the run, e.g. a promoted tail jump's ret_addr==0 entry, are
+     * absorbed into the same pop group).
+     *
+     * Record the guest X30 -- the architectural return target, just written
+     * by the call that ended the previous BB -- for every frame entered by
+     * a real call, instead of statically computing "address after the call
+     * instruction of <from>/<jmp>". The two agree for an honestly
+     * attributed call, but <from>/<jmp> may describe a different call site:
+     * after a call into skipped code (a libc PLT hop) returns, setup_bbcc
+     * splices the still-set `nonskipped` BB in as the call source, whose
+     * last jump is some *earlier* call of that BB.
+     *
+     * The splice also routes the skipped->nonskipped jump (PLT stub ->
+     * libc body, an emulated call that never updated X30) through this
+     * jk_Call branch. Such a frame must record 0 like any other emulated
+     * call: giving it the (spliced or X30-inherited) address of the skip
+     * frame right below would duplicate that frame's ret_addr, so the
+     * callee's single return matches the top entry only and leaks the skip
+     * frame -- never popped again on arm64 (no SP movement), and starving
+     * the one-pop budget of the next same-SP return, which cascades into
+     * misattributed callers and phantom 'N recursion clones. A real call
+     * is recognizable as: statically a call in <from> AND not the
+     * skipped->nonskipped splice (skip pushes and calls made while
+     * `nonskipped` is clear are genuine; the splice is the nonskipped,
+     * non-skip delayed push -- see setup_bbcc). */
+    if ((from->bb->jmp[jmp].jmpkind == jk_Call) &&
+        (skip || !CLG_(current_state).nonskipped)) {
+      Addr lr;
+      VG_(get_shadow_regs_area)(CLG_(current_tid), (UChar*)&lr, 0,
+                                offsetof(VexGuestArchState, guest_X30),
+                                sizeof(lr));
+      ret_addr = lr;
+    }
+    else
+      ret_addr = 0;
+#else
     if (from->bb->jmp[jmp].jmpkind == jk_Call) {
       UInt instr = from->bb->jmp[jmp].instr;
       ret_addr = bb_addr(from->bb) +
@@ -242,6 +288,7 @@ void CLG_(push_call_stack)(BBCC* from, UInt jmp, BBCC* to, Addr sp, Bool skip)
     }
     else
       ret_addr = 0;
+#endif
 
     /* put jcc on call stack */
     current_entry->jcc = jcc;

From 596bec4599be1cb08afab207c1ff6e8b73c3e552 Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 15:22:43 +0000
Subject: [PATCH 3/9] fix(coregrind): arm64 FP-chain unwind fallback when CFI
 is missing

VG_(get_StackTrace) on arm64 was CFI-only, so Callgrind's OFF->ON
shadow-stack seeding stopped at the first frame without unwind info --
notably CPython's -X perf JIT trampolines, which have no FDEs but keep
the AAPCS64 frame chain alive (that is their design: perf/samply walk
them by FP). The truncated seed left the seeded context stack one entry
deep after the innermost frame popped; bbcc.c's underflow check then
misread the fn-stack base sentinel as a signal marker on every return,
and handleUnderflow fabricated named nodes for obj-skipped interpreter
functions with inverted, full-cost edges (_ctypes_callproc ->
PyCFuncPtr_call -> _TAIL_CALL_* -> ... as the graph root). x86_64 never
hit this because its unwinder already has the %rbp fallback.

Follow the frame records {saved X29, saved X30} when CFI fails, with
guards: record in-stack and 8-aligned, SP must progress, next record
strictly higher (saved X29 == 0 accepted as chain terminator), pc 0/1
stops the walk. Caller IPs keep the CFI path's -1 bias.

Callgrind's seeder correspondingly records ret_addr = ips[frame+1] + 1,
undoing that bias so the arm64 return matcher (exact-X30 matching, no
SP movement on bl/ret) can pop seeded frames.

An A/B run confirmed raising CLG_RECON_MAX_FRAMES alone does not help:
the seed was CFI-truncated at ~8 frames, nowhere near the 256 cap.

Regression coverage: callgrind-utils/tests/objskip_seed_underflow.rs, a
minimal fixture whose asm trampoline maintains FP but has no CFI and
which starts instrumentation two obj-skipped frames deep; it asserts no
skipped frame leaks into the folded graph and that the workload parents
under the trampoline.
---
 CODSPEED-CHANGELOG.md    | 38 ++++++++++++++++++++++++++++++
 callgrind/callstack.c    | 10 +++++++-
 coregrind/m_stacktrace.c | 51 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/CODSPEED-CHANGELOG.md b/CODSPEED-CHANGELOG.md
index ddf42427a..eecf557b0 100644
--- a/CODSPEED-CHANGELOG.md
+++ b/CODSPEED-CHANGELOG.md
@@ -126,3 +126,41 @@ valgrind --tool=callgrind \
 - Frames entered by a real call record the guest X30 at callee entry - the architectural return target - which is identical to the static value for honestly attributed calls and correct even when the delayed push runs with a spliced source.
 - Emulated calls (promoted tail jumps) and the skipped-to-nonskipped splice record `ret_addr = 0`, so the return matcher walks over them down the same-SP run and pops them as a group together with the frame of the real call underneath.
 - Regression coverage: the `callgrind-utils` fixture snapshots `arm64_plt_phantom_recursion` / `arm64_free_tailcall_phantom` (now pinning the correct, phantom-free shapes), the structural guards in `arm64_fractal_alloc_no_free_misattribution`, and a direct no-`complex_fractal_benchmark'` assertion in `rust_fixture_full_trace` that fails loudly on every platform independent of snapshot noise.
+
+### arm64: frame-pointer fallback for stack unwinding (seeding OFF->ON transitions)
+
+**Fix**: `VG_(get_StackTrace)` on arm64 now falls back to walking the AAPCS64
+frame-pointer chain (X29 -> { saved X29, saved X30 }) when CFI lookup fails,
+mirroring the amd64 `%rbp` fallback. (An A/B run confirmed raising
+`CLG_RECON_MAX_FRAMES` alone does not help: the seed was CFI-truncated at
+~8 frames, nowhere near the 256 cap.)
+
+**Motivation**: the arm64 unwinder was CFI-only, so at the
+`CALLGRIND_START_INSTRUMENTATION` shadow-stack seeding it stopped at the first
+frame without unwind info — notably CPython's `-X perf` JIT trampolines, which
+have no FDEs but do maintain the FP chain (that is their whole design: perf
+and samply walk them by FP). The truncated seed left the seeded context stack
+one entry deep after the innermost frame popped; `bbcc.c`'s underflow check
+then misread the fn-stack base sentinel as a signal marker on every return,
+and `handleUnderflow` fabricated named nodes for obj-skipped interpreter
+functions with inverted, full-cost call edges (`_ctypes_callproc ->
+PyCFuncPtr_call -> _TAIL_CALL_* -> ...` as the graph root). x86_64 never hit
+this because its unwinder already had the FP fallback. Full triage:
+`.agents/docs/arm64-python-seeding-underflow-analysis.md`.
+
+**How it works**:
+- CFI-first, unchanged; the FP chain is consulted only where CFI fails.
+- Guards against garbage chains: the frame record must lie inside the stack
+  and be 8-aligned, the recovered SP must move towards the stack base, the
+  next record must be strictly higher (saved X29 == 0 is accepted as the
+  conventional chain terminator), and pc values of 0/1 stop the walk.
+- Caller IPs keep the `- 1` bias of the CFI path, so consumers (including
+  Callgrind's seeder, which re-adds 1 to recover exact return targets) see
+  a uniform convention.
+- Affects all tools' stack traces on arm64 (error reports may now walk
+  through JIT/asm frames instead of stopping).
+- Regression coverage: `callgrind-utils/tests/objskip_seed_underflow.rs` — a
+  minimal fixture whose asm trampoline maintains FP but has no CFI, starting
+  instrumentation two obj-skipped frames deep; it asserts no skipped frame
+  leaks into the folded graph and that the workload parents under the
+  trampoline. Fails on the CFI-only unwinder, passes with the fallback.
diff --git a/callgrind/callstack.c b/callgrind/callstack.c
index 4b2800e8a..c365bee61 100644
--- a/callgrind/callstack.c
+++ b/callgrind/callstack.c
@@ -568,7 +568,15 @@ void CLG_(reconstruct_call_stack_from_native)(ThreadId tid)
          * SP, sps[frame+1]; the outermost frame keeps its own SP as nothing
          * returns past it during measurement. */
         ce->sp       = (frame + 1 < (Int)n) ? sps[frame + 1] : sps[frame];
-        ce->ret_addr = (frame + 1 < (Int)n) ? ips[frame + 1] : 0;
+        /* ret_addr must be the exact architectural return target: the arm64
+         * return detector (setup_bbcc) matches it against bb_addr() of the
+         * returned-into block, and push_call_stack records the exact X30 there.
+         * But VG_(get_StackTrace) reports each caller IP as return_addr - 1 (so
+         * symbolization lands in the calling instruction, not the one after);
+         * undo that bias here, else every seeded frame's return is off by one,
+         * is misread as a jump, and the seeded skipped interpreter/ctypes frames
+         * never pop -- leaking them as the graph root. */
+        ce->ret_addr = (frame + 1 < (Int)n) ? ips[frame + 1] + 1 : 0;
         cs->sp++;
         ensure_stack_size(cs->sp + 1);
         cs->entry[cs->sp].cxt = 0;
diff --git a/coregrind/m_stacktrace.c b/coregrind/m_stacktrace.c
index fa2dc0964..f9d4eaf24 100644
--- a/coregrind/m_stacktrace.c
+++ b/coregrind/m_stacktrace.c
@@ -1249,8 +1249,11 @@ UInt VG_(get_StackTrace_wrk) ( ThreadId tid_if_known,
    ips[0] = uregs.pc;
    i = 1;
 
-   /* Loop unwinding the stack, using CFI. */
+   /* Loop unwinding the stack: CFI first, AAPCS64 frame-pointer chain as
+      the fallback. */
    while (True) {
+      Addr old_sp;
+
       if (debug) {
          VG_(printf)("i: %d, pc: 0x%lx, sp: 0x%lx\n",
                      i, uregs.pc, uregs.sp);
@@ -1259,6 +1262,8 @@ UInt VG_(get_StackTrace_wrk) ( ThreadId tid_if_known,
       if (i >= max_n_ips)
          break;
 
+      old_sp = uregs.sp;
+
       if (VG_(use_CF_info)( &uregs, fp_min, fp_max )) {
          if (sps) sps[i] = uregs.sp;
          if (fps) fps[i] = uregs.x29;
@@ -1271,6 +1276,50 @@ UInt VG_(get_StackTrace_wrk) ( ThreadId tid_if_known,
          continue;
       }
 
+      /* If VG_(use_CF_info) fails, the location has no unwind info: JIT
+         pages (e.g. CPython's -X perf trampolines) or hand-written
+         assembly.  Fall back to following the AAPCS64 frame-pointer chain:
+         X29 points at a frame record { saved X29, saved X30 }.  Code built
+         for fp-based profilers (perf, samply) maintains this chain exactly
+         where CFI is missing.  Mirrors the amd64 %rbp fallback, with the
+         same guards: the record must lie inside the stack, the recovered
+         SP must make progress towards the stack base, and the next record
+         must be strictly further up (a saved X29 of 0 is the conventional
+         chain terminator: emit this frame, the bounds check ends the walk
+         on the next iteration).  Stop rather than emit a bogus trail. */
+      if (VG_IS_8_ALIGNED(uregs.x29)
+          && fp_min <= uregs.x29
+          && uregs.x29 <= fp_max - 2 * sizeof(Addr)) {
+         Addr next_x29 = ((Addr*)uregs.x29)[0];
+         Addr next_pc  = ((Addr*)uregs.x29)[1];
+         /* End-of-chain sentinel, same test the other unwinders in this
+            file use.  Recorded pcs are decremented by 1 (see below) so the
+            symbol lookup lands on the call insn rather than the return
+            address; both 0 and 1 collapse to the null address 0 after that
+            -1, so treat either as "no real caller" and stop. */
+         if (0 == next_pc || 1 == next_pc) break;
+         uregs.sp = uregs.x29 + 2 * sizeof(Addr);
+         if (old_sp >= uregs.sp
+             || (next_x29 != 0 && next_x29 <= uregs.x29)) {
+            if (debug)
+               VG_(printf)("     FF end of stack sp %#lx next x29 %#lx\n",
+                           uregs.sp, next_x29);
+            break;
+         }
+         uregs.x29 = next_x29;
+         uregs.x30 = next_pc;
+         uregs.pc  = next_pc;
+         if (sps) sps[i] = uregs.sp;
+         if (fps) fps[i] = uregs.x29;
+         ips[i++] = uregs.pc - 1; /* -1: refer to calling insn, not the RA */
+         if (debug)
+            VG_(printf)("USING FP: pc: 0x%lx, sp: 0x%lx\n",
+                        uregs.pc, uregs.sp);
+         uregs.pc = uregs.pc - 1;
+         RECURSIVE_MERGE(cmrf,ips,i);
+         continue;
+      }
+
       /* No luck.  We have to give up. */
       break;
    }

From 3ec8252d64d98641c1b72557680169fa9555e28a Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 16:28:09 +0000
Subject: [PATCH 4/9] fix(callgrind): skip arm64 _dl_tlsdesc_* resolvers like
 PLT stubs

Every TLSDESC __thread access blr's into the dynamic linker's resolver,
which rets straight back into the middle of the accessing function. When
the access is made from an obj-skipped object (CPython under
pytest-codspeed), the skipped->nonskipped splice pushed the resolver
frame with ret_addr 0; its mid-function return could never match, the
RET-w/o-CALL promotion re-entered the skipped object with nonskipped
pointing at the resolver, and skipped cost plus call edges piled up
under _dl_tlsdesc_return -- pulling nearly whole Python flamegraphs
under that node, plus inverted return-direction edges.

Mark _dl_tlsdesc_* skipped (gated on --skip-plt, arm64 only), the same
transparent-trampoline class as _dl_runtime_resolve; pop_on_jump cannot
apply since these exit via a plain ret to a non-entry address. Skip
pushes are never spliced and record the architectural X30, so the
return pops cleanly and cost keeps flowing to the real non-skipped
caller.

Regression: callgrind-utils/tests/arm64_tls_access.rs (plain +
--obj-skip runs of the shared-lib TLS fixture); both tests fail on the
unfixed tool.
---
 CODSPEED-CHANGELOG.md | 40 ++++++++++++++++++++++++++++++++++++++++
 callgrind/fn.c        | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/CODSPEED-CHANGELOG.md b/CODSPEED-CHANGELOG.md
index eecf557b0..ce8405e1b 100644
--- a/CODSPEED-CHANGELOG.md
+++ b/CODSPEED-CHANGELOG.md
@@ -164,3 +164,43 @@ this because its unwinder already had the FP fallback. Full triage:
   instrumentation two obj-skipped frames deep; it asserts no skipped frame
   leaks into the folded graph and that the workload parents under the
   trampoline. Fails on the CFI-only unwinder, passes with the fallback.
+
+### Callgrind: arm64 TLS-descriptor resolvers treated as transparent trampolines
+
+**Fix**: On arm64, functions named `_dl_tlsdesc_*` (`_dl_tlsdesc_return`,
+`_dl_tlsdesc_undefweak`, `_dl_tlsdesc_dynamic`) are marked skipped, gated on
+`--skip-plt` like PLT sections — the same transparent dynamic-linker
+trampoline class as `_dl_runtime_resolve` (which gets `pop_on_jump`; that
+mechanism cannot apply here because these resolvers exit with a plain `ret`
+to a non-entry address, so `pop_on_jump` would never fire).
+
+**Motivation**: with the (default) AArch64 TLS descriptor model, every
+`__thread` access whose variable cannot be relaxed to Local-Exec compiles to
+a GOT-loaded {resolver, arg} pair and a `blr` into the resolver, which
+returns straight back into the middle of the accessing function. As a named
+node the resolver is pure noise between a function and its own straight-line
+code. Worse, when the access is made from an obj-skipped object (production:
+the CPython binary under pytest-codspeed), the skipped-to-nonskipped splice
+pushed the resolver frame with `ret_addr = 0`; its mid-function return could
+never match, the RET-w/o-CALL promotion re-entered the skipped object with
+`nonskipped` pointing at the resolver, and from then on skipped cost and
+call edges accumulated under `_dl_tlsdesc_return` — observed pulling nearly
+entire Python flamegraphs under that node for TLS-heavy workloads, plus
+inverted return-direction edges (`hash_tree -> build_tree` in the fixture).
+x86_64 never hits this: its default TLS dialect calls `__tls_get_addr` (an
+ordinary function), and x86 `call`/`ret` move SP so leftover frames are
+swept by SP progress anyway.
+
+**How it works**:
+- `CLG_(get_fn_node)` marks first-seen `_dl_tlsdesc_*` functions with
+  `fn->skip = CLG_(clo).skip_plt` (arm64 only).
+- A skip push is never spliced and records the architectural X30 (see the
+  X30 fix above), so the resolver's return matches its own frame and pops
+  cleanly; the resolver's cost and the skipped caller's post-return work
+  keep flowing to the real non-skipped parent via `nonskipped`.
+- Regression coverage: `callgrind-utils/tests/arm64_tls_access.rs`
+  (aarch64-only) compiles `testdata/arm64_tls_access.c` against a `__thread`
+  variable in a shared library and asserts, for both a plain run and an
+  `--obj-skip` run of that library: no `_dl_tlsdesc_*` node, no work stolen
+  under `touch_tls`, no inverted `hash_tree -> build_tree` edges, and
+  `run_measured` as the only root. Both tests fail on the unfixed tool.
diff --git a/callgrind/fn.c b/callgrind/fn.c
index e8b4ba03c..f1d96aca1 100644
--- a/callgrind/fn.c
+++ b/callgrind/fn.c
@@ -711,6 +711,42 @@ fn_node* CLG_(get_fn_node)(BB* bb)
                       (UWord)bb->offset, bb_addr(bb));
       }
 
+#if defined(VGA_arm64)
+      /* aarch64 TLS-descriptor resolvers (_dl_tlsdesc_return,
+       * _dl_tlsdesc_undefweak, _dl_tlsdesc_dynamic) are transparent
+       * dynamic-linker trampolines, the same class as PLT stubs and
+       * _dl_runtime_resolve above: every `__thread` access in code built
+       * with the (default) TLS descriptor model compiles to a GOT-loaded
+       * {resolver, arg} pair and a `blr` into the resolver, which returns
+       * straight back into the middle of the accessing function.
+       *
+       * Unlike _dl_runtime_resolve they never jump onward (pop_on_jump
+       * would never fire: the exit is a plain `ret` to a non-entry
+       * address), so treat them like PLT stubs instead: fn->skip, gated on
+       * --skip-plt. A named node here is pure noise between a function and
+       * its own straight-line code. Worse, when the TLS access is made
+       * from obj-skipped code (production: the CPython binary under
+       * pytest-codspeed), the skipped->nonskipped splice in setup_bbcc
+       * pushes the resolver frame with ret_addr 0; its `ret` back into the
+       * middle of skipped code can never match, the RET-w/o-CALL promotion
+       * re-enters the skipped object with `nonskipped` pointing at the
+       * resolver, and from then on skipped cost and call edges pile up
+       * under `_dl_tlsdesc_return` -- observed pulling nearly whole Python
+       * flamegraphs under that node. As a skip push the frame is not
+       * spliced and records the architectural X30 (see push_call_stack),
+       * so the resolver's return pops cleanly and all cost keeps flowing
+       * to the real non-skipped caller. */
+      if (VG_(strncmp)(fn->name, "_dl_tlsdesc_", 12) == 0) {
+	  fn->skip = CLG_(clo).skip_plt;
+
+	  if (VG_(clo_verbosity) > 1)
+	      VG_(message)(Vg_DebugMsg, "Symbol match: found tlsdesc resolver:"
+                                        " %s +%#lx=%#lx\n",
+		      bb->obj->name + bb->obj->last_slash_pos,
+                      (UWord)bb->offset, bb_addr(bb));
+      }
+#endif
+
       fn->is_malloc  = (VG_(strcmp)(fn->name, "malloc")==0);
       fn->is_realloc = (VG_(strcmp)(fn->name, "realloc")==0);
       fn->is_free    = (VG_(strcmp)(fn->name, "free")==0);

From 93e8152e6819bad9dace490c3e01a8879037751e Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 16:58:53 +0000
Subject: [PATCH 5/9] fix(debuginfo): find split debug files via
 NIX_DEBUG_INFO_DIRS

find_debug_file() only searched the /usr/lib/debug tree, which does not
exist on NixOS: Nix ships separate debug outputs under their own store
paths. Factor the build-id .build-id/xx/yyyy.debug probe into
try_buildid_dir() and honour NIX_DEBUG_INFO_DIRS -- the established
colon-separated convention also used by the nixpkgs gdb/lldb wrappers --
so split debug info resolves inside the dev-shell.
---
 coregrind/m_debuginfo/readelf.c | 73 +++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 8 deletions(-)

diff --git a/coregrind/m_debuginfo/readelf.c b/coregrind/m_debuginfo/readelf.c
index 58ffc9b53..e59d06a0f 100644
--- a/coregrind/m_debuginfo/readelf.c
+++ b/coregrind/m_debuginfo/readelf.c
@@ -1500,6 +1500,37 @@ DiImage* find_debug_file_debuginfod( const HChar* objpath,
 }
 #endif
 
+/* Try one directory as a root for the standard .build-id/xx/yyyy.debug
+   layout. On success, returns the opened image and sets *debugpath_out
+   to a freshly allocated path (which the caller owns); on failure,
+   returns NULL and leaves *debugpath_out untouched. */
+static
+DiImage* try_buildid_dir( const HChar* dir, SizeT dirlen,
+                          const HChar* buildid, Bool rel_ok,
+                          HChar** debugpath_out )
+{
+   DiImage* dimg;
+   HChar* debugpath;
+
+   if (dirlen == 0)
+      return NULL;
+
+   debugpath = ML_(dinfo_zalloc)("di.tbid.1",
+                                 dirlen + VG_(strlen)(buildid) + 19);
+   VG_(memcpy)(debugpath, dir, dirlen);
+   VG_(sprintf)(debugpath + dirlen, "/.build-id/%c%c/%s.debug",
+                buildid[0], buildid[1], buildid + 2);
+
+   dimg = open_debug_file(debugpath, buildid, 0, rel_ok, NULL);
+   if (dimg == NULL) {
+      ML_(dinfo_free)(debugpath);
+      return NULL;
+   }
+
+   *debugpath_out = debugpath;
+   return dimg;
+}
+
 /* Try to find a separate debug file for a given object file.  If
    found, return its DiImage, which should be freed by the caller.  If
    |buildid| is non-NULL, then a debug object matching it is
@@ -1519,16 +1550,42 @@ DiImage* find_debug_file( struct _DebugInfo* di,
    HChar*   debugpath = NULL; /* where we found it */
 
    if (buildid != NULL) {
-      debugpath = ML_(dinfo_zalloc)("di.fdf.1",
-                                    VG_(strlen)(buildid) + 33);
+      /* Nix packages ship separate debug outputs under their own store
+         paths, never under /usr/lib/debug (which doesn't exist on
+         NixOS). NIX_DEBUG_INFO_DIRS is the established convention (also
+         honoured by gdb/lldb via nixpkgs wrappers) for a colon-separated
+         list of trees that mirror the standard .build-id/xx/yyyy.debug
+         layout; try each, then --extra-debuginfo-path, before falling
+         back to the FHS path. */
+      const HChar* nix_dirs = VG_(getenv)("NIX_DEBUG_INFO_DIRS");
+      const HChar* p = nix_dirs;
+
+      while (dimg == NULL && p != NULL && *p != 0) {
+         const HChar* colon = VG_(strchr)(p, ':');
+         SizeT dirlen = colon ? (SizeT)(colon - p) : VG_(strlen)(p);
+
+         dimg = try_buildid_dir(p, dirlen, buildid, rel_ok, &debugpath);
+
+         p = colon ? colon + 1 : p + dirlen;
+      }
 
-      VG_(sprintf)(debugpath, "/usr/lib/debug/.build-id/%c%c/%s.debug",
-                   buildid[0], buildid[1], buildid + 2);
+      if (dimg == NULL && extrapath != NULL) {
+         dimg = try_buildid_dir(extrapath, VG_(strlen)(extrapath),
+                                buildid, rel_ok, &debugpath);
+      }
+
+      if (dimg == NULL) {
+         debugpath = ML_(dinfo_zalloc)("di.fdf.1",
+                                       VG_(strlen)(buildid) + 33);
 
-      dimg = open_debug_file(debugpath, buildid, 0, rel_ok, NULL);
-      if (!dimg) {
-         ML_(dinfo_free)(debugpath);
-         debugpath = NULL;
+         VG_(sprintf)(debugpath, "/usr/lib/debug/.build-id/%c%c/%s.debug",
+                      buildid[0], buildid[1], buildid + 2);
+
+         dimg = open_debug_file(debugpath, buildid, 0, rel_ok, NULL);
+         if (!dimg) {
+            ML_(dinfo_free)(debugpath);
+            debugpath = NULL;
+         }
       }
    }
 

From 4e7c85b763805a42650fe31a6d197ed059203a10 Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 16:59:50 +0000
Subject: [PATCH 6/9] feat(callgrind-utils): parse .out into a call graph with
 JSON and flamegraph output

New Rust crate (edition 2024) that reads a Callgrind .out profile and
extracts call-graph topology (costs/addresses ignored), serializing to
canonical index-ref JSON for stable cross-platform callgraph diffing and
to folded/flamegraph stacks.

Node identity is the {object,file,function} tuple so same-named statics
stay distinct. Edges are emitted only on calls= lines; name compression
across the three ID spaces, the cfl/cfi alias, inline fi/fe
callee-context inheritance, and multi-part merge are handled. A redaction
pass strips volatile addresses/paths so snapshots stay portable.

Fixtures are compiled and profiled by the in-repo Callgrind through an
rstest harness (vg-in-place, --instr-atstart=no plus client requests
keep loader/libc frames out): base graphs (recursion, chain, diamond,
mutual), arm64 unwind fixtures (tail calls, recursion, TLS access,
alloc/free cycles, longjmp, phantom recursion), the objskip seeding
underflow regression, and the arm64 TLS-descriptor regression.
Folded-stack snapshots and structural assertions; clippy and rustfmt
clean.
---
 callgrind-utils/.gitignore                    |   2 +
 callgrind-utils/Cargo.lock                    | 681 ++++++++++++++++++
 callgrind-utils/Cargo.toml                    |  14 +
 callgrind-utils/build.rs                      | 116 +++
 callgrind-utils/src/error.rs                  |  34 +
 callgrind-utils/src/flamegraph.rs             | 179 +++++
 callgrind-utils/src/lib.rs                    |   6 +
 callgrind-utils/src/model.rs                  | 161 +++++
 callgrind-utils/src/parser/mod.rs             | 388 ++++++++++
 callgrind-utils/src/parser/normalize.rs       |  28 +
 callgrind-utils/src/redact.rs                 | 148 ++++
 callgrind-utils/src/serialize.rs              |  51 ++
 .../testdata/arm64_deep_tailcall_chain.c      |  93 +++
 .../testdata/arm64_free_during_recursion.c    | 168 +++++
 .../testdata/arm64_free_tailcall_phantom.c    |  92 +++
 .../testdata/arm64_libm_recursion.c           |  95 +++
 .../testdata/arm64_longjmp_unwind.c           | 107 +++
 .../testdata/arm64_multi_alloc_cycle.c        | 160 ++++
 .../testdata/arm64_objskip_tailcall.c         | 102 +++
 .../testdata/arm64_objskip_tailcall_lib.c     |  24 +
 .../testdata/arm64_ping_pong_recursion.c      | 101 +++
 .../testdata/arm64_plt_phantom_recursion.c    |  85 +++
 .../testdata/arm64_recursive_return.c         |  89 +++
 callgrind-utils/testdata/arm64_tail_call.c    |  56 ++
 callgrind-utils/testdata/arm64_tls_access.c   | 103 +++
 .../testdata/arm64_tls_access_lib.c           |  12 +
 .../testdata/arm64_wrapped_alloc_chain.c      | 144 ++++
 callgrind-utils/testdata/chain.c              |  27 +
 callgrind-utils/testdata/clgctl.c             |  28 +
 callgrind-utils/testdata/diamond.c            |  32 +
 callgrind-utils/testdata/fractal.c            | 247 +++++++
 callgrind-utils/testdata/fractal.rs           | 330 +++++++++
 callgrind-utils/testdata/fractal_alloc.rs     | 273 +++++++
 callgrind-utils/testdata/mutual.c             |  26 +
 .../testdata/objskip_seed_underflow.c         | 100 +++
 .../testdata/objskip_seed_underflow_lib.c     |  40 +
 callgrind-utils/testdata/recursion.c          |  40 +
 callgrind-utils/testdata/recursion.py         |  59 ++
 callgrind-utils/tests/arm64_tls_access.rs     | 221 ++++++
 callgrind-utils/tests/data/example.out        | 126 ++++
 callgrind-utils/tests/flamegraph.rs           | 195 +++++
 .../tests/objskip_seed_underflow.rs           | 165 +++++
 callgrind-utils/tests/parser.rs               | 314 ++++++++
 callgrind-utils/tests/python_callgraph.rs     | 137 ++++
 callgrind-utils/tests/rust_callgraph.rs       | 328 +++++++++
 callgrind-utils/tests/snapshot.rs             | 197 +++++
 ...allgraph__recursion_py__topology_json.snap | 101 +++
 .../rust_callgraph__fractal_rs_folded.snap    |  57 ++
 ...ust_callgraph__fractal_rs_full_folded.snap |  57 ++
 ...hot__arm64_deep_tailcall_chain_folded.snap |  47 ++
 ...t__arm64_free_during_recursion_folded.snap |  70 ++
 ...t__arm64_free_tailcall_phantom_folded.snap |  19 +
 ...snapshot__arm64_libm_recursion_folded.snap | 105 +++
 ...snapshot__arm64_longjmp_unwind_folded.snap |  37 +
 ...pshot__arm64_multi_alloc_cycle_folded.snap |  66 ++
 ...hot__arm64_ping_pong_recursion_folded.snap |  38 +
 ...t__arm64_plt_phantom_recursion_folded.snap |  11 +
 ...apshot__arm64_recursive_return_folded.snap |  48 ++
 .../snapshot__arm64_tail_call_folded.snap     |   8 +
 ...hot__arm64_wrapped_alloc_chain_folded.snap |  39 +
 .../snapshots/snapshot__chain_folded.snap     |   8 +
 .../snapshot__chain_full_folded.snap          |   8 +
 .../snapshots/snapshot__diamond_folded.snap   |  10 +
 .../snapshot__diamond_full_folded.snap        |  10 +
 .../snapshots/snapshot__fractal_folded.snap   |  58 ++
 .../snapshot__fractal_full_folded.snap        |  58 ++
 .../snapshots/snapshot__mutual_folded.snap    |  10 +
 .../snapshot__mutual_full_folded.snap         |  10 +
 .../snapshots/snapshot__recursion_folded.snap |  10 +
 .../snapshot__recursion_full_folded.snap      |  10 +
 70 files changed, 7019 insertions(+)
 create mode 100644 callgrind-utils/.gitignore
 create mode 100644 callgrind-utils/Cargo.lock
 create mode 100644 callgrind-utils/Cargo.toml
 create mode 100644 callgrind-utils/build.rs
 create mode 100644 callgrind-utils/src/error.rs
 create mode 100644 callgrind-utils/src/flamegraph.rs
 create mode 100644 callgrind-utils/src/lib.rs
 create mode 100644 callgrind-utils/src/model.rs
 create mode 100644 callgrind-utils/src/parser/mod.rs
 create mode 100644 callgrind-utils/src/parser/normalize.rs
 create mode 100644 callgrind-utils/src/redact.rs
 create mode 100644 callgrind-utils/src/serialize.rs
 create mode 100644 callgrind-utils/testdata/arm64_deep_tailcall_chain.c
 create mode 100644 callgrind-utils/testdata/arm64_free_during_recursion.c
 create mode 100644 callgrind-utils/testdata/arm64_free_tailcall_phantom.c
 create mode 100644 callgrind-utils/testdata/arm64_libm_recursion.c
 create mode 100644 callgrind-utils/testdata/arm64_longjmp_unwind.c
 create mode 100644 callgrind-utils/testdata/arm64_multi_alloc_cycle.c
 create mode 100644 callgrind-utils/testdata/arm64_objskip_tailcall.c
 create mode 100644 callgrind-utils/testdata/arm64_objskip_tailcall_lib.c
 create mode 100644 callgrind-utils/testdata/arm64_ping_pong_recursion.c
 create mode 100644 callgrind-utils/testdata/arm64_plt_phantom_recursion.c
 create mode 100644 callgrind-utils/testdata/arm64_recursive_return.c
 create mode 100644 callgrind-utils/testdata/arm64_tail_call.c
 create mode 100644 callgrind-utils/testdata/arm64_tls_access.c
 create mode 100644 callgrind-utils/testdata/arm64_tls_access_lib.c
 create mode 100644 callgrind-utils/testdata/arm64_wrapped_alloc_chain.c
 create mode 100644 callgrind-utils/testdata/chain.c
 create mode 100644 callgrind-utils/testdata/clgctl.c
 create mode 100644 callgrind-utils/testdata/diamond.c
 create mode 100644 callgrind-utils/testdata/fractal.c
 create mode 100644 callgrind-utils/testdata/fractal.rs
 create mode 100644 callgrind-utils/testdata/fractal_alloc.rs
 create mode 100644 callgrind-utils/testdata/mutual.c
 create mode 100644 callgrind-utils/testdata/objskip_seed_underflow.c
 create mode 100644 callgrind-utils/testdata/objskip_seed_underflow_lib.c
 create mode 100644 callgrind-utils/testdata/recursion.c
 create mode 100644 callgrind-utils/testdata/recursion.py
 create mode 100644 callgrind-utils/tests/arm64_tls_access.rs
 create mode 100644 callgrind-utils/tests/data/example.out
 create mode 100644 callgrind-utils/tests/flamegraph.rs
 create mode 100644 callgrind-utils/tests/objskip_seed_underflow.rs
 create mode 100644 callgrind-utils/tests/parser.rs
 create mode 100644 callgrind-utils/tests/python_callgraph.rs
 create mode 100644 callgrind-utils/tests/rust_callgraph.rs
 create mode 100644 callgrind-utils/tests/snapshot.rs
 create mode 100644 callgrind-utils/tests/snapshots/python_callgraph__recursion_py__topology_json.snap
 create mode 100644 callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_full_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_deep_tailcall_chain_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_free_during_recursion_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_free_tailcall_phantom_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_libm_recursion_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_longjmp_unwind_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_multi_alloc_cycle_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_ping_pong_recursion_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_plt_phantom_recursion_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_recursive_return_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_tail_call_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__arm64_wrapped_alloc_chain_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__chain_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__chain_full_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__diamond_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__diamond_full_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__fractal_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__fractal_full_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__mutual_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__mutual_full_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__recursion_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/snapshot__recursion_full_folded.snap

diff --git a/callgrind-utils/.gitignore b/callgrind-utils/.gitignore
new file mode 100644
index 000000000..435dc4ab4
--- /dev/null
+++ b/callgrind-utils/.gitignore
@@ -0,0 +1,2 @@
+target/
+*.svg
\ No newline at end of file
diff --git a/callgrind-utils/Cargo.lock b/callgrind-utils/Cargo.lock
new file mode 100644
index 000000000..5c6bce20c
--- /dev/null
+++ b/callgrind-utils/Cargo.lock
@@ -0,0 +1,681 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 4
+
+[[package]]
+name = "ahash"
+version = "0.8.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5a15f179cd60c4584b8a8c596927aadc462e27f2ca70c04e0071964a73ba7a75"
+dependencies = [
+ "cfg-if",
+ "getrandom 0.3.4",
+ "once_cell",
+ "version_check",
+ "zerocopy",
+]
+
+[[package]]
+name = "aho-corasick"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ddd31a130427c27518df266943a5308ed92d4b226cc639f5a8f1002816174301"
+dependencies = [
+ "memchr",
+]
+
+[[package]]
+name = "arrayvec"
+version = "0.7.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f02882884d3e1bc524fb12c79f107f6ad0e1cfd498c536ffb494301740995dfe"
+
+[[package]]
+name = "bitflags"
+version = "2.13.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b4388bee8683e3d04af747c73422af53102d2bd24d9eadb6cbc100baef4b43f8"
+
+[[package]]
+name = "bytemuck"
+version = "1.25.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c8efb64bd706a16a1bdde310ae86b351e4d21550d98d056f22f8a7f7a2183fec"
+
+[[package]]
+name = "callgrind-utils"
+version = "0.1.0"
+dependencies = [
+ "inferno",
+ "insta",
+ "rstest",
+ "serde",
+ "serde_json",
+ "thiserror",
+]
+
+[[package]]
+name = "cfg-if"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
+
+[[package]]
+name = "console"
+version = "0.16.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d64e8af5551369d19cf50138de61f1c42074ab970f74e99be916646777f8fc87"
+dependencies = [
+ "encode_unicode",
+ "libc",
+ "windows-sys",
+]
+
+[[package]]
+name = "encode_unicode"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "34aa73646ffb006b8f5147f3dc182bd4bcb190227ce861fc4a4844bf8e3cb2c0"
+
+[[package]]
+name = "equivalent"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f"
+
+[[package]]
+name = "errno"
+version = "0.3.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb"
+dependencies = [
+ "libc",
+ "windows-sys",
+]
+
+[[package]]
+name = "fastrand"
+version = "2.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6"
+
+[[package]]
+name = "futures"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8b147ee9d1f6d097cef9ce628cd2ee62288d963e16fb287bd9286455b241382d"
+dependencies = [
+ "futures-channel",
+ "futures-core",
+ "futures-executor",
+ "futures-io",
+ "futures-sink",
+ "futures-task",
+ "futures-util",
+]
+
+[[package]]
+name = "futures-channel"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "07bbe89c50d7a535e539b8c17bc0b49bdb77747034daa8087407d655f3f7cc1d"
+dependencies = [
+ "futures-core",
+ "futures-sink",
+]
+
+[[package]]
+name = "futures-core"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7e3450815272ef58cec6d564423f6e755e25379b217b0bc688e295ba24df6b1d"
+
+[[package]]
+name = "futures-executor"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "baf29c38818342a3b26b5b923639e7b1f4a61fc5e76102d4b1981c6dc7a7579d"
+dependencies = [
+ "futures-core",
+ "futures-task",
+ "futures-util",
+]
+
+[[package]]
+name = "futures-io"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cecba35d7ad927e23624b22ad55235f2239cfa44fd10428eecbeba6d6a717718"
+
+[[package]]
+name = "futures-macro"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e835b70203e41293343137df5c0664546da5745f82ec9b84d40be8336958447b"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "futures-sink"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c39754e157331b013978ec91992bde1ac089843443c49cbc7f46150b0fad0893"
+
+[[package]]
+name = "futures-task"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "037711b3d59c33004d3856fbdc83b99d4ff37a24768fa1be9ce3538a1cde4393"
+
+[[package]]
+name = "futures-timer"
+version = "3.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "af43fadb8a98512d547e37b4e92e0ced13e205c061b87b4623eff01d918d6968"
+
+[[package]]
+name = "futures-util"
+version = "0.3.32"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "389ca41296e6190b48053de0321d02a77f32f8a5d2461dd38762c0593805c6d6"
+dependencies = [
+ "futures-channel",
+ "futures-core",
+ "futures-io",
+ "futures-macro",
+ "futures-sink",
+ "futures-task",
+ "memchr",
+ "pin-project-lite",
+ "slab",
+]
+
+[[package]]
+name = "getrandom"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "r-efi 5.3.0",
+ "wasip2",
+]
+
+[[package]]
+name = "getrandom"
+version = "0.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "300e883d756b2e4ec94e02791f39b04b522276138852cfc41d9fb7e904106099"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "r-efi 6.0.0",
+]
+
+[[package]]
+name = "glob"
+version = "0.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0cc23270f6e1808e30a928bdc84dea0b9b4136a8bc82338574f23baf47bbd280"
+
+[[package]]
+name = "hashbrown"
+version = "0.17.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ed5909b6e89a2db4456e54cd5f673791d7eca6732202bbf2a9cc504fe2f9b84a"
+
+[[package]]
+name = "indexmap"
+version = "2.14.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d466e9454f08e4a911e14806c24e16fba1b4c121d1ea474396f396069cf949d9"
+dependencies = [
+ "equivalent",
+ "hashbrown",
+]
+
+[[package]]
+name = "inferno"
+version = "0.12.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "90807d610575744524d9bdc69f3885d96f0e6c3354565b0828354a7ff2a262b8"
+dependencies = [
+ "ahash",
+ "itoa",
+ "log",
+ "num-format",
+ "once_cell",
+ "quick-xml",
+ "rgb",
+ "str_stack",
+]
+
+[[package]]
+name = "insta"
+version = "1.48.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "86f0f8fee8c926415c58d6ae43a08523a26faccb2323f5e6b644fe7dd4ef6b82"
+dependencies = [
+ "console",
+ "once_cell",
+ "similar",
+ "tempfile",
+]
+
+[[package]]
+name = "itoa"
+version = "1.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682"
+
+[[package]]
+name = "libc"
+version = "0.2.186"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "68ab91017fe16c622486840e4c83c9a37afeff978bd239b5293d61ece587de66"
+
+[[package]]
+name = "linux-raw-sys"
+version = "0.12.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53"
+
+[[package]]
+name = "log"
+version = "0.4.33"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0ceec5bc11778974d1bcb055b18002eba7f4b3518b6a0081b3af5f21666da9ad"
+
+[[package]]
+name = "memchr"
+version = "2.8.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "88904434abc2901f197fe8cc55f0445e7ded921dba5911dad2e2b39b48e663c4"
+
+[[package]]
+name = "num-format"
+version = "0.4.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a652d9771a63711fd3c3deb670acfbe5c30a4072e664d7a3bf5a9e1056ac72c3"
+dependencies = [
+ "arrayvec",
+ "itoa",
+]
+
+[[package]]
+name = "once_cell"
+version = "1.21.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
+
+[[package]]
+name = "pin-project-lite"
+version = "0.2.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a89322df9ebe1c1578d689c92318e070967d1042b512afbe49518723f4e6d5cd"
+
+[[package]]
+name = "proc-macro-crate"
+version = "3.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e67ba7e9b2b56446f1d419b1d807906278ffa1a658a8a5d8a39dcb1f5a78614f"
+dependencies = [
+ "toml_edit",
+]
+
+[[package]]
+name = "proc-macro2"
+version = "1.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "quick-xml"
+version = "0.39.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cdcc8dd4e2f670d309a5f0e83fe36dfdc05af317008fea29144da1a2ac858e5e"
+dependencies = [
+ "memchr",
+]
+
+[[package]]
+name = "quote"
+version = "1.0.46"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dfbc457d0c7a0759a614551b11a6409e5951f6c7537be1f1b7682b9ae9230368"
+dependencies = [
+ "proc-macro2",
+]
+
+[[package]]
+name = "r-efi"
+version = "5.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f"
+
+[[package]]
+name = "r-efi"
+version = "6.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f8dcc9c7d52a811697d2151c701e0d08956f92b0e24136cf4cf27b57a6a0d9bf"
+
+[[package]]
+name = "regex"
+version = "1.12.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f1292b7759ae1cb9ec195452d1390a074f0cd8541ab7a5a8c31cd6db45d4a6ba"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-automata",
+ "regex-syntax",
+]
+
+[[package]]
+name = "regex-automata"
+version = "0.4.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6e1dd4122fc1595e8162618945476892eefca7b88c52820e74af6262213cae8f"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-syntax",
+]
+
+[[package]]
+name = "regex-syntax"
+version = "0.8.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d6f6ff9a378485b298a5286656da665ba74413d36db0979633275d2e708145d4"
+
+[[package]]
+name = "relative-path"
+version = "1.9.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ba39f3699c378cd8970968dcbff9c43159ea4cfbd88d43c00b22f2ef10a435d2"
+
+[[package]]
+name = "rgb"
+version = "0.8.53"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "47b34b781b31e5d73e9fbc8689c70551fd1ade9a19e3e28cfec8580a79290cc4"
+dependencies = [
+ "bytemuck",
+]
+
+[[package]]
+name = "rstest"
+version = "0.23.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0a2c585be59b6b5dd66a9d2084aa1d8bd52fbdb806eafdeffb52791147862035"
+dependencies = [
+ "futures",
+ "futures-timer",
+ "rstest_macros",
+ "rustc_version",
+]
+
+[[package]]
+name = "rstest_macros"
+version = "0.23.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "825ea780781b15345a146be27eaefb05085e337e869bff01b4306a4fd4a9ad5a"
+dependencies = [
+ "cfg-if",
+ "glob",
+ "proc-macro-crate",
+ "proc-macro2",
+ "quote",
+ "regex",
+ "relative-path",
+ "rustc_version",
+ "syn",
+ "unicode-ident",
+]
+
+[[package]]
+name = "rustc_version"
+version = "0.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cfcb3a22ef46e85b45de6ee7e79d063319ebb6594faafcf1c225ea92ab6e9b92"
+dependencies = [
+ "semver",
+]
+
+[[package]]
+name = "rustix"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190"
+dependencies = [
+ "bitflags",
+ "errno",
+ "libc",
+ "linux-raw-sys",
+ "windows-sys",
+]
+
+[[package]]
+name = "semver"
+version = "1.0.28"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8a7852d02fc848982e0c167ef163aaff9cd91dc640ba85e263cb1ce46fae51cd"
+
+[[package]]
+name = "serde"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e"
+dependencies = [
+ "serde_core",
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_core"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad"
+dependencies = [
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_derive"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "serde_json"
+version = "1.0.150"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e8014e44b4736ed0538adeecded0fce2a272f22dc9578a7eb6b2d9993c74cfb9"
+dependencies = [
+ "itoa",
+ "memchr",
+ "serde",
+ "serde_core",
+ "zmij",
+]
+
+[[package]]
+name = "similar"
+version = "2.7.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bbbb5d9659141646ae647b42fe094daf6c6192d1620870b449d9557f748b2daa"
+
+[[package]]
+name = "slab"
+version = "0.4.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0c790de23124f9ab44544d7ac05d60440adc586479ce501c1d6d7da3cd8c9cf5"
+
+[[package]]
+name = "str_stack"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7f446288b699d66d0fd2e30d1cfe7869194312524b3b9252594868ed26ef056a"
+
+[[package]]
+name = "syn"
+version = "2.0.118"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1b9ae57f904213ebb649ce6895b8a66c66f0203b9319718f69a5612a065b1422"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+
+[[package]]
+name = "tempfile"
+version = "3.27.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd"
+dependencies = [
+ "fastrand",
+ "getrandom 0.4.3",
+ "once_cell",
+ "rustix",
+ "windows-sys",
+]
+
+[[package]]
+name = "thiserror"
+version = "2.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4288b5bcbc7920c07a1149a35cf9590a2aa808e0bc1eafaade0b80947865fbc4"
+dependencies = [
+ "thiserror-impl",
+]
+
+[[package]]
+name = "thiserror-impl"
+version = "2.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ebc4ee7f67670e9b64d05fa4253e753e016c6c95ff35b89b7941d6b856dec1d5"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "toml_datetime"
+version = "1.1.1+spec-1.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3165f65f62e28e0115a00b2ebdd37eb6f3b641855f9d636d3cd4103767159ad7"
+dependencies = [
+ "serde_core",
+]
+
+[[package]]
+name = "toml_edit"
+version = "0.25.12+spec-1.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d2153edc6955a6c354fad8f5efd38b6a8769bdccf9fe50f8e1329f81b0baa5d7"
+dependencies = [
+ "indexmap",
+ "toml_datetime",
+ "toml_parser",
+ "winnow",
+]
+
+[[package]]
+name = "toml_parser"
+version = "1.1.2+spec-1.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a2abe9b86193656635d2411dc43050282ca48aa31c2451210f4202550afb7526"
+dependencies = [
+ "winnow",
+]
+
+[[package]]
+name = "unicode-ident"
+version = "1.0.24"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
+
+[[package]]
+name = "version_check"
+version = "0.9.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
+
+[[package]]
+name = "wasip2"
+version = "1.0.4+wasi-0.2.12"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b67efb37e106e55ce722a510d6b5f9c17f083e5fc79afc2badeb12cc313d9487"
+dependencies = [
+ "wit-bindgen",
+]
+
+[[package]]
+name = "windows-link"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5"
+
+[[package]]
+name = "windows-sys"
+version = "0.61.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc"
+dependencies = [
+ "windows-link",
+]
+
+[[package]]
+name = "winnow"
+version = "1.0.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0592e1c9d151f854e6fd382574c3a0855250e1d9b2f99d9281c6e6391af352f1"
+dependencies = [
+ "memchr",
+]
+
+[[package]]
+name = "wit-bindgen"
+version = "0.57.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ebf944e87a7c253233ad6766e082e3cd714b5d03812acc24c318f549614536e"
+
+[[package]]
+name = "zerocopy"
+version = "0.8.52"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ce1022995ff5ff5d841ad7d994facc23098cd40152f2c1d11cd607c6f530653f"
+dependencies = [
+ "zerocopy-derive",
+]
+
+[[package]]
+name = "zerocopy-derive"
+version = "0.8.52"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ae7f38b72ec2a254e2b87ef277cf2cd4fb97cbebf944faa6f33354da0867930"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "zmij"
+version = "1.0.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa"
diff --git a/callgrind-utils/Cargo.toml b/callgrind-utils/Cargo.toml
new file mode 100644
index 000000000..19c2f50b9
--- /dev/null
+++ b/callgrind-utils/Cargo.toml
@@ -0,0 +1,14 @@
+[package]
+name = "callgrind-utils"
+version = "0.1.0"
+edition = "2024"
+
+[dependencies]
+inferno = { version = "0.12.6", default-features = false }
+serde = { version = "1", features = ["derive"] }
+serde_json = "1"
+thiserror = "2"
+
+[dev-dependencies]
+insta = "1"
+rstest = "0.23"
diff --git a/callgrind-utils/build.rs b/callgrind-utils/build.rs
new file mode 100644
index 000000000..5f5a67dde
--- /dev/null
+++ b/callgrind-utils/build.rs
@@ -0,0 +1,116 @@
+//! Ensures the in-repo Callgrind (`../vg-in-place`) is built before the tests
+//! that shell out to it run.
+//!
+//! The build is incremental: `make` is timestamp-driven, so this is a few
+//! seconds when the tree is already current and only does real work when the
+//! Callgrind sources change. Build order matters: VEX -> coregrind -> callgrind.
+
+use std::env;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+fn main() {
+    let repo = PathBuf::from(env::var("CARGO_MANIFEST_DIR").expect("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf();
+
+    track_sources(&repo);
+
+    // Callgrind names its tool binary after the target: callgrind-<arch>-linux.
+    let arch = match env::consts::ARCH {
+        "x86_64" => "amd64",
+        "aarch64" => "arm64",
+        other => panic!("unsupported arch for the Callgrind build: {other}"),
+    };
+
+    configure_if_needed(&repo);
+    build(&repo);
+    assert_artifacts(&repo, arch);
+}
+
+/// Rebuild when a hand-written Callgrind source changes. Only the top-level
+/// `callgrind/*.c` / `*.h` are tracked: the `tests/` subdir accumulates
+/// `callgrind.out.*` / `vgcore.*` on every run, which would otherwise
+/// re-trigger this build on each test invocation.
+fn track_sources(repo: &Path) {
+    println!("cargo:rerun-if-changed=build.rs");
+    println!(
+        "cargo:rerun-if-changed={}",
+        repo.join("configure").display()
+    );
+
+    let cg = repo.join("callgrind");
+    let entries = std::fs::read_dir(&cg).unwrap_or_else(|e| panic!("read {}: {e}", cg.display()));
+    for path in entries.flatten().map(|e| e.path()) {
+        if matches!(path.extension().and_then(|e| e.to_str()), Some("c" | "h")) {
+            println!("cargo:rerun-if-changed={}", path.display());
+        }
+    }
+}
+
+/// `configure` is checked in, so this only runs on a pristine tree. Callgrind
+/// cycle estimation needs Capstone; `nix develop` exports `CAPSTONE_DIR`, which
+/// `configure` picks up. Fail loudly if it is missing rather than emitting a
+/// cryptic configure error.
+fn configure_if_needed(repo: &Path) {
+    if repo.join("Makefile").is_file() {
+        return;
+    }
+
+    assert!(
+        env::var_os("CAPSTONE_DIR").is_some(),
+        "valgrind-codspeed is not configured and CAPSTONE_DIR is unset.\n\
+         Build from inside `nix develop` (which exports CAPSTONE_DIR), or configure\n\
+         manually: ./configure --enable-only64bit --with-capstone=PATH"
+    );
+
+    run(Command::new("./configure")
+        .arg("--enable-only64bit")
+        .current_dir(repo));
+}
+
+fn build(repo: &Path) {
+    let jobs = format!(
+        "-j{}",
+        std::thread::available_parallelism()
+            .map(|n| n.get())
+            .unwrap_or(1)
+    );
+
+    run(Command::new("make")
+        .arg("include/vgversion.h")
+        .current_dir(repo));
+    for dir in ["VEX", "coregrind", "callgrind"] {
+        run(Command::new("make")
+            .arg(&jobs)
+            .arg("-C")
+            .arg(dir)
+            .current_dir(repo));
+    }
+}
+
+/// The three artifacts `vg-in-place` execs: the launcher, the tool, and the
+/// `.in_place` symlink the launcher resolves via `VALGRIND_LIB`.
+fn assert_artifacts(repo: &Path, arch: &str) {
+    let tool = format!("callgrind-{arch}-linux");
+    for path in [
+        repo.join("coregrind/valgrind"),
+        repo.join("callgrind").join(&tool),
+        repo.join(".in_place").join(&tool),
+    ] {
+        assert!(
+            path.exists(),
+            "expected build artifact missing after make: {}",
+            path.display()
+        );
+    }
+}
+
+fn run(cmd: &mut Command) {
+    let shown = format!("{cmd:?}");
+    let status = cmd
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn {shown}: {e}"));
+    assert!(status.success(), "command failed ({status}): {shown}");
+}
diff --git a/callgrind-utils/src/error.rs b/callgrind-utils/src/error.rs
new file mode 100644
index 000000000..83943f51a
--- /dev/null
+++ b/callgrind-utils/src/error.rs
@@ -0,0 +1,34 @@
+use thiserror::Error;
+
+/// Errors raised while parsing a Callgrind `.out` file.
+#[derive(Debug, Error)]
+pub enum ParseError {
+    #[error("I/O error: {0}")]
+    Io(#[from] std::io::Error),
+    #[error("bad id: {0}")]
+    BadId(#[from] std::num::ParseIntError),
+    #[error("call record missing required cfn=")]
+    MissingCfn,
+    #[error("unexpected end of input")]
+    UnexpectedEof,
+}
+
+/// Errors raised while serializing a `CallGraph` to JSON.
+#[derive(Debug, Error)]
+pub enum ToJsonError {
+    #[error("serde error: {0}")]
+    Serde(#[from] serde_json::Error),
+    #[error("I/O error: {0}")]
+    Io(#[from] std::io::Error),
+}
+
+/// Errors raised while rendering a `CallGraph` to a flamegraph SVG.
+#[derive(Debug, Error)]
+pub enum FlamegraphError {
+    #[error("the graph carries no cost data (all self/inclusive costs are zero)")]
+    NoCost,
+    #[error("inferno flamegraph error: {0}")]
+    Inferno(String),
+    #[error("I/O error: {0}")]
+    Io(#[from] std::io::Error),
+}
diff --git a/callgrind-utils/src/flamegraph.rs b/callgrind-utils/src/flamegraph.rs
new file mode 100644
index 000000000..2278700a7
--- /dev/null
+++ b/callgrind-utils/src/flamegraph.rs
@@ -0,0 +1,179 @@
+use inferno::flamegraph::{self, Options};
+
+use super::{error::FlamegraphError, model::CallGraph};
+
+const MIN_BUDGET: f64 = 1.0;
+const MIN_BUDGET_FRACTION: f64 = 0.0005;
+
+impl CallGraph {
+    pub fn to_folded_without_costs(&self) -> Vec<String> {
+        self.to_folded()
+            .iter()
+            .map(|line| {
+                let mut parts = line.split_whitespace();
+                let stack = parts.next().unwrap_or_default();
+                format!("{stack} <cost>")
+            })
+            .collect()
+    }
+
+    pub fn to_folded(&self) -> Vec<String> {
+        let nodes = self.nodes();
+        let n = nodes.len();
+        let names: Vec<&str> = nodes.iter().map(|node| node.function.as_str()).collect();
+        let self_costs: Vec<u64> = (0..n).map(|i| self.self_cost(i)).collect();
+
+        let mut out = vec![Vec::<(usize, u64)>::new(); n];
+        let mut incoming_incl = vec![0u64; n];
+        for edge in self.edges() {
+            let Some(caller) = self.node_index(&edge.caller) else {
+                continue;
+            };
+            let Some(callee) = self.node_index(&edge.callee) else {
+                continue;
+            };
+            let inclusive_cost = edge.inclusive_cost.unwrap_or(0);
+            out[caller].push((callee, inclusive_cost));
+            incoming_incl[callee] += inclusive_cost;
+        }
+
+        let incl: Vec<u64> = (0..n)
+            .map(|i| self_costs[i] + out[i].iter().map(|(_, cost)| *cost).sum::<u64>())
+            .collect();
+
+        let roots = roots(&incoming_incl, &incl);
+        let total: f64 = roots.iter().map(|(_, budget)| *budget).sum();
+        let min_budget = MIN_BUDGET.max(total * MIN_BUDGET_FRACTION);
+
+        let mut lines = Vec::new();
+        let mut stack = Vec::new();
+        let mut on_path = vec![false; n];
+        for (root, budget) in roots {
+            fold_dfs(
+                root,
+                budget,
+                min_budget,
+                &mut stack,
+                &mut on_path,
+                &out,
+                &incl,
+                &self_costs,
+                &names,
+                &mut lines,
+            );
+        }
+        lines
+    }
+
+    pub fn to_flamegraph(&self) -> Result<String, FlamegraphError> {
+        let lines = self.to_folded();
+        if lines.is_empty() {
+            return Err(FlamegraphError::NoCost);
+        }
+
+        let mut opts = Options::default();
+        opts.title = "Callgrind".to_string();
+        opts.count_name = "instructions".to_string();
+
+        let mut svg = Vec::new();
+        flamegraph::from_lines(&mut opts, lines.iter().map(String::as_str), &mut svg)
+            .map_err(|e| FlamegraphError::Inferno(e.to_string()))?;
+        String::from_utf8(svg).map_err(|e| FlamegraphError::Inferno(e.to_string()))
+    }
+
+    pub fn to_flamegraph_file(
+        &self,
+        path: impl AsRef<std::path::Path>,
+    ) -> Result<(), FlamegraphError> {
+        let svg = self.to_flamegraph()?;
+        std::fs::write(path, svg)?;
+        Ok(())
+    }
+}
+
+fn roots(incoming_incl: &[u64], incl: &[u64]) -> Vec<(usize, f64)> {
+    let roots: Vec<(usize, f64)> = (0..incl.len())
+        .filter_map(|i| {
+            let uncovered = incl[i].saturating_sub(incoming_incl[i]);
+            (uncovered > 0).then_some((i, uncovered as f64))
+        })
+        .collect();
+    if !roots.is_empty() {
+        return roots;
+    }
+    (0..incl.len())
+        .filter(|&i| incl[i] > 0)
+        .max_by_key(|&i| incl[i])
+        .map(|i| (i, incl[i] as f64))
+        .into_iter()
+        .collect()
+}
+
+#[allow(clippy::too_many_arguments)]
+fn fold_dfs(
+    node: usize,
+    budget: f64,
+    min_budget: f64,
+    stack: &mut Vec<usize>,
+    on_path: &mut [bool],
+    out: &[Vec<(usize, u64)>],
+    incl: &[u64],
+    self_costs: &[u64],
+    names: &[&str],
+    lines: &mut Vec<String>,
+) {
+    let should_prune = budget < min_budget || incl[node] == 0;
+    if should_prune {
+        return;
+    }
+
+    stack.push(node);
+    on_path[node] = true;
+
+    let frac = (budget / incl[node] as f64).min(1.0);
+    let self_here = (self_costs[node] as f64 * frac).round() as u64;
+    if self_here >= 1 {
+        lines.push(fold_line(stack, names, self_here));
+    }
+
+    for &(child, edge_incl) in &out[node] {
+        let child_budget = edge_incl as f64 * frac;
+        if !on_path[child] {
+            fold_dfs(
+                child,
+                child_budget,
+                min_budget,
+                stack,
+                on_path,
+                out,
+                incl,
+                self_costs,
+                names,
+                lines,
+            );
+            continue;
+        }
+        let recursive = child_budget.round() as u64;
+        if recursive >= 1 {
+            stack.push(child);
+            lines.push(fold_line(stack, names, recursive));
+            stack.pop();
+        }
+    }
+
+    on_path[node] = false;
+    stack.pop();
+}
+
+fn fold_line(stack: &[usize], names: &[&str], count: u64) -> String {
+    let mut line = String::new();
+    for (i, &idx) in stack.iter().enumerate() {
+        if i > 0 {
+            line.push(';');
+        }
+        line.push_str(names[idx]);
+    }
+    line.push(' ');
+    line.push_str(&count.to_string());
+    line
+}
diff --git a/callgrind-utils/src/lib.rs b/callgrind-utils/src/lib.rs
new file mode 100644
index 000000000..719c21a32
--- /dev/null
+++ b/callgrind-utils/src/lib.rs
@@ -0,0 +1,6 @@
+pub mod error;
+pub mod flamegraph;
+pub mod model;
+pub mod parser;
+mod redact;
+pub mod serialize;
diff --git a/callgrind-utils/src/model.rs b/callgrind-utils/src/model.rs
new file mode 100644
index 000000000..f0297f558
--- /dev/null
+++ b/callgrind-utils/src/model.rs
@@ -0,0 +1,161 @@
+use std::collections::HashMap;
+
+use serde::Serialize;
+
+/// A call-graph node: a single function identity.
+///
+/// Node identity is the full `(object, file, function)` tuple, so two
+/// statics that share a name but live in different objects/files are
+/// distinct nodes (no false merge).
+#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize)]
+pub struct Node {
+    pub function: String,
+    pub file: String,
+    pub object: String,
+}
+
+/// A directed call edge: `caller` calls `callee`, optionally annotated
+/// with an observed `call_count` and the callee subtree's `inclusive_cost`
+/// (first event column, e.g. `Ir`) as invoked through this edge.
+///
+/// `Edge` deliberately does NOT derive `Serialize`: the canonical JSON
+/// view references nodes by index, not by value. See `serialize::EdgeJson`.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct Edge {
+    pub caller: Node,
+    pub callee: Node,
+    pub call_count: Option<u64>,
+    pub inclusive_cost: Option<u64>,
+}
+
+/// Tunables for `.out` parsing.
+#[derive(Debug, Clone)]
+pub struct ParseOptions {
+    /// When true, file/object paths are reduced to their basename and
+    /// Callgrind-style unknowns (`???`) collapse to `unknown`.
+    pub normalize_paths: bool,
+    /// Sentinel substituted for absent/unknown object or file names.
+    pub unknown: String,
+}
+
+impl Default for ParseOptions {
+    fn default() -> Self {
+        Self {
+            normalize_paths: true,
+            unknown: "???".to_string(),
+        }
+    }
+}
+
+/// The parsed call graph: sorted, deduplicated nodes and edges.
+///
+/// Fields are `pub(crate)` so the sibling `parser` and `serialize`
+/// modules can materialize/consume them without exposing them publicly.
+pub struct CallGraph {
+    pub(crate) nodes: Vec<Node>,
+    pub(crate) edges: Vec<Edge>,
+    /// Self cost (first event column, e.g. `Ir`) per node, aligned index-for-index
+    /// with `nodes`. Zero for nodes that carried no self-cost lines.
+    pub(crate) self_costs: Vec<u64>,
+}
+
+impl CallGraph {
+    /// Borrow the sorted node list.
+    pub fn nodes(&self) -> &[Node] {
+        &self.nodes
+    }
+
+    /// Borrow the sorted, deduplicated edge list.
+    pub fn edges(&self) -> &[Edge] {
+        &self.edges
+    }
+
+    /// Self cost of the node at `index` (first event column). Zero if absent.
+    pub fn self_cost(&self, index: usize) -> u64 {
+        self.self_costs.get(index).copied().unwrap_or(0)
+    }
+
+    /// Construct a `CallGraph` from raw parsed material.
+    ///
+    /// Nodes are sorted by `(object, file, function)` and de-duplicated.
+    /// Edges are sorted by `(caller_idx, callee_idx)` using the sorted
+    /// node order, then de-duplicated by `(caller, callee)`, aggregating
+    /// `call_count` and `inclusive_cost` across duplicates (sum when both
+    /// are `Some`; keep the first value when any duplicate is `None`).
+    ///
+    /// `self_costs` maps each node identity to its accumulated self cost; it
+    /// is projected onto the sorted node order (missing entries become 0).
+    pub(crate) fn from_parts(
+        mut nodes: Vec<Node>,
+        mut edges: Vec<Edge>,
+        self_costs: HashMap<Node, u64>,
+    ) -> Self {
+        nodes.sort_by(|a, b| {
+            a.object
+                .cmp(&b.object)
+                .then_with(|| a.file.cmp(&b.file))
+                .then_with(|| a.function.cmp(&b.function))
+        });
+        nodes.dedup();
+
+        // Index lookup for stable node ordering of edges.
+        let mut index: HashMap<&Node, usize> = HashMap::with_capacity(nodes.len());
+        for (i, n) in nodes.iter().enumerate() {
+            index.insert(n, i);
+        }
+
+        let edge_rank = |e: &Edge| {
+            (
+                index.get(&e.caller).copied().unwrap_or(usize::MAX),
+                index.get(&e.callee).copied().unwrap_or(usize::MAX),
+            )
+        };
+        edges.sort_by_key(edge_rank);
+
+        // Dedup adjacent (now grouped) edges, aggregating call_count.
+        let mut deduped: Vec<Edge> = Vec::with_capacity(edges.len());
+        for e in edges {
+            // Dedup adjacent (now grouped) duplicate edges, summing counts;
+            // any None keeps the first value as-is.
+            if let Some(last) = deduped.last_mut()
+                && last.caller == e.caller
+                && last.callee == e.callee
+            {
+                if let (Some(a), Some(b)) = (last.call_count, e.call_count) {
+                    last.call_count = Some(a + b);
+                }
+                if let (Some(a), Some(b)) = (last.inclusive_cost, e.inclusive_cost) {
+                    last.inclusive_cost = Some(a + b);
+                }
+                continue;
+            }
+            deduped.push(e);
+        }
+
+        let node_self_costs: Vec<u64> = nodes
+            .iter()
+            .map(|n| self_costs.get(n).copied().unwrap_or(0))
+            .collect();
+
+        Self {
+            nodes,
+            edges: deduped,
+            self_costs: node_self_costs,
+        }
+    }
+
+    /// Index of `n` within the sorted node list, or `None` if absent.
+    ///
+    /// Uses binary search over the `(object, file, function)` ordering
+    /// established by `from_parts`.
+    pub(crate) fn node_index(&self, n: &Node) -> Option<usize> {
+        self.nodes
+            .binary_search_by(|x| {
+                x.object
+                    .cmp(&n.object)
+                    .then_with(|| x.file.cmp(&n.file))
+                    .then_with(|| x.function.cmp(&n.function))
+            })
+            .ok()
+    }
+}
diff --git a/callgrind-utils/src/parser/mod.rs b/callgrind-utils/src/parser/mod.rs
new file mode 100644
index 000000000..6e1264af2
--- /dev/null
+++ b/callgrind-utils/src/parser/mod.rs
@@ -0,0 +1,388 @@
+use std::collections::HashMap;
+
+use crate::{
+    error::ParseError,
+    model::{CallGraph, Edge, Node, ParseOptions},
+};
+
+mod normalize;
+
+/// Header/auxiliary keys that carry no call-graph topology and are dropped
+/// outright. `part`/`thread` are handled separately (context boundaries),
+/// not here. `cfni` is an inline-function annotation, not a callee spec.
+const SKIP_KEYS: &[&str] = &[
+    "version",
+    "creator",
+    "pid",
+    "cmd",
+    "desc",
+    "positions",
+    "events",
+    "event",
+    "summary",
+    "totals",
+    "rec",
+    "jfi",
+    "jfn",
+    "frfn",
+    "cfni",
+    "jump",
+    "jcnd",
+];
+
+impl CallGraph {
+    /// Parse a Callgrind `.out` stream into a call graph.
+    ///
+    /// The format is line-oriented (see `callgrind/docs/cl-format.xml`). We
+    /// track three independent name-compression ID spaces (functions, files,
+    /// objects), the current caller context, and a pending callee record.
+    /// An edge is emitted only when a `calls=` line closes a record that has a
+    /// pending `cfn=`; a bare `cfn=` is callee context that gets discarded.
+    pub fn parse(reader: impl std::io::BufRead) -> Result<Self, ParseError> {
+        Self::parse_with(reader, &ParseOptions::default())
+    }
+
+    /// Parse with explicit [`ParseOptions`] (e.g. to disable path normalization).
+    pub fn parse_with(
+        reader: impl std::io::BufRead,
+        opts: &ParseOptions,
+    ) -> Result<Self, ParseError> {
+        // Three SEPARATE name-compression ID spaces.
+        let mut fn_ids: HashMap<u32, String> = HashMap::new();
+        let mut file_ids: HashMap<u32, String> = HashMap::new();
+        let mut obj_ids: HashMap<u32, String> = HashMap::new();
+
+        // Current caller context.
+        let mut cur_obj: Option<String> = None;
+        let mut cur_fl: Option<String> = None; // the function's own file (`fl=`)
+        let mut cur_pos_file: Option<String> = None; // current position file (`fl`/`fi`/`fe`)
+        let mut cur_fn: Option<String> = None;
+
+        // Pending callee record, built from `cob`/`cfi`/`cfl`/`cfn`.
+        let mut pend_cob: Option<String> = None;
+        let mut pend_cfi: Option<String> = None;
+        let mut pend_cfn: Option<String> = None;
+
+        let mut nodes: Vec<Node> = Vec::new();
+        let mut edges: Vec<Edge> = Vec::new();
+
+        // Self cost (first event column) accumulated per function-node identity.
+        let mut self_costs: HashMap<Node, u64> = HashMap::new();
+
+        // Cost-line layout, learned from the `positions:`/`events:` headers.
+        // A cost line has exactly `num_positions + num_events` tokens; the first
+        // event value lives at token index `num_positions`.
+        let mut num_positions: usize = 1;
+        let mut num_events: usize = 1;
+
+        // Index of the edge whose inclusive cost the NEXT cost line supplies.
+        // Set right after a `calls=` line, consumed by that call's cost line.
+        let mut expect_call_cost: Option<usize> = None;
+
+        for line in reader.lines() {
+            let line = line?; // io error -> ParseError::Io (#[from])
+            let trimmed = line.trim_start();
+
+            // Blank lines and comments carry nothing.
+            if trimmed.is_empty() || trimmed.starts_with('#') {
+                continue;
+            }
+
+            let key = line_key(trimmed);
+
+            // Cost-line layout headers. `positions: line` / `positions: instr line`
+            // fixes the leading position-column count; `events: Ir Cy ...` fixes the
+            // event-column count. The flamegraph weight is the FIRST event column.
+            if key == "positions" {
+                num_positions = header_token_count(trimmed, key).max(1);
+                continue;
+            }
+            if key == "events" {
+                num_events = header_token_count(trimmed, key).max(1);
+                continue;
+            }
+
+            // `part:`/`thread:` separators bound a record: clear the pending
+            // callee, but keep the ID maps and caller context (IDs persist
+            // across parts; parts/threads are always merged into one graph).
+            if key == "part" || key == "thread" {
+                pend_cob = None;
+                pend_cfi = None;
+                pend_cfn = None;
+                expect_call_cost = None;
+                continue;
+            }
+
+            // Header/auxiliary lines carry no topology. Body-level skips
+            // (`jump`/`jcnd`/`jfi`/`jfn`/`cfni`/`frfn`) must ALSO close any open
+            // call record, so a bare `cfn=` cannot survive across them and
+            // poison a later `calls=`. Clearing when nothing is pending is a
+            // harmless no-op for true header lines.
+            if SKIP_KEYS.contains(&key) {
+                pend_cob = None;
+                pend_cfi = None;
+                pend_cfn = None;
+                expect_call_cost = None;
+                continue;
+            }
+
+            // Position specs and `calls` are `key=value`; a colon-separated
+            // (`ob:`) or bare token is a header/cost/unknown line, never a spec.
+            let assign = trimmed.as_bytes().get(key.len()) == Some(&b'=');
+
+            // A `calls=` line closes a call record and emits the edge.
+            if key == "calls" && assign {
+                if let Some(cfn) = pend_cfn.take() {
+                    let rhs = &trimmed[key.len() + 1..];
+                    let call_count = parse_call_count(rhs);
+
+                    // Caller file is the function's own `fl` (cur_fl), NEVER the
+                    // current position file: an inline `fi=`/`fe=` transition
+                    // moves the callee context but not the caller's identity.
+                    let caller = make_node(
+                        cur_fn.as_deref(),
+                        cur_fl.as_deref(),
+                        cur_obj.as_deref(),
+                        opts,
+                    );
+                    // Callee inherits the current position file (which may be an
+                    // inline `fi`/`fe` file) and the caller object unless the
+                    // record overrode them with `cfi`/`cfl`/`cob`.
+                    let callee_file = pend_cfi.as_deref().or(cur_pos_file.as_deref());
+                    let callee_obj = pend_cob.as_deref().or(cur_obj.as_deref());
+                    let callee = make_node(Some(cfn.as_str()), callee_file, callee_obj, opts);
+
+                    nodes.push(caller.clone());
+                    nodes.push(callee.clone());
+                    edges.push(Edge {
+                        caller,
+                        callee,
+                        call_count,
+                        inclusive_cost: None,
+                    });
+                    // The next cost line carries this call's inclusive cost.
+                    expect_call_cost = Some(edges.len() - 1);
+                }
+                // Whether or not an edge was emitted, the record is closed.
+                pend_cob = None;
+                pend_cfi = None;
+                continue;
+            }
+
+            // Lines lacking an `=` after the key — colon headers (`ob:`), bare
+            // tokens, and cost/address lines — are never specs or calls, so
+            // they only close any open call record (a bare `cfn=` thus cannot
+            // poison a later `calls=`).
+            if !assign {
+                let cost = parse_cost_value(trimmed, num_positions, num_events);
+                match (cost, expect_call_cost.take()) {
+                    // The cost line immediately following a `calls=`: inclusive
+                    // cost of that call's callee subtree.
+                    (Some(c), Some(edge_idx)) => {
+                        edges[edge_idx].inclusive_cost = Some(c);
+                    }
+                    // A body cost line of the current function: self cost.
+                    (Some(c), None) => {
+                        if let Some(f) = cur_fn.as_deref() {
+                            let node =
+                                make_node(Some(f), cur_fl.as_deref(), cur_obj.as_deref(), opts);
+                            *self_costs.entry(node).or_insert(0) += c;
+                        }
+                    }
+                    // Not a cost line (colon header / bare token).
+                    (None, _) => {}
+                }
+                pend_cob = None;
+                pend_cfi = None;
+                pend_cfn = None;
+                continue;
+            }
+
+            // Recognized position specs dispatch below; an unknown `key=value`
+            // falls to the `_` arm, which also closes the record. A spec line
+            // means the call's cost line (if any) has passed.
+            expect_call_cost = None;
+            match key {
+                "ob" => {
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut obj_ids)?;
+                    cur_obj = Some(x);
+                    pend_cob = None;
+                    pend_cfi = None;
+                    pend_cfn = None;
+                }
+                "fl" => {
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut file_ids)?;
+                    cur_fl = Some(x.clone());
+                    cur_pos_file = Some(x);
+                    pend_cob = None;
+                    pend_cfi = None;
+                    pend_cfn = None;
+                }
+                "fi" | "fe" => {
+                    // Inline-file transition: moves the position file only, not
+                    // the function's own `fl`.
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut file_ids)?;
+                    cur_pos_file = Some(x);
+                    pend_cob = None;
+                    pend_cfi = None;
+                    pend_cfn = None;
+                }
+                "fn" => {
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut fn_ids)?;
+                    cur_fn = Some(x);
+                    pend_cob = None;
+                    pend_cfi = None;
+                    pend_cfn = None;
+                }
+                "cob" => {
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut obj_ids)?;
+                    pend_cob = Some(x);
+                }
+                "cfi" | "cfl" => {
+                    // `cfl` is the historical alias of `cfi`; identical meaning.
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut file_ids)?;
+                    pend_cfi = Some(x);
+                }
+                "cfn" => {
+                    // Do NOT clear pend_cob/pend_cfi: they legitimately precede
+                    // cfn within the same call record.
+                    let x = parse_pos_name(rhs_of(trimmed, key), &mut fn_ids)?;
+                    pend_cfn = Some(x);
+                }
+                _ => {
+                    // Cost/subposition lines and anything unrecognized close any
+                    // dangling callee context.
+                    pend_cob = None;
+                    pend_cfi = None;
+                    pend_cfn = None;
+                }
+            }
+        }
+
+        // Nothing to flush at EOF: a bare trailing `cfn=` is discarded.
+        Ok(CallGraph::from_parts(nodes, edges, self_costs))
+    }
+}
+
+/// Count the whitespace-separated tokens in a `positions:`/`events:` header
+/// value (everything after the `key:` prefix). `positions: instr line` -> 2.
+fn header_token_count(trimmed: &str, key: &str) -> usize {
+    trimmed[key.len()..]
+        .trim_start_matches([':', '='])
+        .split_whitespace()
+        .count()
+}
+
+/// First event value of a cost line, or `None` if `trimmed` is not one.
+///
+/// A cost line is `num_positions` position tokens followed by 1..=`num_events`
+/// event counts; Callgrind omits trailing zero counts, so the value list is
+/// variable-length. The first event column (`Ir`, token index `num_positions`)
+/// is returned. Requiring the leading tokens to be position-like (line/instr,
+/// possibly `+N`/`-N`/`*`/`0x..`) plus a decimal first value rejects colon
+/// headers and bare tokens that also lack an `=`.
+fn parse_cost_value(trimmed: &str, num_positions: usize, num_events: usize) -> Option<u64> {
+    let tokens: Vec<&str> = trimmed.split_whitespace().collect();
+    let has_valid_token_count =
+        tokens.len() > num_positions && tokens.len() <= num_positions + num_events;
+    if !has_valid_token_count {
+        return None;
+    }
+    if !tokens[..num_positions].iter().all(|t| is_position_token(t)) {
+        return None;
+    }
+    tokens[num_positions].parse::<u64>().ok()
+}
+
+/// Whether `tok` is a Callgrind position/subposition token: `*` (repeat), an
+/// absolute decimal or `0x` address, or a `+N`/`-N` relative offset.
+fn is_position_token(tok: &str) -> bool {
+    if tok == "*" {
+        return true;
+    }
+    if let Some(hex) = tok.strip_prefix("0x").or_else(|| tok.strip_prefix("0X")) {
+        return !hex.is_empty() && hex.bytes().all(|b| b.is_ascii_hexdigit());
+    }
+    let digits = tok.strip_prefix(['+', '-']).unwrap_or(tok);
+    !digits.is_empty() && digits.bytes().all(|b| b.is_ascii_digit())
+}
+
+/// The leading token of `line`: everything up to the first `=`, `:`, or
+/// whitespace. For `fn=(1) main` this is `"fn"`; for `0x401000 4`, `"0x401000"`.
+fn line_key(line: &str) -> &str {
+    let end = line
+        .find(|c: char| c == '=' || c == ':' || c.is_whitespace())
+        .unwrap_or(line.len());
+    &line[..end]
+}
+
+/// The value after `key=` in a position-spec line. Callers only invoke this for
+/// keys known to be followed by `=`, so the separator byte is skipped directly.
+fn rhs_of<'a>(trimmed: &'a str, key: &str) -> &'a str {
+    &trimmed[key.len() + 1..]
+}
+
+/// Resolve a name-compression RHS against its ID map.
+///
+/// `(N) name` defines ID `N` -> `name` and returns the name; `(N)` references a
+/// previously defined ID; a bare `name` (compression off) is returned verbatim
+/// and never touches the map.
+fn parse_pos_name(rhs: &str, map: &mut HashMap<u32, String>) -> Result<String, ParseError> {
+    let rhs = rhs.trim_start();
+    let Some(after_paren) = rhs.strip_prefix('(') else {
+        // Compression off: literal name.
+        return Ok(rhs.trim().to_owned());
+    };
+
+    // The entire substring before `)` is the numeric ID; everything after it
+    // (split on the FIRST `)`, so names may themselves contain `)`) is the
+    // optional name. An unterminated `(N` treats the remainder as the ID.
+    let (num, rest) = after_paren.split_once(')').unwrap_or((after_paren, ""));
+    let id: u32 = num.trim().parse()?; // non-numeric/empty id -> ParseError::BadId
+    let name = rest.trim();
+
+    if name.is_empty() {
+        // Reference: resolve the prior definition (empty if unknown; the
+        // normalizer maps empties to opts.unknown for files/objects).
+        Ok(map.get(&id).cloned().unwrap_or_default())
+    } else {
+        map.insert(id, name.to_owned());
+        Ok(name.to_owned())
+    }
+}
+
+/// First token after `calls=`, parsed as a decimal or `0x`-hex count.
+fn parse_call_count(rhs: &str) -> Option<u64> {
+    let tok = rhs.split_whitespace().next()?;
+    match tok.strip_prefix("0x").or_else(|| tok.strip_prefix("0X")) {
+        Some(hex) => u64::from_str_radix(hex, 16).ok(),
+        None => tok.parse::<u64>().ok(),
+    }
+}
+
+/// Build a node. The function name keeps its raw text; file and object are
+/// normalized (basename + unknown handling per `opts`). Absent/empty file and
+/// object default to `opts.unknown` BEFORE normalizing so that disabling
+/// `normalize_paths` cannot leave a blank node key.
+fn make_node(
+    function: Option<&str>,
+    file: Option<&str>,
+    object: Option<&str>,
+    opts: &ParseOptions,
+) -> Node {
+    let or_unknown = |v: Option<&str>| {
+        normalize::normalize_path(
+            v.filter(|s| !s.is_empty()).unwrap_or(opts.unknown.as_str()),
+            opts,
+        )
+    };
+    let function = match function {
+        Some(f) if !f.is_empty() => f.to_owned(),
+        _ => opts.unknown.clone(),
+    };
+    Node {
+        function,
+        file: or_unknown(file),
+        object: or_unknown(object),
+    }
+}
diff --git a/callgrind-utils/src/parser/normalize.rs b/callgrind-utils/src/parser/normalize.rs
new file mode 100644
index 000000000..c31ca8add
--- /dev/null
+++ b/callgrind-utils/src/parser/normalize.rs
@@ -0,0 +1,28 @@
+use crate::model::ParseOptions;
+
+/// Return the last path segment after the final `/`.
+///
+/// `"foo/bar/baz.c"` -> `"baz.c"`; `"baz.c"` -> `"baz.c"`; `""` -> `""`.
+pub(crate) fn basename(path: &str) -> &str {
+    match path.rfind('/') {
+        Some(i) => &path[i + 1..],
+        None => path,
+    }
+}
+
+/// Normalize a file/object path according to `opts`.
+///
+/// When `normalize_paths` is disabled the path is returned verbatim.
+/// Otherwise the basename is taken and Callgrind-style unknowns (empty or
+/// `"???"`) collapse to `opts.unknown`.
+pub(crate) fn normalize_path(path: &str, opts: &ParseOptions) -> String {
+    if !opts.normalize_paths {
+        return path.to_string();
+    }
+    let leaf = basename(path);
+    if leaf.is_empty() || leaf == "???" {
+        opts.unknown.clone()
+    } else {
+        leaf.to_string()
+    }
+}
diff --git a/callgrind-utils/src/redact.rs b/callgrind-utils/src/redact.rs
new file mode 100644
index 000000000..18c660fbf
--- /dev/null
+++ b/callgrind-utils/src/redact.rs
@@ -0,0 +1,148 @@
+use std::collections::HashMap;
+
+use super::model::{CallGraph, Node};
+
+const UNKNOWN: &str = "???";
+
+impl CallGraph {
+    /// Redact host-specific node identity and rebuild the canonical graph.
+    ///
+    /// Self costs are re-keyed onto the redacted node identities, summing where
+    /// distinct nodes collapse to the same identity (e.g. libc functions).
+    pub fn redact(self) -> CallGraph {
+        let CallGraph {
+            nodes,
+            edges,
+            self_costs,
+        } = self;
+        let mut nodes = nodes;
+        let mut edges = edges;
+
+        let mut self_cost_map: HashMap<Node, u64> = HashMap::new();
+        for (node, &cost) in nodes.iter().zip(self_costs.iter()) {
+            let mut redacted = node.clone();
+            redact_node(&mut redacted);
+            *self_cost_map.entry(redacted).or_insert(0) += cost;
+        }
+
+        for node in &mut nodes {
+            redact_node(node);
+        }
+
+        for edge in &mut edges {
+            redact_node(&mut edge.caller);
+            redact_node(&mut edge.callee);
+        }
+
+        CallGraph::from_parts(nodes, edges, self_cost_map)
+    }
+}
+
+fn redact_node(node: &mut Node) {
+    node.object = redact_object(&node.object);
+
+    if is_runtime_object(&node.object) {
+        node.function = UNKNOWN.to_string();
+        node.file = UNKNOWN.to_string();
+        return;
+    }
+
+    node.function = redact_function(&node.function);
+}
+
+fn redact_function(function: &str) -> String {
+    let function = strip_symbol_version(function);
+    if is_hex_address(function) {
+        return "<unsymbolicated>".to_string();
+    }
+    function.to_string()
+}
+
+fn strip_symbol_version(function: &str) -> &str {
+    for marker in ["@@", "@"] {
+        let Some(index) = function.find(marker) else {
+            continue;
+        };
+        let version = &function[index + marker.len()..];
+        if is_symbol_version(version) {
+            return &function[..index];
+        }
+    }
+    function
+}
+
+fn is_symbol_version(version: &str) -> bool {
+    let Some(first) = version.chars().next() else {
+        return false;
+    };
+    (first.is_ascii_alphanumeric() || first == '_')
+        && version
+            .chars()
+            .all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '.')
+}
+
+fn is_hex_address(function: &str) -> bool {
+    let Some(hex) = function.strip_prefix("0x") else {
+        return false;
+    };
+    !hex.is_empty() && hex.chars().all(|c| c.is_ascii_hexdigit())
+}
+
+fn redact_object(object: &str) -> String {
+    if is_loader_soname(object) {
+        return "ld-linux".to_string();
+    }
+    if let Some(module) = cpython_extension_module(object) {
+        return format!("{module}.cpython.so");
+    }
+    if is_libffi_soname(object) {
+        return "libffi.so".to_string();
+    }
+    object.to_string()
+}
+
+fn is_runtime_object(object: &str) -> bool {
+    object == "ld-linux" || is_libc_soname(object)
+}
+
+fn is_libc_soname(object: &str) -> bool {
+    let Some(version) = object.strip_prefix("libc.so.") else {
+        return false;
+    };
+    !version.is_empty() && version.chars().all(|c| c.is_ascii_digit())
+}
+
+fn cpython_extension_module(object: &str) -> Option<&str> {
+    let (module, suffix) = object.split_once(".cpython-")?;
+    let abi = suffix.strip_suffix(".so")?;
+    if module.is_empty() || abi.is_empty() {
+        return None;
+    }
+    Some(module)
+}
+
+fn is_libffi_soname(object: &str) -> bool {
+    let Some(version) = object.strip_prefix("libffi.so.") else {
+        return false;
+    };
+    !version.is_empty()
+        && version.chars().all(|c| c.is_ascii_digit() || c == '.')
+        && version.chars().any(|c| c.is_ascii_digit())
+}
+fn is_loader_soname(object: &str) -> bool {
+    let Some(rest) = object.strip_prefix("ld-") else {
+        return false;
+    };
+    let Some(index) = rest.find(".so.") else {
+        return false;
+    };
+
+    let loader_name = &rest[..index];
+    let soname_version = &rest[index + ".so.".len()..];
+    !loader_name.is_empty()
+        && loader_name
+            .chars()
+            .all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')
+        && !soname_version.is_empty()
+        && soname_version.chars().all(|c| c.is_ascii_digit())
+}
diff --git a/callgrind-utils/src/serialize.rs b/callgrind-utils/src/serialize.rs
new file mode 100644
index 000000000..b28b8816f
--- /dev/null
+++ b/callgrind-utils/src/serialize.rs
@@ -0,0 +1,51 @@
+use serde::Serialize;
+
+use super::{
+    error::ToJsonError,
+    model::{CallGraph, Node},
+};
+
+/// Canonical JSON view of the whole graph: nodes inline, edges by index.
+#[derive(Serialize)]
+struct GraphJson<'a> {
+    nodes: &'a [Node],
+    edges: Vec<EdgeJson>,
+}
+
+/// JSON view of a single edge: caller/callee as node indices.
+///
+/// `call_count` is omitted from the output when `None`.
+#[derive(Serialize)]
+struct EdgeJson {
+    caller: usize,
+    callee: usize,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    call_count: Option<u64>,
+}
+
+impl CallGraph {
+    /// Serialize the graph to a canonical pretty-printed JSON string.
+    pub fn to_json(&self) -> Result<String, serde_json::Error> {
+        let edges: Vec<EdgeJson> = self
+            .edges()
+            .iter()
+            .map(|e| EdgeJson {
+                caller: self.node_index(&e.caller).expect("caller node present"),
+                callee: self.node_index(&e.callee).expect("callee node present"),
+                call_count: e.call_count,
+            })
+            .collect();
+        let graph = GraphJson {
+            nodes: self.nodes(),
+            edges,
+        };
+        serde_json::to_string_pretty(&graph)
+    }
+
+    /// Serialize the graph to a JSON file at `path`.
+    pub fn to_json_file(&self, path: impl AsRef<std::path::Path>) -> Result<(), ToJsonError> {
+        let s = self.to_json()?;
+        std::fs::write(path, s)?;
+        Ok(())
+    }
+}
diff --git a/callgrind-utils/testdata/arm64_deep_tailcall_chain.c b/callgrind-utils/testdata/arm64_deep_tailcall_chain.c
new file mode 100644
index 000000000..e37ba3d24
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_deep_tailcall_chain.c
@@ -0,0 +1,93 @@
+// AArch64 reproducer: a longer (6-stage) flat-SP tail-call chain than
+// arm64_tail_call.c's 2-stage one, reached from inside real `bl`-based tree
+// recursion rather than a flat wrapper chain. Scales up the number of
+// same-SP frames `popcount_on_return` must pop in one go when the final
+// `ret` fires, and nests that under strictly-lower recursion frames above
+// it (arm64_tail_call.c has no recursion above its chain).
+#include <callgrind.h>
+
+#define MAX_DEPTH 5
+#define MAX_NODES 256
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static int stage_f(int n) { return n * 2 + 1; }
+__attribute__((noinline)) static int stage_e(int n) { return stage_f(n + 1); }
+__attribute__((noinline)) static int stage_d(int n) { return stage_e(n + 1); }
+__attribute__((noinline)) static int stage_c(int n) { return stage_d(n + 1); }
+__attribute__((noinline)) static int stage_b(int n) { return stage_c(n + 1); }
+__attribute__((noinline)) static int stage_a(int n) { return stage_b(n + 1); }
+
+__attribute__((noinline)) static Node *walk(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        node->left = walk(depth + 1, child_value(seed, 1, depth));
+        node->right = walk(depth + 1, child_value(seed, 2, depth));
+    }
+
+    // Real call (`bl`) into the chain, then 5 plain-`b` sibling calls, then
+    // one real `ret`, then post-call sibling work in this same frame.
+    int chained = stage_a(seed);
+    node->value = seed + (chained % 97);
+    return node;
+}
+
+__attribute__((noinline)) static int recursive_sum(const Node *node) {
+    if (!node) return 0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = walk(0, 1);
+    return recursive_sum(root) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_free_during_recursion.c b/callgrind-utils/testdata/arm64_free_during_recursion.c
new file mode 100644
index 000000000..6dcbefb62
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_free_during_recursion.c
@@ -0,0 +1,168 @@
+// AArch64 reproducer for a real production bug: after `free()` returns from
+// deallocating a scratch heap buffer used mid-recursion, Callgrind's return
+// matching misattributes the post-free work in the SAME caller frame as a
+// fresh call FROM `free()` INTO the caller, instead of a return. This was
+// observed live on a CodSpeed aarch64 runner (`free` showing up as a parent
+// of `analyze_fractal_tree`, stealing ~13% of the benchmark's total time)
+// even after the "fix: ARM unwinding" commit — the fix does not cover this
+// case. Mirrors fractal.rs's `analyze_fractal_tree`: a self-recursive
+// analysis function that, at every recursion level, walks the tree
+// (real `bl` recursion => strictly-lower-SP frames), then mallocs a scratch
+// buffer, does work, frees it, and keeps computing in the same frame
+// afterward (real `bl`/`ret` to libc malloc/free, not a toy stand-in).
+#include <callgrind.h>
+#include <stdlib.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+#define ANALYSIS_DEPTH 3
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+typedef struct Analysis {
+    int total_sum;
+    int node_count;
+    int variance;
+} Analysis;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+    return node;
+}
+
+__attribute__((noinline)) static int recursive_sum(const Node *node) {
+    if (!node) return 0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static int count_nodes(const Node *node) {
+    if (!node) return 0;
+    return 1 + count_nodes(node->left) + count_nodes(node->right);
+}
+
+__attribute__((noinline)) static void collect_leaf(const Node *node, int *buf, int *count) {
+    if (!node) return;
+    if (!node->left && !node->right) {
+        buf[(*count)++] = node->value;
+        return;
+    }
+    collect_leaf(node->left, buf, count);
+    collect_leaf(node->right, buf, count);
+}
+
+// Mirrors Rust's `__rust_dealloc -> __rdl_dealloc -> free` thin-wrapper
+// chain, which the compiler likely tail-calls all the way down to the PLT
+// stub for `free`, rather than calling `free` directly.
+__attribute__((noinline)) static void dealloc_wrapper2(void *ptr) {
+    free(ptr);
+}
+
+__attribute__((noinline)) static void dealloc_wrapper1(void *ptr) {
+    dealloc_wrapper2(ptr);
+}
+
+__attribute__((noinline)) static int compute_variance(const Node *root) {
+    int *buf = malloc(sizeof(int) * MAX_NODES);
+    int count = 0;
+    collect_leaf(root, buf, &count);
+
+    int local[MAX_NODES];
+    for (int i = 0; i < count; i++) {
+        local[i] = buf[i];
+    }
+
+    dealloc_wrapper1(buf);
+
+    // Post-free work in the caller's own frame -- this is exactly the cost
+    // that gets stolen and re-parented under `free` in the buggy case.
+    int mean = 0;
+    for (int i = 0; i < count; i++) {
+        mean += local[i];
+    }
+    if (count > 0) mean /= count;
+
+    int variance = 0;
+    for (int i = 0; i < count; i++) {
+        int diff = local[i] - mean;
+        variance += diff * diff;
+    }
+    if (count > 0) variance /= count;
+    return variance;
+}
+
+__attribute__((noinline)) static Analysis analyze_tree(const Node *root, int depth) {
+    int total_sum = recursive_sum(root);
+    int node_count = count_nodes(root);
+    int variance = compute_variance(root);
+
+    if (depth > 0) {
+        Analysis nested = analyze_tree(root, depth - 1);
+        Analysis result;
+        result.total_sum = total_sum + nested.total_sum / 10;
+        result.node_count = node_count;
+        result.variance = (variance + nested.variance) / 2;
+        return result;
+    }
+
+    Analysis result = { total_sum, node_count, variance };
+    return result;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 1);
+    Analysis analysis = analyze_tree(root, ANALYSIS_DEPTH);
+    return analysis.total_sum + analysis.node_count + analysis.variance;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_free_tailcall_phantom.c b/callgrind-utils/testdata/arm64_free_tailcall_phantom.c
new file mode 100644
index 000000000..06096d454
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_free_tailcall_phantom.c
@@ -0,0 +1,92 @@
+// AArch64 reproducer for the production "free calls X" misattribution: after a
+// tail-called `free()` returns, its return is misclassified as a fresh call
+// back into the caller, so `free` becomes a phantom parent (`caller'2`) of the
+// post-free work. This is the symptom arm64_free_during_recursion.c and
+// fractal_alloc.rs hit amid a full benchmark ("free showing up as a parent of
+// analyze_fractal_tree, stealing ~13% of the run time").
+//
+// This shares its ROOT CAUSE with arm64_plt_phantom_recursion.c -- it is the
+// same stale-`nonskipped` bug -- but surfaces on a *tail-called libc free*,
+// which is why it is worth pinning as its own fixture. The two triggers that
+// must combine (verified by ablation: removing either makes the phantom vanish):
+//
+//   1. `caller` calls `malloc` inside the measured region. That is a hop
+//      `caller -> PLT stub -> libc malloc` into another ELF object, which
+//      Callgrind treats as a skipped region; when malloc returns,
+//      `current_state.nonskipped = caller` is left dangling (see
+//      arm64_plt_phantom_recursion.c for the mechanism). The next `bl dealloc1`
+//      then has its shadow-stack return address computed from the *malloc* call
+//      site instead of the `bl dealloc1` site (bbcc.c's
+//      `FIXME: take the real passed count from shadow stack`).
+//   2. `free` is reached through two thin tail-call wrappers
+//      (`dealloc1 -> dealloc2 -> free`). At -O2 each `return f(p);` is a tail
+//      branch `b f`, which Callgrind emulates as a call with `ret_addr == 0`.
+//
+// When `free` returns to `caller`, the return matcher walks the shadow stack
+// looking for a frame whose recorded return address matches: the `free`/
+// `dealloc2` frames have `ret_addr == 0` and `dealloc1`'s recorded address is
+// the *wrong* (malloc-site) one from (1), so nothing matches. The return is
+// misclassified "RET w/o CALL" and re-promoted to a call into `caller` ->
+// phantom `caller'2`, under which `post_free_work` is misattributed.
+//
+// AArch64-specific for the same reason as arm64_plt_phantom_recursion.c: on x86
+// the return is detected by SP movement regardless of the recorded address.
+// Built at -O2 by tests/snapshot.rs; libc frames redact to `???`. The `malloc`
+// must stay inside the measured region (it is trigger 1); a direct `free` with
+// no tail wrappers, or a tail-called `free` without the preceding `malloc`,
+// both profile cleanly.
+#include <callgrind.h>
+#include <stdlib.h>
+
+// Two thin dealloc wrappers, each a pure tail call, mirroring the
+// `dealloc_wrapper1 -> dealloc_wrapper2 -> free` chain seen in production.
+__attribute__((noinline)) static void dealloc2(void *p) { free(p); }
+__attribute__((noinline)) static void dealloc1(void *p) { dealloc2(p); }
+
+// Ordinary work `caller` does after the free returns; the bug misattributes it
+// under the phantom `caller'2` instead of directly under `caller`.
+__attribute__((noinline)) static int post_free_work(int x) {
+    volatile int acc = x;
+    for (int i = 0; i < 8; i++) {
+        acc += i * x;
+    }
+    return acc % 97;
+}
+
+// Non-recursive. Any `caller'2` clone in the snapshot is the bug.
+__attribute__((noinline)) static int caller(void) {
+    void *p = malloc(64);     // PLT hop -> libc; leaves `nonskipped` dangling
+    volatile char *c = p;
+    c[0] = 1;
+    dealloc1(p);              // bl dealloc1 -> tail chain -> free (emulated calls)
+    return post_free_work(3); // misattributed under phantom caller'2
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = caller();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += caller();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_libm_recursion.c b/callgrind-utils/testdata/arm64_libm_recursion.c
new file mode 100644
index 000000000..4907a4f5b
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_libm_recursion.c
@@ -0,0 +1,95 @@
+// AArch64 reproducer: a real libm PLT call (`sin`) at every level of a
+// recursive tree build, followed by sibling work in the same frame after
+// the call returns. Mirrors fractal.rs's `build_fractal`, which calls `sin`
+// at every recursion level to perturb child seeds -- exercises the same
+// return-into-caller-frame path as malloc/free, but through a different
+// external library boundary (libm instead of libc's allocator).
+#include <callgrind.h>
+#include <math.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+
+typedef struct Node {
+    double value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(double value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static double perturb(double seed, int side, int depth) {
+    return sin(seed * (side + 1) + depth);
+}
+
+__attribute__((noinline)) static double hash_tree(const Node *node) {
+    if (!node) return 0.0;
+    return node->value + hash_tree(node->left) * 1.5 + hash_tree(node->right) * 2.5;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, double seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        double left_seed = perturb(seed, 0, depth);
+        node->left = build_tree(depth + 1, left_seed);
+
+        double right_seed = perturb(seed, 1, depth);
+        node->right = build_tree(depth + 1, right_seed);
+    }
+
+    // Post-call sibling work in this same frame, after both recursive
+    // descents (each preceded by a real `bl sin@plt`) have returned.
+    node->value += hash_tree(node) * 0.01;
+    return node;
+}
+
+__attribute__((noinline)) static double recursive_sum(const Node *node) {
+    if (!node) return 0.0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 0.37);
+    double total = recursive_sum(root) + hash_tree(root);
+    return (int)(total * 1000.0) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_longjmp_unwind.c b/callgrind-utils/testdata/arm64_longjmp_unwind.c
new file mode 100644
index 000000000..f04c1c16f
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_longjmp_unwind.c
@@ -0,0 +1,107 @@
+// AArch64 reproducer: a multi-level non-local jump (`longjmp`) that unwinds
+// several real recursion frames at once via an indirect branch, not a
+// `ret`. Exercises the NEW bbcc.c block that reclassifies an ordinary jump
+// as a return when its target matches a recorded return address deeper in
+// the call stack (as opposed to the immediate top-of-stack frame) --
+// distinct from arm64_recursive_return.c, which only unwinds one frame at
+// a time via ordinary `ret`s.
+#include <callgrind.h>
+#include <setjmp.h>
+
+#define MAX_DEPTH 8
+#define MAX_NODES 512
+#define ABORT_DEPTH 5
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+static int aborted;
+static jmp_buf abort_point;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (!aborted && depth == ABORT_DEPTH && seed % 7 == 0) {
+        // Jump directly back to `complex_benchmark`'s frame, skipping every
+        // intermediate `build_tree` recursion level's own `ret`. Guarded by
+        // `aborted` so the post-landing rebuild can't re-trigger the jump.
+        aborted = 1;
+        longjmp(abort_point, seed);
+    }
+
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+
+    node->value += depth;
+    return node;
+}
+
+__attribute__((noinline)) static int recursive_sum(const Node *node) {
+    if (!node) return 0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    aborted = 0;
+    int jumped_seed = setjmp(abort_point);
+
+    Node *root;
+    if (jumped_seed != 0) {
+        // Landed here via longjmp from deep inside build_tree. Continue
+        // real work in this frame after the multi-level unwind.
+        root = build_tree(0, jumped_seed + 1);
+    } else {
+        root = build_tree(0, 1);
+    }
+
+    return recursive_sum(root) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_multi_alloc_cycle.c b/callgrind-utils/testdata/arm64_multi_alloc_cycle.c
new file mode 100644
index 000000000..f3101703e
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_multi_alloc_cycle.c
@@ -0,0 +1,160 @@
+// AArch64 reproducer: TWO sequential malloc/free cycles inside the same
+// recursive analysis frame (mirrors the real fractal benchmark's
+// compute_median + compute_interquartile_range, which each allocate and
+// drop their own scratch Vec back-to-back). Tests whether call-stack state
+// left over from unwinding the FIRST alloc/free cycle corrupts matching for
+// the SECOND cycle within the same still-open caller frame.
+#include <callgrind.h>
+#include <stdlib.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+#define ANALYSIS_DEPTH 3
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+typedef struct Analysis {
+    int total_sum;
+    int variance;
+    int spread;
+} Analysis;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+    return node;
+}
+
+__attribute__((noinline)) static int recursive_sum(const Node *node) {
+    if (!node) return 0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static void collect_leaf(const Node *node, int *buf, int *count) {
+    if (!node) return;
+    if (!node->left && !node->right) {
+        buf[(*count)++] = node->value;
+        return;
+    }
+    collect_leaf(node->left, buf, count);
+    collect_leaf(node->right, buf, count);
+}
+
+__attribute__((noinline)) static int compute_variance(const Node *root) {
+    int *buf = malloc(sizeof(int) * MAX_NODES);
+    int count = 0;
+    collect_leaf(root, buf, &count);
+
+    int local[MAX_NODES];
+    for (int i = 0; i < count; i++) local[i] = buf[i];
+
+    int mean = 0;
+    for (int i = 0; i < count; i++) mean += buf[i];
+    if (count > 0) mean /= count;
+
+    free(buf);
+
+    // Post-free work in this frame, following the first free() in this
+    // function -- reads the pre-free local copy, not the freed buffer.
+    int variance = 0;
+    for (int i = 0; i < count; i++) {
+        int diff = local[i] - mean;
+        variance += diff * diff;
+    }
+    if (count > 0) variance /= count;
+    return variance;
+}
+
+__attribute__((noinline)) static int compute_spread(const Node *root) {
+    int *buf = malloc(sizeof(int) * MAX_NODES);
+    int count = 0;
+    collect_leaf(root, buf, &count);
+
+    int lo = count > 0 ? buf[0] : 0;
+    int hi = count > 0 ? buf[0] : 0;
+    for (int i = 1; i < count; i++) {
+        if (buf[i] < lo) lo = buf[i];
+        if (buf[i] > hi) hi = buf[i];
+    }
+
+    free(buf);
+
+    // Post-free work in this frame, following the second free() in a row.
+    return (hi - lo) * count;
+}
+
+__attribute__((noinline)) static Analysis analyze_tree(const Node *root, int depth) {
+    int total_sum = recursive_sum(root);
+    int variance = compute_variance(root);
+    int spread = compute_spread(root);
+
+    if (depth > 0) {
+        Analysis nested = analyze_tree(root, depth - 1);
+        Analysis result;
+        result.total_sum = total_sum + nested.total_sum / 10;
+        result.variance = (variance + nested.variance) / 2;
+        result.spread = spread + nested.spread;
+        return result;
+    }
+
+    Analysis result = { total_sum, variance, spread };
+    return result;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 1);
+    Analysis analysis = analyze_tree(root, ANALYSIS_DEPTH);
+    return analysis.total_sum + analysis.variance + analysis.spread;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_objskip_tailcall.c b/callgrind-utils/testdata/arm64_objskip_tailcall.c
new file mode 100644
index 000000000..902b31ced
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_objskip_tailcall.c
@@ -0,0 +1,102 @@
+// AArch64 reproducer probing the interaction between `--obj-skip` splicing
+// (callgrind/bbcc.c's "call from skipped to nonskipped" handling, using
+// CLG_(current_state).nonskipped) and the emulated-call machinery (tail
+// calls promoted to jk_Call, ret_addr inheritance, alias-popping) fixed
+// earlier this session. `skipped_entry`/`skipped_relay` live in a
+// companion shared library that gets passed to `--obj-skip`; the relay
+// hop into `skipped_relay` and the final hop back into `visible_target`
+// (in THIS, non-skipped, executable) are both plain tail calls, so the
+// return-matching machinery must correctly splice the skipped frames out
+// while still popping the right number of call-stack entries when
+// `visible_target` eventually returns for real.
+#include <callgrind.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+extern int skipped_entry(int seed);
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static int hash_tree(const Node *node) {
+    if (!node) return 0;
+    return node->value + hash_tree(node->left) * 5 + hash_tree(node->right) * 7;
+}
+
+// Real call target for the skipped library's final tail-call hop. Also
+// exported so the linker can't inline/elide the cross-object boundary.
+__attribute__((noinline)) int visible_target(int seed) {
+    return seed * 2 + 1;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+
+    // Real call (`bl`) into the skipped library's entry point, which
+    // tail-calls within the skipped object, then tail-calls back out into
+    // visible_target (non-skipped) -- then post-call work in this same
+    // frame, exactly the pattern that gets stolen if splicing mishandles
+    // the emulated hops.
+    int relayed = skipped_entry(seed);
+    node->value += hash_tree(node) + (relayed % 7);
+    return node;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 1);
+    return hash_tree(root) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_objskip_tailcall_lib.c b/callgrind-utils/testdata/arm64_objskip_tailcall_lib.c
new file mode 100644
index 000000000..5fc5515f9
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_objskip_tailcall_lib.c
@@ -0,0 +1,24 @@
+// Shared library half of arm64_objskip_tailcall.c. Everything here will be
+// marked skip=True via --obj-skip=<this .so>. `skipped_entry` is called
+// (real `bl`) from the non-skipped main executable, then tail-calls
+// `skipped_relay` (still inside this same skipped object), which itself
+// tail-calls back OUT into `visible_target` in the main executable --
+// mirroring the shape callgrind's obj-skip splicing is supposed to handle
+// (attribute the call directly to the real, non-skipped caller), but with
+// the skipped side of the chain built from emulated (tail-called) frames.
+extern int visible_target(int seed);
+
+__attribute__((noinline)) static int skipped_relay(int seed) {
+    // Two REAL (non-tail) calls out to non-skipped code, with skipped-side
+    // work interleaved, then a final tail call out -- stresses the
+    // `passed = bbcc->bb->cjmp_count` approximation in bbcc.c's
+    // "call from skipped to nonskipped" splice across repeated
+    // skip/nonskip transitions within one skipped frame's lifetime.
+    int a = visible_target(seed);
+    int b = visible_target(seed + a);
+    return visible_target(seed + a + b);
+}
+
+__attribute__((noinline)) int skipped_entry(int seed) {
+    return skipped_relay(seed + 1);
+}
diff --git a/callgrind-utils/testdata/arm64_ping_pong_recursion.c b/callgrind-utils/testdata/arm64_ping_pong_recursion.c
new file mode 100644
index 000000000..7efdc7920
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_ping_pong_recursion.c
@@ -0,0 +1,101 @@
+// AArch64 reproducer: a long flat-SP mutual-tail-call chain (`ping` <-> `pong`,
+// alternating plain `b` sibling calls) nested INSIDE ordinary `bl`-based tree
+// recursion. Each tree node triggers a bounded ping/pong chain before doing
+// post-call sibling work in its own frame. Stresses `popcount_on_return`
+// needing to pop many same-SP frames at once, nested under multiple levels
+// of strictly-lower-SP real recursion frames -- the combination the simpler
+// arm64_tail_call.c (flat-only) and arm64_recursive_return.c (bl-only)
+// fixtures don't exercise together.
+#include <callgrind.h>
+
+#define MAX_DEPTH 5
+#define MAX_NODES 256
+#define PING_PONG_ROUNDS 10
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static int pong(int v, int n);
+
+__attribute__((noinline)) static int ping(int v, int n) {
+    if (n <= 0) return v;
+    return pong(v * 2 + 1, n - 1);
+}
+
+__attribute__((noinline)) static int pong(int v, int n) {
+    if (n <= 0) return v;
+    return ping(v * 3 + 2, n - 1);
+}
+
+__attribute__((noinline)) static Node *walk(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        node->left = walk(depth + 1, child_value(seed, 1, depth));
+        node->right = walk(depth + 1, child_value(seed, 2, depth));
+    }
+
+    // Bounded flat-SP tail-call chain, then post-call sibling work in this
+    // same (real, `bl`-reached) frame once the chain's final `ret` fires.
+    int chained = ping(seed, PING_PONG_ROUNDS);
+    node->value = seed + (chained % 97);
+    return node;
+}
+
+__attribute__((noinline)) static int recursive_sum(const Node *node) {
+    if (!node) return 0;
+    return node->value + recursive_sum(node->left) + recursive_sum(node->right);
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = walk(0, 1);
+    return recursive_sum(root) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_plt_phantom_recursion.c b/callgrind-utils/testdata/arm64_plt_phantom_recursion.c
new file mode 100644
index 000000000..382892fee
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_plt_phantom_recursion.c
@@ -0,0 +1,85 @@
+// AArch64 reproducer for the "non-recursive function appears recursive"
+// callstack bug (a phantom `foo'2` recursion clone). This is the minimal
+// distillation of fractal.rs's full-trace `complex_fractal_benchmark'2`.
+//
+// Trigger sequence, all inside `outer`:
+//   1. `memset` a stack buffer. At -O2 this is a real `bl memset@plt`, i.e. a
+//      hop `outer -> PLT stub -> libc memset` that crosses into another ELF
+//      object. Callgrind treats the PLT hop as a *skipped region*, so memset's
+//      shadow-stack frame stores `nonskipped = outer`. When memset returns,
+//      `pop_call_stack` restores `current_state.nonskipped = outer` and nothing
+//      clears it again.
+//   2. An ordinary `bl leaf`. In the delayed-push path, because `nonskipped` is
+//      still set, bbcc.c overrides the call's `from`/`passed` with the stale
+//      `nonskipped` BB (the `FIXME: take the real passed count` line), so
+//      `leaf`'s recorded return address is computed from the *memset* call site
+//      instead of the `bl leaf` site.
+//   3. When `leaf` returns, that wrong return address does not match, so the
+//      return is misclassified "RET w/o CALL" and re-promoted to a fresh call
+//      back into `outer`'s body -> phantom `outer'2`. Everything `outer` does
+//      after `leaf` (here `sibling`) is then misattributed under `outer'2`.
+//
+// This is AArch64-specific: on x86 `call`/`ret` move SP, so the return is
+// detected by SP alone regardless of the corrupted return address. On AArch64
+// `bl`/`ret` leave SP unchanged across the call boundary, so Callgrind relies
+// entirely on the (here wrong) return-address match.
+//
+// `leaf` and `sibling` are deliberately NON-recursive, so any `'2` clone in the
+// snapshot is unambiguously the bug. Built at -O2 by tests/snapshot.rs.
+#include <callgrind.h>
+#include <string.h>
+
+// Big enough (and volatile) that the compiler emits a real `bl memset@plt`
+// rather than inlining the clear.
+#define BUF_BYTES 512
+
+__attribute__((noinline)) static int leaf(const volatile char *buf) {
+    int acc = 0;
+    for (int i = 0; i < BUF_BYTES; i += 64) {
+        acc += buf[i];
+    }
+    return acc + 7;
+}
+
+__attribute__((noinline)) static int sibling(int x) {
+    return x * 2 + 1;
+}
+
+// Non-recursive. Mirrors Rust's `complex_fractal_benchmark`, whose
+// `let mut pool = Pool::new()` emits the same leading `bl memset@plt`.
+__attribute__((noinline)) static int outer(void) {
+    volatile char buf[BUF_BYTES];
+    memset((void *)buf, 0, sizeof buf); // bl memset@plt -> libc (skipped region)
+    int a = leaf(buf);                  // ordinary bl; its return is misdetected
+    int b = sibling(a);                 // misattributed under phantom outer'2
+    return a + b;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = outer();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += outer();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_recursive_return.c b/callgrind-utils/testdata/arm64_recursive_return.c
new file mode 100644
index 000000000..a65d744e6
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_recursive_return.c
@@ -0,0 +1,89 @@
+// AArch64-focused reproducer for Callgrind shadow-stack unwinding on ordinary
+// compiler-generated recursive returns.  Mirrors fractal.rs's shape: a
+// multi-frame wrapper chain (main -> run_benchmark -> warmup -> run_measured)
+// so CALLGRIND_START_INSTRUMENTATION fires several native frames deep and the
+// shadow stack must be seeded, then a benchmark function that builds a
+// recursive tree and does post-order/sibling work afterwards.
+#include <callgrind.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static int hash_tree(const Node *node) {
+    if (!node) return 0;
+    return node->value + hash_tree(node->left) * 5 + hash_tree(node->right) * 7;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+
+    node->value += hash_tree(node);
+    return node;
+}
+
+__attribute__((noinline)) static int sibling_after_tree(const Node *root) {
+    return root->value % 97;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 1);
+    int total = hash_tree(root);
+    total += sibling_after_tree(root);
+    return total;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_tail_call.c b/callgrind-utils/testdata/arm64_tail_call.c
new file mode 100644
index 000000000..6e3533f92
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_tail_call.c
@@ -0,0 +1,56 @@
+// AArch64-focused reproducer for tail-call handling in Callgrind's shadow
+// stack. Exercises two paired fixes: guest_arm64_toIR.c now classifies an
+// unlinked `B` as Ijk_Boring (a tail call) instead of Ijk_Call, and
+// bbcc.c's return matching must pop through the resulting chain of same-SP
+// tail-call frames in one go once the real `ret` finally executes. Built
+// with -O2 so `stage_a -> stage_b -> stage_c` compile to sibling calls
+// (`b`, not `bl`) that reuse a single stack frame. The seed is threaded
+// through a volatile global (rather than a compile-time constant literal)
+// so GCC's interprocedural constant propagation can't clone/fold
+// stage_a/stage_b/stage_c into `.constprop.0` variants or evaluate the
+// chain down to a single `mov`+`ret` -- the tail-call shape must survive
+// codegen for this fixture to exercise anything.
+#include <callgrind.h>
+
+volatile int g_seed = 5;
+
+__attribute__((noinline)) static int stage_c(int n) {
+    return n * 2 + 1;
+}
+
+__attribute__((noinline)) static int stage_b(int n) {
+    return stage_c(n + 1);
+}
+
+__attribute__((noinline)) static int stage_a(int n) {
+    return stage_b(n + 1);
+}
+
+__attribute__((noinline)) static int run_measured(int n) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = stage_a(n);
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(int n) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += stage_a(n);
+    }
+    (void)acc;
+    return run_measured(n);
+}
+
+__attribute__((noinline)) static int run_benchmark(int n) {
+    return warmup(n);
+}
+
+int main(void) {
+    volatile int result = run_benchmark(g_seed);
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_tls_access.c b/callgrind-utils/testdata/arm64_tls_access.c
new file mode 100644
index 000000000..b18ab0f78
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_tls_access.c
@@ -0,0 +1,103 @@
+// AArch64 reproducer for a misattribution around AArch64 TLS descriptor
+// resolvers. On a dynamically-linked/PIE binary, accessing a `__thread`
+// variable compiles to a GOT-loaded {resolver_fn, arg} pair followed by
+// `blr` into that resolver (NOT a normal PLT call) -- `_dl_tlsdesc_return`
+// for a statically-known offset, `_dl_tlsdesc_undefweak`/`_dl_tlsdesc_dynamic`
+// for other TLS models. This is the exact same class of "transparent
+// trampoline" as `_dl_runtime_resolve` (the lazy PLT-binding resolver,
+// which callgrind/fn.c already special-cases via `fn->pop_on_jump = True`),
+// but callgrind never applied the same treatment to the tlsdesc family.
+// Every TLS access in a recursive hot path triggers this, so the return
+// from `_dl_tlsdesc_return` back into the accessing function gets
+// misattributed as `_dl_tlsdesc_return` calling into whatever runs next --
+// observed in production pulling almost the ENTIRE program's cost under
+// `_dl_tlsdesc_return` for TLS-heavy workloads (e.g. CPython, which keeps
+// per-thread interpreter state in a `__thread` variable).
+#include <callgrind.h>
+
+#define MAX_DEPTH 6
+#define MAX_NODES 256
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+// Defined in arm64_tls_access_lib.c's shared library: a `__thread`
+// variable that lives in a SEPARATE .so can't be relaxed by the linker
+// down to the cheap Local-Exec TP-relative model, forcing the real
+// TLS-descriptor path (GOT-loaded {resolver, arg} pair plus `blr`).
+extern int touch_tls(int delta);
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+__attribute__((noinline)) static int hash_tree(const Node *node) {
+    if (!node) return 0;
+    return node->value + hash_tree(node->left) * 5 + hash_tree(node->right) * 7;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+
+    // Every recursion level touches the TLS variable (real `blr` into the
+    // tlsdesc resolver), then does more work in this same frame afterward
+    // -- exactly the "post-trampoline-return work" that gets stolen.
+    int bumped = touch_tls(node->value);
+    node->value += hash_tree(node) + (bumped % 7);
+    return node;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    touch_tls(-touch_tls(0));
+    Node *root = build_tree(0, 1);
+    return hash_tree(root) % 1000000;
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/arm64_tls_access_lib.c b/callgrind-utils/testdata/arm64_tls_access_lib.c
new file mode 100644
index 000000000..0251af830
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_tls_access_lib.c
@@ -0,0 +1,12 @@
+// Shared-library half of arm64_tls_access.c. A `__thread` variable defined
+// in a separate .so (rather than the main executable) can't be relaxed by
+// the linker down to the cheap Local-Exec TP-relative model -- accessing it
+// from the main executable forces the real TLS-descriptor path (a
+// GOT-loaded {resolver, arg} pair plus `blr`), which is what exercises
+// `_dl_tlsdesc_return`/`_dl_tlsdesc_undefweak`/`_dl_tlsdesc_dynamic`.
+__thread int tls_counter;
+
+__attribute__((noinline)) int touch_tls(int delta) {
+    tls_counter += delta;
+    return tls_counter;
+}
diff --git a/callgrind-utils/testdata/arm64_wrapped_alloc_chain.c b/callgrind-utils/testdata/arm64_wrapped_alloc_chain.c
new file mode 100644
index 000000000..9ca2b536b
--- /dev/null
+++ b/callgrind-utils/testdata/arm64_wrapped_alloc_chain.c
@@ -0,0 +1,144 @@
+// AArch64 reproducer for the "aliased emulated frames" bug: a chain of
+// THREE plain-`b` tail calls (mirroring Rust's real
+// `__rust_alloc -> __rdl_alloc -> malloc@plt` / `__rust_dealloc ->
+// __rdl_dealloc -> free@plt` shims) into a real external allocator
+// function. Each tail-called wrapper's call-stack entry inherits its
+// ret_addr from the frame that emulated it (bbcc.c's push_call_stack), so
+// all three stacked entries end up sharing the exact same ret_addr. When
+// the real allocator function finally does its own `ret`, only the
+// topmost of these aliased entries gets popped unless the return-matching
+// loop keeps consuming deeper equal-SP frames that independently match
+// the same target (bbcc.c's extend_popcount_through_aliases) -- otherwise
+// the two stale wrapper entries misattribute the NEXT jump as a fresh
+// call into whatever code runs after the allocation, instead of a plain
+// continuation of the real caller.
+//
+// A single-hop wrapper (see arm64_tail_call.c/arm64_free_during_recursion.c)
+// is not enough to exercise this: with only one emulated frame between the
+// real caller and the real external function, "descend until a match is
+// found" (the pre-existing loop) walks past the lone zero/aliased entry
+// and lands correctly on the real caller's own (non-aliased) entry by
+// coincidence. Three hops stacks enough aliased frames that under-counting
+// by even one leaves a stale entry behind.
+#include <callgrind.h>
+#include <stdlib.h>
+#include <string.h>
+
+#define MAX_DEPTH 5
+#define MAX_NODES 256
+
+typedef struct Node {
+    int value;
+    struct Node *left;
+    struct Node *right;
+} Node;
+
+static Node pool[MAX_NODES];
+static int used;
+
+__attribute__((noinline)) static Node *pool_alloc(int value) {
+    Node *node = &pool[used++];
+    node->value = value;
+    node->left = 0;
+    node->right = 0;
+    return node;
+}
+
+__attribute__((noinline)) static int child_value(int parent, int side, int depth) {
+    return parent * 3 + side + depth;
+}
+
+// Three-hop tail-call chains into the real allocator, matching the real
+// Rust shim depth (`__rust_alloc -> __rdl_alloc -> malloc@plt`).
+__attribute__((noinline)) static void *alloc_hop3(size_t n) {
+    return malloc(n);
+}
+__attribute__((noinline)) static void *alloc_hop2(size_t n) {
+    return alloc_hop3(n);
+}
+__attribute__((noinline)) static void *alloc_hop1(size_t n) {
+    return alloc_hop2(n);
+}
+
+__attribute__((noinline)) static void dealloc_hop3(void *ptr) {
+    free(ptr);
+}
+__attribute__((noinline)) static void dealloc_hop2(void *ptr) {
+    dealloc_hop3(ptr);
+}
+__attribute__((noinline)) static void dealloc_hop1(void *ptr) {
+    dealloc_hop2(ptr);
+}
+
+__attribute__((noinline)) static void collect_leaf(const Node *node, int *buf, int *count) {
+    if (!node) return;
+    if (!node->left && !node->right) {
+        buf[(*count)++] = node->value;
+        return;
+    }
+    collect_leaf(node->left, buf, count);
+    collect_leaf(node->right, buf, count);
+}
+
+__attribute__((noinline)) static int compute_stat(const Node *root) {
+    // Real call (`bl`) into the 3-hop alloc chain -- the frame that
+    // eventually emulates the tail-called wrappers.
+    int *buf = alloc_hop1(sizeof(int) * MAX_NODES);
+    int count = 0;
+    collect_leaf(root, buf, &count);
+
+    int sum = 0;
+    for (int i = 0; i < count; i++) sum += buf[i];
+
+    // Real call (`bl`) into the 3-hop dealloc chain.
+    dealloc_hop1(buf);
+
+    // Post-free work in this same frame -- exactly the code that gets
+    // stolen and re-parented under the allocator if the aliased frames
+    // above aren't all correctly popped.
+    return sum % 1000;
+}
+
+__attribute__((noinline)) static Node *build_tree(int depth, int seed) {
+    Node *node = pool_alloc(seed);
+    if (depth < MAX_DEPTH) {
+        node->left = build_tree(depth + 1, child_value(seed, 1, depth));
+        node->right = build_tree(depth + 1, child_value(seed, 2, depth));
+    }
+    return node;
+}
+
+__attribute__((noinline)) static int complex_benchmark(void) {
+    used = 0;
+    Node *root = build_tree(0, 1);
+    return compute_stat(root);
+}
+
+__attribute__((noinline)) static int run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    int result = complex_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+__attribute__((noinline)) static int warmup(void) {
+    volatile int acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+__attribute__((noinline)) static int run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile int result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/chain.c b/callgrind-utils/testdata/chain.c
new file mode 100644
index 000000000..cb9360263
--- /dev/null
+++ b/callgrind-utils/testdata/chain.c
@@ -0,0 +1,27 @@
+// Fixture: a linear call chain `main -> a -> b -> c` (no recursion, no shared
+// callees). See recursion.c for the instrumentation/build conventions.
+
+#include <callgrind.h>
+
+static int c(int n) {
+    return n + 1;
+}
+
+static int b(int n) {
+    return c(n) + 1;
+}
+
+static int a(int n) {
+    return b(n) + 1;
+}
+
+int main(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    volatile int sink = a(5);
+    (void)sink;
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/clgctl.c b/callgrind-utils/testdata/clgctl.c
new file mode 100644
index 000000000..3deb59341
--- /dev/null
+++ b/callgrind-utils/testdata/clgctl.c
@@ -0,0 +1,28 @@
+// Callgrind client-request shim for the Python fixture (`recursion.py`).
+//
+// The CALLGRIND_* client requests are inline-asm sequences, so they can't be
+// issued from pure Python. The Python fixture loads this shared library via
+// `ctypes` and calls these entry points to drive instrumentation, mirroring
+// what pytest-codspeed's instrument-hooks does: skip the Python runtime objects
+// at runtime, then START/ZERO around the measured region and STOP after.
+//
+// Build (shared, against the in-repo client-request headers):
+//   cc -g -O0 -shared -fPIC -I callgrind -I include ...
+
+#include <callgrind.h>
+
+// Add an object file to Callgrind's obj-skip list at runtime. Matching is exact
+// against the mapped object path, so the caller passes a realpath (same as
+// instrument-hooks' `callgrind_add_obj_skip`).
+void clg_add_obj_skip(const char *path) {
+    CALLGRIND_ADD_OBJ_SKIP(path);
+}
+
+void clg_start(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+}
+
+void clg_stop(void) {
+    CALLGRIND_STOP_INSTRUMENTATION;
+}
diff --git a/callgrind-utils/testdata/diamond.c b/callgrind-utils/testdata/diamond.c
new file mode 100644
index 000000000..d617b2759
--- /dev/null
+++ b/callgrind-utils/testdata/diamond.c
@@ -0,0 +1,32 @@
+// Fixture: a diamond graph where `bottom` is a shared callee reached via two
+// paths: `main -> top -> {left, right} -> bottom`. Exercises a node with two
+// distinct incoming edges. See recursion.c for the conventions.
+
+#include <callgrind.h>
+
+static int bottom(int n) {
+    return n * 2;
+}
+
+static int left(int n) {
+    return bottom(n) + 1;
+}
+
+static int right(int n) {
+    return bottom(n) + 2;
+}
+
+static int top(int n) {
+    return left(n) + right(n);
+}
+
+int main(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    volatile int sink = top(5);
+    (void)sink;
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/fractal.c b/callgrind-utils/testdata/fractal.c
new file mode 100644
index 000000000..bde5417cf
--- /dev/null
+++ b/callgrind-utils/testdata/fractal.c
@@ -0,0 +1,247 @@
+// Build with `-g -O0` so the functions are real (no inlining) and carry debug
+// names:
+//   cc -g -O0 -I callgrind -I include ...
+
+#include <callgrind.h>
+
+#define MAX_DEPTH 5
+#define BRANCH_FACTOR 3
+#define FIB_N 25
+#define MAX_NODES 1024
+
+typedef struct FractalNode {
+    long value;
+    int depth;
+    unsigned long computed_hash;
+    struct FractalNode *children[BRANCH_FACTOR];
+    int num_children;
+} FractalNode;
+
+// Bump-allocated node pool: avoids the allocator frames a heap tree would leak
+// into the profile. Reset at the start of every tree build.
+static FractalNode g_pool[MAX_NODES];
+static int g_pool_used;
+
+static FractalNode *pool_alloc(void) {
+    FractalNode *node = &g_pool[g_pool_used++];
+    node->value = 0;
+    node->depth = 0;
+    node->computed_hash = 0;
+    node->num_children = 0;
+    return node;
+}
+
+// Deterministic child seed (integer stand-in for the original golden-ratio sine).
+static long compute_child_value(long parent_value, int child_index, int depth) {
+    unsigned long base = (unsigned long)parent_value * 2654435761UL;
+    unsigned long offset = (unsigned long)(child_index + 1) * (unsigned long)(depth + 1);
+    return (long)(((base ^ (offset * 40503UL)) % 100UL) + 1UL);
+}
+
+static unsigned long compute_tree_hash(const FractalNode *node) {
+    unsigned long hash = (unsigned long)node->value;
+    hash = hash * 31 + (unsigned long)node->depth;
+
+    for (int i = 0; i < node->num_children; i++) {
+        hash = hash * 31 + compute_tree_hash(node->children[i]);
+    }
+    return hash;
+}
+
+static FractalNode *build_fractal(int depth, long seed) {
+    FractalNode *node = pool_alloc();
+    node->value = seed;
+    node->depth = depth;
+
+    if (depth < MAX_DEPTH) {
+        node->num_children = BRANCH_FACTOR;
+        for (int i = 0; i < BRANCH_FACTOR; i++) {
+            long child_seed = compute_child_value(seed, i, depth);
+            node->children[i] = build_fractal(depth + 1, child_seed);
+        }
+    }
+
+    node->computed_hash = compute_tree_hash(node);
+    return node;
+}
+
+static long recursive_sum(const FractalNode *node) {
+    long children_sum = 0;
+    for (int i = 0; i < node->num_children; i++) {
+        children_sum += recursive_sum(node->children[i]);
+    }
+    return node->value + children_sum;
+}
+
+static long max_path_sum(const FractalNode *node) {
+    if (node->num_children == 0) {
+        return node->value;
+    }
+
+    long max_child_path = 0;
+    for (int i = 0; i < node->num_children; i++) {
+        long child_path = max_path_sum(node->children[i]);
+        if (child_path > max_child_path) {
+            max_child_path = child_path;
+        }
+    }
+    return node->value + max_child_path;
+}
+
+static int count_nodes(const FractalNode *node) {
+    int count = 1;
+    for (int i = 0; i < node->num_children; i++) {
+        count += count_nodes(node->children[i]);
+    }
+    return count;
+}
+
+// Collected leaves land in a shared buffer; the caller resets g_leaf_count.
+static long g_leaves[MAX_NODES];
+static int g_leaf_count;
+
+static void collect_leaves(const FractalNode *node) {
+    if (node->num_children == 0) {
+        g_leaves[g_leaf_count++] = node->value;
+        return;
+    }
+    for (int i = 0; i < node->num_children; i++) {
+        collect_leaves(node->children[i]);
+    }
+}
+
+static int fibonacci_memo(int n, int *memo) {
+    if (n <= 1) {
+        return n;
+    }
+    if (memo[n] != -1) {
+        return memo[n];
+    }
+
+    int result = fibonacci_memo(n - 1, memo) + fibonacci_memo(n - 2, memo);
+    memo[n] = result;
+    return result;
+}
+
+static long compute_variance(const long *values, int count) {
+    if (count == 0) {
+        return 0;
+    }
+
+    long mean = 0;
+    for (int i = 0; i < count; i++) {
+        mean += values[i];
+    }
+    mean /= count;
+
+    long variance = 0;
+    for (int i = 0; i < count; i++) {
+        long diff = values[i] - mean;
+        variance += diff * diff;
+    }
+    return variance / count;
+}
+
+static long recursive_path_score(long value, int depth) {
+    if (depth == 0 || value < 2) {
+        return value;
+    }
+    long reduced = (value * 4) / 5;
+    return 1 + recursive_path_score(reduced, depth - 1) / 2;
+}
+
+static long compute_complexity_score(int node_count, long variance, long max_path) {
+    long base_score = (long)node_count * variance;
+    long path_factor = recursive_path_score(max_path, 5);
+    return base_score + path_factor;
+}
+
+typedef struct {
+    long total_sum;
+    int node_count;
+    long max_path;
+    long leaf_variance;
+    long complexity_score;
+} TreeAnalysis;
+
+static TreeAnalysis analyze_fractal_tree(FractalNode *tree, int analysis_depth) {
+    long total_sum = recursive_sum(tree);
+    int node_count = count_nodes(tree);
+    long max_path = max_path_sum(tree);
+
+    g_leaf_count = 0;
+    collect_leaves(tree);
+    long leaf_variance = compute_variance(g_leaves, g_leaf_count);
+
+    TreeAnalysis analysis;
+    if (analysis_depth > 0) {
+        TreeAnalysis nested = analyze_fractal_tree(tree, analysis_depth - 1);
+        analysis.total_sum = total_sum + nested.total_sum / 10;
+        analysis.node_count = node_count;
+        analysis.max_path = max_path > nested.max_path ? max_path : nested.max_path;
+        analysis.leaf_variance = (leaf_variance + nested.leaf_variance) / 2;
+        analysis.complexity_score =
+            compute_complexity_score(node_count, leaf_variance, max_path);
+        return analysis;
+    }
+
+    analysis.total_sum = total_sum;
+    analysis.node_count = node_count;
+    analysis.max_path = max_path;
+    analysis.leaf_variance = leaf_variance;
+    analysis.complexity_score = compute_complexity_score(node_count, leaf_variance, max_path);
+    return analysis;
+}
+
+static long complex_fractal_benchmark(void) {
+    g_pool_used = 0;
+    FractalNode *tree = build_fractal(0, 42);
+
+    TreeAnalysis analysis = analyze_fractal_tree(tree, 2);
+
+    int memo[FIB_N + 1];
+    for (int i = 0; i <= FIB_N; i++) {
+        memo[i] = -1;
+    }
+    long fib_result = fibonacci_memo(FIB_N, memo);
+
+    long tree_hash = (long)compute_tree_hash(tree);
+    long tree_metric = analysis.total_sum + (long)analysis.node_count * 10 + analysis.max_path +
+                       analysis.leaf_variance + analysis.complexity_score;
+
+    return (tree_metric + fib_result + tree_hash) % 1000000;
+}
+
+// Deepest frame: this is where instrumentation is turned on, with
+// main -> run_benchmark -> warmup -> run_measured already live on the native
+// stack but the shadow stack empty. The seeder reconstructs that chain.
+static long run_measured(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    long result = complex_fractal_benchmark();
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return result;
+}
+
+// Two unmeasured warmup iterations (instrumentation still off) before the
+// measured run, like a real benchmark harness.
+static long warmup(void) {
+    volatile long acc = 0;
+    for (int i = 0; i < 2; i++) {
+        acc += complex_fractal_benchmark();
+    }
+    (void)acc;
+    return run_measured();
+}
+
+static long run_benchmark(void) {
+    return warmup();
+}
+
+int main(void) {
+    volatile long result = run_benchmark();
+    (void)result;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/fractal.rs b/callgrind-utils/testdata/fractal.rs
new file mode 100644
index 000000000..3933626b0
--- /dev/null
+++ b/callgrind-utils/testdata/fractal.rs
@@ -0,0 +1,330 @@
+// Rust twin of `testdata/fractal.c`: a pure-compute recursive fractal whose
+// Callgrind client requests fire several frames deep.
+//
+// The CALLGRIND_* client requests are inline-asm sequences, so a pure-Rust
+// binary can't issue them directly. Instead this fixture links the C
+// `clgctl.c` shim (compiled into a static lib by the test harness) and calls
+// `clg_start` / `clg_stop` through FFI, the same shim the Python fixture drives
+// via ctypes.
+//
+// Every function is `#[no_mangle]` so the profile carries stable C-like symbol
+// names (Callgrind's node redaction does not strip Rust mangling hashes).
+// Integer arithmetic and a fixed-size arena (no `Vec`, no `f64`) keep the graph
+// free of allocator / libm frames, so the parsed JSON is stable across
+// platforms.
+//
+// Build (done by tests/rust_callgraph.rs):
+//   rustc --edition 2021 -g -C opt-level=0 -L native=<dir> -l static=clgctl ...
+
+#![allow(dead_code)]
+
+const MAX_DEPTH: usize = 5;
+const BRANCH_FACTOR: usize = 3;
+const FIB_N: usize = 25;
+const MAX_NODES: usize = 1024;
+
+extern "C" {
+    fn clg_start();
+    fn clg_stop();
+}
+
+#[derive(Clone, Copy)]
+struct FractalNode {
+    value: i64,
+    depth: i64,
+    computed_hash: u64,
+    children: [usize; BRANCH_FACTOR],
+    num_children: usize,
+}
+
+impl FractalNode {
+    const fn zero() -> Self {
+        FractalNode {
+            value: 0,
+            depth: 0,
+            computed_hash: 0,
+            children: [0; BRANCH_FACTOR],
+            num_children: 0,
+        }
+    }
+}
+
+// Bump-allocated node arena: avoids the allocator frames a heap tree would leak
+// into the profile. A fresh arena is used for every tree build.
+struct Pool {
+    nodes: [FractalNode; MAX_NODES],
+    used: usize,
+}
+
+impl Pool {
+    fn new() -> Self {
+        Pool {
+            nodes: [FractalNode::zero(); MAX_NODES],
+            used: 0,
+        }
+    }
+}
+
+#[no_mangle]
+#[inline(never)]
+fn pool_alloc(pool: &mut Pool) -> usize {
+    let idx = pool.used;
+    pool.used += 1;
+    pool.nodes[idx] = FractalNode::zero();
+    idx
+}
+
+// Deterministic child seed (integer stand-in for the original golden-ratio sine).
+#[no_mangle]
+#[inline(never)]
+fn compute_child_value(parent_value: i64, child_index: usize, depth: usize) -> i64 {
+    let base = (parent_value as u64).wrapping_mul(2654435761);
+    let offset = ((child_index as u64) + 1).wrapping_mul((depth as u64) + 1);
+    (((base ^ offset.wrapping_mul(40503)) % 100) + 1) as i64
+}
+
+#[no_mangle]
+#[inline(never)]
+fn compute_tree_hash(pool: &Pool, idx: usize) -> u64 {
+    let node = pool.nodes[idx];
+    let mut hash = (node.value as u64).wrapping_mul(31).wrapping_add(node.depth as u64);
+    for i in 0..node.num_children {
+        hash = hash
+            .wrapping_mul(31)
+            .wrapping_add(compute_tree_hash(pool, node.children[i]));
+    }
+    hash
+}
+
+#[no_mangle]
+#[inline(never)]
+fn build_fractal(pool: &mut Pool, depth: usize, seed: i64) -> usize {
+    let idx = pool_alloc(pool);
+    pool.nodes[idx].value = seed;
+    pool.nodes[idx].depth = depth as i64;
+
+    if depth < MAX_DEPTH {
+        let mut children = [0usize; BRANCH_FACTOR];
+        for i in 0..BRANCH_FACTOR {
+            let child_seed = compute_child_value(seed, i, depth);
+            children[i] = build_fractal(pool, depth + 1, child_seed);
+        }
+        pool.nodes[idx].children = children;
+        pool.nodes[idx].num_children = BRANCH_FACTOR;
+    }
+
+    pool.nodes[idx].computed_hash = compute_tree_hash(pool, idx);
+    idx
+}
+
+#[no_mangle]
+#[inline(never)]
+fn recursive_sum(pool: &Pool, idx: usize) -> i64 {
+    let node = pool.nodes[idx];
+    let mut children_sum = 0i64;
+    for i in 0..node.num_children {
+        children_sum += recursive_sum(pool, node.children[i]);
+    }
+    node.value + children_sum
+}
+
+#[no_mangle]
+#[inline(never)]
+fn max_path_sum(pool: &Pool, idx: usize) -> i64 {
+    let node = pool.nodes[idx];
+    if node.num_children == 0 {
+        return node.value;
+    }
+
+    let mut max_child_path = 0i64;
+    for i in 0..node.num_children {
+        let child_path = max_path_sum(pool, node.children[i]);
+        if child_path > max_child_path {
+            max_child_path = child_path;
+        }
+    }
+    node.value + max_child_path
+}
+
+#[no_mangle]
+#[inline(never)]
+fn count_nodes(pool: &Pool, idx: usize) -> i64 {
+    let node = pool.nodes[idx];
+    let mut count = 1i64;
+    for i in 0..node.num_children {
+        count += count_nodes(pool, node.children[i]);
+    }
+    count
+}
+
+#[no_mangle]
+#[inline(never)]
+fn collect_leaves(pool: &Pool, idx: usize, leaves: &mut [i64], count: &mut usize) {
+    let node = pool.nodes[idx];
+    if node.num_children == 0 {
+        leaves[*count] = node.value;
+        *count += 1;
+        return;
+    }
+    for i in 0..node.num_children {
+        collect_leaves(pool, node.children[i], leaves, count);
+    }
+}
+
+#[no_mangle]
+#[inline(never)]
+fn fibonacci_memo(n: i64, memo: &mut [i64]) -> i64 {
+    if n <= 1 {
+        return n;
+    }
+    if memo[n as usize] != -1 {
+        return memo[n as usize];
+    }
+
+    let result = fibonacci_memo(n - 1, memo) + fibonacci_memo(n - 2, memo);
+    memo[n as usize] = result;
+    result
+}
+
+#[no_mangle]
+#[inline(never)]
+fn compute_variance(values: &[i64]) -> i64 {
+    if values.is_empty() {
+        return 0;
+    }
+
+    let mut mean = 0i64;
+    for &v in values {
+        mean += v;
+    }
+    mean /= values.len() as i64;
+
+    let mut variance = 0i64;
+    for &v in values {
+        let diff = v - mean;
+        variance += diff * diff;
+    }
+    variance / values.len() as i64
+}
+
+#[no_mangle]
+#[inline(never)]
+fn recursive_path_score(value: i64, depth: usize) -> i64 {
+    if depth == 0 || value < 2 {
+        return value;
+    }
+    let reduced = (value * 4) / 5;
+    1 + recursive_path_score(reduced, depth - 1) / 2
+}
+
+#[no_mangle]
+#[inline(never)]
+fn compute_complexity_score(node_count: i64, variance: i64, max_path: i64) -> i64 {
+    let base_score = node_count * variance;
+    let path_factor = recursive_path_score(max_path, 5);
+    base_score + path_factor
+}
+
+#[derive(Clone, Copy)]
+struct TreeAnalysis {
+    total_sum: i64,
+    node_count: i64,
+    max_path: i64,
+    leaf_variance: i64,
+    complexity_score: i64,
+}
+
+#[no_mangle]
+#[inline(never)]
+fn analyze_fractal_tree(pool: &Pool, root: usize, analysis_depth: usize) -> TreeAnalysis {
+    let total_sum = recursive_sum(pool, root);
+    let node_count = count_nodes(pool, root);
+    let max_path = max_path_sum(pool, root);
+
+    let mut leaves = [0i64; MAX_NODES];
+    let mut leaf_count = 0usize;
+    collect_leaves(pool, root, &mut leaves, &mut leaf_count);
+    let leaf_variance = compute_variance(&leaves[..leaf_count]);
+
+    if analysis_depth > 0 {
+        let nested = analyze_fractal_tree(pool, root, analysis_depth - 1);
+        return TreeAnalysis {
+            total_sum: total_sum + nested.total_sum / 10,
+            node_count,
+            max_path: max_path.max(nested.max_path),
+            leaf_variance: (leaf_variance + nested.leaf_variance) / 2,
+            complexity_score: compute_complexity_score(node_count, leaf_variance, max_path),
+        };
+    }
+
+    TreeAnalysis {
+        total_sum,
+        node_count,
+        max_path,
+        leaf_variance,
+        complexity_score: compute_complexity_score(node_count, leaf_variance, max_path),
+    }
+}
+
+#[no_mangle]
+#[inline(never)]
+fn complex_fractal_benchmark() -> i64 {
+    let mut pool = Pool::new();
+    let root = build_fractal(&mut pool, 0, 42);
+
+    let analysis = analyze_fractal_tree(&pool, root, 2);
+
+    let mut memo = [-1i64; FIB_N + 1];
+    let fib_result = fibonacci_memo(FIB_N as i64, &mut memo);
+
+    let tree_hash = compute_tree_hash(&pool, root) as i64;
+    let tree_metric = analysis.total_sum
+        + analysis.node_count * 10
+        + analysis.max_path
+        + analysis.leaf_variance
+        + analysis.complexity_score;
+
+    (tree_metric.wrapping_add(fib_result).wrapping_add(tree_hash)).rem_euclid(1_000_000)
+}
+
+// Deepest frame: instrumentation is turned on here, with
+// main -> run_benchmark -> warmup -> run_measured already live on the native
+// stack but the shadow stack empty. The seeder reconstructs that chain.
+#[no_mangle]
+#[inline(never)]
+fn run_measured() -> i64 {
+    unsafe {
+        clg_start();
+    }
+
+    let result = complex_fractal_benchmark();
+
+    unsafe {
+        clg_stop();
+    }
+    result
+}
+
+// Two unmeasured warmup iterations (instrumentation still off) before the
+// measured run, like a real benchmark harness.
+#[no_mangle]
+#[inline(never)]
+fn warmup() -> i64 {
+    let mut acc = 0i64;
+    for _ in 0..2 {
+        acc = acc.wrapping_add(complex_fractal_benchmark());
+    }
+    std::hint::black_box(acc);
+    run_measured()
+}
+
+#[no_mangle]
+#[inline(never)]
+fn run_benchmark() -> i64 {
+    warmup()
+}
+
+fn main() {
+    let result = run_benchmark();
+    std::hint::black_box(result);
+}
diff --git a/callgrind-utils/testdata/fractal_alloc.rs b/callgrind-utils/testdata/fractal_alloc.rs
new file mode 100644
index 000000000..c815e0931
--- /dev/null
+++ b/callgrind-utils/testdata/fractal_alloc.rs
@@ -0,0 +1,273 @@
+// Adapted directly from the real production benchmark that exhibits the
+// "free calls analyze_fractal_tree" misattribution on aarch64
+// (codspeed-integrations-e2e-tests/rust/{src/lib.rs,src/fractal.rs}). Unlike
+// testdata/fractal.rs (which deliberately avoids Vec/f64 to keep the graph
+// allocator-free and stable across platforms), this fixture intentionally
+// uses real Vec<FractalNode>/Vec<f64> heap allocation, matching the shape
+// that has been confirmed (via a real production trace) to trigger the
+// bug: `analyze_fractal_tree` computes a median then an interquartile range
+// back-to-back (each allocates a scratch Vec<f64>, sorts it, and drops it)
+// before making a self-recursive call.
+//
+// `enable_regression` is hardcoded false, matching CODSPEED_REGRESSION=0 in
+// CI -- it doesn't gate any of the allocations relevant to this bug.
+//
+// Build (mirrors testdata/fractal.rs's convention, done by tests/rust_callgraph.rs):
+//   rustc --edition 2021 -g -C opt-level=3 -L native=<dir> -l static=clgctl ...
+
+#![allow(dead_code)]
+
+extern "C" {
+    fn clg_start();
+    fn clg_stop();
+}
+
+#[derive(Debug, Clone)]
+struct FractalNode {
+    value: f64,
+    children: Vec<FractalNode>,
+}
+
+impl FractalNode {
+    fn new(value: f64) -> Self {
+        FractalNode {
+            value,
+            children: Vec::new(),
+        }
+    }
+
+    fn build_fractal(depth: usize, max_depth: usize, branch_factor: usize, seed: f64) -> Self {
+        let mut node = FractalNode::new(seed);
+        if depth < max_depth {
+            for i in 0..branch_factor {
+                let child_seed = Self::compute_child_value(seed, i, depth);
+                node.children
+                    .push(Self::build_fractal(depth + 1, max_depth, branch_factor, child_seed));
+            }
+        }
+        node
+    }
+
+    fn compute_child_value(parent_value: f64, child_index: usize, depth: usize) -> f64 {
+        let base = parent_value * 0.618033988749;
+        let offset = (child_index as f64 + 1.0) * (depth as f64 + 1.0);
+        (base + offset).sin().abs() * 100.0
+    }
+
+    fn recursive_sum(&self) -> f64 {
+        let children_sum: f64 = self.children.iter().map(|c| c.recursive_sum()).sum();
+        self.value + children_sum
+    }
+
+    fn max_path_sum(&self) -> f64 {
+        if self.children.is_empty() {
+            return self.value;
+        }
+        let max_child_path = self
+            .children
+            .iter()
+            .map(|c| c.max_path_sum())
+            .fold(f64::NEG_INFINITY, f64::max);
+        self.value + max_child_path
+    }
+
+    fn count_nodes(&self) -> usize {
+        1 + self.children.iter().map(|c| c.count_nodes()).sum::<usize>()
+    }
+
+    fn collect_leaves(&self, leaves: &mut Vec<f64>) {
+        if self.children.is_empty() {
+            leaves.push(self.value);
+        } else {
+            for child in &self.children {
+                child.collect_leaves(leaves);
+            }
+        }
+    }
+}
+
+#[no_mangle]
+#[inline(never)]
+fn analyze_fractal_tree(tree: &FractalNode, analysis_depth: usize) -> TreeAnalysis {
+    let total_sum = tree.recursive_sum();
+    let node_count = tree.count_nodes();
+    let max_path = tree.max_path_sum();
+
+    let mut leaves = Vec::new();
+    tree.collect_leaves(&mut leaves);
+    let leaf_variance = compute_variance(&leaves);
+
+    let leaf_stddev = leaf_variance.sqrt();
+    let leaf_median = compute_median(&leaves);
+    let leaf_iqr = compute_interquartile_range(&leaves);
+
+    if analysis_depth > 0 {
+        let nested_analysis = analyze_fractal_tree(tree, analysis_depth - 1);
+        TreeAnalysis {
+            total_sum: total_sum + nested_analysis.total_sum * 0.1,
+            node_count,
+            max_path: max_path.max(nested_analysis.max_path),
+            leaf_variance: (leaf_variance + nested_analysis.leaf_variance) / 2.0,
+            complexity_score: compute_complexity_score(
+                node_count,
+                leaf_variance,
+                max_path,
+                leaf_stddev,
+                leaf_median,
+                leaf_iqr,
+            ),
+        }
+    } else {
+        TreeAnalysis {
+            total_sum,
+            node_count,
+            max_path,
+            leaf_variance,
+            complexity_score: compute_complexity_score(
+                node_count,
+                leaf_variance,
+                max_path,
+                leaf_stddev,
+                leaf_median,
+                leaf_iqr,
+            ),
+        }
+    }
+}
+
+fn compute_variance(values: &[f64]) -> f64 {
+    if values.is_empty() {
+        return 0.0;
+    }
+    let mean = values.iter().sum::<f64>() / values.len() as f64;
+    values.iter().map(|v| (v - mean) * (v - mean)).sum::<f64>() / values.len() as f64
+}
+
+fn compute_median(values: &[f64]) -> f64 {
+    if values.is_empty() {
+        return 0.0;
+    }
+    let mut sorted = values.to_vec();
+    sorted.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
+    let mid = sorted.len() / 2;
+    if sorted.len() % 2 == 0 {
+        (sorted[mid - 1] + sorted[mid]) / 2.0
+    } else {
+        sorted[mid]
+    }
+}
+
+fn compute_interquartile_range(values: &[f64]) -> f64 {
+    if values.len() < 4 {
+        return 0.0;
+    }
+    let mut sorted = values.to_vec();
+    sorted.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
+    let q1_idx = sorted.len() / 4;
+    let q3_idx = (sorted.len() * 3) / 4;
+    sorted[q3_idx] - sorted[q1_idx]
+}
+
+fn compute_complexity_score(
+    node_count: usize,
+    variance: f64,
+    max_path: f64,
+    stddev: f64,
+    median: f64,
+    iqr: f64,
+) -> f64 {
+    let base_score = (node_count as f64).ln() * variance.sqrt();
+    let path_factor = recursive_path_score(max_path, 7);
+    let distribution_factor = (stddev + median + iqr) / 3.0;
+    let trig_factor = (distribution_factor.sin().abs() + distribution_factor.cos().abs()) / 2.0;
+    base_score * path_factor * (1.0 + trig_factor)
+}
+
+fn recursive_path_score(value: f64, depth: usize) -> f64 {
+    if depth == 0 || value < 1.0 {
+        return value;
+    }
+    let reduced = value * 0.8;
+    1.0 + recursive_path_score(reduced, depth - 1) * 0.5
+}
+
+#[derive(Debug)]
+struct TreeAnalysis {
+    total_sum: f64,
+    node_count: usize,
+    max_path: f64,
+    leaf_variance: f64,
+    complexity_score: f64,
+}
+
+fn fibonacci_memo(n: u32, memo: &mut std::collections::HashMap<u32, u64>) -> u64 {
+    if n <= 1 {
+        return n as u64;
+    }
+    if let Some(&result) = memo.get(&n) {
+        return result;
+    }
+    let result = fibonacci_memo(n - 1, memo) + fibonacci_memo(n - 2, memo);
+    memo.insert(n, result);
+    result
+}
+
+#[no_mangle]
+#[inline(never)]
+fn complex_fractal_benchmark(tree_depth: usize, branch_factor: usize, fib_n: u32) -> f64 {
+    let tree = FractalNode::build_fractal(0, tree_depth, branch_factor, 42.0);
+    let analysis = analyze_fractal_tree(&tree, 4);
+
+    let mut memo = std::collections::HashMap::new();
+    let fib_result = fibonacci_memo(fib_n, &mut memo) as f64;
+    let fib_result2 = fibonacci_memo(fib_n + 2, &mut memo) as f64;
+    let fib_result3 = fibonacci_memo(fib_n + 3, &mut memo) as f64;
+
+    let tree_hash_stub = tree.recursive_sum();
+    let tree_metric = analysis.total_sum
+        + (analysis.node_count as f64 * 10.0)
+        + analysis.max_path
+        + analysis.leaf_variance
+        + analysis.complexity_score;
+
+    let combined = tree_metric + fib_result + fib_result2 + fib_result3 + tree_hash_stub;
+    let transformed = combined.sqrt() * combined.ln_1p();
+    let trig_result = (combined / 1000.0).sin().powi(2) + (combined / 1000.0).cos().powi(2);
+
+    (transformed + combined + trig_result) % 1_000_000.0
+}
+
+#[no_mangle]
+#[inline(never)]
+fn run_measured() -> f64 {
+    unsafe {
+        clg_start();
+    }
+    let result = complex_fractal_benchmark(5, 3, 25);
+    unsafe {
+        clg_stop();
+    }
+    result
+}
+
+#[no_mangle]
+#[inline(never)]
+fn warmup() -> f64 {
+    let mut acc = 0.0f64;
+    for _ in 0..2 {
+        acc += complex_fractal_benchmark(5, 3, 25);
+    }
+    std::hint::black_box(acc);
+    run_measured()
+}
+
+#[no_mangle]
+#[inline(never)]
+fn run_benchmark() -> f64 {
+    warmup()
+}
+
+fn main() {
+    let result = run_benchmark();
+    std::hint::black_box(result);
+}
diff --git a/callgrind-utils/testdata/mutual.c b/callgrind-utils/testdata/mutual.c
new file mode 100644
index 000000000..0afc3013f
--- /dev/null
+++ b/callgrind-utils/testdata/mutual.c
@@ -0,0 +1,26 @@
+// Fixture: mutual recursion `is_even <-> is_odd`, forming a two-function cycle
+// reached from `main`. Exercises cyclic call topology. See recursion.c for the
+// instrumentation/build conventions.
+
+#include <callgrind.h>
+
+static int is_odd(int n);
+
+static int is_even(int n) {
+    return n == 0 ? 1 : is_odd(n - 1);
+}
+
+static int is_odd(int n) {
+    return n == 0 ? 0 : is_even(n - 1);
+}
+
+int main(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    volatile int sink = is_even(6);
+    (void)sink;
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/objskip_seed_underflow.c b/callgrind-utils/testdata/objskip_seed_underflow.c
new file mode 100644
index 000000000..4175ab01d
--- /dev/null
+++ b/callgrind-utils/testdata/objskip_seed_underflow.c
@@ -0,0 +1,100 @@
+// Minimal cross-arch reproduction of the arm64 OFF->ON seeding-underflow
+// cascade behind the python_fractal_* test failures (full analysis:
+// .agents/docs/arm64-python-seeding-underflow-analysis.md).
+//
+// Call shape, distilled from `python3 -X perf` + ctypes:
+//
+//   main -> trampoline_call            [asm: FP chain maintained, NO CFI]
+//     -> skip_drive                    [obj-skipped .so, "eval loop"]
+//       -> skip_begin_hop1 -> skip_begin_hop2      [obj-skipped "ffi dive"]
+//         -> clg_begin_marker          [this binary: START fires here]
+//       <- returns climb back through the skipped hops
+//       -> workload -> leaf_mix        [this binary: the measured region]
+//       -> skip_end_hop1 -> skip_end_hop2 -> clg_end_marker   [STOP]
+//
+// The asm trampoline mimics CPython's -X perf JIT trampolines: it maintains
+// the frame-pointer chain but has no .eh_frame FDE. Valgrind's aarch64
+// unwinder is CFI-only (m_stacktrace.c), so the OFF->ON seed stops AT the
+// trampoline and the seeded context stack is exactly
+// [trampoline_call, clg_begin_marker] — one entry deep once
+// clg_begin_marker's frame pops. On the next return, bbcc.c's underflow test
+// misreads the fn-stack base sentinel as a signal-separation marker,
+// handleUnderflow ignores fn->skip, and the skipped hops leak into the graph
+// as named, inverted, full-cost nodes. On x86_64 the unwinder's FP fallback
+// walks past the trampoline into main/libc, the context stack stays deeper
+// than one, and the output is clean — the correct shape on every arch:
+//
+//   trampoline_call;workload;leaf_mix     (skipped frames folded away)
+#include <callgrind.h>
+#include <limits.h>
+#include <stdlib.h>
+
+extern int skip_drive(int n);
+int trampoline_call(int n);
+
+// CPython-trampoline-shaped hop: frame record maintained (so the x86_64
+// FP-fallback unwinder can walk through it), but no .cfi_* directives, so no
+// FDE is emitted and the CFI-only aarch64 unwinder must stop here.
+#if defined(__aarch64__)
+__asm__(
+    ".text\n"
+    ".globl trampoline_call\n"
+    ".type trampoline_call, %function\n"
+    "trampoline_call:\n"
+    "    stp x29, x30, [sp, #-16]!\n"
+    "    mov x29, sp\n"
+    "    bl skip_drive\n"
+    "    ldp x29, x30, [sp], #16\n"
+    "    ret\n"
+    ".size trampoline_call, .-trampoline_call\n");
+#elif defined(__x86_64__)
+__asm__(
+    ".text\n"
+    ".globl trampoline_call\n"
+    ".type trampoline_call, @function\n"
+    "trampoline_call:\n"
+    "    pushq %rbp\n"
+    "    movq %rsp, %rbp\n"
+    "    call skip_drive@PLT\n"
+    "    popq %rbp\n"
+    "    ret\n"
+    ".size trampoline_call, .-trampoline_call\n");
+#else
+#error "objskip_seed_underflow: unsupported architecture"
+#endif
+
+// Innermost frame of the OFF->ON transition, reached through the skipped
+// hops — the clg_start() twin.
+__attribute__((noinline)) int clg_begin_marker(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    return 1;
+}
+
+__attribute__((noinline)) int clg_end_marker(void) {
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return 1;
+}
+
+__attribute__((noinline)) int leaf_mix(int v) {
+    return (int)(((unsigned)v * 2654435761u) >> 16);
+}
+
+// The measured region, called from the skipped library — must fold as a
+// direct child of trampoline_call.
+__attribute__((noinline)) int workload(int n) {
+    int acc = 1;
+    for (int i = 0; i < n; i++)
+        acc += leaf_mix(acc + i);
+    return acc;
+}
+
+int main(int argc, char **argv) {
+    // argv[1] = path of the companion .so; register by realpath, exactly as
+    // pytest-codspeed / fractal.py do (Callgrind keys obj-skip on the mapped
+    // object path).
+    char resolved[PATH_MAX];
+    if (argc > 1 && realpath(argv[1], resolved))
+        CALLGRIND_ADD_OBJ_SKIP(resolved);
+    int r = trampoline_call(512);
+    return r > 0 ? 0 : 1;
+}
diff --git a/callgrind-utils/testdata/objskip_seed_underflow_lib.c b/callgrind-utils/testdata/objskip_seed_underflow_lib.c
new file mode 100644
index 000000000..e3748a285
--- /dev/null
+++ b/callgrind-utils/testdata/objskip_seed_underflow_lib.c
@@ -0,0 +1,40 @@
+// Companion shared library for objskip_seed_underflow.c — the "interpreter".
+//
+// Every function here is obj-skipped at runtime (the binary passes this .so's
+// realpath to CALLGRIND_ADD_OBJ_SKIP before instrumentation starts), so none
+// of these frames may ever appear in the output. `skip_drive` plays CPython's
+// eval loop: it reaches the instrumentation toggles through extra skipped
+// hops (like the ctypes/libffi dive under `clgctl.clg_start()`), then calls
+// the measured workload back in the non-skipped binary.
+//
+// The two-hop dive matters: after the OFF->ON seed's innermost non-skipped
+// frame (clg_begin_marker) pops, the returns hop2 -> hop1 -> skip_drive
+// execute while the seeded context stack is one entry deep — the state that
+// trips bbcc.c's base-sentinel/underflow misfire on arm64.
+
+extern int clg_begin_marker(void);
+extern int clg_end_marker(void);
+extern int workload(int n);
+
+__attribute__((noinline)) int skip_begin_hop2(void) {
+    return clg_begin_marker() + 1;
+}
+
+__attribute__((noinline)) int skip_begin_hop1(void) {
+    return skip_begin_hop2() + 1;
+}
+
+__attribute__((noinline)) int skip_end_hop2(void) {
+    return clg_end_marker() + 1;
+}
+
+__attribute__((noinline)) int skip_end_hop1(void) {
+    return skip_end_hop2() + 1;
+}
+
+__attribute__((noinline)) int skip_drive(int n) {
+    int acc = skip_begin_hop1(); /* OFF->ON fires two skipped frames down */
+    acc += workload(n);          /* the measured region */
+    acc += skip_end_hop1();      /* ON->OFF fires two skipped frames down */
+    return acc;
+}
diff --git a/callgrind-utils/testdata/recursion.c b/callgrind-utils/testdata/recursion.c
new file mode 100644
index 000000000..e27cfa00e
--- /dev/null
+++ b/callgrind-utils/testdata/recursion.c
@@ -0,0 +1,40 @@
+// Fixture for callgrind-utils snapshot tests.
+//
+// A small, pure-compute call graph: direct recursion (`fib` -> `fib`) plus two
+// helper edges (`compute` -> `fib`, `compute` -> `square`) under `main`.
+//
+// Mirrors how CodSpeed drives a benchmark: instrumentation is off at startup
+// (run with `--instr-atstart=no`), so loader/libc-start frames are excluded,
+// then turned on around the measured region. Build with `-g -O0` so the
+// functions are real (no inlining) and carry debug names.
+//
+// Requires the in-repo Callgrind client-request header:
+//   cc -g -O0 -I callgrind -I include ...
+
+#include <callgrind.h>
+
+static int fib(int n) {
+    if (n < 2) {
+        return n;
+    }
+    return fib(n - 1) + fib(n - 2);
+}
+
+static int square(int n) {
+    return n * n;
+}
+
+static int compute(int n) {
+    return fib(n) + square(n);
+}
+
+int main(void) {
+    CALLGRIND_START_INSTRUMENTATION;
+    CALLGRIND_ZERO_STATS;
+
+    volatile int sink = compute(8);
+    (void)sink;
+
+    CALLGRIND_STOP_INSTRUMENTATION;
+    return 0;
+}
diff --git a/callgrind-utils/testdata/recursion.py b/callgrind-utils/testdata/recursion.py
new file mode 100644
index 000000000..5ea30dde1
--- /dev/null
+++ b/callgrind-utils/testdata/recursion.py
@@ -0,0 +1,59 @@
+# Python counterpart to recursion.c: the same fib/square/compute shape, driven
+# the way CodSpeed drives a benchmark. Instrumentation is off at startup (run
+# with --instr-atstart=no) and turned on around the measured region via the
+# clgctl shim, whose compiled path is passed as argv[1].
+#
+# Before starting, we skip the Python runtime objects (libpython + the python
+# executable) from Callgrind at runtime, exactly as pytest-codspeed's
+# instrument-hooks does in _callgrind_skip_python_runtime: the interpreter's own
+# C frames are folded into their callers so they don't obfuscate the graph.
+# Matching is by exact realpath, since Callgrind keys obj-skip on the mapped
+# object path.
+import ctypes
+import os
+import sys
+import sysconfig
+
+clgctl = ctypes.CDLL(sys.argv[1])
+
+
+def skip_python_runtime():
+    ldlibrary = sysconfig.get_config_var("LDLIBRARY")
+    libdir = sysconfig.get_config_var("LIBDIR")
+    libpython = next(
+        (
+            p
+            for p in (
+                os.path.join(libdir, ldlibrary) if ldlibrary and libdir else None,
+                os.path.join(sys.prefix, "lib", ldlibrary) if ldlibrary else None,
+            )
+            if p and os.path.exists(p)
+        ),
+        None,
+    )
+    for path in (libpython, sys.executable):
+        if path:
+            clgctl.clg_add_obj_skip(os.path.realpath(path).encode())
+
+
+def fib(n):
+    if n < 2:
+        return n
+    return fib(n - 1) + fib(n - 2)
+
+
+def square(n):
+    return n * n
+
+
+def compute(n):
+    return fib(n) + square(n)
+
+
+skip_python_runtime()
+
+clgctl.clg_start()
+sink = compute(20)
+clgctl.clg_stop()
+
+assert sink == 7165, sink
diff --git a/callgrind-utils/tests/arm64_tls_access.rs b/callgrind-utils/tests/arm64_tls_access.rs
new file mode 100644
index 000000000..fd15671d6
--- /dev/null
+++ b/callgrind-utils/tests/arm64_tls_access.rs
@@ -0,0 +1,221 @@
+//! AArch64 TLS-descriptor resolver transparency (`_dl_tlsdesc_return`).
+//!
+//! `testdata/arm64_tls_access.c` + `testdata/arm64_tls_access_lib.c`: a
+//! `__thread` variable defined in a shared library forces the TLSDESC access
+//! model, so every access `blr`s into the dynamic linker's resolver, which
+//! `ret`s straight back into the middle of the accessing function.
+//! `callgrind/fn.c` marks `_dl_tlsdesc_*` as skipped (the same transparent
+//! trampoline class as PLT stubs and `_dl_runtime_resolve`), so the resolver
+//! must never surface as a named node.
+//!
+//! The second test is the production shape (CPython under pytest-codspeed,
+//! which obj-skips the interpreter binary): a TLS access made *from an
+//! obj-skipped object* used to push the resolver frame with `ret_addr == 0`
+//! via the skipped->nonskipped splice; the resolver's mid-function return
+//! could never match, the RET-w/o-CALL promotion re-entered the skipped
+//! object with `nonskipped` pointing at the resolver, and skipped cost plus
+//! call edges piled up under `_dl_tlsdesc_return` (observed pulling nearly
+//! whole Python flamegraphs under that node, plus inverted
+//! `hash_tree -> build_tree` edges in this fixture).
+//!
+//! Structural assertions only (no insta golden), so glibc/toolchain noise
+//! stays out of the contract. aarch64-only: other arches compile this TLS
+//! access to `__tls_get_addr` calls and never exercise the TLSDESC path.
+#![cfg(target_arch = "aarch64")]
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+/// Compile the TLS-owning `.so` and the main binary (which links it directly,
+/// with an rpath back to the work dir). `-O2` matches the other `arm64_*`
+/// fixtures and keeps the TLSDESC sequence a real GOT-loaded `blr`.
+fn compile(work: &Path) -> (PathBuf, PathBuf) {
+    let repo = repo_root();
+    let testdata = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata");
+    std::fs::create_dir_all(work).expect("create work dir");
+
+    let lib = work.join("libarm64_tls_access.so");
+    let status = Command::new("cc")
+        .args(["-g", "-O2", "-shared", "-fPIC", "-o"])
+        .arg(&lib)
+        .arg(testdata.join("arm64_tls_access_lib.c"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for TLS lib: {e}"));
+    assert!(status.success(), "cc failed for TLS lib ({status})");
+
+    let bin = work.join("arm64_tls_access");
+    let status = Command::new("cc")
+        .args(["-g", "-O2"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&bin)
+        .arg(testdata.join("arm64_tls_access.c"))
+        .arg(&lib)
+        .arg(format!("-Wl,-rpath,{}", work.display()))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for fixture: {e}"));
+    assert!(status.success(), "cc failed for fixture ({status})");
+    (bin, lib)
+}
+
+/// Profile with the exact runner flag set used by the other fixture tests,
+/// plus any `extra_args` (`--obj-skip=...` for the production shape).
+fn run_callgrind(bin: &Path, out_file: &Path, extra_args: &[String]) -> String {
+    let log_file = out_file.with_extension("valgrind.log");
+    let args = [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        "--instr-atstart=no",
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ];
+    let status = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(args)
+        .args(extra_args)
+        .arg(format!("--callgrind-out-file={}", out_file.display()))
+        .arg(format!("--log-file={}", log_file.display()))
+        .arg(bin)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    assert!(status.success(), "vg-in-place exited with {status}");
+    std::fs::read_to_string(out_file).unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()))
+}
+
+fn folded(raw: &str) -> Vec<String> {
+    CallGraph::parse(Cursor::new(raw))
+        .unwrap_or_else(|e| panic!("parse callgrind output: {e:?}"))
+        .to_folded_without_costs()
+}
+
+/// The resolver must never appear as a named node, and no stack may place a
+/// fixture function under it (the "post-trampoline work stolen" signature).
+fn assert_no_tlsdesc_node(folded: &[String], dump: &str) {
+    let leaked: Vec<&String> = folded.iter().filter(|l| l.contains("_dl_tlsdesc_")).collect();
+    assert!(
+        leaked.is_empty(),
+        "TLSDESC resolver leaked into the folded graph:\n{}\n\nfull folded output:\n{dump}",
+        leaked
+            .iter()
+            .map(|s| s.as_str())
+            .collect::<Vec<_>>()
+            .join("\n")
+    );
+}
+
+#[test]
+fn tlsdesc_resolver_is_transparent() {
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("plain");
+    let (bin, _lib) = compile(&work);
+    let out_file = work.join("arm64_tls_access.callgrind.out");
+    let raw = run_callgrind(&bin, &out_file, &[]);
+    let folded = folded(&raw);
+    let dump = folded.join("\n");
+
+    assert_no_tlsdesc_node(&folded, &dump);
+
+    // touch_tls itself is not skipped here: it must show up under its real
+    // callers, with the resolver's cost folded into it...
+    assert!(
+        folded.iter().any(|l| l.contains(";touch_tls")),
+        "touch_tls missing from the folded graph:\n{dump}"
+    );
+    // ...and must stay a leaf: a `touch_tls;<fixture fn>` stack means the
+    // resolver's unmatched return re-parented the caller's work.
+    let stolen: Vec<&String> = folded
+        .iter()
+        .filter(|l| l.contains("touch_tls;"))
+        .collect();
+    assert!(
+        stolen.is_empty(),
+        "work stolen under touch_tls (unmatched TLSDESC return):\n{}\n\nfull folded output:\n{dump}",
+        stolen
+            .iter()
+            .map(|s| s.as_str())
+            .collect::<Vec<_>>()
+            .join("\n")
+    );
+}
+
+#[test]
+fn tlsdesc_from_objskipped_code_does_not_steal_cost() {
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("objskip");
+    let (bin, lib) = compile(&work);
+    let out_file = work.join("arm64_tls_access.objskip.callgrind.out");
+    let raw = run_callgrind(&bin, &out_file, &[format!("--obj-skip={}", lib.display())]);
+    let folded = folded(&raw);
+    let dump = folded.join("\n");
+
+    assert_no_tlsdesc_node(&folded, &dump);
+
+    // The TLS-accessing function lives in the skipped object: it must fold
+    // away entirely.
+    assert!(
+        !dump.contains("touch_tls"),
+        "obj-skipped touch_tls leaked into the folded graph:\n{dump}"
+    );
+
+    // The stuck-resolver cascade manufactured return-direction edges
+    // (hash_tree "calling" build_tree). The real call direction is
+    // build_tree -> hash_tree only.
+    let inverted: Vec<&String> = folded
+        .iter()
+        .filter(|l| l.contains("hash_tree;build_tree"))
+        .collect();
+    assert!(
+        inverted.is_empty(),
+        "inverted hash_tree -> build_tree edges (stuck TLSDESC frame cascade):\n{}\n\nfull folded output:\n{dump}",
+        inverted
+            .iter()
+            .map(|s| s.as_str())
+            .collect::<Vec<_>>()
+            .join("\n")
+    );
+
+    // Instrumentation starts inside run_measured: it is the only legal root.
+    for line in &folded {
+        let root = line
+            .split(';')
+            .next()
+            .unwrap()
+            .trim_end_matches(" <cost>");
+        assert!(
+            root == "run_measured",
+            "unexpected root {root:?} in folded graph:\n{dump}"
+        );
+    }
+}
diff --git a/callgrind-utils/tests/data/example.out b/callgrind-utils/tests/data/example.out
new file mode 100644
index 000000000..fa47c84b8
--- /dev/null
+++ b/callgrind-utils/tests/data/example.out
@@ -0,0 +1,126 @@
+# callgrind format
+version: 1
+creator: callgrind-fixture
+pid: 1
+cmd: ./prog
+desc: I1 cache
+desc: D1 cache
+positions: line
+events: Ir
+summary: 1000
+totals: 1000
+
+# ===== Part 1 =====
+# Header / context lines: ob=, fl=, fn= define compressed IDs 1 for each space.
+# Object ID 1 = /path/to/clreq ; File ID 1 = file1.c ; Function ID 1 = main.
+ob=(1) /path/to/clreq
+fl=(1) file1.c
+fn=(1) main
+
+# --- main body: cost/subposition lines using +N / * / -N / 0x... (all ignored) ---
+0x401000 4
++5 8
+* 3
+-2 1
+
+# --- two-line call spec: cfn=(2) func1 / calls=1 50 / cost 16 400 ---
+# Defines Function ID 2 = func1. The cost line (16 400) is present but ignored.
+cfn=(2) func1
+calls=1 50
+16 400
+
+# --- cfl= alias equals cfi= for a callee file spec ---
+# cfl=(5) cflfile.c defines File ID 5 = cflfile.c and sets the callee file.
+cfl=(5) cflfile.c
+cfn=cflop
+calls=1 52
+18 30
+
+# --- cfni= inline function line: ignored for topology (no node/edge created) ---
+cfni=(7) some_inline
+# --- omitted cfi/cfl: callee inherits the CURRENT file context (file1.c here) ---
+cfn=nofile
+calls=1 53
+19 10
+
+# --- same function name in two different objects/files -> TWO distinct nodes ---
+# helper in liba/fileA.c .  Object ID 2 = liba ; File ID 2 = fileA.c ; Function ID 4 = helper.
+cob=(2) liba
+cfi=(2) fileA.c
+cfn=(4) helper
+calls=1 60
+20 5
+# helper in libb/fileB.c (same name, different object+file -> distinct node).
+# cfn=(4) is a REFERENCE reusing Function ID 4 = helper.
+cob=(3) libb
+cfi=(3) fileB.c
+cfn=(4)
+calls=1 61
+21 5
+
+# --- cob= overrides caller object (callee in extlib, file inherited from context) ---
+# Object ID 4 = extlib . No cfi -> callee inherits current file (file1.c).
+cob=(4) extlib
+cfn=extfn
+calls=1 70
+22 3
+
+# --- switch caller context to func1 (fn=(2) REFERENCE -> resolves to func1) ---
+fl=(1)
+fn=(2)
+ob=(1)
+# func1 calls func2 .  Defines Function ID 3 = func2 .
+cfn=(3) func2
+calls=1 54
+23 100
+
+# --- switch caller context to func2 (fn=(3) REFERENCE) ---
+fl=(1)
+fn=(3)
+ob=(1)
+# func2 calls rec .  Defines Function ID 5 = rec .
+cfn=(5) rec
+calls=1 55
+24 50
+# func2 calls func1 : cfn=(2) REFERENCE resolves to func1 (name compression reuse).
+cfn=(2)
+calls=1 62
+25 20
+
+# --- switch caller context to rec (fn=(5) REFERENCE); recursion -> self-edge ---
+fl=(1)
+fn=(5)
+ob=(1)
+cfn=(5)
+calls=1 56
+26 7
+
+# --- inline fi=/fe= file transition BEFORE a call with no cfi ---
+# inlhost fl=file1.c .  fi=(6) inline.c switches the CURRENT file context to inline.c.
+# The call below has NO cfi/cfl, so the callee inherits inline.c (NOT the fl file1.c).
+fl=(1) file1.c
+fn=inlhost
+ob=(1)
+fi=(6) inline.c
+cfn=inltarget
+calls=1 57
+27 4
+# fe=(1) switches the current file context back to the function file (file1.c).
+fe=(1)
+
+# ===== Part 2: multi-part merge (ID maps persist across parts) =====
+part: 2
+# References resolve via the persistent ID maps:
+#   fl=(1) -> file1.c , fn=(1) -> main , ob=(1) -> /path/to/clreq
+fl=(1)
+fn=(1)
+ob=(1)
+# main calls part2fn : this edge only appears in part 2 and must merge into the graph.
+cfn=part2fn
+calls=1 100
+29 2
+# bare cfn= with NO calls= line -> NO edge. Per cl-format.xml, CallSpec requires a
+# calls= line (CallLine); cfn= alone only sets callee context and is discarded. The
+# "28 1" below is a self-cost line of main (ignored). `nocnt` must NOT become a node.
+cfn=nocnt
+28 1
diff --git a/callgrind-utils/tests/flamegraph.rs b/callgrind-utils/tests/flamegraph.rs
new file mode 100644
index 000000000..090b24bab
--- /dev/null
+++ b/callgrind-utils/tests/flamegraph.rs
@@ -0,0 +1,195 @@
+//! Tests for the collapsed-stack / flamegraph projection.
+
+use callgrind_utils::error::FlamegraphError;
+use callgrind_utils::model::CallGraph;
+use std::io::Cursor;
+
+fn parse(out: &str) -> CallGraph {
+    CallGraph::parse(Cursor::new(out)).expect("parse")
+}
+
+fn folded_sorted(g: &CallGraph) -> Vec<String> {
+    let mut lines = g.to_folded();
+    lines.sort();
+    lines
+}
+
+const LINEAR: &str = "\
+part: 1
+pid: 1
+positions: line
+events: Ir
+fn=main
+10 5
+cfn=work
+calls=1 90
+11 90
+
+fn=work
+20 40
+cfn=leaf
+calls=1 50
+21 50
+
+fn=leaf
+30 50
+";
+
+#[test]
+fn folds_linear_chain_with_self_costs() {
+    let g = parse(LINEAR);
+    assert_eq!(
+        folded_sorted(&g),
+        vec![
+            "main 5".to_string(),
+            "main;work 40".to_string(),
+            "main;work;leaf 50".to_string(),
+        ]
+    );
+}
+
+#[test]
+fn renders_svg() {
+    let g = parse(LINEAR);
+    let svg = g.to_flamegraph().expect("svg");
+    assert!(svg.contains("<svg"), "expected an SVG document");
+    assert!(svg.contains("main"), "expected frame labels in the SVG");
+}
+
+const SHARED: &str = "\
+part: 1
+pid: 1
+positions: line
+events: Ir
+fn=root
+10 0
+cfn=a
+calls=1 20
+11 20
+cfn=b
+calls=1 10
+12 10
+
+fn=a
+20 0
+cfn=shared
+calls=1 20
+21 20
+
+fn=b
+30 0
+cfn=shared
+calls=1 10
+31 10
+
+fn=shared
+40 30
+";
+
+#[test]
+fn distributes_shared_callee_by_inclusive_cost() {
+    let g = parse(SHARED);
+    assert_eq!(
+        folded_sorted(&g),
+        vec![
+            "root;a;shared 20".to_string(),
+            "root;b;shared 10".to_string(),
+        ]
+    );
+}
+
+const RECURSION: &str = "\
+part: 1
+pid: 1
+positions: line
+events: Ir
+fn=rec
+10 5
+cfn=rec
+calls=1 3
+11 3
+";
+
+#[test]
+fn recursion_does_not_loop() {
+    let g = parse(RECURSION);
+    let lines = folded_sorted(&g);
+    assert!(lines.iter().all(|l| l.matches("rec").count() <= 2));
+    assert!(!lines.is_empty());
+}
+
+const SEEDED: &str = "\
+part: 1
+pid: 1
+positions: line
+events: Ir
+fn=entry
+10 5
+cfn=hot
+calls=1 0
+11 0
+
+fn=hot
+20 100
+";
+
+#[test]
+fn heavy_frame_behind_zero_cost_edge_survives() {
+    let g = parse(SEEDED);
+    let folded = folded_sorted(&g);
+    assert!(
+        folded.iter().any(|l| l == "hot 100"),
+        "hot's self cost must survive a zero-cost incoming edge; got {folded:?}"
+    );
+    let total: u64 = folded
+        .iter()
+        .map(|l| l.rsplit(' ').next().unwrap().parse::<u64>().unwrap())
+        .sum();
+    assert_eq!(total, 105, "entry(5) + hot(100)");
+}
+
+const SPARSE: &str = "\
+part: 1
+pid: 1
+positions: instr line
+events: Ir Dr Dw
+fn=main
+0x1000 10 7
++4 11 3 0
+cfn=leaf
+calls=1 0 0
+0x2000 12 20
+
+fn=leaf
+0x3000 20 20 0 0
+";
+
+#[test]
+fn parses_sparse_instr_line_cost_lines() {
+    let g = parse(SPARSE);
+    assert_eq!(
+        folded_sorted(&g),
+        vec!["main 10".to_string(), "main;leaf 20".to_string()],
+        "self=7+3 for main, inclusive/self=20 for leaf"
+    );
+}
+
+#[test]
+fn no_cost_data_is_an_error() {
+    let out = "\
+part: 1
+pid: 1
+positions: line
+events: Ir
+fn=main
+0 0
+cfn=child
+calls=1 0
+0 0
+
+fn=child
+0 0
+";
+    let g = parse(out);
+    assert!(matches!(g.to_flamegraph(), Err(FlamegraphError::NoCost)));
+}
diff --git a/callgrind-utils/tests/objskip_seed_underflow.rs b/callgrind-utils/tests/objskip_seed_underflow.rs
new file mode 100644
index 000000000..623e3c902
--- /dev/null
+++ b/callgrind-utils/tests/objskip_seed_underflow.rs
@@ -0,0 +1,165 @@
+//! Minimal cross-arch reproduction of the arm64 OFF->ON seeding-underflow
+//! cascade (analysis: `.agents/docs/arm64-python-seeding-underflow-analysis.md`).
+//!
+//! `testdata/objskip_seed_underflow.c` starts instrumentation two obj-skipped
+//! frames below an asm trampoline that maintains the FP chain but carries no
+//! CFI — the exact shape of CPython's `-X perf` JIT trampolines in the
+//! `python_fractal_*` tests. Correct behavior on every arch: the `skip_*`
+//! frames fold away and `workload` parents under `trampoline_call`.
+//!
+//! On aarch64 the CFI-only unwinder seeds a depth-1 context stack, bbcc.c's
+//! underflow heuristic misreads the fn-stack base sentinel as a signal
+//! marker, and `handleUnderflow` mints named nodes for `skip=1` functions —
+//! these assertions fail there until that is fixed. On x86_64 the FP-fallback
+//! unwinder seeds main/libc below the trampoline, the depth never reaches 1,
+//! and the assertions pass.
+//!
+//! Structural assertions only (no insta golden): the contract is
+//! platform-independent, and the folded text stays free of libc/arch noise.
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+/// Compile the skipped companion `.so` and the main binary (which links it
+/// directly, with an rpath back to the work dir).
+fn compile(work: &Path) -> (PathBuf, PathBuf) {
+    let repo = repo_root();
+    let testdata = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata");
+    std::fs::create_dir_all(work).expect("create work dir");
+
+    let lib = work.join("libobjskip_seed_underflow.so");
+    let status = Command::new("cc")
+        // -z now: resolve PLTs eagerly so lazy-binding _dl_runtime_resolve
+        // chains don't show up as first-call noise in the folded graph.
+        .args(["-g", "-O0", "-shared", "-fPIC", "-Wl,-z,now", "-o"])
+        .arg(&lib)
+        .arg(testdata.join("objskip_seed_underflow_lib.c"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for skip lib: {e}"));
+    assert!(status.success(), "cc failed for skip lib ({status})");
+
+    let bin = work.join("objskip_seed_underflow");
+    let status = Command::new("cc")
+        // -rdynamic: the skipped .so resolves workload/clg_*_marker back in
+        // the executable at load time.
+        .args(["-g", "-O0", "-rdynamic", "-Wl,-z,now"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&bin)
+        .arg(testdata.join("objskip_seed_underflow.c"))
+        .arg(&lib)
+        .arg(format!("-Wl,-rpath,{}", work.display()))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for fixture: {e}"));
+    assert!(status.success(), "cc failed for fixture ({status})");
+    (bin, lib)
+}
+
+/// Profile with the exact runner flag set used by the other fixture tests.
+/// The lib path is argv[1]: the fixture realpaths it into
+/// `CALLGRIND_ADD_OBJ_SKIP` before instrumentation starts.
+fn run_callgrind(bin: &Path, lib: &Path, out_file: &Path) -> String {
+    let log_file = out_file.with_extension("valgrind.log");
+    let args = [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        "--instr-atstart=no",
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ];
+    let status = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(args)
+        .arg(format!("--callgrind-out-file={}", out_file.display()))
+        .arg(format!("--log-file={}", log_file.display()))
+        .arg(bin)
+        .arg(lib)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    assert!(status.success(), "vg-in-place exited with {status}");
+    std::fs::read_to_string(out_file).unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()))
+}
+
+#[test]
+fn objskip_seed_underflow_folds_skipped_frames() {
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("objskip_seed_underflow");
+    let (bin, lib) = compile(&work);
+    let out_file = work.join("objskip_seed_underflow.callgrind.out");
+    let raw = run_callgrind(&bin, &lib, &out_file);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse callgrind output: {e:?}"));
+    let folded = graph.to_folded_without_costs();
+    let dump = folded.join("\n");
+
+    // 1. Obj-skipped frames must never appear as named nodes. On broken
+    //    arm64 the underflow cascade mints skip_begin_hop1/_ctypes-style
+    //    inverted nodes for skip=1 functions.
+    let leaked: Vec<&String> = folded.iter().filter(|l| l.contains("skip_")).collect();
+    assert!(
+        leaked.is_empty(),
+        "obj-skipped frames leaked into the folded graph \
+         (OFF->ON seeding-underflow cascade):\n{}\n\nfull folded output:\n{dump}",
+        leaked
+            .iter()
+            .map(|s| s.as_str())
+            .collect::<Vec<_>>()
+            .join("\n")
+    );
+
+    // 2. The measured region must parent under the trampoline (nearest
+    //    non-skipped frame), like py::run_measured;py::complex_fractal_benchmark.
+    assert!(
+        folded
+            .iter()
+            .any(|l| l.starts_with("trampoline_call;workload")),
+        "workload is not parented under trampoline_call:\n{dump}"
+    );
+
+    // 3. Roots: only the seeded innermost frame (clg_begin_marker, tiny
+    //    post-request residue) and the trampoline may be roots.
+    for line in &folded {
+        let root = line
+            .split(';')
+            .next()
+            .unwrap()
+            .trim_end_matches(" <cost>");
+        assert!(
+            root == "trampoline_call" || root == "clg_begin_marker",
+            "unexpected root {root:?} in folded graph:\n{dump}"
+        );
+    }
+}
diff --git a/callgrind-utils/tests/parser.rs b/callgrind-utils/tests/parser.rs
new file mode 100644
index 000000000..671ba39e9
--- /dev/null
+++ b/callgrind-utils/tests/parser.rs
@@ -0,0 +1,314 @@
+//! Integration tests for the Callgrind `.out` -> call-graph parser.
+//!
+//! Exercises the real format shapes from `callgrind/docs/cl-format.xml` and
+//! `callgrind/dump.c`: two-line call specs, name compression `(N)`, the
+//! `cfl`/`cfi` alias, callee file/object inheritance (including inline
+//! `fi`/`fe` transitions), same-named functions in distinct objects, direct
+//! recursion, multi-part merge, and the canonical JSON projection.
+
+use callgrind_utils::model::{CallGraph, Edge, Node, ParseOptions};
+use std::io::Cursor;
+
+const FIXTURE: &str = include_str!("data/example.out");
+
+fn parse_default() -> CallGraph {
+    CallGraph::parse(Cursor::new(FIXTURE)).expect("parse fixture")
+}
+
+/// All edges whose caller and callee function names match.
+fn edges_fn<'a>(g: &'a CallGraph, caller: &str, callee: &str) -> Vec<&'a Edge> {
+    g.edges()
+        .iter()
+        .filter(|e| e.caller.function == caller && e.callee.function == callee)
+        .collect()
+}
+
+/// All nodes with the given function name (distinct by object/file).
+fn nodes_fn<'a>(g: &'a CallGraph, function: &str) -> Vec<&'a Node> {
+    g.nodes()
+        .iter()
+        .filter(|n| n.function == function)
+        .collect()
+}
+
+#[test]
+fn parses_basic_callgraph() {
+    let g = parse_default();
+    // 12 distinct nodes, 12 edges (see fixture; `nocnt` is discarded, no edge).
+    assert_eq!(g.nodes().len(), 12, "nodes: {:#?}", g.nodes());
+    assert_eq!(g.edges().len(), 12, "edges: {:#?}", g.edges());
+
+    let mf1 = edges_fn(&g, "main", "func1");
+    assert_eq!(mf1.len(), 1);
+    assert_eq!(mf1[0].call_count, Some(1));
+    assert_eq!(mf1[0].caller.file, "file1.c");
+    assert_eq!(mf1[0].callee.file, "file1.c");
+}
+
+#[test]
+fn resolves_name_compression() {
+    // `fn=(1)`/`fl=(1)`/`ob=(1)` references must resolve to their defs.
+    let g = parse_default();
+    let main = nodes_fn(&g, "main");
+    assert_eq!(main.len(), 1);
+    assert_eq!(main[0].file, "file1.c");
+    assert_eq!(main[0].object, "clreq");
+    // func2 -> func1 uses `cfn=(2)` as a *reference* to the earlier def.
+    assert_eq!(edges_fn(&g, "func2", "func1").len(), 1);
+}
+
+#[test]
+fn cfl_alias_equals_cfi() {
+    // `cfl=(5) cflfile.c` is the historical alias of `cfi=`; the callee file
+    // must resolve to cflfile.c.
+    let g = parse_default();
+    let e = edges_fn(&g, "main", "cflop");
+    assert_eq!(e.len(), 1);
+    assert_eq!(e[0].callee.file, "cflfile.c");
+    assert_eq!(e[0].callee.object, "clreq");
+}
+
+#[test]
+fn omitted_cfi_inherits_current_file_context() {
+    // No `cfi`/`cfl`: the callee inherits the CURRENT position file, NOT the
+    // caller's original `fl`. For `nofile` the context is still file1.c.
+    let g = parse_default();
+    let e = edges_fn(&g, "main", "nofile");
+    assert_eq!(e.len(), 1);
+    assert_eq!(e[0].callee.file, "file1.c");
+}
+
+#[test]
+fn inline_fi_fe_changes_callee_context_not_caller() {
+    // CRITICAL: after `fi=(6) inline.c`, a `cfn=` with no `cfi` makes the
+    // CALLEE inherit inline.c, while the CALLER (inlhost) keeps its own `fl`
+    // (file1.c). Pins both halves: caller file != callee file here.
+    let g = parse_default();
+    let inlhost = nodes_fn(&g, "inlhost");
+    assert_eq!(inlhost.len(), 1);
+    assert_eq!(
+        inlhost[0].file, "file1.c",
+        "caller keeps its fl, not the inline file"
+    );
+
+    let e = edges_fn(&g, "inlhost", "inltarget");
+    assert_eq!(e.len(), 1);
+    assert_eq!(
+        e[0].callee.file, "inline.c",
+        "callee inherits the inline context"
+    );
+    assert_eq!(e[0].caller.file, "file1.c");
+}
+
+#[test]
+fn same_name_different_object_are_distinct() {
+    // `helper` exists in liba/fileA.c AND libb/fileB.c -> two distinct nodes,
+    // two distinct edges from main.
+    let g = parse_default();
+    let helpers = nodes_fn(&g, "helper");
+    assert_eq!(helpers.len(), 2, "helpers: {helpers:#?}");
+
+    let mut keys: Vec<(&str, &str)> = helpers
+        .iter()
+        .map(|n| (n.object.as_str(), n.file.as_str()))
+        .collect();
+    keys.sort();
+    assert_eq!(keys, vec![("liba", "fileA.c"), ("libb", "fileB.c")]);
+
+    assert_eq!(edges_fn(&g, "main", "helper").len(), 2);
+}
+
+#[test]
+fn recursion_becomes_self_edge() {
+    let g = parse_default();
+    let rec = edges_fn(&g, "rec", "rec");
+    assert_eq!(rec.len(), 1);
+    assert_eq!(rec[0].caller, rec[0].callee);
+}
+
+#[test]
+fn cob_overrides_caller_object() {
+    // `cob=(4) extlib` with no `cfi`: callee object is extlib, file inherited
+    // from caller context (file1.c).
+    let g = parse_default();
+    let e = edges_fn(&g, "main", "extfn");
+    assert_eq!(e.len(), 1);
+    assert_eq!(e[0].callee.object, "extlib");
+    assert_eq!(e[0].callee.file, "file1.c");
+    assert_eq!(e[0].caller.object, "clreq");
+}
+
+#[test]
+fn multi_part_merged() {
+    // The `part: 2` section's `main -> part2fn` edge must merge into one graph.
+    let g = parse_default();
+    assert_eq!(edges_fn(&g, "main", "part2fn").len(), 1);
+}
+
+#[test]
+fn bare_cfn_without_calls_is_discarded() {
+    // `cfn=nocnt` with no `calls=` line is callee context only, not a call
+    // record (cl-format.xml: CallSpec requires a CallLine). No node, no edge.
+    let g = parse_default();
+    assert!(nodes_fn(&g, "nocnt").is_empty(), "nocnt must not be a node");
+    assert!(edges_fn(&g, "main", "nocnt").is_empty(), "no edge to nocnt");
+}
+
+#[test]
+fn every_edge_has_a_call_count() {
+    // With the calls=-required rule, every emitted edge carries Some(count).
+    let g = parse_default();
+    for e in g.edges() {
+        assert!(e.call_count.is_some(), "edge {e:?} should have a count");
+    }
+}
+
+#[test]
+fn costs_and_addresses_ignored() {
+    // Subposition/cost lines (+N, *, -N, 0x..., "16 400") never create nodes.
+    // Node count stays at the 12 real functions.
+    let g = parse_default();
+    assert_eq!(g.nodes().len(), 12);
+    assert!(!g.nodes().iter().any(|n| n.function.starts_with("0x")));
+}
+
+#[test]
+fn paths_normalized_by_default() {
+    // Default opts: object path `/path/to/clreq` -> basename `clreq`.
+    let g = parse_default();
+    assert!(g.nodes().iter().any(|n| n.object == "clreq"));
+    assert!(
+        !g.nodes().iter().any(|n| n.object.contains('/')),
+        "no object should retain a path separator"
+    );
+}
+
+#[test]
+fn paths_verbatim_when_normalization_off() {
+    let opts = ParseOptions {
+        normalize_paths: false,
+        ..Default::default()
+    };
+    let g = CallGraph::parse_with(Cursor::new(FIXTURE), &opts).expect("parse");
+    assert!(
+        g.nodes().iter().any(|n| n.object == "/path/to/clreq"),
+        "object path must be kept verbatim: {:#?}",
+        g.nodes()
+    );
+}
+
+#[test]
+fn to_json_is_canonical() {
+    let g = parse_default();
+    let json = g.to_json().expect("to_json");
+    let v: serde_json::Value = serde_json::from_str(&json).expect("valid json");
+
+    let nodes = v["nodes"].as_array().expect("nodes array");
+    let edges = v["edges"].as_array().expect("edges array");
+    assert_eq!(nodes.len(), 12);
+    assert_eq!(edges.len(), 12);
+
+    // Nodes sorted by (object, file, function).
+    let key = |n: &serde_json::Value| {
+        (
+            n["object"].as_str().unwrap().to_owned(),
+            n["file"].as_str().unwrap().to_owned(),
+            n["function"].as_str().unwrap().to_owned(),
+        )
+    };
+    let mut sorted = nodes.clone();
+    sorted.sort_by_key(key);
+    assert_eq!(nodes, &sorted, "nodes must be pre-sorted");
+
+    // Edges reference nodes by valid index; call_count present (never None here).
+    for e in edges {
+        let c = e["caller"].as_u64().unwrap() as usize;
+        let d = e["callee"].as_u64().unwrap() as usize;
+        assert!(
+            c < nodes.len() && d < nodes.len(),
+            "edge index out of range"
+        );
+        assert!(
+            e.get("call_count").is_some(),
+            "call_count present for fixture edges"
+        );
+    }
+
+    // Edges sorted by (caller_idx, callee_idx).
+    let pairs: Vec<(u64, u64)> = edges
+        .iter()
+        .map(|e| (e["caller"].as_u64().unwrap(), e["callee"].as_u64().unwrap()))
+        .collect();
+    let mut sorted_pairs = pairs.clone();
+    sorted_pairs.sort();
+    assert_eq!(
+        pairs, sorted_pairs,
+        "edges must be pre-sorted by index pair"
+    );
+}
+
+#[test]
+fn to_json_omits_none_call_count() {
+    // Construct via parse, then confirm the serializer would omit a None count
+    // by checking the field is absent only when the value is None. All fixture
+    // edges have Some, so every edge object must carry call_count.
+    let g = parse_default();
+    let json = g.to_json().expect("to_json");
+    let v: serde_json::Value = serde_json::from_str(&json).unwrap();
+    for e in v["edges"].as_array().unwrap() {
+        assert!(e.get("call_count").is_some());
+    }
+}
+
+#[test]
+fn bare_cfn_does_not_poison_next_edge() {
+    // A bare `cfn=unused` (cleared by the following self-cost line) must not
+    // become the callee of a later `calls=` that has its own `cfn=`.
+    let out = "\
+# callgrind format
+events: Ir
+ob=(1) prog
+fl=(1) a.c
+fn=(1) caller
+cfn=(2) unused
+5 3
+cfn=(3) realcallee
+calls=2 10
+6 4
+";
+    let g = CallGraph::parse(Cursor::new(out)).expect("parse");
+    assert!(
+        nodes_fn(&g, "unused").is_empty(),
+        "bare cfn must be discarded"
+    );
+    let e = edges_fn(&g, "caller", "realcallee");
+    assert_eq!(e.len(), 1);
+    assert_eq!(e[0].call_count, Some(2));
+    assert!(edges_fn(&g, "caller", "unused").is_empty());
+}
+
+#[test]
+fn bare_cfn_does_not_survive_jump_line() {
+    // A `jump=`/`jcnd=` line between a bare `cfn=` and a `calls=` must clear the
+    // pending callee, so the `calls=` (lacking its own `cfn=`) emits no edge.
+    let out = "\
+# callgrind format
+events: Ir
+ob=(1) prog
+fl=(1) a.c
+fn=(1) caller
+cfn=(2) unused
+jump=3 10
+calls=2 11
+6 4
+";
+    let g = CallGraph::parse(Cursor::new(out)).expect("parse");
+    assert!(
+        nodes_fn(&g, "unused").is_empty(),
+        "jump must clear the pending cfn"
+    );
+    assert!(
+        g.edges().is_empty(),
+        "calls= had no live cfn after the jump -> no edge"
+    );
+}
diff --git a/callgrind-utils/tests/python_callgraph.rs b/callgrind-utils/tests/python_callgraph.rs
new file mode 100644
index 000000000..91cf91b71
--- /dev/null
+++ b/callgrind-utils/tests/python_callgraph.rs
@@ -0,0 +1,137 @@
+//! Snapshot of the Python fixture's external Callgrind graph JSON.
+//!
+//! Callgrind records the CPython interpreter's C frames, not the Python
+//! functions: the interpreter loop is obj-skipped at runtime via the `clgctl`
+//! shim's `CALLGRIND_ADD_OBJ_SKIP`, so what remains is the ctypes/libffi/libc
+//! C-residual around the `clg_start`/`clg_stop` shim.
+//!
+//! Requires a built `./vg-in-place` at the repo root and `cc`. Silently skips
+//! when `python3` is not on PATH (mirrors the `.vgtest` `prereq` guards).
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+fn have_python3() -> bool {
+    Command::new("python3")
+        .arg("--version")
+        .output()
+        .map(|o| o.status.success())
+        .unwrap_or(false)
+}
+
+/// Compile the Callgrind client-request shim the Python fixture loads via
+/// `ctypes`, as a shared library against the in-repo `callgrind.h`.
+fn compile_clgctl() -> PathBuf {
+    let repo = repo_root();
+    let src = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata/clgctl.c");
+    let lib = Path::new(env!("CARGO_TARGET_TMPDIR")).join("libclgctl.so");
+
+    let status = Command::new("cc")
+        .args(["-g", "-O0", "-shared", "-fPIC"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&lib)
+        .arg(&src)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for clgctl: {e}"));
+    assert!(
+        status.success(),
+        "cc failed for {} ({status})",
+        src.display()
+    );
+    lib
+}
+
+fn runner_callgrind_args(out_file: &Path) -> Vec<String> {
+    [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        "--instr-atstart=no",
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ]
+    .into_iter()
+    .map(str::to_string)
+    .chain([format!("--callgrind-out-file={}", out_file.display())])
+    .collect()
+}
+
+/// Profile `testdata/recursion.py` with the same Callgrind flags as the runner
+/// and return the `.out` contents.
+fn run_python(clgctl: &Path, _instr_atstart: bool) -> String {
+    let script = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata/recursion.py");
+    let out_file = Path::new(env!("CARGO_TARGET_TMPDIR")).join("python.callgrind.out");
+    let log_file = out_file.with_extension("valgrind.log");
+
+    let status = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(runner_callgrind_args(&out_file))
+        .arg(format!("--log-file={}", log_file.display()))
+        .arg("python3")
+        .arg(&script)
+        .arg(clgctl)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    assert!(status.success(), "vg-in-place exited with {status}");
+    std::fs::read_to_string(&out_file)
+        .unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()))
+}
+
+/// Render a flamegraph of the fixture profiled with `--instr-atstart=yes`, so
+/// the whole-program call stack is captured from process start and the
+/// interpreter's `fib` recursion (`_PyEval_EvalFrameDefault` and the
+/// PyLong/frame helpers) is visible. Under `--instr-atstart=no` the measured
+/// region begins inside already-obj-skipped libpython, so everything folds
+/// into `(below main)` and the flamegraph is a single uninformative bar.
+/// Rendered from the RAW graph (redaction collapses libc/ld into a non-root
+/// `???` node). Writes `python.svg` at the crate root for manual inspection.
+#[test]
+#[ignore]
+fn python_flamegraph() {
+    if !have_python3() {
+        eprintln!("skipping python_flamegraph: python3 not on PATH");
+        return;
+    }
+
+    let clgctl = compile_clgctl();
+    let raw = run_python(&clgctl, true);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse python callgrind output: {e:?}"));
+
+    let out = Path::new(env!("CARGO_MANIFEST_DIR")).join("python.svg");
+    graph.to_flamegraph_file(&out).expect("render flamegraph");
+}
diff --git a/callgrind-utils/tests/rust_callgraph.rs b/callgrind-utils/tests/rust_callgraph.rs
new file mode 100644
index 000000000..3708ae1db
--- /dev/null
+++ b/callgrind-utils/tests/rust_callgraph.rs
@@ -0,0 +1,328 @@
+//! Golden snapshot of the Rust fixture's call graph.
+//!
+//! The Rust twin of the C `fractal` case in `snapshot.rs`: compile
+//! `testdata/fractal.rs` (linking the `clgctl.c` client-request shim as a static
+//! lib, since the CALLGRIND_* requests are inline asm), profile it live under
+//! the in-repo Callgrind with `--instr-atstart=no`, parse, and snapshot the
+//! redacted folded stacks.
+//!
+//! The fixture fires the client requests several frames deep
+//! (`main` -> `run_benchmark` -> `warmup` -> `run_measured`), so the scoped
+//! graph is just the measured region's own functions: the shadow-stack seeder
+//! reconstructs the native chain but the outer frames do their work while
+//! instrumentation is off, so they never enter the graph. Every fixture
+//! function is `#[no_mangle] #[inline(never)]` and the workload is pure integer
+//! math over a fixed arena, so the only non-fixture frame is a libc `memset`
+//! (redacted to `???`) and the JSON is stable across platforms.
+//!
+//! A second `--instr-atstart=yes` case captures the whole program from process
+//! start, mirroring the C `fixture_full_trace`: the std runtime startup
+//! (`std::rt::lang_start`), `main`, and the loader frames appear, and the JSON
+//! is snapshotted raw (no redaction), so it is toolchain- and platform-specific
+//! like the C full-trace snapshots. Callgrind demangles the Rust symbols and
+//! drops their hash suffixes, so the names stay stable for a pinned toolchain.
+//!
+//! Requires a built `./vg-in-place` at the repo root. Silently skips when
+//! `rustc` is not on PATH.
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+fn have_rustc() -> bool {
+    Command::new("rustc")
+        .arg("--version")
+        .output()
+        .map(|o| o.status.success())
+        .unwrap_or(false)
+}
+
+/// Compile the Callgrind client-request shim into a static library, then build
+/// `testdata/fractal.rs` against it.
+///
+/// `-C opt-level=2` inlines away the std iterator / bounds-check helpers so they
+/// don't appear as their own (toolchain-version-specific) nodes; the fixture's
+/// own functions stay distinct because each is `#[inline(never)]`. The binary is
+/// named `fractal_rs` so its object basename is stable in the snapshot.
+///
+/// Each caller passes a private `work` dir: the two test cases run in parallel,
+/// so they must not share the intermediate `.o`/`.a`/binary paths. The binary
+/// basename stays `fractal_rs` either way, so the snapshot's object name is
+/// identical across cases.
+fn compile_rust_fixture(work: &Path) -> PathBuf {
+    let repo = repo_root();
+    let manifest = Path::new(env!("CARGO_MANIFEST_DIR"));
+    let tmp = work;
+    std::fs::create_dir_all(tmp).expect("create work dir");
+
+    let obj = tmp.join("clgctl_rs.o");
+    let status = Command::new("cc")
+        .args(["-g", "-O0", "-fPIC", "-c"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&obj)
+        .arg(manifest.join("testdata/clgctl.c"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for clgctl: {e}"));
+    assert!(status.success(), "cc failed for clgctl.c ({status})");
+
+    // `ar` appends, so start from a clean archive.
+    let lib = tmp.join("libclgctl_rs.a");
+    let _ = std::fs::remove_file(&lib);
+    let status = Command::new("ar")
+        .arg("rcs")
+        .arg(&lib)
+        .arg(&obj)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn ar: {e}"));
+    assert!(status.success(), "ar failed ({status})");
+
+    let bin = tmp.join("fractal_rs");
+    let status = Command::new("rustc")
+        .args(["--edition", "2021", "-g", "-C", "opt-level=2"])
+        .arg("-L")
+        .arg(format!("native={}", tmp.display()))
+        .arg("-l")
+        .arg("static=clgctl_rs")
+        .arg("-o")
+        .arg(&bin)
+        .arg(manifest.join("testdata/fractal.rs"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn rustc: {e}"));
+    assert!(status.success(), "rustc failed ({status})");
+    bin
+}
+
+/// Compile `testdata/fractal_alloc.rs` -- a real heap-allocating twin of
+/// `fractal.rs` (`Vec<FractalNode>` tree, `Vec<f64>` scratch buffers,
+/// `HashMap` memoization) adapted from the actual production benchmark that
+/// exhibited the "free calls analyze_fractal_tree" misattribution bug.
+/// `-C opt-level=3` matches the real benchmark's build profile; plain
+/// `fractal.rs`'s `opt-level=2` was not enough to reproduce it.
+#[cfg(target_arch = "aarch64")]
+fn compile_fractal_alloc_fixture(work: &Path) -> PathBuf {
+    let repo = repo_root();
+    let manifest = Path::new(env!("CARGO_MANIFEST_DIR"));
+    let tmp = work;
+    std::fs::create_dir_all(tmp).expect("create work dir");
+
+    let obj = tmp.join("clgctl_rs.o");
+    let status = Command::new("cc")
+        .args(["-g", "-O0", "-fPIC", "-c"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&obj)
+        .arg(manifest.join("testdata/clgctl.c"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for clgctl: {e}"));
+    assert!(status.success(), "cc failed for clgctl.c ({status})");
+
+    let lib = tmp.join("libclgctl_rs.a");
+    let _ = std::fs::remove_file(&lib);
+    let status = Command::new("ar")
+        .arg("rcs")
+        .arg(&lib)
+        .arg(&obj)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn ar: {e}"));
+    assert!(status.success(), "ar failed ({status})");
+
+    let bin = tmp.join("fractal_alloc");
+    let status = Command::new("rustc")
+        .args(["--edition", "2021", "-g", "-C", "opt-level=3"])
+        .arg("-L")
+        .arg(format!("native={}", tmp.display()))
+        .arg("-l")
+        .arg("static=clgctl_rs")
+        .arg("-o")
+        .arg(&bin)
+        .arg(manifest.join("testdata/fractal_alloc.rs"))
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn rustc: {e}"));
+    assert!(status.success(), "rustc failed ({status})");
+    bin
+}
+
+fn runner_callgrind_args(instr_atstart: bool, out_file: &Path) -> Vec<String> {
+    [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        if instr_atstart {
+            "--instr-atstart=yes"
+        } else {
+            "--instr-atstart=no"
+        },
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ]
+    .into_iter()
+    .map(str::to_string)
+    .chain([format!("--callgrind-out-file={}", out_file.display())])
+    .collect()
+}
+
+fn run_callgrind_with_runner_args(bin: &Path, out_file: &Path, instr_atstart: bool) -> String {
+    let status = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(runner_callgrind_args(instr_atstart, out_file))
+        .arg(bin)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    assert!(status.success(), "vg-in-place exited with {status}");
+    std::fs::read_to_string(out_file).unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()))
+}
+
+/// Profile `bin` with the same Callgrind flags as the runner and return the
+/// `.out` contents. `--instr-atstart=no` pairs with the fixture's client
+/// requests so only the measured region is profiled.
+fn run_callgrind(bin: &Path) -> String {
+    let out_file = bin.with_extension("callgrind.out");
+    run_callgrind_with_runner_args(bin, &out_file, false)
+}
+
+/// Profile `bin` with the runner-equivalent Callgrind flags and return the raw,
+/// unredacted graph input. This intentionally keeps `--instr-atstart=no`; the
+/// production runner does not capture a separate full-program trace.
+fn run_callgrind_full(bin: &Path) -> String {
+    let out_file = bin.with_extension("full.callgrind.out");
+    run_callgrind_with_runner_args(bin, &out_file, false)
+}
+
+#[test]
+fn rust_fixture_canonical_json() {
+    if !have_rustc() {
+        eprintln!("skipping rust_fixture_canonical_json: rustc not on PATH");
+        return;
+    }
+
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("scoped");
+    let bin = compile_rust_fixture(&work);
+    let raw = run_callgrind(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse rust callgrind output: {e:?}"))
+        .redact();
+    insta::assert_snapshot!(
+        format!("fractal_rs_folded"),
+        graph.to_folded_without_costs().join("\n")
+    );
+    graph.to_flamegraph_file("fractal_rs.partial.svg").unwrap();
+}
+
+/// Regression test for the "free calls X" misattribution: `free()`'s
+/// return was getting promoted to a fresh CALL into whatever function ran
+/// next, because on arm64 a call into skipped code (the libc PLT hop) left
+/// a stale skip frame + `nonskipped` state behind, corrupting the return
+/// addresses recorded for the following frames. Fixed in
+/// callgrind/callstack.c by recording the guest X30 as the frame's return
+/// target for real calls and keeping `ret_addr = 0` for emulated/spliced
+/// pushes. Asserts directly on the graph (not just a snapshot) so a
+/// regression fails loudly instead of silently getting re-approved.
+#[cfg(target_arch = "aarch64")]
+#[test]
+fn arm64_fractal_alloc_no_free_misattribution() {
+    if !have_rustc() {
+        eprintln!("skipping arm64_fractal_alloc_no_free_misattribution: rustc not on PATH");
+        return;
+    }
+
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("fractal_alloc");
+    let bin = compile_fractal_alloc_fixture(&work);
+    let raw = run_callgrind(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse fractal_alloc callgrind output: {e:?}"))
+        .redact();
+
+    // No snapshot assertion here: `fibonacci_memo`'s exact call/cost counts
+    // have observed run-to-run jitter (unrelated to the free-misattribution
+    // bug this test guards), which would make an exact-match snapshot
+    // flaky. The structural assertions below are the actual regression
+    // guard: after redaction every libc frame is `???`, and libc never
+    // calls back into the fixture, so a fixture function nested under a
+    // `???` frame is exactly the "free calls X" misattribution; a `'2`
+    // clone of the non-recursive entry point is the phantom-recursion twin.
+    let folded = graph.to_folded_without_costs().join("\n");
+    assert!(
+        !folded.contains("complex_fractal_benchmark'"),
+        "phantom recursion clone of complex_fractal_benchmark:\n{folded}"
+    );
+    for line in folded.lines() {
+        let frames: Vec<&str> = line
+            .split_once(' ')
+            .map_or(line, |(path, _)| path)
+            .split(';')
+            .collect();
+        if let Some(first_unknown) = frames.iter().position(|f| *f == "???") {
+            assert!(
+                frames[first_unknown..].iter().all(|f| *f == "???"),
+                "fixture frame misattributed under a libc (`???`) frame:\n{line}"
+            );
+        }
+    }
+}
+
+#[test]
+fn rust_fixture_full_trace() {
+    if !have_rustc() {
+        eprintln!("skipping rust_fixture_full_trace: rustc not on PATH");
+        return;
+    }
+
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("full");
+    let bin = compile_rust_fixture(&work);
+    let raw = run_callgrind_full(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse rust full callgrind output: {e:?}"));
+
+    // complex_fractal_benchmark is called exactly once and is not recursive:
+    // a `'2` clone means the shadow call stack lost a return and re-promoted
+    // it to a phantom call back into the live caller (the arm64 X30/ret_addr
+    // regression). Assert directly so this fails loudly on every platform,
+    // independent of the platform-specific symbol noise in the snapshot.
+    let folded = graph.to_folded_without_costs().join("\n");
+    assert!(
+        !folded.contains("complex_fractal_benchmark'"),
+        "phantom recursion clone of complex_fractal_benchmark in folded output:\n{folded}"
+    );
+
+    insta::assert_snapshot!(
+        format!("fractal_rs_full_folded"),
+        graph.to_folded_without_costs().join("\n")
+    );
+    graph.to_flamegraph_file("fractal_rs.full.svg").unwrap();
+}
diff --git a/callgrind-utils/tests/snapshot.rs b/callgrind-utils/tests/snapshot.rs
new file mode 100644
index 000000000..8b99589b7
--- /dev/null
+++ b/callgrind-utils/tests/snapshot.rs
@@ -0,0 +1,197 @@
+//! Golden snapshot tests over the `testdata/*.c` fixtures.
+//!
+//! Each case compiles its fixture and profiles it with the in-repo Callgrind
+//! (`vg-in-place`, expected at the repo root), then snapshots the folded
+//! stacks. The fixtures run with `--instr-atstart=no` (plus client requests)
+//! and `--obj-skip`, so the graph is just their own functions and the folded
+//! output is stable across platforms.
+//!
+//! These tests require a built `./vg-in-place` at the repo root.
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+use rstest::rstest;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+/// Compile `testdata/<stem>.c` into this test binary's temp dir. `-O0` keeps the
+/// default fixtures un-inlined and `-g` gives them debug names; `callgrind.h`
+/// pulls in `valgrind.h` via `-I include`.
+fn compile_fixture_with_flags(stem: &str, cflags: &[&str]) -> PathBuf {
+    let repo = repo_root();
+    let src = Path::new(env!("CARGO_MANIFEST_DIR"))
+        .join("testdata")
+        .join(format!("{stem}.c"));
+    let bin = Path::new(env!("CARGO_TARGET_TMPDIR")).join(stem);
+
+    let status = Command::new("cc")
+        .arg("-g")
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&bin)
+        .arg(&src)
+        // Flags (including `-l` libs) go after the source so link-order
+        // sensitive libraries resolve symbols the source object needs.
+        .args(cflags)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for {stem}: {e}"));
+    assert!(
+        status.success(),
+        "cc failed for {} ({status})",
+        src.display()
+    );
+    bin
+}
+
+fn compile_fixture(stem: &str) -> PathBuf {
+    compile_fixture_with_flags(stem, &["-O0"])
+}
+
+fn runner_callgrind_args(out_file: &Path) -> Vec<String> {
+    [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        "--instr-atstart=no",
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ]
+    .into_iter()
+    .map(str::to_string)
+    .chain([format!("--callgrind-out-file={}", out_file.display())])
+    .collect()
+}
+
+fn run_callgrind_with_runner_args(bin: &Path, out_file: &Path) -> String {
+    let log_file = out_file.with_extension("valgrind.log");
+    let status = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(runner_callgrind_args(out_file))
+        .arg(format!("--log-file={}", log_file.display()))
+        .arg(bin)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    assert!(status.success());
+    std::fs::read_to_string(out_file).unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()))
+}
+
+/// Profile `bin` with the same Callgrind flags as the runner and return the
+/// `.out` contents.
+fn run_callgrind(bin: &Path) -> String {
+    let out_file = bin.with_extension("callgrind.out");
+    run_callgrind_with_runner_args(bin, &out_file)
+}
+
+#[rstest]
+#[case("recursion")]
+#[case("chain")]
+#[case("diamond")]
+#[case("mutual")]
+#[case("fractal")]
+fn fixture_canonical_json(#[case] stem: &str) {
+    let bin = compile_fixture(stem);
+    let raw = run_callgrind(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse {stem} callgrind output: {e:?}"))
+        .redact();
+    graph
+        .to_flamegraph_file(format!("{stem}.partial.svg"))
+        .unwrap();
+    insta::assert_snapshot!(
+        format!("{stem}_folded"),
+        graph.to_folded_without_costs().join("\n")
+    );
+}
+
+/// AArch64-specific unwinding reproducers, built at `-O2` (see each fixture's
+/// header comment for the shadow-stack scenario it targets). Golden snapshots
+/// are only ever generated on aarch64, so this stays out of the cross-arch
+/// `fixture_canonical_json` cases above.
+#[cfg(target_arch = "aarch64")]
+#[rstest]
+// #[case("arm64_recursive_return")]
+// #[case("arm64_tail_call")]
+// #[case("arm64_free_during_recursion")]
+// #[case("arm64_multi_alloc_cycle")]
+// #[case("arm64_libm_recursion")]
+// #[case("arm64_ping_pong_recursion")]
+// #[case("arm64_longjmp_unwind")]
+// #[case("arm64_deep_tailcall_chain")]
+// #[case("arm64_wrapped_alloc_chain")]
+#[case("arm64_plt_phantom_recursion")]
+#[case("arm64_free_tailcall_phantom")]
+fn arm64_fixture_canonical_json(#[case] stem: &str) {
+    // `-lm` is harmless for fixtures that don't need libm and required by
+    // arm64_libm_recursion, which does.
+    let bin = compile_fixture_with_flags(stem, &["-O2", "-lm"]);
+    let raw = run_callgrind(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse {stem} callgrind output: {e:?}"))
+        .redact();
+
+    insta::assert_snapshot!(
+        format!("{stem}_folded"),
+        graph.to_folded_without_costs().join("\n")
+    );
+}
+
+/// Profile `bin` with the same Callgrind flags as the runner and return the
+/// raw, unredacted graph input. The production runner uses
+/// `--instr-atstart=no`, so this intentionally does not capture a separate
+/// full-program trace.
+fn run_callgrind_full(bin: &Path) -> String {
+    let out_file = bin.with_extension("full.callgrind.out");
+    run_callgrind_with_runner_args(bin, &out_file)
+}
+
+#[rstest]
+#[case("recursion")]
+#[case("chain")]
+#[case("diamond")]
+#[case("mutual")]
+#[case("fractal")]
+fn fixture_full_trace(#[case] stem: &str) {
+    let bin = compile_fixture(stem);
+    let raw = run_callgrind_full(&bin);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse {stem} full callgrind output: {e:?}"));
+    graph
+        .to_flamegraph_file(format!("{stem}.full.svg"))
+        .unwrap();
+
+    insta::assert_snapshot!(
+        format!("{stem}_full_folded"),
+        graph.to_folded_without_costs().join("\n")
+    );
+}
diff --git a/callgrind-utils/tests/snapshots/python_callgraph__recursion_py__topology_json.snap b/callgrind-utils/tests/snapshots/python_callgraph__recursion_py__topology_json.snap
new file mode 100644
index 000000000..3deca4311
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/python_callgraph__recursion_py__topology_json.snap
@@ -0,0 +1,101 @@
+---
+source: tests/python_callgraph.rs
+expression: json
+---
+{
+  "nodes": [
+    {
+      "function": "<unsymbolicated>",
+      "file": "???",
+      "object": "_ctypes.cpython.so"
+    },
+    {
+      "function": "???",
+      "file": "???",
+      "object": "ld-linux"
+    },
+    {
+      "function": "???",
+      "file": "???",
+      "object": "libc.so.6"
+    },
+    {
+      "function": "clg_start",
+      "file": "clgctl.c",
+      "object": "libclgctl.so"
+    },
+    {
+      "function": "<unsymbolicated>",
+      "file": "???",
+      "object": "libffi.so"
+    },
+    {
+      "function": "ffi_call",
+      "file": "???",
+      "object": "libffi.so"
+    },
+    {
+      "function": "ffi_prep_cif",
+      "file": "???",
+      "object": "libffi.so"
+    }
+  ],
+  "edges": [
+    {
+      "caller": 0,
+      "callee": 0
+    },
+    {
+      "caller": 0,
+      "callee": 2
+    },
+    {
+      "caller": 0,
+      "callee": 5
+    },
+    {
+      "caller": 0,
+      "callee": 6
+    },
+    {
+      "caller": 1,
+      "callee": 1
+    },
+    {
+      "caller": 1,
+      "callee": 2
+    },
+    {
+      "caller": 2,
+      "callee": 0
+    },
+    {
+      "caller": 2,
+      "callee": 1
+    },
+    {
+      "caller": 2,
+      "callee": 2
+    },
+    {
+      "caller": 3,
+      "callee": 4
+    },
+    {
+      "caller": 4,
+      "callee": 0
+    },
+    {
+      "caller": 4,
+      "callee": 2
+    },
+    {
+      "caller": 4,
+      "callee": 4
+    },
+    {
+      "caller": 5,
+      "callee": 4
+    }
+  ]
+}
diff --git a/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_folded.snap b/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_folded.snap
new file mode 100644
index 000000000..c9e8237a1
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_folded.snap
@@ -0,0 +1,57 @@
+---
+source: tests/rust_callgraph.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_fractal_benchmark <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;??? <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;??? <cost>
+run_measured;complex_fractal_benchmark;build_fractal <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_child_value <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;pool_alloc <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;??? <cost>
diff --git a/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_full_folded.snap b/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_full_folded.snap
new file mode 100644
index 000000000..e4baf8065
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/rust_callgraph__fractal_rs_full_folded.snap
@@ -0,0 +1,57 @@
+---
+source: tests/rust_callgraph.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_fractal_benchmark <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;__memset_avx2_unaligned_erms <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;__memset_avx2_unaligned_erms <cost>
+run_measured;complex_fractal_benchmark;build_fractal <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_child_value <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;pool_alloc <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;__memset_avx2_unaligned_erms <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_deep_tailcall_chain_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_deep_tailcall_chain_folded.snap
new file mode 100644
index 000000000..a30a60cc3
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_deep_tailcall_chain_folded.snap
@@ -0,0 +1,47 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;walk <cost>
+run_measured;complex_benchmark;walk;pool_alloc <cost>
+run_measured;complex_benchmark;walk;stage_a <cost>
+run_measured;complex_benchmark;walk;stage_a;stage_b <cost>
+run_measured;complex_benchmark;walk;stage_a;stage_b;stage_c <cost>
+run_measured;complex_benchmark;walk;stage_a;stage_b;stage_c;stage_d <cost>
+run_measured;complex_benchmark;walk;stage_a;stage_b;stage_c;stage_d;stage_e <cost>
+run_measured;complex_benchmark;walk;stage_a;stage_b;stage_c;stage_d;stage_e;stage_f <cost>
+run_measured;complex_benchmark;walk;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;pool_alloc <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d;stage_e <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d;stage_e;stage_f <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;pool_alloc <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d;stage_e <cost>
+run_measured;complex_benchmark;walk;walk'2;stage_a;stage_b;stage_c;stage_d;stage_e;stage_f <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;recursive_sum <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_free_during_recursion_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_free_during_recursion_folded.snap
new file mode 100644
index 000000000..bd320e53d
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_free_during_recursion_folded.snap
@@ -0,0 +1,70 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;analyze_tree <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;dealloc_wrapper1 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;dealloc_wrapper1;dealloc_wrapper2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;dealloc_wrapper1;dealloc_wrapper2;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;dealloc_wrapper1;dealloc_wrapper2;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;dealloc_wrapper1 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;dealloc_wrapper1;dealloc_wrapper2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;dealloc_wrapper1;dealloc_wrapper2;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;dealloc_wrapper1;dealloc_wrapper2;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;analyze_tree'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_free_tailcall_phantom_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_free_tailcall_phantom_folded.snap
new file mode 100644
index 000000000..476061de1
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_free_tailcall_phantom_folded.snap
@@ -0,0 +1,19 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;caller <cost>
+run_measured;caller;??? <cost>
+run_measured;caller;???;??? <cost>
+run_measured;caller;???;??? <cost>
+run_measured;caller;???;??? <cost>
+run_measured;caller;???;??? <cost>
+run_measured;caller;???;??? <cost>
+run_measured;caller;dealloc1 <cost>
+run_measured;caller;dealloc1;dealloc2 <cost>
+run_measured;caller;dealloc1;dealloc2;??? <cost>
+run_measured;caller;dealloc1;dealloc2;???;??? <cost>
+run_measured;caller;dealloc1;dealloc2;???;??? <cost>
+run_measured;caller;dealloc1;dealloc2;???;???;??? <cost>
+run_measured;caller;post_free_work.constprop.0 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_libm_recursion_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_libm_recursion_folded.snap
new file mode 100644
index 000000000..3c6981d72
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_libm_recursion_folded.snap
@@ -0,0 +1,105 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;perturb <cost>
+run_measured;complex_benchmark;build_tree;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;libc_feholdsetround_aarch64_ctx <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;reduce_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;libc_feholdsetround_aarch64_ctx <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;reduce_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;perturb <cost>
+run_measured;complex_benchmark;build_tree;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;libc_feholdsetround_aarch64_ctx <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;reduce_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;libc_feholdsetround_aarch64_ctx <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;reduce_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sin <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;perturb;sin;do_sincos <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;recursive_sum <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;hash_tree <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_longjmp_unwind_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_longjmp_unwind_folded.snap
new file mode 100644
index 000000000..d0ef3b585
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_longjmp_unwind_folded.snap
@@ -0,0 +1,37 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;??? <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;recursive_sum <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_multi_alloc_cycle_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_multi_alloc_cycle_folded.snap
new file mode 100644
index 000000000..e651c87fa
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_multi_alloc_cycle_folded.snap
@@ -0,0 +1,66 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;analyze_tree <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_variance;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;??? <cost>
+run_measured;complex_benchmark;analyze_tree;compute_spread;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_variance;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;collect_leaf <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;compute_spread;???;??? <cost>
+run_measured;complex_benchmark;analyze_tree;analyze_tree'2;analyze_tree'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_ping_pong_recursion_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_ping_pong_recursion_folded.snap
new file mode 100644
index 000000000..849e70998
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_ping_pong_recursion_folded.snap
@@ -0,0 +1,38 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;walk <cost>
+run_measured;complex_benchmark;walk;pool_alloc <cost>
+run_measured;complex_benchmark;walk;ping <cost>
+run_measured;complex_benchmark;walk;ping;pong <cost>
+run_measured;complex_benchmark;walk;ping;pong;ping'2 <cost>
+run_measured;complex_benchmark;walk;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;pool_alloc <cost>
+run_measured;complex_benchmark;walk;walk'2;ping <cost>
+run_measured;complex_benchmark;walk;walk'2;ping;pong <cost>
+run_measured;complex_benchmark;walk;walk'2;ping;pong;ping'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;pool_alloc <cost>
+run_measured;complex_benchmark;walk;walk'2;ping <cost>
+run_measured;complex_benchmark;walk;walk'2;ping;pong <cost>
+run_measured;complex_benchmark;walk;walk'2;ping;pong;ping'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;walk;walk'2;child_value <cost>
+run_measured;complex_benchmark;walk;walk'2;walk'2 <cost>
+run_measured;complex_benchmark;recursive_sum <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_benchmark;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_plt_phantom_recursion_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_plt_phantom_recursion_folded.snap
new file mode 100644
index 000000000..b80118c28
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_plt_phantom_recursion_folded.snap
@@ -0,0 +1,11 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;outer <cost>
+run_measured;outer;memset <cost>
+run_measured;outer;??? <cost>
+run_measured;outer;memset <cost>
+run_measured;outer;leaf <cost>
+run_measured;outer;sibling <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_recursive_return_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_recursive_return_folded.snap
new file mode 100644
index 000000000..d61f92314
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_recursive_return_folded.snap
@@ -0,0 +1,48 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
+run_measured;complex_benchmark;hash_tree;hash_tree'2;hash_tree'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_tail_call_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_tail_call_folded.snap
new file mode 100644
index 000000000..e18819f04
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_tail_call_folded.snap
@@ -0,0 +1,8 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;stage_a <cost>
+run_measured;stage_a;stage_b <cost>
+run_measured;stage_a;stage_b;stage_c <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__arm64_wrapped_alloc_chain_folded.snap b/callgrind-utils/tests/snapshots/snapshot__arm64_wrapped_alloc_chain_folded.snap
new file mode 100644
index 000000000..94e365d87
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__arm64_wrapped_alloc_chain_folded.snap
@@ -0,0 +1,39 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_benchmark <cost>
+run_measured;complex_benchmark;build_tree <cost>
+run_measured;complex_benchmark;build_tree;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;pool_alloc <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;child_value <cost>
+run_measured;complex_benchmark;build_tree;build_tree'2;build_tree'2 <cost>
+run_measured;complex_benchmark;compute_stat <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0 <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0 <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0;alloc_hop3.constprop.0 <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0;alloc_hop3.constprop.0;??? <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0;alloc_hop3.constprop.0;???;??? <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0;alloc_hop3.constprop.0;???;??? <cost>
+run_measured;complex_benchmark;compute_stat;alloc_hop1.constprop.0;alloc_hop2.constprop.0;alloc_hop3.constprop.0;???;??? <cost>
+run_measured;complex_benchmark;compute_stat;collect_leaf <cost>
+run_measured;complex_benchmark;compute_stat;collect_leaf;collect_leaf'2 <cost>
+run_measured;complex_benchmark;compute_stat;collect_leaf;collect_leaf'2;collect_leaf'2 <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1 <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1;dealloc_hop2 <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1;dealloc_hop2;dealloc_hop3 <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1;dealloc_hop2;dealloc_hop3;??? <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1;dealloc_hop2;dealloc_hop3;???;??? <cost>
+run_measured;complex_benchmark;compute_stat;dealloc_hop1;dealloc_hop2;dealloc_hop3;???;???;??? <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__chain_folded.snap b/callgrind-utils/tests/snapshots/snapshot__chain_folded.snap
new file mode 100644
index 000000000..d5f629c72
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__chain_folded.snap
@@ -0,0 +1,8 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;a <cost>
+main;a;b <cost>
+main;a;b;c <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__chain_full_folded.snap b/callgrind-utils/tests/snapshots/snapshot__chain_full_folded.snap
new file mode 100644
index 000000000..d5f629c72
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__chain_full_folded.snap
@@ -0,0 +1,8 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;a <cost>
+main;a;b <cost>
+main;a;b;c <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__diamond_folded.snap b/callgrind-utils/tests/snapshots/snapshot__diamond_folded.snap
new file mode 100644
index 000000000..bf4cf401b
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__diamond_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;top <cost>
+main;top;left <cost>
+main;top;left;bottom <cost>
+main;top;right <cost>
+main;top;right;bottom <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__diamond_full_folded.snap b/callgrind-utils/tests/snapshots/snapshot__diamond_full_folded.snap
new file mode 100644
index 000000000..bf4cf401b
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__diamond_full_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;top <cost>
+main;top;left <cost>
+main;top;left;bottom <cost>
+main;top;right <cost>
+main;top;right;bottom <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__fractal_folded.snap b/callgrind-utils/tests/snapshots/snapshot__fractal_folded.snap
new file mode 100644
index 000000000..54eded900
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__fractal_folded.snap
@@ -0,0 +1,58 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_fractal_benchmark <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score;recursive_path_score'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score;recursive_path_score'2;recursive_path_score'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_child_value <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;pool_alloc <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2;fibonacci_memo'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__fractal_full_folded.snap b/callgrind-utils/tests/snapshots/snapshot__fractal_full_folded.snap
new file mode 100644
index 000000000..54eded900
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__fractal_full_folded.snap
@@ -0,0 +1,58 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+run_measured <cost>
+run_measured;complex_fractal_benchmark <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;analyze_fractal_tree'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score;recursive_path_score'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_complexity_score;recursive_path_score;recursive_path_score'2;recursive_path_score'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;analyze_fractal_tree'2;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;collect_leaves;collect_leaves'2;collect_leaves'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_complexity_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_complexity_score;recursive_path_score <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;compute_variance <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;count_nodes;count_nodes'2;count_nodes'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;max_path_sum;max_path_sum'2;max_path_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;analyze_fractal_tree;recursive_sum;recursive_sum'2;recursive_sum'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;build_fractal'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_child_value <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;build_fractal'2;pool_alloc <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;build_fractal;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;compute_tree_hash;compute_tree_hash'2;compute_tree_hash'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2 <cost>
+run_measured;complex_fractal_benchmark;fibonacci_memo;fibonacci_memo'2;fibonacci_memo'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__mutual_folded.snap b/callgrind-utils/tests/snapshots/snapshot__mutual_folded.snap
new file mode 100644
index 000000000..4be66e3ac
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__mutual_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;is_even <cost>
+main;is_even;is_odd <cost>
+main;is_even;is_odd;is_even'2 <cost>
+main;is_even;is_odd;is_even'2;is_odd'2 <cost>
+main;is_even;is_odd;is_even'2;is_odd'2;is_even'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__mutual_full_folded.snap b/callgrind-utils/tests/snapshots/snapshot__mutual_full_folded.snap
new file mode 100644
index 000000000..4be66e3ac
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__mutual_full_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;is_even <cost>
+main;is_even;is_odd <cost>
+main;is_even;is_odd;is_even'2 <cost>
+main;is_even;is_odd;is_even'2;is_odd'2 <cost>
+main;is_even;is_odd;is_even'2;is_odd'2;is_even'2 <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__recursion_folded.snap b/callgrind-utils/tests/snapshots/snapshot__recursion_folded.snap
new file mode 100644
index 000000000..8b4495a7b
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__recursion_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;compute <cost>
+main;compute;fib <cost>
+main;compute;fib;fib'2 <cost>
+main;compute;fib;fib'2;fib'2 <cost>
+main;compute;square <cost>
diff --git a/callgrind-utils/tests/snapshots/snapshot__recursion_full_folded.snap b/callgrind-utils/tests/snapshots/snapshot__recursion_full_folded.snap
new file mode 100644
index 000000000..8b4495a7b
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/snapshot__recursion_full_folded.snap
@@ -0,0 +1,10 @@
+---
+source: tests/snapshot.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+main <cost>
+main;compute <cost>
+main;compute;fib <cost>
+main;compute;fib;fib'2 <cost>
+main;compute;fib;fib'2;fib'2 <cost>
+main;compute;square <cost>

From bb9053ebe3f9c4ed5de40df0bad6efa9d213aa3f Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Thu, 2 Jul 2026 17:00:31 +0000
Subject: [PATCH 7/9] feat(callgrind-utils): add perf_map symbolization for -X
 perf Python frames

Add the `perf_map` module: `CallGraph::symbolize_perf_map` resolves Callgrind's anonymous `0x...` JIT nodes to `py::<qualname>:<file>` via CPython's `/tmp/perf-<pid>.map`, written under `python3 -X perf`. Add the `fractal.py` fixture and `python_fractal_callgraph` test that profiles it live and snapshots the folded stacks and canonical JSON.
---
 callgrind-utils/src/lib.rs                    |   1 +
 callgrind-utils/src/perf_map.rs               | 143 ++++++++++
 callgrind-utils/testdata/fractal.py           | 256 ++++++++++++++++++
 .../tests/python_fractal_callgraph.rs         | 185 +++++++++++++
 ..._fractal_callgraph__fractal_py_folded.snap |  98 +++++++
 ...tal_callgraph__fractal_py_full_folded.snap |  85 ++++++
 6 files changed, 768 insertions(+)
 create mode 100644 callgrind-utils/src/perf_map.rs
 create mode 100644 callgrind-utils/testdata/fractal.py
 create mode 100644 callgrind-utils/tests/python_fractal_callgraph.rs
 create mode 100644 callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_folded.snap
 create mode 100644 callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_full_folded.snap

diff --git a/callgrind-utils/src/lib.rs b/callgrind-utils/src/lib.rs
index 719c21a32..c24fb2404 100644
--- a/callgrind-utils/src/lib.rs
+++ b/callgrind-utils/src/lib.rs
@@ -2,5 +2,6 @@ pub mod error;
 pub mod flamegraph;
 pub mod model;
 pub mod parser;
+pub mod perf_map;
 mod redact;
 pub mod serialize;
diff --git a/callgrind-utils/src/perf_map.rs b/callgrind-utils/src/perf_map.rs
new file mode 100644
index 000000000..3f31b417f
--- /dev/null
+++ b/callgrind-utils/src/perf_map.rs
@@ -0,0 +1,143 @@
+//! Symbolization of anonymous JIT frames via a `perf-<pid>.map` file.
+//!
+//! Callgrind emits anonymous JIT code (CPython's `-X perf` trampolines, V8, ...)
+//! as the literal absolute address `0x...`, leaving symbolization to the
+//! backend. CPython writes one trampoline per code object plus a
+//! `/tmp/perf-<pid>.map` line `<start-hex> <size-hex> py::<qualname>:<file>`, so
+//! an address that falls in a trampoline's range resolves to its Python name.
+
+use std::collections::HashMap;
+use std::io::BufRead;
+use std::path::Path;
+
+use super::model::{CallGraph, Node};
+
+/// A parsed `perf-<pid>.map`: half-open `[start, end)` address ranges, each
+/// mapped to a symbol, sorted by `start` for binary search.
+pub struct PerfMap {
+    entries: Vec<(u64, u64, String)>,
+}
+
+impl PerfMap {
+    pub fn from_file(path: impl AsRef<Path>) -> std::io::Result<Self> {
+        let file = std::fs::File::open(path)?;
+        Ok(Self::from_reader(std::io::BufReader::new(file)))
+    }
+
+    pub fn from_reader(reader: impl BufRead) -> Self {
+        let mut entries: Vec<(u64, u64, String)> = reader
+            .lines()
+            .map_while(Result::ok)
+            .filter_map(|line| parse_entry(&line))
+            .collect();
+        entries.sort_by_key(|(start, _, _)| *start);
+        Self { entries }
+    }
+
+    /// Resolve an address to its symbol, or `None` if it falls in no range.
+    pub fn resolve(&self, addr: u64) -> Option<&str> {
+        let index = self.entries.partition_point(|(start, _, _)| *start <= addr);
+        let (start, end, name) = self.entries.get(index.checked_sub(1)?)?;
+        (*start..*end).contains(&addr).then_some(name.as_str())
+    }
+}
+
+/// A perf-map line is `<start-hex> <size-hex> <symbol>`; anything else (blank
+/// lines, comments) is skipped.
+fn parse_entry(line: &str) -> Option<(u64, u64, String)> {
+    let mut parts = line.splitn(3, ' ');
+    let start = u64::from_str_radix(parts.next()?, 16).ok()?;
+    let size = u64::from_str_radix(parts.next()?, 16).ok()?;
+    let symbol = parts.next()?.trim();
+    (!symbol.is_empty()).then(|| (start, start.wrapping_add(size), symbol.to_string()))
+}
+
+impl CallGraph {
+    /// Rename anonymous JIT nodes (`0x...`) to their `perf-<pid>.map` symbol.
+    ///
+    /// The perf symbol embeds the source path (`py::fib:/abs/path/fractal.py`);
+    /// the path is split into the node's `file` (basename only, so snapshots
+    /// stay portable) and the `py::`-prefixed name stays as the function.
+    pub fn symbolize_perf_map(self, map: &PerfMap) -> CallGraph {
+        let CallGraph {
+            mut nodes,
+            mut edges,
+            self_costs,
+        } = self;
+
+        // Self costs are re-keyed onto the symbolized identities, summing where
+        // distinct addresses collapse to the same resolved name.
+        let mut self_cost_map: HashMap<Node, u64> = HashMap::new();
+        for (node, &cost) in nodes.iter().zip(self_costs.iter()) {
+            let mut symbolized = node.clone();
+            symbolize_node(&mut symbolized, map);
+            *self_cost_map.entry(symbolized).or_insert(0) += cost;
+        }
+
+        for node in &mut nodes {
+            symbolize_node(node, map);
+        }
+        for edge in &mut edges {
+            symbolize_node(&mut edge.caller, map);
+            symbolize_node(&mut edge.callee, map);
+        }
+
+        CallGraph::from_parts(nodes, edges, self_cost_map)
+    }
+}
+
+/// Rename an anonymous JIT node (`0x...`) to its `perf-<pid>.map` symbol.
+///
+/// The perf symbol embeds the source path (`py::fib:/abs/path/fractal.py`);
+/// the path is split into the node's `file` (basename only, so snapshots stay
+/// portable) and the `py::`-prefixed name stays as the function. Nodes whose
+/// function is not a resolvable address are left untouched.
+fn symbolize_node(node: &mut Node, map: &PerfMap) {
+    // Callgrind appends a `'N` recursion marker (e.g. `0x1234'2`); strip it to
+    // resolve the address, then re-attach it so the marker survives on the
+    // resolved name like on native frames.
+    let (base, cycle) = split_cycle_suffix(&node.function);
+    let Some(addr) = parse_hex_address(base) else {
+        return;
+    };
+    let Some(symbol) = map.resolve(addr) else {
+        return;
+    };
+    let (name, file) = split_symbol_file(symbol);
+    node.function = format!("{name}{cycle}");
+    if let Some(file) = file {
+        node.file = file;
+    }
+}
+
+fn parse_hex_address(name: &str) -> Option<u64> {
+    let hex = name.strip_prefix("0x")?;
+    u64::from_str_radix(hex, 16).ok()
+}
+
+/// Split a trailing Callgrind recursion marker `'<digits>` off a node name,
+/// returning (base, marker-including-quote). No marker yields an empty suffix.
+fn split_cycle_suffix(name: &str) -> (&str, &str) {
+    let Some((base, digits)) = name.rsplit_once('\'') else {
+        return (name, "");
+    };
+    if !digits.is_empty() && digits.bytes().all(|b| b.is_ascii_digit()) {
+        return (base, &name[base.len()..]);
+    }
+    (name, "")
+}
+
+/// Split `py::<qualname>:<path>` into (`py::<qualname>`, basename of `<path>`).
+/// The trailing `:` separates the file, so split on the last one; a symbol
+/// without it (rare) keeps its name and gets no file.
+fn split_symbol_file(symbol: &str) -> (String, Option<String>) {
+    let Some((name, path)) = symbol.rsplit_once(':') else {
+        return (symbol.to_string(), None);
+    };
+    let base = path
+        .rsplit('/')
+        .next()
+        .filter(|p| !p.is_empty())
+        .unwrap_or(path);
+    (name.to_string(), Some(base.to_string()))
+}
diff --git a/callgrind-utils/testdata/fractal.py b/callgrind-utils/testdata/fractal.py
new file mode 100644
index 000000000..d0e4fba1d
--- /dev/null
+++ b/callgrind-utils/testdata/fractal.py
@@ -0,0 +1,256 @@
+# Python twin of `testdata/fractal.rs`: a self-contained copy of the CodSpeed
+# e2e Python benchmark (its `fractal.py` + `benchmark.py` merged), driven the
+# way CodSpeed drives a benchmark.
+#
+# Instrumentation is off at startup (run with --instr-atstart=no) and turned on
+# around the measured region via the `clgctl` shim, whose compiled path is
+# passed as argv[1]. The client requests fire several frames deep
+# (main -> run_benchmark -> warmup -> run_measured), mirroring the Rust twin, so
+# the seeder must reconstruct the native chain at the OFF->ON transition.
+#
+# Before starting, we skip the Python runtime objects (libpython + the python
+# executable) from Callgrind at runtime, exactly as pytest-codspeed's
+# instrument-hooks does in _callgrind_skip_python_runtime: the interpreter's own
+# C frames are folded into their callers so they don't obfuscate the graph.
+# Matching is by exact realpath, since Callgrind keys obj-skip on the mapped
+# object path.
+
+import ctypes
+import math
+import os
+import sys
+import sysconfig
+from typing import Dict, List
+
+clgctl = ctypes.CDLL(sys.argv[1])
+
+# Benchmark workload parameters, matching the e2e `test_benchmark.py` /
+# `bench_fractal.rs` case: complex_fractal_benchmark(5, 3, 25).
+TREE_DEPTH = 5
+BRANCH_FACTOR = 3
+FIB_N = 25
+
+
+def skip_python_runtime():
+    ldlibrary = sysconfig.get_config_var("LDLIBRARY")
+    libdir = sysconfig.get_config_var("LIBDIR")
+    libpython = next(
+        (
+            p
+            for p in (
+                os.path.join(libdir, ldlibrary) if ldlibrary and libdir else None,
+                os.path.join(sys.prefix, "lib", ldlibrary) if ldlibrary else None,
+            )
+            if p and os.path.exists(p)
+        ),
+        None,
+    )
+    for path in (libpython, sys.executable):
+        if path:
+            clgctl.clg_add_obj_skip(os.path.realpath(path).encode())
+
+
+class NodeMetadata:
+    """Metadata for a fractal node."""
+
+    def __init__(self, depth: int, branch_factor: int):
+        self.depth = depth
+        self.branch_factor = branch_factor
+        self.computed_hash = 0
+
+
+class FractalNode:
+    """A node in a fractal computation tree."""
+
+    def __init__(self, value: float, depth: int, branch_factor: int):
+        self.value = value
+        self.children: List[FractalNode] = []
+        self.metadata = NodeMetadata(depth, branch_factor)
+
+    @classmethod
+    def build_fractal(
+        cls, depth: int, max_depth: int, branch_factor: int, seed: float
+    ) -> "FractalNode":
+        """Recursively build a fractal tree with branching patterns."""
+        node = cls(seed, depth, branch_factor)
+
+        if depth < max_depth:
+            for i in range(branch_factor):
+                child_seed = cls._compute_child_value(seed, i, depth)
+                child = cls.build_fractal(depth + 1, max_depth, branch_factor, child_seed)
+                node.children.append(child)
+
+        node.metadata.computed_hash = node.compute_tree_hash()
+        return node
+
+    @staticmethod
+    def _compute_child_value(parent_value: float, child_index: int, depth: int) -> float:
+        """Nested helper function to compute child values."""
+        base = parent_value * 0.618033988749  # Golden ratio conjugate
+        offset = (child_index + 1) * (depth + 1)
+        return abs(math.sin(base + offset)) * 100.0
+
+    def compute_tree_hash(self) -> int:
+        """Recursively compute a hash of the entire tree structure."""
+        hash_value = int(self.value * 1000)
+        hash_value = (hash_value * 31 + self.metadata.depth) & 0xFFFFFFFFFFFFFFFF
+        for child in self.children:
+            child_hash = child.compute_tree_hash()
+            hash_value = (hash_value * 31 + child_hash) & 0xFFFFFFFFFFFFFFFF
+        return hash_value
+
+    def recursive_sum(self) -> float:
+        """Recursively compute the sum of all values in the tree."""
+        children_sum = sum(child.recursive_sum() for child in self.children)
+        return self.value + children_sum
+
+    def max_path_sum(self) -> float:
+        """Recursively find the maximum path sum from root to any leaf."""
+        if not self.children:
+            return self.value
+        max_child_path = max(child.max_path_sum() for child in self.children)
+        return self.value + max_child_path
+
+    def count_nodes(self) -> int:
+        """Recursively count all nodes in the tree."""
+        return 1 + sum(child.count_nodes() for child in self.children)
+
+    def collect_leaves(self, leaves: List[float]) -> None:
+        """Recursively collect all leaf values."""
+        if not self.children:
+            leaves.append(self.value)
+        else:
+            for child in self.children:
+                child.collect_leaves(leaves)
+
+
+class TreeAnalysis:
+    """Results of fractal tree analysis."""
+
+    def __init__(
+        self,
+        total_sum: float,
+        node_count: int,
+        max_path: float,
+        leaf_variance: float,
+        complexity_score: float,
+    ):
+        self.total_sum = total_sum
+        self.node_count = node_count
+        self.max_path = max_path
+        self.leaf_variance = leaf_variance
+        self.complexity_score = complexity_score
+
+
+def fibonacci_memo(n: int, memo: Dict[int, int]) -> int:
+    """Compute Fibonacci with memoization (recursive with nested dict operations)."""
+    if n <= 1:
+        return n
+    if n in memo:
+        return memo[n]
+    result = fibonacci_memo(n - 1, memo) + fibonacci_memo(n - 2, memo)
+    memo[n] = result
+    return result
+
+
+def compute_variance(values: List[float]) -> float:
+    """Nested helper to compute variance."""
+    if not values:
+        return 0.0
+    mean = sum(values) / len(values)
+    variance = sum((v - mean) ** 2 for v in values) / len(values)
+    return variance
+
+
+def recursive_path_score(value: float, depth: int) -> float:
+    """Recursive helper for path scoring."""
+    if depth == 0 or value < 1.0:
+        return value
+    reduced = value * 0.8
+    return 1.0 + recursive_path_score(reduced, depth - 1) * 0.5
+
+
+def compute_complexity_score(node_count: int, variance: float, max_path: float) -> float:
+    """Nested helper to compute complexity score (with recursive internal call)."""
+    base_score = math.log(node_count) * math.sqrt(variance)
+    path_factor = recursive_path_score(max_path, 5)
+    return base_score * path_factor
+
+
+def analyze_fractal_tree(tree: FractalNode, analysis_depth: int) -> TreeAnalysis:
+    """Nested function that analyzes the fractal tree with multiple passes."""
+    total_sum = tree.recursive_sum()
+    node_count = tree.count_nodes()
+    max_path = tree.max_path_sum()
+
+    leaves: List[float] = []
+    tree.collect_leaves(leaves)
+    leaf_variance = compute_variance(leaves)
+
+    if analysis_depth > 0:
+        nested_analysis = analyze_fractal_tree(tree, analysis_depth - 1)
+        return TreeAnalysis(
+            total_sum=total_sum + nested_analysis.total_sum * 0.1,
+            node_count=node_count,
+            max_path=max(max_path, nested_analysis.max_path),
+            leaf_variance=(leaf_variance + nested_analysis.leaf_variance) / 2.0,
+            complexity_score=compute_complexity_score(node_count, leaf_variance, max_path),
+        )
+    return TreeAnalysis(
+        total_sum=total_sum,
+        node_count=node_count,
+        max_path=max_path,
+        leaf_variance=leaf_variance,
+        complexity_score=compute_complexity_score(node_count, leaf_variance, max_path),
+    )
+
+
+def complex_fractal_benchmark(tree_depth: int, branch_factor: int, fib_n: int) -> float:
+    """Main benchmark: complex fractal tree computation."""
+    tree = FractalNode.build_fractal(0, tree_depth, branch_factor, 42.0)
+    analysis = analyze_fractal_tree(tree, 2)
+
+    memo: Dict[int, int] = {}
+    fib_result = float(fibonacci_memo(fib_n, memo))
+
+    tree_hash = float(tree.compute_tree_hash())
+    tree_metric = (
+        analysis.total_sum
+        + (analysis.node_count * 10.0)
+        + analysis.max_path
+        + analysis.leaf_variance
+    )
+    return (tree_metric + fib_result + tree_hash) % 1_000_000.0
+
+
+# Deepest frame: instrumentation is turned on here, with
+# main -> run_benchmark -> warmup -> run_measured already live on the native
+# stack but the shadow stack empty. The seeder reconstructs that chain.
+def run_measured() -> float:
+    clgctl.clg_start()
+    result = complex_fractal_benchmark(TREE_DEPTH, BRANCH_FACTOR, FIB_N)
+    clgctl.clg_stop()
+    return result
+
+
+# Two unmeasured warmup iterations (instrumentation still off) before the
+# measured run, like a real benchmark harness.
+def warmup() -> float:
+    acc = 0.0
+    for _ in range(2):
+        acc += complex_fractal_benchmark(TREE_DEPTH, BRANCH_FACTOR, FIB_N)
+    return run_measured()
+
+
+def run_benchmark() -> float:
+    return warmup()
+
+
+def main() -> None:
+    skip_python_runtime()
+    result = run_benchmark()
+    assert 0 <= result < 1_000_000.0, result
+
+
+if __name__ == "__main__":
+    main()
diff --git a/callgrind-utils/tests/python_fractal_callgraph.rs b/callgrind-utils/tests/python_fractal_callgraph.rs
new file mode 100644
index 000000000..591061164
--- /dev/null
+++ b/callgrind-utils/tests/python_fractal_callgraph.rs
@@ -0,0 +1,185 @@
+//! Golden snapshot of the Python fractal fixture's call graph, with real Python
+//! frames recovered via CPython's `-X perf` trampolines.
+//!
+//! The Python twin of `tests/rust_callgraph.rs`: profile `testdata/fractal.py`
+//! (a self-contained copy of the CodSpeed e2e Python benchmark) live under the
+//! in-repo Callgrind with `--instr-atstart=no`, parse, symbolize, and snapshot
+//! the redacted folded stacks.
+//!
+//! Callgrind is a native profiler, so on its own it only sees the CPython
+//! interpreter's C frames, not the Python functions. To surface real Python
+//! frames the fixture runs under `python3 -X perf`, whose per-code-object
+//! trampolines Callgrind records as anonymous `0x...` addresses and CPython maps
+//! to `py::<qualname>:<file>` in `/tmp/perf-<pid>.map`. `symbolize_perf_map`
+//! resolves those addresses back to names. Because `setarch` execs into
+//! Valgrind, the spawned pid equals CPython's `getpid()`, so the map is at a
+//! deterministic path.
+//!
+//! The fixture obj-skips libpython (like pytest-codspeed's instrument-hooks), so
+//! the interpreter's own C frames fold into the `py::` trampoline frames and the
+//! graph is a clean Python call tree (`py::run_benchmark` -> ... ->
+//! `py::complex_fractal_benchmark` -> ...), with only small libc/libm residuals.
+//! The client requests fire several frames deep, so the seeder reconstructs the
+//! native chain at the OFF->ON transition.
+//!
+//! Requires a built `./vg-in-place` at the repo root and `cc`. Silently skips
+//! when `python3` is not on PATH (mirrors the `.vgtest` `prereq` guards).
+use std::env::consts::ARCH;
+use std::io::Cursor;
+use std::path::{Path, PathBuf};
+use std::process::Command;
+
+use callgrind_utils::model::CallGraph;
+use callgrind_utils::perf_map::PerfMap;
+
+/// Repo root: this crate lives at `<repo>/callgrind-utils`.
+fn repo_root() -> PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .parent()
+        .expect("crate has a parent directory")
+        .to_path_buf()
+}
+
+fn vg_in_place() -> PathBuf {
+    let path = repo_root().join("vg-in-place");
+    assert!(
+        path.is_file(),
+        "vg-in-place not found at {} - build Valgrind in place first",
+        path.display()
+    );
+    path
+}
+
+fn have_python3() -> bool {
+    Command::new("python3")
+        .arg("--version")
+        .output()
+        .map(|o| o.status.success())
+        .unwrap_or(false)
+}
+
+/// Compile the Callgrind client-request shim the Python fixture loads via
+/// `ctypes`, as a shared library against the in-repo `callgrind.h`.
+fn compile_clgctl(work: &Path) -> PathBuf {
+    let repo = repo_root();
+    let src = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata/clgctl.c");
+    std::fs::create_dir_all(work).expect("create work dir");
+    let lib = work.join("libclgctl.so");
+
+    let status = Command::new("cc")
+        .args(["-g", "-O0", "-shared", "-fPIC"])
+        .arg("-I")
+        .arg(repo.join("callgrind"))
+        .arg("-I")
+        .arg(repo.join("include"))
+        .arg("-o")
+        .arg(&lib)
+        .arg(&src)
+        .status()
+        .unwrap_or_else(|e| panic!("failed to spawn cc for clgctl: {e}"));
+    assert!(
+        status.success(),
+        "cc failed for {} ({status})",
+        src.display()
+    );
+    lib
+}
+
+fn runner_callgrind_args(out_file: &Path) -> Vec<String> {
+    [
+        "-q",
+        "--trace-children=yes",
+        "--cache-sim=yes",
+        "--I1=32768,8,64",
+        "--D1=32768,8,64",
+        "--LL=8388608,16,64",
+        "--instr-atstart=no",
+        "--collect-systime=nsec",
+        "--read-inline-info=yes",
+        "--tool=callgrind",
+        "--compress-strings=no",
+        "--combine-dumps=yes",
+        "--dump-line=no",
+    ]
+    .into_iter()
+    .map(str::to_string)
+    .chain([format!("--callgrind-out-file={}", out_file.display())])
+    .collect()
+}
+
+/// Profile `testdata/fractal.py` under `python3 -X perf` with the runner's
+/// Callgrind flags. Returns the `.out` contents and the parsed
+/// `/tmp/perf-<pid>.map`. `setarch` execs into Valgrind, so the spawned pid is
+/// CPython's `getpid()` and thus the perf-map filename.
+fn run_python(clgctl: &Path, out_file: &Path) -> (String, PerfMap) {
+    let script = Path::new(env!("CARGO_MANIFEST_DIR")).join("testdata/fractal.py");
+    let log_file = out_file.with_extension("valgrind.log");
+
+    let mut child = Command::new("setarch")
+        .arg(ARCH)
+        .arg("--addr-no-randomize")
+        .arg(vg_in_place())
+        .args(runner_callgrind_args(out_file))
+        .arg(format!("--log-file={}", log_file.display()))
+        .arg("python3")
+        .arg("-X")
+        .arg("perf")
+        .arg(&script)
+        .arg(clgctl)
+        .spawn()
+        .unwrap_or_else(|e| panic!("failed to spawn setarch/vg-in-place: {e}"));
+    let pid = child.id();
+    let status = child.wait().expect("wait for vg-in-place");
+    assert!(status.success(), "vg-in-place exited with {status}");
+
+    let raw = std::fs::read_to_string(out_file)
+        .unwrap_or_else(|e| panic!("read {}: {e}", out_file.display()));
+    let perf_map_path = PathBuf::from(format!("/tmp/perf-{pid}.map"));
+    let perf_map = PerfMap::from_file(&perf_map_path)
+        .unwrap_or_else(|e| panic!("read {}: {e}", perf_map_path.display()));
+    (raw, perf_map)
+}
+
+#[test]
+fn python_fractal_canonical_json() {
+    if !have_python3() {
+        eprintln!("skipping python_fractal_canonical_json: python3 not on PATH");
+        return;
+    }
+
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("scoped");
+    let clgctl = compile_clgctl(&work);
+    let out_file = work.join("fractal_py.callgrind.out");
+    let (raw, perf_map) = run_python(&clgctl, &out_file);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse python callgrind output: {e:?}"))
+        .symbolize_perf_map(&perf_map)
+        .redact();
+    insta::assert_snapshot!(
+        "fractal_py_folded",
+        graph.to_folded_without_costs().join("\n")
+    );
+    graph.to_flamegraph_file("fractal_py.partial.svg").unwrap();
+}
+
+#[test]
+fn python_fractal_full_trace() {
+    if !have_python3() {
+        eprintln!("skipping python_fractal_full_trace: python3 not on PATH");
+        return;
+    }
+
+    let work = Path::new(env!("CARGO_TARGET_TMPDIR")).join("full");
+    let clgctl = compile_clgctl(&work);
+    let out_file = work.join("fractal_py.full.callgrind.out");
+    let (raw, perf_map) = run_python(&clgctl, &out_file);
+    let graph = CallGraph::parse(Cursor::new(raw.as_str()))
+        .unwrap_or_else(|e| panic!("parse python full callgrind output: {e:?}"))
+        .symbolize_perf_map(&perf_map);
+
+    insta::assert_snapshot!(
+        "fractal_py_full_folded",
+        graph.to_folded_without_costs().join("\n")
+    );
+    graph.to_flamegraph_file("fractal_py.full.svg").unwrap();
+}
diff --git a/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_folded.snap b/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_folded.snap
new file mode 100644
index 000000000..63633c4b9
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_folded.snap
@@ -0,0 +1,98 @@
+---
+source: tests/python_fractal_callgraph.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+py::run_measured <cost>
+py::run_measured;py::complex_fractal_benchmark <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__ <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__;py::NodeMetadata.__init__ <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode._compute_child_value <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode._compute_child_value;math_sin <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.build_fractal'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;???;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::analyze_fractal_tree'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow;__ieee754_pow_fma <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow;__ieee754_pow_fma <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo;py::fibonacci_memo'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo;py::fibonacci_memo'2;py::fibonacci_memo'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;??? <cost>
+py::run_measured;py::complex_fractal_benchmark;???;??? <cost>
diff --git a/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_full_folded.snap b/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_full_folded.snap
new file mode 100644
index 000000000..62cbcf296
--- /dev/null
+++ b/callgrind-utils/tests/snapshots/python_fractal_callgraph__fractal_py_full_folded.snap
@@ -0,0 +1,85 @@
+---
+source: tests/python_fractal_callgraph.rs
+expression: "graph.to_folded_without_costs().join(\"\\n\")"
+---
+py::run_measured <cost>
+py::run_measured;py::complex_fractal_benchmark <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__ <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__;py::NodeMetadata.__init__ <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.__init__;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode._compute_child_value <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode._compute_child_value;math_sin <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.build_fractal'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;py::FractalNode.compute_tree_hash;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.build_fractal'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.build_fractal;py::FractalNode.compute_tree_hash;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;py::FractalNode.compute_tree_hash'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;py::FractalNode.compute_tree_hash'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::FractalNode.compute_tree_hash;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.collect_leaves;py::FractalNode.collect_leaves'2;py::FractalNode.collect_leaves'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;py::FractalNode.count_nodes.<locals>.<genexpr>'2;py::FractalNode.count_nodes'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.count_nodes;py::FractalNode.count_nodes.<locals>.<genexpr>;py::FractalNode.count_nodes'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.max_path_sum;py::FractalNode.max_path_sum.<locals>.<genexpr>;py::FractalNode.max_path_sum'2;py::FractalNode.max_path_sum.<locals>.<genexpr>'2;py::FractalNode.max_path_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;py::FractalNode.recursive_sum.<locals>.<genexpr>'2;py::FractalNode.recursive_sum'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::FractalNode.recursive_sum;py::FractalNode.recursive_sum.<locals>.<genexpr>;py::FractalNode.recursive_sum'2;__tls_get_addr <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::analyze_fractal_tree'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow@@GLIBC_2.29 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::analyze_fractal_tree'2;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow@@GLIBC_2.29;__ieee754_pow_fma <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr> <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow@@GLIBC_2.29 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::analyze_fractal_tree;py::compute_variance;py::compute_variance.<locals>.<genexpr>;pow@@GLIBC_2.29;__ieee754_pow_fma <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo;py::fibonacci_memo'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;py::fibonacci_memo;py::fibonacci_memo'2;py::fibonacci_memo'2 <cost>
+py::run_measured;py::complex_fractal_benchmark;__tls_get_addr <cost>

From f7a2d78d9df3fbb3f94e0d3c7179014377c6f5ba Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Fri, 3 Jul 2026 11:17:39 +0200
Subject: [PATCH 8/9] ci: switch to ubuntu-latest

---
 .github/workflows/codspeed.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/codspeed.yml b/.github/workflows/codspeed.yml
index afbc045db..051dc959c 100644
--- a/.github/workflows/codspeed.yml
+++ b/.github/workflows/codspeed.yml
@@ -9,7 +9,7 @@ on:
 
 jobs:
   benchmarks:
-    runs-on: codspeed-macro
+    runs-on: ubuntu-latest
     timeout-minutes: 15
     strategy:
       fail-fast: false

From 9db3f68aa4d81a18b86d044fb2b5dd88eaf03a1d Mon Sep 17 00:00:00 2001
From: not-matthias <matthias@codspeed.io>
Date: Fri, 3 Jul 2026 11:27:21 +0200
Subject: [PATCH 9/9] ci: switch to ubuntu-24.04-arm

---
 .github/workflows/codspeed.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/codspeed.yml b/.github/workflows/codspeed.yml
index 051dc959c..590e4615f 100644
--- a/.github/workflows/codspeed.yml
+++ b/.github/workflows/codspeed.yml
@@ -9,7 +9,7 @@ on:
 
 jobs:
   benchmarks:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-24.04-arm
     timeout-minutes: 15
     strategy:
       fail-fast: false