Skip to content

Use libunwind for jemalloc memory profiling stack traces#5602

Merged
ndr-ds merged 1 commit into
mainfrom
ndr-ds/jemalloc-prof-libunwind
Apr 24, 2026
Merged

Use libunwind for jemalloc memory profiling stack traces#5602
ndr-ds merged 1 commit into
mainfrom
ndr-ds/jemalloc-prof-libunwind

Conversation

@ndr-ds
Copy link
Copy Markdown
Contributor

@ndr-ds ndr-ds commented Mar 5, 2026

Motivation

The default libgcc DWARF unwinder (_Unwind_Find_FDE) used by jemalloc for profiling
stack traces has a known livelock
(jemalloc/jemalloc#2282) — one
thread hangs in _Unwind_Find_FDE and all others block on futex_wait. We've hit this in
production on validator shards.

Proposal

Switch jemalloc's stack unwinder to libunwind by building with
--enable-prof-libunwind.

  • Point tikv-jemallocator at
    linera-io/jemallocator fork which adds a
    profiling_libunwind feature (upstream doesn't support it yet —
    tikv/jemallocator#146)
  • Enable profiling_libunwind in the jemalloc feature
  • Add libunwind-dev (builder) and libunwind8 (runtime) to docker/Dockerfile

Test Plan

  • CI
  • Verify memory profiling works end-to-end in a temporary network deployment

Release Plan

  • These changes shobe backported to testnet_conway.

Links

Copy link
Copy Markdown
Contributor Author

ndr-ds commented Mar 5, 2026

@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from a006f3b to 9a3f534 Compare March 5, 2026 12:50
@ndr-ds ndr-ds force-pushed the ndr-ds/enable-memory-profiling-flag branch from 0ec146a to 14f2a9f Compare March 5, 2026 12:50
@ndr-ds ndr-ds changed the base branch from ndr-ds/enable-memory-profiling-flag to graphite-base/5602 March 25, 2026 03:43
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from 9a3f534 to 3cf1a03 Compare March 25, 2026 03:44
@ndr-ds ndr-ds force-pushed the graphite-base/5602 branch from 14f2a9f to 23c683e Compare March 25, 2026 03:44
@ndr-ds ndr-ds changed the base branch from graphite-base/5602 to ndr-ds/enable-memory-profiling-flag March 25, 2026 03:44
@ndr-ds ndr-ds marked this pull request as ready for review March 25, 2026 03:47
ndr-ds added a commit that referenced this pull request Mar 25, 2026
…ces (#5792)

## Motivation

Backport of #5602, stacked on the backport of #5601. Switches jemalloc's
stack trace
collection from the default (frame pointers) to libunwind, which
produces complete stack
traces in release builds without requiring frame pointers.

## Proposal

Cherry-pick of `9a3f534cbdb` onto the backport of #5601.

## Test Plan

CI
@ndr-ds ndr-ds changed the base branch from ndr-ds/enable-memory-profiling-flag to graphite-base/5602 March 25, 2026 04:25
@ndr-ds ndr-ds force-pushed the graphite-base/5602 branch from 23c683e to ec6dcd1 Compare March 25, 2026 04:25
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from 3cf1a03 to 432b59f Compare March 25, 2026 04:25
@ndr-ds ndr-ds changed the base branch from graphite-base/5602 to ndr-ds/enable-memory-profiling-flag March 25, 2026 04:25
@ndr-ds ndr-ds changed the base branch from ndr-ds/enable-memory-profiling-flag to graphite-base/5602 March 25, 2026 12:51
@ndr-ds ndr-ds force-pushed the graphite-base/5602 branch from ec6dcd1 to 1489741 Compare March 25, 2026 12:52
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from 432b59f to 05f5482 Compare March 25, 2026 12:52
@ndr-ds ndr-ds changed the base branch from graphite-base/5602 to ndr-ds/enable-memory-profiling-flag March 25, 2026 12:52
@ndr-ds ndr-ds force-pushed the ndr-ds/enable-memory-profiling-flag branch from 1489741 to 2490b0e Compare March 25, 2026 13:10
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from 05f5482 to eca652a Compare March 25, 2026 13:10
@ndr-ds ndr-ds force-pushed the ndr-ds/enable-memory-profiling-flag branch from 2490b0e to cc79847 Compare March 25, 2026 13:25
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from eca652a to d2f41a8 Compare March 25, 2026 13:25
@ndr-ds ndr-ds requested review from afck, deuszx and ma2bd March 25, 2026 13:47
Comment thread Cargo.toml Outdated
thiserror = "1.0.65"
thiserror-context = "0.1.1"
tikv-jemallocator = "0.6.0"
tikv-jemallocator = { git = "https://github.com/linera-io/jemallocator.git", rev = "16f3e75fb145e12f5ae15ed232d1368d4539fc14" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use git dependencies we won't be able to publish to crates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, we might have to publish this fork to crates.io then. I have a PR to hopefully merge this upstream tikv/jemallocator#159, but until that gets merged (if it ever does), we'll need to use our published fork

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nvm, maybe if we just add a version it'll work?

Base automatically changed from ndr-ds/enable-memory-profiling-flag to main March 26, 2026 20:49
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from d2f41a8 to 6e26770 Compare March 27, 2026 18:12
@ndr-ds ndr-ds requested a review from deuszx March 27, 2026 18:13
ndr-ds added a commit that referenced this pull request Mar 28, 2026
## Motivation

The `test-crates-and-docrs` CI job has been failing on `testnet_conway`
since the
backport of #5602 (#5792). `cargo package` requires all dependencies to
have a `version`
field — the backport included the git spec but dropped `version =
"0.6"`.

## Proposal

Add `version = "0.6"` to the `tikv-jemallocator` workspace dependency,
matching what
#5602 has on `main`. The `[patch.crates-io]` section still ensures the
git fork is used
at build time; the version is only needed so `cargo package` knows what
to reference
when
stripping git specs for packaging.

Also temporarily adds a `pull_request` trigger to the Documentation
workflow so the fix
can be verified on this PR before merging. **This trigger should be
removed before or
after merge.**

## Test Plan

- CI — the `test-crates-and-docrs` job will run on this PR via the
temporary trigger
Comment thread Cargo.toml Outdated
thiserror = "1.0.65"
thiserror-context = "0.1.1"
tikv-jemallocator = "0.6.0"
tikv-jemallocator = { version = "0.6", git = "https://github.com/linera-io/jemallocator.git", rev = "16f3e75fb145e12f5ae15ed232d1368d4539fc14" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not gonna break the crates' publishing step? If there's a version for it, can't we use one from crates?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. AFAIK cargo publish will strip the git/rev stuff and keep just the version

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm based on this theory what would happen when someone runs cargo install?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's where I was getting at – yes :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, compilation might fail :(
I might have to just bite the bullet then and publish the fork crate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation will fail if someone tries to cargo install with the jemalloc feature, but like I said in the other thread here we could split the profiling into a separate feature, then people could still install binaries with jemalloc just not with profiling

Comment thread Cargo.toml Outdated
Comment on lines +413 to +415
[patch.crates-io]
tikv-jemallocator = { git = "https://github.com/linera-io/jemallocator.git", rev = "16f3e75fb145e12f5ae15ed232d1368d4539fc14" }
tikv-jemalloc-sys = { git = "https://github.com/linera-io/jemallocator.git", rev = "16f3e75fb145e12f5ae15ed232d1368d4539fc14" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe cargo publish will also strip the patches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand then. Either we don't need it or the published crate will contain a different deps than the non-published code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the published crate will use the vanilla jemallocator version without our fork changes. Because SDK users probably don't care about jemalloc and memory profiling? And then our own binaries when we build them will have the fork changes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like accidental outcome and not planned 🤔 maybe that's ok.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a crate linera-jemallocator?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So according to Claude's analysis, if we go the published crates route, we need to actually publish 4 crates:

  1. linera-jemalloc-sys — fork of tikv-jemalloc-sys
  2. linera-jemalloc-ctl — fork of tikv-jemalloc-ctl, depends on linera-jemalloc-sys
  3. linera-jemallocator — fork of tikv-jemallocator, depends on linera-jemalloc-sys
  4. linera-jemalloc-pprof — fork of jemalloc_pprof, depends on linera-jemalloc-ctl

If we decide not to publish crates, I could split the profiling into another jemalloc-profiling feature or something, and if people try to cargo install with profiling, the build will fail. I think that's the tradeoff

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

@ndr-ds ndr-ds requested a review from deuszx April 2, 2026 13:52
@ndr-ds ndr-ds force-pushed the ndr-ds/jemalloc-prof-libunwind branch from 6e26770 to ae7c5be Compare April 24, 2026 16:39
@ndr-ds ndr-ds added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit db29d41 Apr 24, 2026
37 checks passed
@ndr-ds ndr-ds deleted the ndr-ds/jemalloc-prof-libunwind branch April 24, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants