feat(node,cli): expose on-demand pprof HTTP endpoint#668
Draft
ByteYue wants to merge 2 commits into
Draft
Conversation
The node previously supported only a hard-coded periodic CPU profiler
(ENABLE_PPROF=1) that writes `.pb` files to disk every three minutes.
That works for post-mortems but is awkward when you want to profile a
specific interval: you have to scp the file off the node and match it
to wall-clock by hand.
Node side:
- New `--pprof_addr <ADDR>` flag on GravityNodeArgs (e.g.
127.0.0.1:6060). When set, starts an axum HTTP server on the
existing tokio runtime exposing `GET /debug/pprof/profile?seconds=N`
which returns a protobuf CPU profile consumable by `go tool pprof`.
A process-wide mutex serializes overlapping requests (pprof uses
global SIGPROF state; overlapping guards produce garbage); the
second concurrent request gets 409 Conflict.
- `--pprof_addr` disables the periodic disk-dump mode because the
two modes conflict over the same profiler state.
- Integration test covers the end-to-end flow (bind → index →
profile → concurrent-reject).
CLI side:
- `gravity-cli node pprof cpu [--addr ADDR] [--duration SECS]
[--frequency HZ] [--output-file PATH]` downloads a profile via the
HTTP endpoint. `--output-file -` streams protobuf to stdout.
Default addr 127.0.0.1:6060 matches the common local setup.
Heap profiling is intentionally out of scope: exposing jemalloc heap
snapshots requires coordinating MALLOC_CONF prof_prefix with the HTTP
server and is deferred.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The node previously supported only a hard-coded periodic CPU profiler (
ENABLE_PPROF=1) that writes.pbfiles to disk every three minutes. That works for post-mortems but is awkward when you want to profile a specific interval — you have to scp the file off the node and match it to wall-clock by hand.This PR adds an on-demand HTTP endpoint and a CLI wrapper.
Node side
--pprof_addr <ADDR>flag onGravityNodeArgs(e.g.127.0.0.1:6060). When set, starts an axum HTTP server on the existing tokio runtime exposing:GET /— index page documenting the endpointsGET /debug/pprof/profile?seconds=N[&frequency=Hz]— returns a protobuf CPU profile consumable bygo tool pprofpprofuses the global SIGPROF handler; overlappingProfilerGuards produce garbage data, so the second request gets409 Conflict.--pprof_addrdisables the periodic disk-dump mode (ENABLE_PPROF=1) because they conflict over the same profiler state. A warning is emitted if both are set.CLI side
--addrdefaults to127.0.0.1:6060(matches common local setup), overridable viaGRAVITY_PPROF_ADDR.--output-file -streams protobuf bytes to stdout.go tool pprof -http=:8080 <file>command to view the flamegraph.Scope
MALLOC_CONF prof_prefixwith the HTTP server and reading back dump files; deferred to a follow-up.ENABLE_PPROFmode except the new precedence rule.Test plan
pprof_server_end_to_end(in-process): binds ephemeral port → index returns 200 + documents endpoint → profile returns 200 withapplication/octet-streamand valid protobuf tag byte → concurrent overlapping request returns 409 while the first succeeds with 200gravity_node --helplists--pprof_addrwith full descriptiongravity-cli node pprof cpu --helplists all flagscargo build -p gravity_node -p gravity_cli --profile quick-releasesucceeds withRUSTFLAGS="--cfg tokio_unstable"