Skip to content

feat(node,cli): expose on-demand pprof HTTP endpoint#668

Draft
ByteYue wants to merge 2 commits into
mainfrom
feat/node-pprof-http
Draft

feat(node,cli): expose on-demand pprof HTTP endpoint#668
ByteYue wants to merge 2 commits into
mainfrom
feat/node-pprof-http

Conversation

@ByteYue
Copy link
Copy Markdown
Contributor

@ByteYue ByteYue commented Apr 19, 2026

Summary

The node previously supported only a hard-coded periodic CPU profiler (ENABLE_PPROF=1) that writes .pb files to disk every three minutes. That works for post-mortems but is awkward when you want to profile a specific interval — you have to scp the file off the node and match it to wall-clock by hand.

This PR adds an on-demand HTTP endpoint and a CLI wrapper.

Node side

  • New --pprof_addr <ADDR> flag on GravityNodeArgs (e.g. 127.0.0.1:6060). When set, starts an axum HTTP server on the existing tokio runtime exposing:
    • GET / — index page documenting the endpoints
    • GET /debug/pprof/profile?seconds=N[&frequency=Hz] — returns a protobuf CPU profile consumable by go tool pprof
  • Concurrency: a process-wide mutex serializes overlapping requests. pprof uses the global SIGPROF handler; overlapping ProfilerGuards produce garbage data, so the second request gets 409 Conflict.
  • Interaction with existing mode: --pprof_addr disables the periodic disk-dump mode (ENABLE_PPROF=1) because they conflict over the same profiler state. A warning is emitted if both are set.

CLI side

gravity-cli node pprof cpu [--addr ADDR] [--duration SECS] [--frequency HZ] [--output-file PATH]
  • --addr defaults to 127.0.0.1:6060 (matches common local setup), overridable via GRAVITY_PPROF_ADDR.
  • --output-file - streams protobuf bytes to stdout.
  • On success prints the go tool pprof -http=:8080 <file> command to view the flamegraph.

Scope

  • Heap profiling is out of scope. Exposing jemalloc heap snapshots requires coordinating MALLOC_CONF prof_prefix with the HTTP server and reading back dump files; deferred to a follow-up.
  • No changes to the periodic ENABLE_PPROF mode except the new precedence rule.

Test plan

  • Integration test pprof_server_end_to_end (in-process): binds ephemeral port → index returns 200 + documents endpoint → profile returns 200 with application/octet-stream and valid protobuf tag byte → concurrent overlapping request returns 409 while the first succeeds with 200
  • gravity_node --help lists --pprof_addr with full description
  • gravity-cli node pprof cpu --help lists all flags
  • CLI error path: connecting to non-listening port → clean error with hint, exit 1
  • cargo build -p gravity_node -p gravity_cli --profile quick-release succeeds with RUSTFLAGS="--cfg tokio_unstable"

ByteYue added 2 commits April 19, 2026 18:21
The node previously supported only a hard-coded periodic CPU profiler
(ENABLE_PPROF=1) that writes `.pb` files to disk every three minutes.
That works for post-mortems but is awkward when you want to profile a
specific interval: you have to scp the file off the node and match it
to wall-clock by hand.

Node side:
  - New `--pprof_addr <ADDR>` flag on GravityNodeArgs (e.g.
    127.0.0.1:6060). When set, starts an axum HTTP server on the
    existing tokio runtime exposing `GET /debug/pprof/profile?seconds=N`
    which returns a protobuf CPU profile consumable by `go tool pprof`.
    A process-wide mutex serializes overlapping requests (pprof uses
    global SIGPROF state; overlapping guards produce garbage); the
    second concurrent request gets 409 Conflict.
  - `--pprof_addr` disables the periodic disk-dump mode because the
    two modes conflict over the same profiler state.
  - Integration test covers the end-to-end flow (bind → index →
    profile → concurrent-reject).

CLI side:
  - `gravity-cli node pprof cpu [--addr ADDR] [--duration SECS]
    [--frequency HZ] [--output-file PATH]` downloads a profile via the
    HTTP endpoint. `--output-file -` streams protobuf to stdout.
    Default addr 127.0.0.1:6060 matches the common local setup.

Heap profiling is intentionally out of scope: exposing jemalloc heap
snapshots requires coordinating MALLOC_CONF prof_prefix with the HTTP
server and is deferred.
@ByteYue ByteYue marked this pull request as draft April 22, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant