Skip to content

backport feat(process-metrics): add memory gauges and CI soak harness (#4330)#4436

Merged
sveitser merged 1 commit into
release-mainnet-1.2.0-rcfrom
backport-4330-to-release-mainnet-1.2.0-rc
Jun 5, 2026
Merged

backport feat(process-metrics): add memory gauges and CI soak harness (#4330)#4436
sveitser merged 1 commit into
release-mainnet-1.2.0-rcfrom
backport-4330-to-release-mainnet-1.2.0-rc

Conversation

@sveitser
Copy link
Copy Markdown
Collaborator

@sveitser sveitser commented Jun 5, 2026

No description provided.

* feat(process-metrics): add memory gauges and CI soak harness

- new process-metrics crate exposes 5 process gauges: resident_memory_bytes,
  virtual_memory_bytes, open_fds, threads, uptime_seconds
- espresso-node spawns the 5s sampler at startup via SequencerContext::init
- scripts/soak.py orchestrates the docker demo end to end: compose up, wait
  for nodes ready, sample docker stats + /v0/status/metrics every 1s, render
  a peak-total + per-service Markdown summary
- memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over
  demo-drb-header.toml and demo-epoch-reward.toml
- 90-day artifact retention for raw JSONL and summary

* fix(process-metrics): fire-and-forget docker compose up

- `docker compose up -d` blocks on `service_completed_successfully`
  dependencies and propagates their exit code, so a one-shot helper that
  fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127)
  takes the whole soak job down even though all 5 nodes are healthy
- mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via
  `Popen` to a log file, then verify readiness via node HTTP polling
- write the compose-up log into OUTPUT_DIR so it ends up in the artifact

* fix(demo): wait-for-light-client-epoch runs in alpine

- POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has
  no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC
  epoch gate completed instantly without polling
- cherry-picked from 6fe542a (script only;
  the original commit also touched binary-upgrade-tests/run.py)

* fix(process-metrics): gate soak on smoke test, add chart + logs artifact

- soak.py no longer manages docker compose or polls per-node readiness;
  node 2's command lacks the status module so /v0/status/block-height was
  timing out at 300s. The CI workflow now brings the stack up and runs
  scripts/smoke-test-demo as the readiness gate instead
- add RSS-over-time chart to the summary: inline mermaid xychart-beta so
  GitHub renders it natively, plus a full-resolution PNG saved to the
  samples artifact (matplotlib via PEP 723 uv inline deps)
- new soak-logs artifact captures docker compose logs + ps with 1-day
  retention so failures are diagnosable
- swap python3 invocation for `uv run` (matplotlib gets installed from the
  script's inline deps); add astral-sh/setup-uv@v8 step

* fix(ci): use proven `just demo --pull never` pattern for soak

- previous `docker compose up -d &` lost step exit info and didn't compose
  with the way images are tagged in PR builds (PR variant loads pr-<N>
  tagged images via `docker load`)
- match the existing test-demo-pr/non-pr pattern: pre-pull missing images,
  then `just demo --pull never &` + smoke test as the gate
- add `taiki-e/install-action@just` so `just demo` is available

* feat(process-metrics): split soak.py into sample + render subcommands

- `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts
- `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib
  (PEP 723 inline deps). Re-runnable locally on saved data without re-sampling.
- CI splits into two steps so render runs on `if: always()` even when sample
  fails partway, surfacing partial data.
- Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary
  wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end:
  sample under plain python3, render via `uv run` produces a 1200x600 PNG.

* fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag)

* feat(process-metrics): filter summary to nodes, fold RSS sources

- summary table now shows only espresso-node-N rows; the other containers
  are still sampled into the JSONL artifact but not surfaced in the table
  (user only cares about node memory)
- replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS
  (process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile
  interpolation on small samples
- drop the redundant cross-check section: both sources now sit in the
  main table side by side
- chart inherits the node-only filter and drops the matplotlib legend in
  favor of color-matched end-of-line annotations
- drop empty-Name rows that produced a phantom `--` row in earlier output
- render infers duration from sample timestamps so the heading matches the
  actual run when `soak.py render` is invoked without DURATION_SECONDS

* ci(memory-soak): drop aggregator job

* ci(memory-soak): trim env to what isn't derivable

Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible
defaults in soak.py (300s, ./soak-samples, basename of genesis file).
Collapse the matrix object into a flat list of genesis basenames; build the
file path and artifact name inline.

* debug(process-metrics): dump raw metrics response on first scrape

To diagnose why node-metrics.jsonl came out empty in CI: on the first
scrape attempt per node, log byte count + process_* match count, and save
the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so
the artifact contains the exact endpoint output.

* fix(metrics): drop consensus_ subgroup from populate_metrics

- populate_metrics() was wrapping the root in a `consensus` subgroup, so
  every metric registered through it (HotShot consensus, storage, proposal
  fetcher, and now our process metrics) got a misleading consensus_ prefix
- the SQL / scanner / aggregator metrics avoid this because they register
  on the root PrometheusMetrics directly; populate_metrics is the only
  consumer that adds the extra layer
- gauges now publish under their natural names (process_resident_memory_bytes,
  append_da_duration, ...). External Grafana / alert queries that depended
  on the consensus_ prefix will need to be updated
- revert soak.py to match process_* directly and drop the diagnostic raw
  response dump introduced in c916948, no longer needed

* fix(process-metrics): match consensus_process_* gauge names

populate_metrics in hotshot-query-service wraps registered metrics in a
"consensus" subgroup. Our process gauges land there and publish as
consensus_process_*. Stop trying to scrape process_* and use the actual
name. Leave the shared library alone; the prefix is a project-wide
convention that external dashboards depend on.

* ci(memory-soak): use 100 delegators per validator

- pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the
  stake-for-demo docker compose service
- set multiple-delegators mode with 100 delegators per validator in both
  soak workflow jobs to load the stake table beyond a handful of stakers

* fix(process-metrics): add legend to RSS-over-time chart

- pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic
- render a Mermaid flowchart legend below the chart with matching node fills
  (xychart-beta has no native legend; HTML-styled spans get sanitized in
  GitHub step summaries)
- align PNG palette with Mermaid and add an in-figure legend()

* chore(process-metrics): add soak just module, simplify CI

- new crates/process-metrics/justfile registered at root as `mod soak`
- recipes: up, sample, render, logs, down, run, test, fmt, lint
- defaults via env_var_or_default so CI only sets per-matrix variables
- memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls
- updated README run-locally section

* ci(memory-soak): run 1 hour per matrix entry

- set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s)
- timeout-minutes: 90 caps a hung job below the GH default 6h

* refactor(process-metrics): shrink soak.py, add click CLI

- Swap manual parsing for pandas, prometheus-client, humanize,
  python-dotenv. Drops ~290 lines.
- Replace bare command positional with click subcommands so `sample`
  and `render` each have their own `--help` with env-var fallbacks.
- Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to
  `render` so it can run on an arbitrary JSONL pair without depending
  on `--output-dir` layout.
- MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the
  earlier refactor, would silently drop containers reporting GiB).
- Add empty-file and GiB regression tests.
- Show the full PNG path in the summary, not just the basename.
- Add `*args` passthrough to the wrapper just recipes.

* refactor(process-metrics): drop dead code and indirection in soak.py

- Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` /
  `ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not
  interpolate them, so this was a no-op. The `--genesis-file` option
  on `sample` only existed to set those vars and is removed; the
  matching `export` is dropped from the just recipe.
- `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead
  of `http://localhost:<port>`, so `_process_rss_max` no longer needs
  to recover the index by regex. `NODE_BASE_PORT` removed.
- Aggregate max-per-bucket for the Mermaid chart instead of picking
  every Nth point, so peaks are not missed at 1Hz sampling.
- Compute the chart df (`seconds` + `rss_mb`, sorted) once in
  `render_summary` and pass it to both chart renderers, dropping
  `_series_mb`.
- Inline `_process_rss_max` (single caller) and `_hb`. Flatten
  `METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`.
- Rename `_load_docker` to `_load_docker_metrics`.
- Narrow `scrape_node`'s except clause to `OSError`. URLError /
  TimeoutError / ConnectionError are subclasses.
- Show full PNG path in the summary, not just the basename.

* refactor(process-metrics): register at metrics root, trim boilerplate

- spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics
  is in scope, so gauges publish as process_* instead of consensus_process_*
- drop the spawn from SequencerContext::init (was using the consensus
  subgroup handle from populate_metrics)
- swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)]
- delete the test module: both tests were low-value (tautology over
  hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80
  lines of Metrics/Gauge trait stub boilerplate

* fix(process-metrics): match process_* gauge names in soak scripts

- gauges now register at the prometheus root (commit 82992b6456),
  so the consensus_ prefix is gone
- update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly

* fix(process-metrics): chart RSS in MiB to match binary-unit source

- docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the
  process gauge is raw bytes; dividing by 1_000_000 and labeling "MB"
  inflated the displayed value by ~4.8% and disagreed with docker stats
- divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title
  and y-axis label

* feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges

- process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self
- node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges)
- node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback
- cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat
- cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite
- SecondsAccumulator preserves sub-second precision across delta adds to integer Counters
- best-effort reads: missing kernel files debug-log and skip, never break the scrape

* chore: cargo sort

(cherry picked from commit 36b751e)
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 5, 2026

Claude finished @sveitser's task in 4m 47s —— View job


Code Review: feat(process-metrics): add memory gauges and CI soak harness (#4330)

  • Read repo guidelines and existing review comments
  • Review PR diff and changed files
  • Analyze code quality and correctness
  • Check for security concerns
  • Evaluate performance implications
  • Review test coverage
  • Review documentation
  • Post detailed review feedback

Summary

This is a well-structured PR that adds a new process-metrics crate for exposing process/host/cgroup metrics via Prometheus gauges and counters, plus a comprehensive CI soak harness for tracking memory over time. The code is clean, well-documented, and the design choices (best-effort reads, startup detection, SecondsAccumulator for precision) are solid.

Strengths

  • Excellent error handling philosophy: All /proc and /sys/fs/cgroup reads are best-effort with debug-level logging on failure — the right choice for observability that should never break the process it's observing.
  • SecondsAccumulator is well designed: Preserving sub-second remainders across ticks avoids cumulative drift. The saturating_sub on counter resets is the correct defensive approach.
  • Good separation of concerns: The Rust crate is purely a metrics producer, the Python soak harness is purely a metrics consumer/renderer, and the justfile + CI workflow ties them together cleanly.
  • Thorough README: The metrics table with sources, the env var reference, and the recipe list are exactly what operators and future contributors need.
  • Good test coverage: Unit tests for SecondsAccumulator and U64Delta cover the important edge cases (first sample, remainder carry, counter reset). The Python test_soak.py covers the rendering pipeline well with edge cases (sparse rows, GiB units, empty data, missing metrics).

Issues Found

See inline comments for details. Summary of the notable ones:

Severity File Issue
Medium src/lib.rs:242 System::new().cpus().len() likely returns 0 — sysinfo doesn't populate CPU list without an explicit refresh
Medium soak.py:126-128 Missing ESPRESSO_NODE_{i}_API_PORT env var crashes the entire sampling loop with KeyError
Low soak.py:62-64 REPO_ROOT computed at import time via git rev-parse — breaks if imported outside a git repo
Low src/lib.rs:246-250 run() loops forever with no cancellation mechanism beyond task abort
Info src/lib.rs:79 u64 as usize casts are safe on 64-bit Linux but would silently truncate on 32-bit

No Issues Found In

  • Security: No untrusted input paths. File reads are all to well-known kernel pseudo-filesystems. The Python scraper only hits localhost. No injection vectors.
  • Performance: 5-second sampling interval is reasonable. sysinfo refresh is scoped to the current PID only. File reads are small kernel files. No allocations in the hot path beyond what sysinfo does internally.
  • CI workflow: The PR/non-PR split with if guards is correct. Matrix strategy with fail-fast: false is appropriate for independent soak runs. Artifact retention (90 days samples, 1 day logs) is sensible.
  • Integration: The three ProcessMetrics::new(ds.metrics()).run() spawn sites in options.rs correctly cover the status-only, filesystem, and SQL data source paths.

let total = self.remainder + delta;
let whole = total / units_per_second;
self.remainder = total % units_per_second;
whole as usize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug risk): whole is a u64 (result of dividing two u64 values). On 32-bit targets, as usize would silently truncate if the accumulated delta ever exceeded u32::MAX seconds (~136 years). Not a practical concern today since the node would never run that long in one stretch, but worth noting this crate is inherently Linux/64-bit-only anyway (procfs, /proc/self, etc.), so the cast is safe in practice. Same applies to U64Delta::observe at line 94.

nit: If you ever want to make the 64-bit assumption explicit, a #[cfg(not(target_pointer_width = "64"))] compile_error would prevent accidental 32-bit builds.

Comment on lines +242 to +244
self.cpu_count.set(sysinfo::System::new().cpus().len());

let mut system = System::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Two System::new() calls — one for CPU count (line 242) and one for the sampling loop (line 244). sysinfo::System::new() doesn't populate CPU info by default (requires refresh_cpu_list()), so cpus().len() will likely return 0 here.

Consider using sysinfo::System::new_with_specifics(RefreshKind::nothing().with_cpu(CpuRefreshKind::nothing())) to populate the CPU list, or use std::thread::available_parallelism() / num_cpus::get() which are cheaper and don't require sysinfo at all.

Could you verify this returns the expected value in your soak runs?

Comment on lines +266 to +269
if let Some(process) = system.process(pid) {
self.resident_memory_bytes.set(process.memory() as usize);
self.virtual_memory_bytes
.set(process.virtual_memory() as usize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note (data fidelity): process.memory() and process.virtual_memory() return u64 (bytes), and the Gauge trait takes usize. On 64-bit Linux this is fine, but values above usize::MAX would silently wrap on 32-bit targets. Since this whole crate is Linux-only (procfs dependency), and all production targets are 64-bit, this is acceptable — just documenting the assumption.

Comment on lines +246 to +250
let mut ticker = interval(SAMPLE_INTERVAL);
loop {
ticker.tick().await;
self.sample(&mut system, env, &mut previous);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (graceful shutdown): run() loops forever with no cancellation mechanism. The task is spawned into the TaskList, so presumably dropping the task handle cancels it — is that the intended shutdown path? If yes, this is fine. If not, consider accepting a CancellationToken or tokio::sync::watch to allow clean shutdown without relying on task abort.

Comment on lines +62 to +64
REPO_ROOT = Path(
subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (robustness): REPO_ROOT is computed at module import time via subprocess.check_output(["git", ...]). If this script is ever imported or run outside a git repository (e.g., from a packaged wheel or tmp directory), it will crash with a CalledProcessError before any CLI argument is even parsed.

Consider deferring this to a lazy property or computing it inside the CLI commands that need it.

Suggested change
REPO_ROOT = Path(
subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
REPO_ROOT = None
def _repo_root() -> Path:
global REPO_ROOT
if REPO_ROOT is None:
REPO_ROOT = Path(
subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
return REPO_ROOT

Comment on lines +126 to +128
node_ports = [
(i, int(os.environ[f"ESPRESSO_NODE_{i}_API_PORT"])) for i in NODE_INDICES
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (crash on missing env var): If any ESPRESSO_NODE_{i}_API_PORT environment variable is missing, this raises a KeyError and the entire sampling loop crashes. The .env file might not define all 5 ports depending on the deployment configuration.

Consider using os.environ.get() with a default, or wrapping with a try/except and logging a warning for missing nodes:

Suggested change
node_ports = [
(i, int(os.environ[f"ESPRESSO_NODE_{i}_API_PORT"])) for i in NODE_INDICES
]
node_ports = []
for i in NODE_INDICES:
key = f"ESPRESSO_NODE_{i}_API_PORT"
port = os.environ.get(key)
if port is None:
log.warning(f"{key} not set; skipping node {i}")
continue
node_ports.append((i, int(port)))

Comment on lines +146 to +149
log.info(f"sample t={int(now - t0)}s samples={samples}")
next_progress += PROGRESS_INTERVAL
if (sleep_for := (t0 + samples) - time.time()) > 0:
time.sleep(sleep_for)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (sampling drift): The sleep calculation t0 + samples assumes 1-second sample intervals, but this is hardcoded implicitly. If scraping takes longer than 1 second (e.g., a node is slow to respond with the 2s timeout on line 75), samples will stack up with no sleep. Not a bug, but the samples counter then no longer represents actual seconds elapsed — it's just a sequence number. The progress log on line 146 already uses wall-clock now - t0, which is correct.

Comment on lines +589 to +595
- name: Upload soak samples
if: always()
uses: actions/upload-artifact@v7
with:
name: memory-soak-${{ matrix.genesis }}
path: ./soak-samples
retention-days: 90
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (CI): With matrix: [demo-drb-header, demo-epoch-reward] and fail-fast: false, both matrix jobs produce artifacts with the same name pattern memory-soak-${{ matrix.genesis }}. This is fine as-is since the genesis values differ, but if a future matrix entry has the same name, artifacts would collide. More importantly, since the PR and non-PR jobs share the same artifact names, if both somehow run in the same workflow (shouldn't happen due to the if guards), they'd conflict. This is safe as designed — just flagging for awareness.

Comment on lines +73 to +76
try:
with urllib.request.urlopen(
f"http://localhost:{port}/v0/status/metrics", timeout=2.0
) as resp:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (security): urllib.request.urlopen with an http://localhost URL is fine for local scraping, but worth noting that if port were ever sourced from untrusted input, this could be used for SSRF. In this case the ports come from env vars set by the trusted .env file, so it's safe.

@sveitser sveitser changed the title feat(process-metrics): add memory gauges and CI soak harness (#4330) backport feat(process-metrics): add memory gauges and CI soak harness (#4330) Jun 5, 2026
@sveitser sveitser enabled auto-merge (squash) June 5, 2026 11:25
@sveitser sveitser merged commit 49fa5d0 into release-mainnet-1.2.0-rc Jun 5, 2026
194 of 197 checks passed
@sveitser sveitser deleted the backport-4330-to-release-mainnet-1.2.0-rc branch June 5, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants