backport feat(process-metrics): add memory gauges and CI soak harness (#4330)#4436
Conversation
* feat(process-metrics): add memory gauges and CI soak harness - new process-metrics crate exposes 5 process gauges: resident_memory_bytes, virtual_memory_bytes, open_fds, threads, uptime_seconds - espresso-node spawns the 5s sampler at startup via SequencerContext::init - scripts/soak.py orchestrates the docker demo end to end: compose up, wait for nodes ready, sample docker stats + /v0/status/metrics every 1s, render a peak-total + per-service Markdown summary - memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over demo-drb-header.toml and demo-epoch-reward.toml - 90-day artifact retention for raw JSONL and summary * fix(process-metrics): fire-and-forget docker compose up - `docker compose up -d` blocks on `service_completed_successfully` dependencies and propagates their exit code, so a one-shot helper that fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127) takes the whole soak job down even though all 5 nodes are healthy - mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via `Popen` to a log file, then verify readiness via node HTTP polling - write the compose-up log into OUTPUT_DIR so it ends up in the artifact * fix(demo): wait-for-light-client-epoch runs in alpine - POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC epoch gate completed instantly without polling - cherry-picked from 6fe542a (script only; the original commit also touched binary-upgrade-tests/run.py) * fix(process-metrics): gate soak on smoke test, add chart + logs artifact - soak.py no longer manages docker compose or polls per-node readiness; node 2's command lacks the status module so /v0/status/block-height was timing out at 300s. The CI workflow now brings the stack up and runs scripts/smoke-test-demo as the readiness gate instead - add RSS-over-time chart to the summary: inline mermaid xychart-beta so GitHub renders it natively, plus a full-resolution PNG saved to the samples artifact (matplotlib via PEP 723 uv inline deps) - new soak-logs artifact captures docker compose logs + ps with 1-day retention so failures are diagnosable - swap python3 invocation for `uv run` (matplotlib gets installed from the script's inline deps); add astral-sh/setup-uv@v8 step * fix(ci): use proven `just demo --pull never` pattern for soak - previous `docker compose up -d &` lost step exit info and didn't compose with the way images are tagged in PR builds (PR variant loads pr-<N> tagged images via `docker load`) - match the existing test-demo-pr/non-pr pattern: pre-pull missing images, then `just demo --pull never &` + smoke test as the gate - add `taiki-e/install-action@just` so `just demo` is available * feat(process-metrics): split soak.py into sample + render subcommands - `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts - `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib (PEP 723 inline deps). Re-runnable locally on saved data without re-sampling. - CI splits into two steps so render runs on `if: always()` even when sample fails partway, surfacing partial data. - Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end: sample under plain python3, render via `uv run` produces a 1200x600 PNG. * fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag) * feat(process-metrics): filter summary to nodes, fold RSS sources - summary table now shows only espresso-node-N rows; the other containers are still sampled into the JSONL artifact but not surfaced in the table (user only cares about node memory) - replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS (process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile interpolation on small samples - drop the redundant cross-check section: both sources now sit in the main table side by side - chart inherits the node-only filter and drops the matplotlib legend in favor of color-matched end-of-line annotations - drop empty-Name rows that produced a phantom `--` row in earlier output - render infers duration from sample timestamps so the heading matches the actual run when `soak.py render` is invoked without DURATION_SECONDS * ci(memory-soak): drop aggregator job * ci(memory-soak): trim env to what isn't derivable Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible defaults in soak.py (300s, ./soak-samples, basename of genesis file). Collapse the matrix object into a flat list of genesis basenames; build the file path and artifact name inline. * debug(process-metrics): dump raw metrics response on first scrape To diagnose why node-metrics.jsonl came out empty in CI: on the first scrape attempt per node, log byte count + process_* match count, and save the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so the artifact contains the exact endpoint output. * fix(metrics): drop consensus_ subgroup from populate_metrics - populate_metrics() was wrapping the root in a `consensus` subgroup, so every metric registered through it (HotShot consensus, storage, proposal fetcher, and now our process metrics) got a misleading consensus_ prefix - the SQL / scanner / aggregator metrics avoid this because they register on the root PrometheusMetrics directly; populate_metrics is the only consumer that adds the extra layer - gauges now publish under their natural names (process_resident_memory_bytes, append_da_duration, ...). External Grafana / alert queries that depended on the consensus_ prefix will need to be updated - revert soak.py to match process_* directly and drop the diagnostic raw response dump introduced in c916948, no longer needed * fix(process-metrics): match consensus_process_* gauge names populate_metrics in hotshot-query-service wraps registered metrics in a "consensus" subgroup. Our process gauges land there and publish as consensus_process_*. Stop trying to scrape process_* and use the actual name. Leave the shared library alone; the prefix is a project-wide convention that external dashboards depend on. * ci(memory-soak): use 100 delegators per validator - pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the stake-for-demo docker compose service - set multiple-delegators mode with 100 delegators per validator in both soak workflow jobs to load the stake table beyond a handful of stakers * fix(process-metrics): add legend to RSS-over-time chart - pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic - render a Mermaid flowchart legend below the chart with matching node fills (xychart-beta has no native legend; HTML-styled spans get sanitized in GitHub step summaries) - align PNG palette with Mermaid and add an in-figure legend() * chore(process-metrics): add soak just module, simplify CI - new crates/process-metrics/justfile registered at root as `mod soak` - recipes: up, sample, render, logs, down, run, test, fmt, lint - defaults via env_var_or_default so CI only sets per-matrix variables - memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls - updated README run-locally section * ci(memory-soak): run 1 hour per matrix entry - set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s) - timeout-minutes: 90 caps a hung job below the GH default 6h * refactor(process-metrics): shrink soak.py, add click CLI - Swap manual parsing for pandas, prometheus-client, humanize, python-dotenv. Drops ~290 lines. - Replace bare command positional with click subcommands so `sample` and `render` each have their own `--help` with env-var fallbacks. - Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to `render` so it can run on an arbitrary JSONL pair without depending on `--output-dir` layout. - MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the earlier refactor, would silently drop containers reporting GiB). - Add empty-file and GiB regression tests. - Show the full PNG path in the summary, not just the basename. - Add `*args` passthrough to the wrapper just recipes. * refactor(process-metrics): drop dead code and indirection in soak.py - Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` / `ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not interpolate them, so this was a no-op. The `--genesis-file` option on `sample` only existed to set those vars and is removed; the matching `export` is dropped from the just recipe. - `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead of `http://localhost:<port>`, so `_process_rss_max` no longer needs to recover the index by regex. `NODE_BASE_PORT` removed. - Aggregate max-per-bucket for the Mermaid chart instead of picking every Nth point, so peaks are not missed at 1Hz sampling. - Compute the chart df (`seconds` + `rss_mb`, sorted) once in `render_summary` and pass it to both chart renderers, dropping `_series_mb`. - Inline `_process_rss_max` (single caller) and `_hb`. Flatten `METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`. - Rename `_load_docker` to `_load_docker_metrics`. - Narrow `scrape_node`'s except clause to `OSError`. URLError / TimeoutError / ConnectionError are subclasses. - Show full PNG path in the summary, not just the basename. * refactor(process-metrics): register at metrics root, trim boilerplate - spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics is in scope, so gauges publish as process_* instead of consensus_process_* - drop the spawn from SequencerContext::init (was using the consensus subgroup handle from populate_metrics) - swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)] - delete the test module: both tests were low-value (tautology over hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80 lines of Metrics/Gauge trait stub boilerplate * fix(process-metrics): match process_* gauge names in soak scripts - gauges now register at the prometheus root (commit 82992b6456), so the consensus_ prefix is gone - update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly * fix(process-metrics): chart RSS in MiB to match binary-unit source - docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the process gauge is raw bytes; dividing by 1_000_000 and labeling "MB" inflated the displayed value by ~4.8% and disagreed with docker stats - divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title and y-axis label * feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges - process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self - node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges) - node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback - cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat - cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite - SecondsAccumulator preserves sub-second precision across delta adds to integer Counters - best-effort reads: missing kernel files debug-log and skip, never break the scrape * chore: cargo sort (cherry picked from commit 36b751e)
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
|
Claude finished @sveitser's task in 4m 47s —— View job Code Review: feat(process-metrics): add memory gauges and CI soak harness (#4330)
SummaryThis is a well-structured PR that adds a new Strengths
Issues FoundSee inline comments for details. Summary of the notable ones:
No Issues Found In
|
| let total = self.remainder + delta; | ||
| let whole = total / units_per_second; | ||
| self.remainder = total % units_per_second; | ||
| whole as usize |
There was a problem hiding this comment.
issue (bug risk): whole is a u64 (result of dividing two u64 values). On 32-bit targets, as usize would silently truncate if the accumulated delta ever exceeded u32::MAX seconds (~136 years). Not a practical concern today since the node would never run that long in one stretch, but worth noting this crate is inherently Linux/64-bit-only anyway (procfs, /proc/self, etc.), so the cast is safe in practice. Same applies to U64Delta::observe at line 94.
nit: If you ever want to make the 64-bit assumption explicit, a #[cfg(not(target_pointer_width = "64"))] compile_error would prevent accidental 32-bit builds.
| self.cpu_count.set(sysinfo::System::new().cpus().len()); | ||
|
|
||
| let mut system = System::new(); |
There was a problem hiding this comment.
nit: Two System::new() calls — one for CPU count (line 242) and one for the sampling loop (line 244). sysinfo::System::new() doesn't populate CPU info by default (requires refresh_cpu_list()), so cpus().len() will likely return 0 here.
Consider using sysinfo::System::new_with_specifics(RefreshKind::nothing().with_cpu(CpuRefreshKind::nothing())) to populate the CPU list, or use std::thread::available_parallelism() / num_cpus::get() which are cheaper and don't require sysinfo at all.
Could you verify this returns the expected value in your soak runs?
| if let Some(process) = system.process(pid) { | ||
| self.resident_memory_bytes.set(process.memory() as usize); | ||
| self.virtual_memory_bytes | ||
| .set(process.virtual_memory() as usize); |
There was a problem hiding this comment.
note (data fidelity): process.memory() and process.virtual_memory() return u64 (bytes), and the Gauge trait takes usize. On 64-bit Linux this is fine, but values above usize::MAX would silently wrap on 32-bit targets. Since this whole crate is Linux-only (procfs dependency), and all production targets are 64-bit, this is acceptable — just documenting the assumption.
| let mut ticker = interval(SAMPLE_INTERVAL); | ||
| loop { | ||
| ticker.tick().await; | ||
| self.sample(&mut system, env, &mut previous); | ||
| } |
There was a problem hiding this comment.
question (graceful shutdown): run() loops forever with no cancellation mechanism. The task is spawned into the TaskList, so presumably dropping the task handle cancels it — is that the intended shutdown path? If yes, this is fine. If not, consider accepting a CancellationToken or tokio::sync::watch to allow clean shutdown without relying on task abort.
| REPO_ROOT = Path( | ||
| subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip() | ||
| ) |
There was a problem hiding this comment.
issue (robustness): REPO_ROOT is computed at module import time via subprocess.check_output(["git", ...]). If this script is ever imported or run outside a git repository (e.g., from a packaged wheel or tmp directory), it will crash with a CalledProcessError before any CLI argument is even parsed.
Consider deferring this to a lazy property or computing it inside the CLI commands that need it.
| REPO_ROOT = Path( | |
| subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip() | |
| ) | |
| REPO_ROOT = None | |
| def _repo_root() -> Path: | |
| global REPO_ROOT | |
| if REPO_ROOT is None: | |
| REPO_ROOT = Path( | |
| subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip() | |
| ) | |
| return REPO_ROOT |
| node_ports = [ | ||
| (i, int(os.environ[f"ESPRESSO_NODE_{i}_API_PORT"])) for i in NODE_INDICES | ||
| ] |
There was a problem hiding this comment.
issue (crash on missing env var): If any ESPRESSO_NODE_{i}_API_PORT environment variable is missing, this raises a KeyError and the entire sampling loop crashes. The .env file might not define all 5 ports depending on the deployment configuration.
Consider using os.environ.get() with a default, or wrapping with a try/except and logging a warning for missing nodes:
| node_ports = [ | |
| (i, int(os.environ[f"ESPRESSO_NODE_{i}_API_PORT"])) for i in NODE_INDICES | |
| ] | |
| node_ports = [] | |
| for i in NODE_INDICES: | |
| key = f"ESPRESSO_NODE_{i}_API_PORT" | |
| port = os.environ.get(key) | |
| if port is None: | |
| log.warning(f"{key} not set; skipping node {i}") | |
| continue | |
| node_ports.append((i, int(port))) |
| log.info(f"sample t={int(now - t0)}s samples={samples}") | ||
| next_progress += PROGRESS_INTERVAL | ||
| if (sleep_for := (t0 + samples) - time.time()) > 0: | ||
| time.sleep(sleep_for) |
There was a problem hiding this comment.
nit (sampling drift): The sleep calculation t0 + samples assumes 1-second sample intervals, but this is hardcoded implicitly. If scraping takes longer than 1 second (e.g., a node is slow to respond with the 2s timeout on line 75), samples will stack up with no sleep. Not a bug, but the samples counter then no longer represents actual seconds elapsed — it's just a sequence number. The progress log on line 146 already uses wall-clock now - t0, which is correct.
| - name: Upload soak samples | ||
| if: always() | ||
| uses: actions/upload-artifact@v7 | ||
| with: | ||
| name: memory-soak-${{ matrix.genesis }} | ||
| path: ./soak-samples | ||
| retention-days: 90 |
There was a problem hiding this comment.
issue (CI): With matrix: [demo-drb-header, demo-epoch-reward] and fail-fast: false, both matrix jobs produce artifacts with the same name pattern memory-soak-${{ matrix.genesis }}. This is fine as-is since the genesis values differ, but if a future matrix entry has the same name, artifacts would collide. More importantly, since the PR and non-PR jobs share the same artifact names, if both somehow run in the same workflow (shouldn't happen due to the if guards), they'd conflict. This is safe as designed — just flagging for awareness.
| try: | ||
| with urllib.request.urlopen( | ||
| f"http://localhost:{port}/v0/status/metrics", timeout=2.0 | ||
| ) as resp: |
There was a problem hiding this comment.
nit (security): urllib.request.urlopen with an http://localhost URL is fine for local scraping, but worth noting that if port were ever sourced from untrusted input, this could be used for SSRF. In this case the ports come from env vars set by the trusted .env file, so it's safe.
No description provided.