Skip to content

Commit 36b751e

Browse files
authored
feat(process-metrics): add memory gauges and CI soak harness (#4330)
* feat(process-metrics): add memory gauges and CI soak harness - new process-metrics crate exposes 5 process gauges: resident_memory_bytes, virtual_memory_bytes, open_fds, threads, uptime_seconds - espresso-node spawns the 5s sampler at startup via SequencerContext::init - scripts/soak.py orchestrates the docker demo end to end: compose up, wait for nodes ready, sample docker stats + /v0/status/metrics every 1s, render a peak-total + per-service Markdown summary - memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over demo-drb-header.toml and demo-epoch-reward.toml - 90-day artifact retention for raw JSONL and summary * fix(process-metrics): fire-and-forget docker compose up - `docker compose up -d` blocks on `service_completed_successfully` dependencies and propagates their exit code, so a one-shot helper that fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127) takes the whole soak job down even though all 5 nodes are healthy - mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via `Popen` to a log file, then verify readiness via node HTTP polling - write the compose-up log into OUTPUT_DIR so it ends up in the artifact * fix(demo): wait-for-light-client-epoch runs in alpine - POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC epoch gate completed instantly without polling - cherry-picked from 6fe542a (script only; the original commit also touched binary-upgrade-tests/run.py) * fix(process-metrics): gate soak on smoke test, add chart + logs artifact - soak.py no longer manages docker compose or polls per-node readiness; node 2's command lacks the status module so /v0/status/block-height was timing out at 300s. The CI workflow now brings the stack up and runs scripts/smoke-test-demo as the readiness gate instead - add RSS-over-time chart to the summary: inline mermaid xychart-beta so GitHub renders it natively, plus a full-resolution PNG saved to the samples artifact (matplotlib via PEP 723 uv inline deps) - new soak-logs artifact captures docker compose logs + ps with 1-day retention so failures are diagnosable - swap python3 invocation for `uv run` (matplotlib gets installed from the script's inline deps); add astral-sh/setup-uv@v8 step * fix(ci): use proven `just demo --pull never` pattern for soak - previous `docker compose up -d &` lost step exit info and didn't compose with the way images are tagged in PR builds (PR variant loads pr-<N> tagged images via `docker load`) - match the existing test-demo-pr/non-pr pattern: pre-pull missing images, then `just demo --pull never &` + smoke test as the gate - add `taiki-e/install-action@just` so `just demo` is available * feat(process-metrics): split soak.py into sample + render subcommands - `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts - `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib (PEP 723 inline deps). Re-runnable locally on saved data without re-sampling. - CI splits into two steps so render runs on `if: always()` even when sample fails partway, surfacing partial data. - Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end: sample under plain python3, render via `uv run` produces a 1200x600 PNG. * fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag) * feat(process-metrics): filter summary to nodes, fold RSS sources - summary table now shows only espresso-node-N rows; the other containers are still sampled into the JSONL artifact but not surfaced in the table (user only cares about node memory) - replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS (process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile interpolation on small samples - drop the redundant cross-check section: both sources now sit in the main table side by side - chart inherits the node-only filter and drops the matplotlib legend in favor of color-matched end-of-line annotations - drop empty-Name rows that produced a phantom `--` row in earlier output - render infers duration from sample timestamps so the heading matches the actual run when `soak.py render` is invoked without DURATION_SECONDS * ci(memory-soak): drop aggregator job * ci(memory-soak): trim env to what isn't derivable Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible defaults in soak.py (300s, ./soak-samples, basename of genesis file). Collapse the matrix object into a flat list of genesis basenames; build the file path and artifact name inline. * debug(process-metrics): dump raw metrics response on first scrape To diagnose why node-metrics.jsonl came out empty in CI: on the first scrape attempt per node, log byte count + process_* match count, and save the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so the artifact contains the exact endpoint output. * fix(metrics): drop consensus_ subgroup from populate_metrics - populate_metrics() was wrapping the root in a `consensus` subgroup, so every metric registered through it (HotShot consensus, storage, proposal fetcher, and now our process metrics) got a misleading consensus_ prefix - the SQL / scanner / aggregator metrics avoid this because they register on the root PrometheusMetrics directly; populate_metrics is the only consumer that adds the extra layer - gauges now publish under their natural names (process_resident_memory_bytes, append_da_duration, ...). External Grafana / alert queries that depended on the consensus_ prefix will need to be updated - revert soak.py to match process_* directly and drop the diagnostic raw response dump introduced in c916948, no longer needed * fix(process-metrics): match consensus_process_* gauge names populate_metrics in hotshot-query-service wraps registered metrics in a "consensus" subgroup. Our process gauges land there and publish as consensus_process_*. Stop trying to scrape process_* and use the actual name. Leave the shared library alone; the prefix is a project-wide convention that external dashboards depend on. * ci(memory-soak): use 100 delegators per validator - pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the stake-for-demo docker compose service - set multiple-delegators mode with 100 delegators per validator in both soak workflow jobs to load the stake table beyond a handful of stakers * fix(process-metrics): add legend to RSS-over-time chart - pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic - render a Mermaid flowchart legend below the chart with matching node fills (xychart-beta has no native legend; HTML-styled spans get sanitized in GitHub step summaries) - align PNG palette with Mermaid and add an in-figure legend() * chore(process-metrics): add soak just module, simplify CI - new crates/process-metrics/justfile registered at root as `mod soak` - recipes: up, sample, render, logs, down, run, test, fmt, lint - defaults via env_var_or_default so CI only sets per-matrix variables - memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls - updated README run-locally section * ci(memory-soak): run 1 hour per matrix entry - set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s) - timeout-minutes: 90 caps a hung job below the GH default 6h * refactor(process-metrics): shrink soak.py, add click CLI - Swap manual parsing for pandas, prometheus-client, humanize, python-dotenv. Drops ~290 lines. - Replace bare command positional with click subcommands so `sample` and `render` each have their own `--help` with env-var fallbacks. - Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to `render` so it can run on an arbitrary JSONL pair without depending on `--output-dir` layout. - MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the earlier refactor, would silently drop containers reporting GiB). - Add empty-file and GiB regression tests. - Show the full PNG path in the summary, not just the basename. - Add `*args` passthrough to the wrapper just recipes. * refactor(process-metrics): drop dead code and indirection in soak.py - Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` / `ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not interpolate them, so this was a no-op. The `--genesis-file` option on `sample` only existed to set those vars and is removed; the matching `export` is dropped from the just recipe. - `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead of `http://localhost:<port>`, so `_process_rss_max` no longer needs to recover the index by regex. `NODE_BASE_PORT` removed. - Aggregate max-per-bucket for the Mermaid chart instead of picking every Nth point, so peaks are not missed at 1Hz sampling. - Compute the chart df (`seconds` + `rss_mb`, sorted) once in `render_summary` and pass it to both chart renderers, dropping `_series_mb`. - Inline `_process_rss_max` (single caller) and `_hb`. Flatten `METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`. - Rename `_load_docker` to `_load_docker_metrics`. - Narrow `scrape_node`'s except clause to `OSError`. URLError / TimeoutError / ConnectionError are subclasses. - Show full PNG path in the summary, not just the basename. * refactor(process-metrics): register at metrics root, trim boilerplate - spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics is in scope, so gauges publish as process_* instead of consensus_process_* - drop the spawn from SequencerContext::init (was using the consensus subgroup handle from populate_metrics) - swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)] - delete the test module: both tests were low-value (tautology over hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80 lines of Metrics/Gauge trait stub boilerplate * fix(process-metrics): match process_* gauge names in soak scripts - gauges now register at the prometheus root (commit 82992b6456), so the consensus_ prefix is gone - update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly * fix(process-metrics): chart RSS in MiB to match binary-unit source - docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the process gauge is raw bytes; dividing by 1_000_000 and labeling "MB" inflated the displayed value by ~4.8% and disagreed with docker stats - divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title and y-axis label * feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges - process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self - node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges) - node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback - cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat - cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite - SecondsAccumulator preserves sub-second precision across delta adds to integer Counters - best-effort reads: missing kernel files debug-log and skip, never break the scrape * chore: cargo sort
1 parent fcb5beb commit 36b751e

14 files changed

Lines changed: 1634 additions & 12 deletions

File tree

.github/workflows/build.yml

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -708,3 +708,113 @@ jobs:
708708
echo "No jobs passed. Failing."
709709
exit 1
710710
fi
711+
712+
memory-soak-pr:
713+
if: github.event_name == 'pull_request'
714+
runs-on: ubuntu-latest
715+
timeout-minutes: 90
716+
needs: [build-dockers-amd]
717+
strategy:
718+
fail-fast: false
719+
matrix:
720+
genesis: [demo-drb-header, demo-epoch-reward]
721+
env:
722+
DOCKER_TAG: pr-${{ github.event.pull_request.number }}
723+
ESPRESSO_NODE_GENESIS_FILE: genesis/${{ matrix.genesis }}.toml
724+
DURATION_SECONDS: 3600
725+
steps:
726+
- uses: actions/checkout@v6
727+
- uses: astral-sh/setup-uv@v8.1.0
728+
- uses: taiki-e/install-action@just
729+
730+
- name: Download docker image artifacts
731+
uses: actions/download-artifact@v8
732+
with:
733+
path: ${{ runner.temp }}/docker-images
734+
pattern: "*-docker-image"
735+
736+
- name: Load docker images
737+
run: |
738+
for file in $(find ${{ runner.temp }}/docker-images -name "*.tar"); do
739+
docker load --input $file
740+
done
741+
742+
- name: Start demo + smoke test
743+
run: just soak::up
744+
745+
- name: Sample
746+
run: just soak::sample
747+
748+
- name: Render summary + chart
749+
if: always()
750+
run: just soak::render
751+
752+
- name: Dump compose logs
753+
if: always()
754+
run: just soak::logs
755+
756+
- name: Upload soak samples
757+
if: always()
758+
uses: actions/upload-artifact@v7
759+
with:
760+
name: memory-soak-${{ matrix.genesis }}
761+
path: ./soak-samples
762+
retention-days: 90
763+
764+
- name: Upload soak logs
765+
if: always()
766+
uses: actions/upload-artifact@v7
767+
with:
768+
name: memory-soak-${{ matrix.genesis }}-logs
769+
path: ./soak-logs
770+
retention-days: 1
771+
772+
memory-soak-non-pr:
773+
if: github.event_name != 'pull_request'
774+
runs-on: ubuntu-latest
775+
timeout-minutes: 90
776+
needs: [build-dockers-amd, create-multiplatform-docker-image]
777+
strategy:
778+
fail-fast: false
779+
matrix:
780+
genesis: [demo-drb-header, demo-epoch-reward]
781+
env:
782+
ESPRESSO_NODE_GENESIS_FILE: genesis/${{ matrix.genesis }}.toml
783+
DURATION_SECONDS: 3600
784+
steps:
785+
- uses: actions/checkout@v6
786+
- uses: astral-sh/setup-uv@v8.1.0
787+
- uses: taiki-e/install-action@just
788+
789+
- name: Set docker tag
790+
run: echo "DOCKER_TAG=$(echo '${{ github.ref_name }}' | tr '/' '-')" >> $GITHUB_ENV
791+
792+
- name: Start demo + smoke test
793+
run: just soak::up
794+
795+
- name: Sample
796+
run: just soak::sample
797+
798+
- name: Render summary + chart
799+
if: always()
800+
run: just soak::render
801+
802+
- name: Dump compose logs
803+
if: always()
804+
run: just soak::logs
805+
806+
- name: Upload soak samples
807+
if: always()
808+
uses: actions/upload-artifact@v7
809+
with:
810+
name: memory-soak-${{ matrix.genesis }}
811+
path: ./soak-samples
812+
retention-days: 90
813+
814+
- name: Upload soak logs
815+
if: always()
816+
uses: actions/upload-artifact@v7
817+
with:
818+
name: memory-soak-${{ matrix.genesis }}-logs
819+
path: ./soak-logs
820+
retention-days: 1

Cargo.lock

Lines changed: 123 additions & 11 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ members = [
3535
"crates/hotshot/testing",
3636
"crates/hotshot/types",
3737
"crates/hotshot/utils",
38+
"crates/process-metrics",
3839
"crates/serialization/api",
3940
"crates/versions",
4041
"hotshot-events-service",
@@ -90,6 +91,7 @@ default-members = [
9091
"light-client",
9192
"light-client-query-service",
9293
"node-metrics",
94+
"crates/process-metrics",
9395
"request-response",
9496
"sdks/crypto-helper",
9597
"crates/serialization/api",
@@ -334,6 +336,8 @@ pretty_assertions = { version = "1.4", features = ["unstable"] }
334336
primitive-types = "0.13"
335337
priority-queue = "2"
336338
proc-macro2 = "1"
339+
process-metrics = { path = "crates/process-metrics" }
340+
procfs = "0.18"
337341
prometheus = { version = "0.13", default-features = false }
338342
prometheus-parse = "0.2.5"
339343
proptest = "1"

crates/espresso/node/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ moka = { workspace = true }
8585
num_enum = { workspace = true }
8686
parking_lot = { workspace = true }
8787
priority-queue = { workspace = true }
88+
process-metrics = { workspace = true }
8889
rand = { workspace = true }
8990
rand_chacha = { workspace = true }
9091
rand_distr = { workspace = true }

0 commit comments

Comments
 (0)