Skip to content

Commit 49fa5d0

Browse files
authored
backport feat(process-metrics): add memory gauges and CI soak harness (#4330) (#4436)
feat(process-metrics): add memory gauges and CI soak harness (#4330) * feat(process-metrics): add memory gauges and CI soak harness - new process-metrics crate exposes 5 process gauges: resident_memory_bytes, virtual_memory_bytes, open_fds, threads, uptime_seconds - espresso-node spawns the 5s sampler at startup via SequencerContext::init - scripts/soak.py orchestrates the docker demo end to end: compose up, wait for nodes ready, sample docker stats + /v0/status/metrics every 1s, render a peak-total + per-service Markdown summary - memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over demo-drb-header.toml and demo-epoch-reward.toml - 90-day artifact retention for raw JSONL and summary * fix(process-metrics): fire-and-forget docker compose up - `docker compose up -d` blocks on `service_completed_successfully` dependencies and propagates their exit code, so a one-shot helper that fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127) takes the whole soak job down even though all 5 nodes are healthy - mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via `Popen` to a log file, then verify readiness via node HTTP polling - write the compose-up log into OUTPUT_DIR so it ends up in the artifact * fix(demo): wait-for-light-client-epoch runs in alpine - POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC epoch gate completed instantly without polling - cherry-picked from 6fe542a (script only; the original commit also touched binary-upgrade-tests/run.py) * fix(process-metrics): gate soak on smoke test, add chart + logs artifact - soak.py no longer manages docker compose or polls per-node readiness; node 2's command lacks the status module so /v0/status/block-height was timing out at 300s. The CI workflow now brings the stack up and runs scripts/smoke-test-demo as the readiness gate instead - add RSS-over-time chart to the summary: inline mermaid xychart-beta so GitHub renders it natively, plus a full-resolution PNG saved to the samples artifact (matplotlib via PEP 723 uv inline deps) - new soak-logs artifact captures docker compose logs + ps with 1-day retention so failures are diagnosable - swap python3 invocation for `uv run` (matplotlib gets installed from the script's inline deps); add astral-sh/setup-uv@v8 step * fix(ci): use proven `just demo --pull never` pattern for soak - previous `docker compose up -d &` lost step exit info and didn't compose with the way images are tagged in PR builds (PR variant loads pr-<N> tagged images via `docker load`) - match the existing test-demo-pr/non-pr pattern: pre-pull missing images, then `just demo --pull never &` + smoke test as the gate - add `taiki-e/install-action@just` so `just demo` is available * feat(process-metrics): split soak.py into sample + render subcommands - `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts - `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib (PEP 723 inline deps). Re-runnable locally on saved data without re-sampling. - CI splits into two steps so render runs on `if: always()` even when sample fails partway, surfacing partial data. - Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end: sample under plain python3, render via `uv run` produces a 1200x600 PNG. * fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag) * feat(process-metrics): filter summary to nodes, fold RSS sources - summary table now shows only espresso-node-N rows; the other containers are still sampled into the JSONL artifact but not surfaced in the table (user only cares about node memory) - replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS (process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile interpolation on small samples - drop the redundant cross-check section: both sources now sit in the main table side by side - chart inherits the node-only filter and drops the matplotlib legend in favor of color-matched end-of-line annotations - drop empty-Name rows that produced a phantom `--` row in earlier output - render infers duration from sample timestamps so the heading matches the actual run when `soak.py render` is invoked without DURATION_SECONDS * ci(memory-soak): drop aggregator job * ci(memory-soak): trim env to what isn't derivable Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible defaults in soak.py (300s, ./soak-samples, basename of genesis file). Collapse the matrix object into a flat list of genesis basenames; build the file path and artifact name inline. * debug(process-metrics): dump raw metrics response on first scrape To diagnose why node-metrics.jsonl came out empty in CI: on the first scrape attempt per node, log byte count + process_* match count, and save the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so the artifact contains the exact endpoint output. * fix(metrics): drop consensus_ subgroup from populate_metrics - populate_metrics() was wrapping the root in a `consensus` subgroup, so every metric registered through it (HotShot consensus, storage, proposal fetcher, and now our process metrics) got a misleading consensus_ prefix - the SQL / scanner / aggregator metrics avoid this because they register on the root PrometheusMetrics directly; populate_metrics is the only consumer that adds the extra layer - gauges now publish under their natural names (process_resident_memory_bytes, append_da_duration, ...). External Grafana / alert queries that depended on the consensus_ prefix will need to be updated - revert soak.py to match process_* directly and drop the diagnostic raw response dump introduced in c916948, no longer needed * fix(process-metrics): match consensus_process_* gauge names populate_metrics in hotshot-query-service wraps registered metrics in a "consensus" subgroup. Our process gauges land there and publish as consensus_process_*. Stop trying to scrape process_* and use the actual name. Leave the shared library alone; the prefix is a project-wide convention that external dashboards depend on. * ci(memory-soak): use 100 delegators per validator - pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the stake-for-demo docker compose service - set multiple-delegators mode with 100 delegators per validator in both soak workflow jobs to load the stake table beyond a handful of stakers * fix(process-metrics): add legend to RSS-over-time chart - pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic - render a Mermaid flowchart legend below the chart with matching node fills (xychart-beta has no native legend; HTML-styled spans get sanitized in GitHub step summaries) - align PNG palette with Mermaid and add an in-figure legend() * chore(process-metrics): add soak just module, simplify CI - new crates/process-metrics/justfile registered at root as `mod soak` - recipes: up, sample, render, logs, down, run, test, fmt, lint - defaults via env_var_or_default so CI only sets per-matrix variables - memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls - updated README run-locally section * ci(memory-soak): run 1 hour per matrix entry - set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s) - timeout-minutes: 90 caps a hung job below the GH default 6h * refactor(process-metrics): shrink soak.py, add click CLI - Swap manual parsing for pandas, prometheus-client, humanize, python-dotenv. Drops ~290 lines. - Replace bare command positional with click subcommands so `sample` and `render` each have their own `--help` with env-var fallbacks. - Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to `render` so it can run on an arbitrary JSONL pair without depending on `--output-dir` layout. - MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the earlier refactor, would silently drop containers reporting GiB). - Add empty-file and GiB regression tests. - Show the full PNG path in the summary, not just the basename. - Add `*args` passthrough to the wrapper just recipes. * refactor(process-metrics): drop dead code and indirection in soak.py - Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` / `ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not interpolate them, so this was a no-op. The `--genesis-file` option on `sample` only existed to set those vars and is removed; the matching `export` is dropped from the just recipe. - `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead of `http://localhost:<port>`, so `_process_rss_max` no longer needs to recover the index by regex. `NODE_BASE_PORT` removed. - Aggregate max-per-bucket for the Mermaid chart instead of picking every Nth point, so peaks are not missed at 1Hz sampling. - Compute the chart df (`seconds` + `rss_mb`, sorted) once in `render_summary` and pass it to both chart renderers, dropping `_series_mb`. - Inline `_process_rss_max` (single caller) and `_hb`. Flatten `METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`. - Rename `_load_docker` to `_load_docker_metrics`. - Narrow `scrape_node`'s except clause to `OSError`. URLError / TimeoutError / ConnectionError are subclasses. - Show full PNG path in the summary, not just the basename. * refactor(process-metrics): register at metrics root, trim boilerplate - spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics is in scope, so gauges publish as process_* instead of consensus_process_* - drop the spawn from SequencerContext::init (was using the consensus subgroup handle from populate_metrics) - swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)] - delete the test module: both tests were low-value (tautology over hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80 lines of Metrics/Gauge trait stub boilerplate * fix(process-metrics): match process_* gauge names in soak scripts - gauges now register at the prometheus root (commit 82992b6456), so the consensus_ prefix is gone - update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly * fix(process-metrics): chart RSS in MiB to match binary-unit source - docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the process gauge is raw bytes; dividing by 1_000_000 and labeling "MB" inflated the displayed value by ~4.8% and disagreed with docker stats - divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title and y-axis label * feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges - process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self - node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges) - node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback - cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat - cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite - SecondsAccumulator preserves sub-second precision across delta adds to integer Counters - best-effort reads: missing kernel files debug-log and skip, never break the scrape * chore: cargo sort (cherry picked from commit 36b751e)
1 parent e4ea214 commit 49fa5d0

14 files changed

Lines changed: 1634 additions & 12 deletions

File tree

.github/workflows/build.yml

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,3 +541,113 @@ jobs:
541541
echo "No jobs passed. Failing."
542542
exit 1
543543
fi
544+
545+
memory-soak-pr:
546+
if: github.event_name == 'pull_request'
547+
runs-on: ubuntu-latest
548+
timeout-minutes: 90
549+
needs: [build-dockers-amd]
550+
strategy:
551+
fail-fast: false
552+
matrix:
553+
genesis: [demo-drb-header, demo-epoch-reward]
554+
env:
555+
DOCKER_TAG: pr-${{ github.event.pull_request.number }}
556+
ESPRESSO_NODE_GENESIS_FILE: genesis/${{ matrix.genesis }}.toml
557+
DURATION_SECONDS: 3600
558+
steps:
559+
- uses: actions/checkout@v6
560+
- uses: astral-sh/setup-uv@v8.1.0
561+
- uses: taiki-e/install-action@just
562+
563+
- name: Download docker image artifacts
564+
uses: actions/download-artifact@v8
565+
with:
566+
path: ${{ runner.temp }}/docker-images
567+
pattern: "*-docker-image"
568+
569+
- name: Load docker images
570+
run: |
571+
for file in $(find ${{ runner.temp }}/docker-images -name "*.tar"); do
572+
docker load --input $file
573+
done
574+
575+
- name: Start demo + smoke test
576+
run: just soak::up
577+
578+
- name: Sample
579+
run: just soak::sample
580+
581+
- name: Render summary + chart
582+
if: always()
583+
run: just soak::render
584+
585+
- name: Dump compose logs
586+
if: always()
587+
run: just soak::logs
588+
589+
- name: Upload soak samples
590+
if: always()
591+
uses: actions/upload-artifact@v7
592+
with:
593+
name: memory-soak-${{ matrix.genesis }}
594+
path: ./soak-samples
595+
retention-days: 90
596+
597+
- name: Upload soak logs
598+
if: always()
599+
uses: actions/upload-artifact@v7
600+
with:
601+
name: memory-soak-${{ matrix.genesis }}-logs
602+
path: ./soak-logs
603+
retention-days: 1
604+
605+
memory-soak-non-pr:
606+
if: github.event_name != 'pull_request'
607+
runs-on: ubuntu-latest
608+
timeout-minutes: 90
609+
needs: [build-dockers-amd, create-multiplatform-docker-image]
610+
strategy:
611+
fail-fast: false
612+
matrix:
613+
genesis: [demo-drb-header, demo-epoch-reward]
614+
env:
615+
ESPRESSO_NODE_GENESIS_FILE: genesis/${{ matrix.genesis }}.toml
616+
DURATION_SECONDS: 3600
617+
steps:
618+
- uses: actions/checkout@v6
619+
- uses: astral-sh/setup-uv@v8.1.0
620+
- uses: taiki-e/install-action@just
621+
622+
- name: Set docker tag
623+
run: echo "DOCKER_TAG=$(echo '${{ github.ref_name }}' | tr '/' '-')" >> $GITHUB_ENV
624+
625+
- name: Start demo + smoke test
626+
run: just soak::up
627+
628+
- name: Sample
629+
run: just soak::sample
630+
631+
- name: Render summary + chart
632+
if: always()
633+
run: just soak::render
634+
635+
- name: Dump compose logs
636+
if: always()
637+
run: just soak::logs
638+
639+
- name: Upload soak samples
640+
if: always()
641+
uses: actions/upload-artifact@v7
642+
with:
643+
name: memory-soak-${{ matrix.genesis }}
644+
path: ./soak-samples
645+
retention-days: 90
646+
647+
- name: Upload soak logs
648+
if: always()
649+
uses: actions/upload-artifact@v7
650+
with:
651+
name: memory-soak-${{ matrix.genesis }}-logs
652+
path: ./soak-logs
653+
retention-days: 1

Cargo.lock

Lines changed: 123 additions & 11 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ members = [
3434
"crates/hotshot/testing",
3535
"crates/hotshot/types",
3636
"crates/hotshot/utils",
37+
"crates/process-metrics",
3738
"crates/serialization/api",
3839
"crates/versions",
3940
"hotshot-events-service",
@@ -89,6 +90,7 @@ default-members = [
8990
"light-client",
9091
"light-client-query-service",
9192
"node-metrics",
93+
"crates/process-metrics",
9294
"request-response",
9395
"sdks/crypto-helper",
9496
"crates/serialization/api",
@@ -324,6 +326,8 @@ pretty_assertions = { version = "1.4", features = ["unstable"] }
324326
primitive-types = "0.13"
325327
priority-queue = "2"
326328
proc-macro2 = "1"
329+
process-metrics = { path = "crates/process-metrics" }
330+
procfs = "0.18"
327331
prometheus = { version = "0.13", default-features = false }
328332
prometheus-parse = "0.2.5"
329333
proptest = "1"

crates/espresso/node/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ moka = { workspace = true }
8383
num_enum = { workspace = true }
8484
parking_lot = { workspace = true }
8585
priority-queue = { workspace = true }
86+
process-metrics = { workspace = true }
8687
rand = { workspace = true }
8788
rand_chacha = { workspace = true }
8889
rand_distr = { workspace = true }

0 commit comments

Comments
 (0)