Commit 36b751e
authored
feat(process-metrics): add memory gauges and CI soak harness (#4330)
* feat(process-metrics): add memory gauges and CI soak harness
- new process-metrics crate exposes 5 process gauges: resident_memory_bytes,
virtual_memory_bytes, open_fds, threads, uptime_seconds
- espresso-node spawns the 5s sampler at startup via SequencerContext::init
- scripts/soak.py orchestrates the docker demo end to end: compose up, wait
for nodes ready, sample docker stats + /v0/status/metrics every 1s, render
a peak-total + per-service Markdown summary
- memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over
demo-drb-header.toml and demo-epoch-reward.toml
- 90-day artifact retention for raw JSONL and summary
* fix(process-metrics): fire-and-forget docker compose up
- `docker compose up -d` blocks on `service_completed_successfully`
dependencies and propagates their exit code, so a one-shot helper that
fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127)
takes the whole soak job down even though all 5 nodes are healthy
- mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via
`Popen` to a log file, then verify readiness via node HTTP polling
- write the compose-up log into OUTPUT_DIR so it ends up in the artifact
* fix(demo): wait-for-light-client-epoch runs in alpine
- POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has
no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC
epoch gate completed instantly without polling
- cherry-picked from 6fe542a (script only;
the original commit also touched binary-upgrade-tests/run.py)
* fix(process-metrics): gate soak on smoke test, add chart + logs artifact
- soak.py no longer manages docker compose or polls per-node readiness;
node 2's command lacks the status module so /v0/status/block-height was
timing out at 300s. The CI workflow now brings the stack up and runs
scripts/smoke-test-demo as the readiness gate instead
- add RSS-over-time chart to the summary: inline mermaid xychart-beta so
GitHub renders it natively, plus a full-resolution PNG saved to the
samples artifact (matplotlib via PEP 723 uv inline deps)
- new soak-logs artifact captures docker compose logs + ps with 1-day
retention so failures are diagnosable
- swap python3 invocation for `uv run` (matplotlib gets installed from the
script's inline deps); add astral-sh/setup-uv@v8 step
* fix(ci): use proven `just demo --pull never` pattern for soak
- previous `docker compose up -d &` lost step exit info and didn't compose
with the way images are tagged in PR builds (PR variant loads pr-<N>
tagged images via `docker load`)
- match the existing test-demo-pr/non-pr pattern: pre-pull missing images,
then `just demo --pull never &` + smoke test as the gate
- add `taiki-e/install-action@just` so `just demo` is available
* feat(process-metrics): split soak.py into sample + render subcommands
- `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts
- `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib
(PEP 723 inline deps). Re-runnable locally on saved data without re-sampling.
- CI splits into two steps so render runs on `if: always()` even when sample
fails partway, surfacing partial data.
- Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary
wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end:
sample under plain python3, render via `uv run` produces a 1200x600 PNG.
* fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag)
* feat(process-metrics): filter summary to nodes, fold RSS sources
- summary table now shows only espresso-node-N rows; the other containers
are still sampled into the JSONL artifact but not surfaced in the table
(user only cares about node memory)
- replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS
(process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile
interpolation on small samples
- drop the redundant cross-check section: both sources now sit in the
main table side by side
- chart inherits the node-only filter and drops the matplotlib legend in
favor of color-matched end-of-line annotations
- drop empty-Name rows that produced a phantom `--` row in earlier output
- render infers duration from sample timestamps so the heading matches the
actual run when `soak.py render` is invoked without DURATION_SECONDS
* ci(memory-soak): drop aggregator job
* ci(memory-soak): trim env to what isn't derivable
Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible
defaults in soak.py (300s, ./soak-samples, basename of genesis file).
Collapse the matrix object into a flat list of genesis basenames; build the
file path and artifact name inline.
* debug(process-metrics): dump raw metrics response on first scrape
To diagnose why node-metrics.jsonl came out empty in CI: on the first
scrape attempt per node, log byte count + process_* match count, and save
the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so
the artifact contains the exact endpoint output.
* fix(metrics): drop consensus_ subgroup from populate_metrics
- populate_metrics() was wrapping the root in a `consensus` subgroup, so
every metric registered through it (HotShot consensus, storage, proposal
fetcher, and now our process metrics) got a misleading consensus_ prefix
- the SQL / scanner / aggregator metrics avoid this because they register
on the root PrometheusMetrics directly; populate_metrics is the only
consumer that adds the extra layer
- gauges now publish under their natural names (process_resident_memory_bytes,
append_da_duration, ...). External Grafana / alert queries that depended
on the consensus_ prefix will need to be updated
- revert soak.py to match process_* directly and drop the diagnostic raw
response dump introduced in c916948, no longer needed
* fix(process-metrics): match consensus_process_* gauge names
populate_metrics in hotshot-query-service wraps registered metrics in a
"consensus" subgroup. Our process gauges land there and publish as
consensus_process_*. Stop trying to scrape process_* and use the actual
name. Leave the shared library alone; the prefix is a project-wide
convention that external dashboards depend on.
* ci(memory-soak): use 100 delegators per validator
- pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the
stake-for-demo docker compose service
- set multiple-delegators mode with 100 delegators per validator in both
soak workflow jobs to load the stake table beyond a handful of stakers
* fix(process-metrics): add legend to RSS-over-time chart
- pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic
- render a Mermaid flowchart legend below the chart with matching node fills
(xychart-beta has no native legend; HTML-styled spans get sanitized in
GitHub step summaries)
- align PNG palette with Mermaid and add an in-figure legend()
* chore(process-metrics): add soak just module, simplify CI
- new crates/process-metrics/justfile registered at root as `mod soak`
- recipes: up, sample, render, logs, down, run, test, fmt, lint
- defaults via env_var_or_default so CI only sets per-matrix variables
- memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls
- updated README run-locally section
* ci(memory-soak): run 1 hour per matrix entry
- set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s)
- timeout-minutes: 90 caps a hung job below the GH default 6h
* refactor(process-metrics): shrink soak.py, add click CLI
- Swap manual parsing for pandas, prometheus-client, humanize,
python-dotenv. Drops ~290 lines.
- Replace bare command positional with click subcommands so `sample`
and `render` each have their own `--help` with env-var fallbacks.
- Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to
`render` so it can run on an arbitrary JSONL pair without depending
on `--output-dir` layout.
- MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the
earlier refactor, would silently drop containers reporting GiB).
- Add empty-file and GiB regression tests.
- Show the full PNG path in the summary, not just the basename.
- Add `*args` passthrough to the wrapper just recipes.
* refactor(process-metrics): drop dead code and indirection in soak.py
- Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` /
`ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not
interpolate them, so this was a no-op. The `--genesis-file` option
on `sample` only existed to set those vars and is removed; the
matching `export` is dropped from the just recipe.
- `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead
of `http://localhost:<port>`, so `_process_rss_max` no longer needs
to recover the index by regex. `NODE_BASE_PORT` removed.
- Aggregate max-per-bucket for the Mermaid chart instead of picking
every Nth point, so peaks are not missed at 1Hz sampling.
- Compute the chart df (`seconds` + `rss_mb`, sorted) once in
`render_summary` and pass it to both chart renderers, dropping
`_series_mb`.
- Inline `_process_rss_max` (single caller) and `_hb`. Flatten
`METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`.
- Rename `_load_docker` to `_load_docker_metrics`.
- Narrow `scrape_node`'s except clause to `OSError`. URLError /
TimeoutError / ConnectionError are subclasses.
- Show full PNG path in the summary, not just the basename.
* refactor(process-metrics): register at metrics root, trim boilerplate
- spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics
is in scope, so gauges publish as process_* instead of consensus_process_*
- drop the spawn from SequencerContext::init (was using the consensus
subgroup handle from populate_metrics)
- swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)]
- delete the test module: both tests were low-value (tautology over
hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80
lines of Metrics/Gauge trait stub boilerplate
* fix(process-metrics): match process_* gauge names in soak scripts
- gauges now register at the prometheus root (commit 82992b6456),
so the consensus_ prefix is gone
- update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly
* fix(process-metrics): chart RSS in MiB to match binary-unit source
- docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the
process gauge is raw bytes; dividing by 1_000_000 and labeling "MB"
inflated the displayed value by ~4.8% and disagreed with docker stats
- divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title
and y-axis label
* feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges
- process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self
- node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges)
- node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback
- cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat
- cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite
- SecondsAccumulator preserves sub-second precision across delta adds to integer Counters
- best-effort reads: missing kernel files debug-log and skip, never break the scrape
* chore: cargo sort1 parent fcb5beb commit 36b751e
14 files changed
Lines changed: 1634 additions & 12 deletions
File tree
- .github/workflows
- crates
- espresso/node
- src/api
- process-metrics
- scripts
- src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
708 | 708 | | |
709 | 709 | | |
710 | 710 | | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
| 94 | + | |
93 | 95 | | |
94 | 96 | | |
95 | 97 | | |
| |||
334 | 336 | | |
335 | 337 | | |
336 | 338 | | |
| 339 | + | |
| 340 | + | |
337 | 341 | | |
338 | 342 | | |
339 | 343 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
| |||
0 commit comments