Commit 49fa5d0
authored
feat(process-metrics): add memory gauges and CI soak harness (#4330)
* feat(process-metrics): add memory gauges and CI soak harness
- new process-metrics crate exposes 5 process gauges: resident_memory_bytes,
virtual_memory_bytes, open_fds, threads, uptime_seconds
- espresso-node spawns the 5s sampler at startup via SequencerContext::init
- scripts/soak.py orchestrates the docker demo end to end: compose up, wait
for nodes ready, sample docker stats + /v0/status/metrics every 1s, render
a peak-total + per-service Markdown summary
- memory-soak-pr and memory-soak-non-pr jobs added to build.yml, matrix over
demo-drb-header.toml and demo-epoch-reward.toml
- 90-day artifact retention for raw JSONL and summary
* fix(process-metrics): fire-and-forget docker compose up
- `docker compose up -d` blocks on `service_completed_successfully`
dependencies and propagates their exit code, so a one-shot helper that
fails (e.g. wait-for-lc-epoch-2 in alpine with a bash shebang exits 127)
takes the whole soak job down even though all 5 nodes are healthy
- mirror the pattern in binary-upgrade-tests/run.py: launch `up -d` via
`Popen` to a log file, then verify readiness via node HTTP polling
- write the compose-up log into OUTPUT_DIR so it ends up in the artifact
* fix(demo): wait-for-light-client-epoch runs in alpine
- POSIX sh shebang and case/[ ] syntax: the badouralix/curl-jq image has
no bash, so the previous `#!/usr/bin/env bash` exited 127 and the LC
epoch gate completed instantly without polling
- cherry-picked from 6fe542a (script only;
the original commit also touched binary-upgrade-tests/run.py)
* fix(process-metrics): gate soak on smoke test, add chart + logs artifact
- soak.py no longer manages docker compose or polls per-node readiness;
node 2's command lacks the status module so /v0/status/block-height was
timing out at 300s. The CI workflow now brings the stack up and runs
scripts/smoke-test-demo as the readiness gate instead
- add RSS-over-time chart to the summary: inline mermaid xychart-beta so
GitHub renders it natively, plus a full-resolution PNG saved to the
samples artifact (matplotlib via PEP 723 uv inline deps)
- new soak-logs artifact captures docker compose logs + ps with 1-day
retention so failures are diagnosable
- swap python3 invocation for `uv run` (matplotlib gets installed from the
script's inline deps); add astral-sh/setup-uv@v8 step
* fix(ci): use proven `just demo --pull never` pattern for soak
- previous `docker compose up -d &` lost step exit info and didn't compose
with the way images are tagged in PR builds (PR variant loads pr-<N>
tagged images via `docker load`)
- match the existing test-demo-pr/non-pr pattern: pre-pull missing images,
then `just demo --pull never &` + smoke test as the gate
- add `taiki-e/install-action@just` so `just demo` is available
* feat(process-metrics): split soak.py into sample + render subcommands
- `soak.py sample`: stdlib only, no matplotlib import; produces JSONL artifacts
- `soak.py render`: reads saved JSONL, writes summary.md + chart; needs matplotlib
(PEP 723 inline deps). Re-runnable locally on saved data without re-sampling.
- CI splits into two steps so render runs on `if: always()` even when sample
fails partway, surfacing partial data.
- Set LD_LIBRARY_PATH in the project flake dev shell so uv-installed binary
wheels (matplotlib/numpy) find libstdc++ on NixOS. Verified end-to-end:
sample under plain python3, render via `uv run` produces a 1200x600 PNG.
* fix(ci): pin setup-uv to v8.1.0 (no floating v8 tag)
* feat(process-metrics): filter summary to nodes, fold RSS sources
- summary table now shows only espresso-node-N rows; the other containers
are still sampled into the JSONL artifact but not surfaced in the table
(user only cares about node memory)
- replace Min/Avg/Max/p99 RSS columns with Max RSS (docker) + Max RSS
(process gauge) + Max CPU%; p99 sometimes exceeded Max due to quantile
interpolation on small samples
- drop the redundant cross-check section: both sources now sit in the
main table side by side
- chart inherits the node-only filter and drops the matplotlib legend in
favor of color-matched end-of-line annotations
- drop empty-Name rows that produced a phantom `--` row in earlier output
- render infers duration from sample timestamps so the heading matches the
actual run when `soak.py render` is invoked without DURATION_SECONDS
* ci(memory-soak): drop aggregator job
* ci(memory-soak): trim env to what isn't derivable
Drop DURATION_SECONDS, OUTPUT_DIR, GENESIS_LABEL: all already have sensible
defaults in soak.py (300s, ./soak-samples, basename of genesis file).
Collapse the matrix object into a flat list of genesis basenames; build the
file path and artifact name inline.
* debug(process-metrics): dump raw metrics response on first scrape
To diagnose why node-metrics.jsonl came out empty in CI: on the first
scrape attempt per node, log byte count + process_* match count, and save
the raw response body to OUTPUT_DIR/raw-metrics-espresso-node-N.txt so
the artifact contains the exact endpoint output.
* fix(metrics): drop consensus_ subgroup from populate_metrics
- populate_metrics() was wrapping the root in a `consensus` subgroup, so
every metric registered through it (HotShot consensus, storage, proposal
fetcher, and now our process metrics) got a misleading consensus_ prefix
- the SQL / scanner / aggregator metrics avoid this because they register
on the root PrometheusMetrics directly; populate_metrics is the only
consumer that adds the extra layer
- gauges now publish under their natural names (process_resident_memory_bytes,
append_da_duration, ...). External Grafana / alert queries that depended
on the consensus_ prefix will need to be updated
- revert soak.py to match process_* directly and drop the diagnostic raw
response dump introduced in c916948, no longer needed
* fix(process-metrics): match consensus_process_* gauge names
populate_metrics in hotshot-query-service wraps registered metrics in a
"consensus" subgroup. Our process gauges land there and publish as
consensus_process_*. Stop trying to scrape process_* and use the actual
name. Leave the shared library alone; the prefix is a project-wide
convention that external dashboards depend on.
* ci(memory-soak): use 100 delegators per validator
- pass DELEGATION_CONFIG and NUM_DELEGATORS_PER_VALIDATOR through to the
stake-for-demo docker compose service
- set multiple-delegators mode with 100 delegators per validator in both
soak workflow jobs to load the stake table beyond a handful of stakers
* fix(process-metrics): add legend to RSS-over-time chart
- pin Mermaid xychart-beta palette via %%{init}%% so colors are deterministic
- render a Mermaid flowchart legend below the chart with matching node fills
(xychart-beta has no native legend; HTML-styled spans get sanitized in
GitHub step summaries)
- align PNG palette with Mermaid and add an in-figure legend()
* chore(process-metrics): add soak just module, simplify CI
- new crates/process-metrics/justfile registered at root as `mod soak`
- recipes: up, sample, render, logs, down, run, test, fmt, lint
- defaults via env_var_or_default so CI only sets per-matrix variables
- memory-soak-pr / memory-soak-non-pr jobs collapsed to `just soak::*` calls
- updated README run-locally section
* ci(memory-soak): run 1 hour per matrix entry
- set DURATION_SECONDS=3600 in both memory-soak jobs (local default stays 300s)
- timeout-minutes: 90 caps a hung job below the GH default 6h
* refactor(process-metrics): shrink soak.py, add click CLI
- Swap manual parsing for pandas, prometheus-client, humanize,
python-dotenv. Drops ~290 lines.
- Replace bare command positional with click subcommands so `sample`
and `render` each have their own `--help` with env-var fallbacks.
- Add `--docker-stats`, `--node-metrics`, `--out-dir`, `--label` to
`render` so it can run on an arbitrary JSONL pair without depending
on `--output-dir` layout.
- MemUsage parser now handles KiB/MiB/GiB/TiB (was MiB-only after the
earlier refactor, would silently drop containers reporting GiB).
- Add empty-file and GiB regression tests.
- Show the full PNG path in the summary, not just the basename.
- Add `*args` passthrough to the wrapper just recipes.
* refactor(process-metrics): drop dead code and indirection in soak.py
- Sampling no longer sets `ESPRESSO_NODE_GENESIS_FILE` /
`ESPRESSO_SEQUENCER_GENESIS_FILE` into the env. The `.env` does not
interpolate them, so this was a no-op. The `--genesis-file` option
on `sample` only existed to set those vars and is removed; the
matching `export` is dropped from the just recipe.
- `scrape_node` writes `\"node\": \"espresso-node-N\"` directly instead
of `http://localhost:<port>`, so `_process_rss_max` no longer needs
to recover the index by regex. `NODE_BASE_PORT` removed.
- Aggregate max-per-bucket for the Mermaid chart instead of picking
every Nth point, so peaks are not missed at 1Hz sampling.
- Compute the chart df (`seconds` + `rss_mb`, sorted) once in
`render_summary` and pass it to both chart renderers, dropping
`_series_mb`.
- Inline `_process_rss_max` (single caller) and `_hb`. Flatten
`METRIC_NAMES` to a hardcoded frozenset; drop `METRIC_PREFIX`.
- Rename `_load_docker` to `_load_docker_metrics`.
- Narrow `scrape_node`'s except clause to `OSError`. URLError /
TimeoutError / ConnectionError are subclasses.
- Show full PNG path in the summary, not just the basename.
* refactor(process-metrics): register at metrics root, trim boilerplate
- spawn ProcessMetrics in api/options.rs where the root PrometheusMetrics
is in scope, so gauges publish as process_* instead of consensus_process_*
- drop the spawn from SequencerContext::init (was using the consensus
subgroup handle from populate_metrics)
- swap Arc<dyn Gauge> for Box<dyn Gauge>, remove unused #[derive(Clone)]
- delete the test module: both tests were low-value (tautology over
hardcoded gauge names; smoke test of sysinfo + /proc) and required ~80
lines of Metrics/Gauge trait stub boilerplate
* fix(process-metrics): match process_* gauge names in soak scripts
- gauges now register at the prometheus root (commit 82992b6456),
so the consensus_ prefix is gone
- update RSS_METRIC, METRIC_NAMES, and the test fixture accordingly
* fix(process-metrics): chart RSS in MiB to match binary-unit source
- docker stats reports MemUsage in KiB/MiB/GiB (powers of 1024) and the
process gauge is raw bytes; dividing by 1_000_000 and labeling "MB"
inflated the displayed value by ~4.8% and disagreed with docker stats
- divide by 1024**2 and label "MiB" across the PNG axis, Mermaid title
and y-axis label
* feat(process-metrics): add CPU, PSI, cgroup, and I/O gauges
- process_cpu_seconds_total, process_read_bytes_total, process_write_bytes_total from /proc/self
- node_cpu_count and node_load{1,5,15}_milli (loadavg x1000 to fit usize gauges)
- node_pressure_{cpu,memory,io}_{waiting,stalled}_seconds_total from cgroup v2 PSI with /proc/pressure fallback
- cgroup_cpu_{periods,throttled_periods,throttled_seconds}_total from /sys/fs/cgroup/cpu.stat
- cgroup_memory_current_bytes always, cgroup_memory_max_bytes only when finite
- SecondsAccumulator preserves sub-second precision across delta adds to integer Counters
- best-effort reads: missing kernel files debug-log and skip, never break the scrape
* chore: cargo sort
(cherry picked from commit 36b751e)
1 parent e4ea214 commit 49fa5d0
14 files changed
Lines changed: 1634 additions & 12 deletions
File tree
- .github/workflows
- crates
- espresso/node
- src/api
- process-metrics
- scripts
- src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
541 | 541 | | |
542 | 542 | | |
543 | 543 | | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
| |||
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| 93 | + | |
92 | 94 | | |
93 | 95 | | |
94 | 96 | | |
| |||
324 | 326 | | |
325 | 327 | | |
326 | 328 | | |
| 329 | + | |
| 330 | + | |
327 | 331 | | |
328 | 332 | | |
329 | 333 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
| 86 | + | |
86 | 87 | | |
87 | 88 | | |
88 | 89 | | |
| |||
0 commit comments