Skip to content

Commit 215859a

Browse files
authored
Merge pull request #1 from lambdaclass/dora-monitor
Add dora_monitor: Slack alerting tool for ethrex devnet
2 parents c1110d6 + 37a3a34 commit 215859a

16 files changed

Lines changed: 1726 additions & 0 deletions

File tree

dora_monitor/.gitignore

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Virtual env
2+
.venv/
3+
4+
# Python bytecode / build
5+
__pycache__/
6+
*.py[cod]
7+
*.egg-info/
8+
*.egg
9+
build/
10+
dist/
11+
12+
# Tooling caches
13+
.pytest_cache/
14+
.mypy_cache/
15+
.ruff_cache/
16+
.tox/
17+
.coverage
18+
coverage.xml
19+
htmlcov/
20+
.python-version
21+
22+
# Local config / runtime state
23+
config.yaml
24+
dora_monitor_state.json
25+
*.log

dora_monitor/Makefile

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
.PHONY: all deps run dry-run dry-run-once clean
2+
3+
CONFIG ?= config.yaml
4+
5+
all: deps
6+
7+
.venv:
8+
uv venv
9+
uv pip install -e .
10+
11+
deps: .venv
12+
13+
run: deps
14+
uv run dora-monitor -c $(CONFIG)
15+
16+
dry-run: deps
17+
uv run dora-monitor -c $(CONFIG) --dry-run --debug
18+
19+
dry-run-once: deps
20+
uv run dora-monitor -c $(CONFIG) --once --dry-run --debug
21+
22+
clean:
23+
rm -rf .venv *.egg-info build dist dora_monitor_state.json

dora_monitor/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# dora_monitor
2+
3+
Polls a [Dora explorer](https://github.com/ethpandaops/dora) API and posts Slack and/or Discord alerts when a specific client (e.g. `ethrex`) misses a block, orphans a block, drifts onto a fork, falls behind the canonical head, or drops offline.
4+
5+
## Install
6+
7+
```bash
8+
cd dora_monitor
9+
python -m venv .venv && source .venv/bin/activate
10+
pip install -e .
11+
```
12+
13+
## Configure
14+
15+
Copy `config.example.yaml` and edit it; the mandatory fields are `dora_url`, `client_match`, and at least one webhook (`slack_webhook_url` or `discord_webhook_url`; either can come from the `SLACK_WEBHOOK_URL` / `DISCORD_WEBHOOK_URL` env var instead). Both may be set, in which case every alert and heartbeat is fanned out to each.
16+
17+
```bash
18+
cp config.example.yaml config.yaml
19+
$EDITOR config.yaml
20+
```
21+
22+
`client_match` is a case-insensitive substring matched against Dora's `proposer_name` and client `name` fields (e.g. `lighthouse-ethrex-1`), so `ethrex` catches every CL/EL pair that runs the ethrex execution layer.
23+
24+
## Run
25+
26+
```bash
27+
# foreground
28+
dora-monitor -c config.yaml
29+
30+
# one tick, no Slack
31+
SLACK_WEBHOOK_URL=unused dora-monitor -c config.yaml --once --dry-run
32+
```
33+
34+
The process holds dedup state in `state_file` so restarts don't re-alert on already-reported missed slots / open conditions.
35+
36+
## Alerts
37+
38+
| Trigger | Source endpoint |
39+
|---|---|
40+
| Missed slot by `client_match` proposer | `/api/v1/slots?with_missing=1` |
41+
| Orphaned block by `client_match` proposer | `/api/v1/slots?with_orphaned=1` |
42+
| `client_match` node on a non-canonical fork | `/api/v1/network/client_head_forks` |
43+
| `client_match` node lagging `>= sync_lag_threshold` slots | `/api/v1/network/client_head_forks` |
44+
| `client_match` node `status != online` | `/api/v1/network/client_head_forks` |
45+
| `client_match` EL `version` string changes (deploy/rollback) | scraped from `/clients/execution` HTML |
46+
47+
Recoveries (fork resolved, caught up, back online) are posted as well.
48+
49+
The periodic heartbeat digest uses Slack Block Kit (`{"blocks": [...]}`) on Slack and a rich embed on Discord, both built from the same gathered data plus a shared plain-text fallback. Action alerts (offline / fork / lag / version change / missed-block) use plain mrkdwn `text` posts on Slack; the same strings are run through a mrkdwn→Discord-markdown translator (single-asterisk bold → double, `:shortcode:` emoji → unicode) before being posted to Discord. Clients with status `online`, on the canonical fork, and at `distance == 0` from canonical head collapse into a single "online @ canonical" bucket so the digest highlights outliers instead of repeating identical rows; use `heartbeat_other_clients: detailed` (default) to list the healthy names, `summary` for just a count, or `off` to drop the section entirely.
50+
51+
## A note on what "client" means here
52+
53+
`/api/v1/network/client_head_forks` lists Dora's **beacon (CL)** clients; their names embed the paired EL (e.g. `lighthouse-ethrex-1` is the Lighthouse beacon paired with an ethrex EL). So the offline / fork / lag signals are observed on the beacon side. An ethrex-EL crash shows up indirectly: the paired beacon's head stops advancing (sync_lag) or its status flips to non-online (offline).
54+
55+
Dora's `/api/v1/clients/execution` is deliberately NOT used. Its `status` field only reflects whether Dora's devp2p crawler could fetch `admin_nodeInfo` from the node, not whether the EL is healthy; ethrex EL nodes typically show `disconnected` there even when the UI shows them as `Ready` and following the chain. The execution clients page's real status (`Ready`/`Synchronizing`/`Offline`) is not exposed via any JSON API as of Dora master.

dora_monitor/config.example.yaml

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Dora explorer base URL (no trailing slash).
2+
dora_url: "https://dora.bal-devnet-7.ethpandaops.io"
3+
4+
# Substring (case-insensitive) matched against proposer_name / client name
5+
# in the Dora API. Any name containing this is considered "ours".
6+
client_match: "ethrex"
7+
8+
# Slack incoming webhook URL. Can also be set via SLACK_WEBHOOK_URL env var.
9+
# Leave empty to disable Slack delivery.
10+
slack_webhook_url: ""
11+
12+
# Discord webhook URL. Can also be set via DISCORD_WEBHOOK_URL env var.
13+
# Leave empty to disable Discord delivery. At least one of slack / discord
14+
# must be configured (unless --dry-run).
15+
discord_webhook_url: ""
16+
17+
# Optional label shown in alert headers (e.g. network name).
18+
network_label: "bal-devnet-7"
19+
20+
# Poll interval in seconds.
21+
poll_interval: 30
22+
23+
# How many recent slots to scan each tick for missed/orphaned blocks.
24+
slot_scan_limit: 64
25+
26+
# Sync lag threshold (slots behind the canonical head) before alerting.
27+
sync_lag_threshold: 16
28+
29+
# Number of consecutive polls a matched client must be on a non-canonical
30+
# head before we fire a fork alert. Filters propagation-timing jitter
31+
# (1-2 slot leads/lags that self-resolve in the next poll). With
32+
# poll_interval=30, fork_confirm_ticks=3 = ~90s of confirmation.
33+
fork_confirm_ticks: 3
34+
35+
# Path for persisted dedup state. Use null for in-memory only.
36+
state_file: "./dora_monitor_state.json"
37+
38+
# HTTP timeout in seconds.
39+
http_timeout: 10
40+
41+
# Set true for verbose logging.
42+
debug: false
43+
44+
# Periodic digest ("heartbeat") posted to Slack to confirm the monitor
45+
# is alive and give a network snapshot. Set 0 to disable.
46+
heartbeat_interval_minutes: 360
47+
# Slots scanned for missed/orphaned counts in each heartbeat.
48+
heartbeat_slot_window: 256
49+
# How to include non-`client_match` clients in the heartbeat:
50+
# off — skip them entirely
51+
# summary — one-line aggregate ("12 total, 1 non-online, 0 off-canonical")
52+
# detailed — per-client list with status / head / distance
53+
heartbeat_other_clients: "detailed"
54+
55+
# Missed-block storm mode. When a single proposer hits
56+
# `missed_burst_threshold` misses inside the last `missed_burst_window_minutes`,
57+
# we stop posting per-slot alerts and switch to one "burst started" alert
58+
# plus periodic "still bursting" updates whose interval backs off through
59+
# `missed_burst_update_schedule_minutes` (last value repeats indefinitely).
60+
# A "burst resolved" alert fires once the window has been clear of misses.
61+
missed_burst_threshold: 5
62+
missed_burst_window_minutes: 15
63+
missed_burst_update_schedule_minutes: [15, 30, 60, 120]
64+
65+
# Which checks to enable.
66+
checks:
67+
missed_blocks: true
68+
forks: true
69+
offline: true
70+
sync_lag: true
71+
# Posts an alert when an ethrex EL's reported version string changes
72+
# between polls (deploy / rollback detection). Reads from the HTML page
73+
# since the v1 JSON API doesn't expose ethrex's EL version.
74+
version_drift: true

dora_monitor/dora_monitor/__init__.py

Whitespace-only changes.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from dora_monitor.main import cli
2+
3+
if __name__ == "__main__":
4+
cli()

0 commit comments

Comments
 (0)