|
| 1 | +--- |
| 2 | +name: local-monitoring |
| 3 | +description: Query the local Grafana/Prometheus/Loki stack shipped with this CDVN repo. Use when investigating cluster health, charon/beacon/EL errors, peer connectivity, validator performance, or log patterns against the locally-running monitoring stack (not Obol's hosted Grafana). |
| 4 | +user-invokable: true |
| 5 | +--- |
| 6 | + |
| 7 | +# Local Monitoring |
| 8 | + |
| 9 | +Query the local monitoring stack (Grafana, Prometheus, Loki) that ships with this repo to investigate cluster health and diagnose issues. |
| 10 | + |
| 11 | +For Obol's hosted Grafana (across all clusters), use the `obol-monitoring` skill instead. This skill is for the local stack only. |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +Before running, verify: |
| 16 | +1. The monitoring stack is up: `docker compose ps prometheus grafana loki` shows them running |
| 17 | +2. Grafana is reachable on the host at `http://localhost:${MONITORING_PORT_GRAFANA:-3000}` (default 3000) |
| 18 | +3. The user knows their Grafana admin credentials, or has unauthenticated access enabled (default in this repo's `grafana.ini`) |
| 19 | + |
| 20 | +If the stack isn't up, point the user to `docker compose up -d prometheus grafana loki` first. |
| 21 | + |
| 22 | +## Architecture notes |
| 23 | + |
| 24 | +- **Prometheus** (`:9090`) and **Loki** (`:3100`) are on the docker network only — not exposed to the host by default. Query them through one of: |
| 25 | + - **Grafana datasource proxy** (preferred): `http://localhost:3000/api/datasources/proxy/uid/<prometheus|loki>/<path>` — uses Grafana's own connection |
| 26 | + - **`docker compose exec`** fallback: `docker compose exec prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=...'` |
| 27 | +- Datasource UIDs (from `grafana/datasource.yml`): `prometheus`, `loki`, `tempo` |
| 28 | +- Charon metrics are labeled with `cluster_name` and `cluster_peer` — get these from `.env` (`CLUSTER_NAME`, `CLUSTER_PEER`) before querying |
| 29 | + |
| 30 | +## Gather Arguments |
| 31 | + |
| 32 | +Use AskUserQuestion to clarify what the user wants to investigate. Common shapes: |
| 33 | + |
| 34 | +1. **What to investigate** — pick one: |
| 35 | + - Cluster health snapshot (readyz, peers, active validators) |
| 36 | + - Charon error/log search (last N minutes) |
| 37 | + - Beacon node performance (latency, sync status) |
| 38 | + - Peer connectivity (ping latency, connection types) |
| 39 | + - Custom PromQL / LogQL query |
| 40 | +2. **Time range** — default last 15m; ask if investigating a specific incident |
| 41 | +3. **Cluster scope** — usually their own (`$CLUSTER_NAME` from `.env`); ask only if multiple clusters share this Prometheus |
| 42 | + |
| 43 | +If the request is already specific (e.g. "show me charon errors from the last hour"), skip AskUserQuestion and proceed. |
| 44 | + |
| 45 | +## Execution |
| 46 | + |
| 47 | +### Instant query (Prometheus) |
| 48 | + |
| 49 | +```bash |
| 50 | +GRAFANA_URL="http://localhost:${MONITORING_PORT_GRAFANA:-3000}" |
| 51 | +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query" \ |
| 52 | + --data-urlencode 'query=<PROMQL>' |
| 53 | +``` |
| 54 | + |
| 55 | +### Range query (Prometheus) |
| 56 | + |
| 57 | +```bash |
| 58 | +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/prometheus/api/v1/query_range" \ |
| 59 | + --data-urlencode 'query=<PROMQL>' \ |
| 60 | + --data-urlencode "start=$(date -u -v-15M +%s)" \ |
| 61 | + --data-urlencode "end=$(date -u +%s)" \ |
| 62 | + --data-urlencode 'step=30s' |
| 63 | +``` |
| 64 | + |
| 65 | +### Log search (Loki) |
| 66 | + |
| 67 | +```bash |
| 68 | +curl -sG "$GRAFANA_URL/api/datasources/proxy/uid/loki/loki/api/v1/query_range" \ |
| 69 | + --data-urlencode 'query={service_name="charon"} |= "error"' \ |
| 70 | + --data-urlencode "start=$(date -u -v-15M +%s)000000000" \ |
| 71 | + --data-urlencode "end=$(date -u +%s)000000000" \ |
| 72 | + --data-urlencode 'limit=200' |
| 73 | +``` |
| 74 | + |
| 75 | +### Fallback via `docker compose exec` |
| 76 | + |
| 77 | +If the Grafana proxy is unavailable: |
| 78 | +```bash |
| 79 | +docker compose exec prometheus wget -qO- "http://localhost:9090/api/v1/query?query=<URL_ENCODED_PROMQL>" |
| 80 | +docker compose exec loki wget -qO- "http://localhost:3100/loki/api/v1/query_range?query=<...>" |
| 81 | +``` |
| 82 | + |
| 83 | +For a query cookbook (cluster health, charon errors, peer ping, BN latency, validator effectiveness), see [queries.md](queries.md). |
| 84 | + |
| 85 | +## Output handling |
| 86 | + |
| 87 | +Parse the JSON response and present results clearly: |
| 88 | + |
| 89 | +- **Prometheus instant query** — show metric labels + value, flag anomalies (zeros where non-zero expected, threshold breaches) |
| 90 | +- **Prometheus range query** — summarise min/max/avg over the window; call out spikes |
| 91 | +- **Loki logs** — group by `cluster_peer` if present; surface error/warn lines verbatim with timestamps; suppress repetitive noise |
| 92 | +- Always print the **exact query that was run** so the user can re-run it in Grafana |
| 93 | + |
| 94 | +If the response contains `"status":"error"`, surface the `error` and `errorType` fields and stop — do not invent results. |
| 95 | + |
| 96 | +## Common diagnoses |
| 97 | + |
| 98 | +When showing results, watch for these patterns and call them out: |
| 99 | + |
| 100 | +- **`app_monitoring_readyz != 1`** — node is not ready; explain what readyz state means (1=ready, other=various failure modes documented in charon docs) |
| 101 | +- **High `p2p_ping_latency_secs` p90** — peer network is slow; check `p2p_peer_connection_types` for relayed vs direct |
| 102 | +- **`p2p_ping_success == 0`** for a peer — that operator is unreachable |
| 103 | +- **Charon log `error` spikes** — group by `topic` / `component` to identify which subsystem |
| 104 | +- **`core_scheduler_validators_active` lower than `cluster_validators`** — some validators not active (not yet activated, or exited) |
| 105 | +- **EL/CL container missing from metrics** — check `docker compose ps` and respective container logs |
| 106 | + |
| 107 | +## Pointers to dashboards |
| 108 | + |
| 109 | +Direct the user to the pre-provisioned dashboards in `grafana/dashboards/` rather than reinventing them: |
| 110 | +- `charon_overview_dashboard.json` — readyz, peers, validator activity (start here) |
| 111 | +- `cluster_dashboard.json` — full cluster view across operators |
| 112 | +- `node_overview_dashboard.json` — host/EL/CL/VC resource usage |
| 113 | +- `logs_dashboard.json` — Loki log explorer with charon filters |
| 114 | + |
| 115 | +Open in browser: `http://localhost:${MONITORING_PORT_GRAFANA:-3000}/dashboards`. |
| 116 | + |
| 117 | +## Dependencies |
| 118 | + |
| 119 | +- `curl`, `jq` (for parsing responses cleanly) |
| 120 | +- Running `prometheus`, `grafana`, `loki` containers from this compose stack |
| 121 | +- `CLUSTER_NAME` and `CLUSTER_PEER` set in `.env` (used as Prometheus label values) |
0 commit comments