Skip to content

docs(observability): add Grafana dashboard starter#6970

Draft
everettVT wants to merge 1 commit into
mainfrom
docs/observability-grafana-dashboard
Draft

docs(observability): add Grafana dashboard starter#6970
everettVT wants to merge 1 commit into
mainfrom
docs/observability-grafana-dashboard

Conversation

@everettVT
Copy link
Copy Markdown
Contributor

Summary

Adds an importable Grafana dashboard for Daft observability, built on the OTel metrics already documented in docs/observability/telemetry.md. 5 panels: rows/sec by operator type, bytes flow, task lifecycle, top operators by cumulative duration, and a failed-tasks stat panel.

Companion to the in-process Daft Dashboard (daft dashboard start):

  • In-process dashboard → live per-query debugging (plan tree, tasks view, heatmap, event log replay)
  • Grafana dashboard → fleet monitoring / on-call / SLO tracking alongside your existing infra observability

Why a draft

Shipped untested end-to-end. I built this from the documented metric surface but haven't run it against a live Daft + OTel + Prometheus + Grafana stack. The metric query shapes assume the standard OTel Collector → Prometheus exporter naming convention (daft.X.Ydaft_x_y_total); this should be verified against an actual scrape before users follow the docs.

Files

  • docs/observability/grafana.md — new docs page (prerequisites, import instructions, panel reference, metric naming convention, "what's not covered yet", adaptation tips)
  • docs/observability/grafana/daft-dashboard.json — the dashboard JSON (Grafana 10+, schema v39, $DS_PROMETHEUS datasource variable)
  • docs/SUMMARY.md — adds "Grafana Dashboard" entry under Observability nav, after Telemetry

Test plan

  • Import daft-dashboard.json into a real Grafana instance pointed at a Prometheus scraping OTel-exported Daft metrics
  • @samstokes / @cckellogg / @universalmind303 — verify metric query names against actual Prometheus output (daft_rows_out_total, daft_duration_total, daft_task_*, etc.)
  • @desmondcheongzx — confirm the process-level memory + CPU metrics (PR feat(observability): add process-level memory and CPU monitoring #6428) are visible via Prometheus and what their names are, so a future PR can add a "process resources" panel row
  • Verify Top 10 operators by cumulative duration query (topk(10, sum by (node_type, node_id) (daft_duration_total))) returns expected shape — the µs counter may need to be rescaled in the panel display
  • Confirm the "distributed-only" caveat in the docs page matches reality (we list daft_bytes_read_total, daft_task_active, daft_task_* as distributed-only per telemetry.md)
  • When @samstokes' peak resident state PR (feat(metrics): track peak resident state bytes for stateful operators #6883) lands, add a memory-attribution panel
  • Once per-operator bytes_in / bytes_out Prometheus exposure is confirmed, add inflation/deflation panels

Context

This started as a marketing-repo canon artifact (Eventual-Inc/marketing#360) supporting the W21 observability launch. Moved here because:

  1. Daft repo can actually test it end-to-end against a real stack
  2. Users reading docs.daft.ai/observability/ will find it where they're already looking
  3. Engineering reviews the metric queries against actual scraper output before users import it

The marketing PR (#360 in Eventual-Inc/marketing) will be closed in favor of this one.

5-panel Grafana dashboard JSON + docs page covering throughput, bytes
flow, task lifecycle, top operators by duration, and a failed-task
counter. Reads OTel-exported Daft metrics via Prometheus.

Complement to the in-process Daft Dashboard: in-process for live
per-query debugging, Grafana for fleet monitoring / on-call /
SLO tracking alongside the rest of your infrastructure observability.

Built from the documented OTel metric surface in
docs/observability/telemetry.md. Shipped untested end-to-end — needs
verification against a real Daft + OTel + Prometheus + Grafana stack
(see PR test plan).

Adds entry to docs/SUMMARY.md under Guide → Configuration & Optimization
→ Observability.
@github-actions github-actions Bot added the docs label May 20, 2026
@github-actions
Copy link
Copy Markdown

Rust Dependency Diff

Head: 2407e359aa6076108ab9153a0503b54365846539 vs Base: 4dd8ee211db381293f8ac75c9e47c9c160d560ce.

OK: Within budget.

  • New Crates: 0
  • Removed Crates: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant