Add OpenTelemetry metrics for self-hosted worker observability by ianhodge · Pull Request #58 · warpdotdev/oz-agent-worker

ianhodge · 2026-04-29T15:43:25Z

Summary

Adds OpenTelemetry metrics export to oz-agent-worker so customers running self-hosted workers can monitor cluster health from their existing OTel/Prometheus stack — directly addressing the customer ask for # workers available, # workers active, active worker duration, and # successes / failures.

Exporter selection is driven by the standard OpenTelemetry environment variables (OTEL_METRICS_EXPORTER=prometheus|otlp|console|none) via go.opentelemetry.io/contrib/exporters/autoexport. Default behavior is unchanged: when the variable is unset, the worker emits no metrics.

Metric catalog

All metrics carry the resource attributes service.name=oz-agent-worker, service.version, worker.id, and worker.backend.

oz_worker_connected (gauge) — 1 while WebSocket connected, 0 otherwise.
oz_worker_tasks_active (UpDownCounter) — tasks currently executing.
oz_worker_tasks_max_concurrent (gauge) — configured concurrency limit.
oz_worker_tasks_claimed_total (counter) — tasks accepted.
oz_worker_tasks_rejected_total{reason} (counter) — e.g. reason="at_capacity".
oz_worker_tasks_completed_total{result} (counter) — result ∈ {succeeded, failed}.
oz_worker_task_duration_seconds{result} (histogram) — wall-clock task duration.
oz_worker_websocket_reconnects_total{reason} (counter) — reconnect attempts.
oz_worker_info{version,backend,worker_id} (gauge, value 1) — build metadata.

PromQL examples for the customer asks:

Workers available: sum(oz_worker_connected)
Workers active: count(oz_worker_tasks_active > 0)
Failure rate: sum(rate(oz_worker_tasks_completed_total{result="failed"}[5m]))
p95 task duration: histogram_quantile(0.95, sum by (le) (rate(oz_worker_task_duration_seconds_bucket[5m])))

Changes

internal/metrics/metrics.go — new package owning the OTel Meter, instrument creation, and typed helpers. Helpers fall back to no-op instruments before Init (or when metrics are disabled) so worker code can call them unconditionally. Uses an atomic.Pointer[instruments] so Init can hot-swap the no-op set for the SDK-backed set without locking.
internal/metrics/metrics_test.go — covers the noop-before-init contract, Init env-var gating, resource attributes, and counter/gauge/histogram semantics via a manual.Reader.
internal/worker/worker.go — instruments the connect/reconnect lifecycle, task claim/reject/complete paths, and active-task accounting.
main.go — calls metrics.Init before worker.New, defers shutdown after w.Shutdown(), sets SetMaxConcurrent and SetWorkerInfo once. Adds a Version build var (overridable via -ldflags="-X main.Version=...").
charts/oz-agent-worker/values.yaml — adds an opt-in metrics: section (default enabled: false).
charts/oz-agent-worker/templates/deployment.yaml — when enabled, injects OTEL_METRICS_EXPORTER, OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0, OTEL_EXPORTER_PROMETHEUS_PORT, plus a named metrics containerPort. Also forwards metrics.extraEnv for OTLP endpoint configuration.
charts/oz-agent-worker/templates/service.yaml — new namespace-scoped Service with prometheus.io/scrape annotations; gated on metrics.enabled and metrics.exporter=prometheus.
charts/oz-agent-worker/templates/podmonitor.yaml — optional Prometheus Operator PodMonitor, gated on metrics.podMonitor.create=true.
README.md — Monitoring section with quick-start, Helm config, and PromQL examples.

Validation

go vet ./... ✅
go build ./... ✅
go test ./... ✅ (new internal/metrics tests pass; existing tests unchanged)
helm lint charts/oz-agent-worker ✅
helm template ... --set metrics.enabled=true --set metrics.podMonitor.create=true renders the expected env vars, containerPort, Service, and PodMonitor.
With metrics.enabled=false (default) the chart renders no OTEL_* env vars, no Service, and no PodMonitor.

Out of scope

warp-server-side aggregate metrics: the customer needs data in their own monitoring stack, and the Cursor analogue they cited is also worker-side.
Tracing / logs export. The same autoexport pattern can be applied for those signals in a follow-up.

Conversation: https://staging.warp.dev/conversation/9f135c58-959f-411e-bb5f-5ed20a26e8d6
Run: https://oz.staging.warp.dev/runs/019dd9d9-609e-7514-957a-b91c5a86596a
Plans:

Self-hosted worker metrics via OpenTelemetry autoexport

This PR was generated with Oz.

Adds an internal metrics package wired through go.opentelemetry.io/contrib/exporters/autoexport so operators choose the exporter via standard OTel environment variables (OTEL_METRICS_EXPORTER=prometheus|otlp|console|none). Metrics default to off when OTEL_METRICS_EXPORTER is unset, so existing deployments are unaffected. Metric catalog (each carries service.name, service.version, worker.id, worker.backend): - oz_worker_connected (gauge): 1 while WebSocket connected, 0 otherwise. - oz_worker_tasks_active (UpDownCounter): tasks currently executing. - oz_worker_tasks_max_concurrent (gauge): configured concurrency limit. - oz_worker_tasks_claimed_total (counter): tasks accepted. - oz_worker_tasks_rejected_total{reason} (counter): tasks declined (e.g. at_capacity). - oz_worker_tasks_completed_total{result} (counter): succeeded/failed. - oz_worker_task_duration_seconds{result} (histogram): wall-clock task duration. - oz_worker_websocket_reconnects_total{reason} (counter): reconnects. - oz_worker_info{version,backend,worker_id} (gauge): build/runtime metadata. These cover the customer-facing asks: workers available (sum oz_worker_connected), workers active (count oz_worker_tasks_active>0), task duration, success/failure rates. Helm chart gains a metrics: section that, when enabled, injects the OTel env vars, opens a containerPort, and creates a Service plus optional PodMonitor. README adds a Monitoring section with quick-start and PromQL examples. Co-Authored-By: Oz <oz-agent@warp.dev>

CI's lint job runs gofmt -s -l . and rejects unformatted code. The struct field alignment in `instruments` and a doc-comment indentation needed canonical formatting. Co-Authored-By: Oz <oz-agent@warp.dev>

The OTel Prometheus exporter only emits instruments that have been observed at least once, which meant fresh workers exposed only the gauges that were recorded synchronously during startup (`oz_worker_connected`, `oz_worker_info`, `oz_worker_tasks_max_concurrent`). Counters and the UpDownCounter stayed invisible until the first task or reject, which broke the customer-facing pattern of writing alerts like `rate(oz_worker_tasks_completed_total{result="failed"}[5m])` against a freshly deployed worker. `Init` now calls `primeInstruments` to record a zero observation against every known (instrument, label) tuple. Bounded label values are pulled from new exported constants (`RejectReasonAtCapacity`, `WSReconnectReasonDialFailed`, `WSReconnectReasonRemoteClose`) and reused at the runtime call sites in worker.go to keep priming and recording in sync. The duration histogram is intentionally not primed: a synthetic 0-second observation would distort latency quantiles. Dashboards that need to query duration before the first task should rely on `_count` rate semantics, which behave correctly when no samples have been observed. Adds a unit test that asserts every counter and the UpDownCounter are visible at startup with primed value 0, every bounded label combination is exposed, and the histogram remains absent. Co-Authored-By: Oz <oz-agent@warp.dev>

ianhodge · 2026-04-29T16:30:32Z

+// Dashboards that need to query the duration histogram before the first task
+// runs should fall back to `_count`-based rate queries, which behave correctly
+// when no samples have been observed.
+func primeInstruments(ctx context.Context, set *instruments) {


we need to set a baseline for a lot of the metrics or else they error out if nothing has been recorded

bnavetta

Nice!

bnavetta · 2026-04-29T16:46:48Z

+//   - console:    writes metrics to stdout.
+//   - none:       disables metrics export entirely.
+//
+// When OTEL_METRICS_EXPORTER is unset, autoexport defaults to OTLP. To make


What does autoexport do if no OTLP destination is set? If we generate metrics unobtrusively when exporting to OTLP and it won't break anything, that seems like a fine default?

yeah that's a good point, will change the default

bnavetta · 2026-04-29T16:48:57Z

+	if err != nil {
+		// The no-op meter can't fail in practice; fall back to a zero-value
+		// instruments struct rather than panicking from package init.
+		noopSet = &instruments{}


This would cause a nil panic at runtime, right? It's kind of all the same if this is unreachable, but a panic at startup seems safer

bnavetta · 2026-04-29T16:52:15Z

 )

+// Version is the build-time version string. Override at link time with
+// -ldflags="-X main.Version=...".


We don't currently do this - as a followup, could you update the release workflow to include a version?

ahh good catch, will do

… init failure Two review-feedback fixes: 1. Default metrics export to whatever autoexport selects when OTEL_METRICS_EXPORTER is unset, rather than treating "unset" as off. This matches the OpenTelemetry SDK convention and means a worker dropped into an environment that already runs an OTLP collector picks it up automatically. Operators who don't run an OTLP collector and want to avoid periodic localhost push errors should set OTEL_METRICS_EXPORTER=none. README and the package doc comment are updated to spell out the new default. 2. The init() noop fallback used to install a zero-value `instruments` struct on the unreachable error path, which would nil-panic the first time a helper got called. Switch to an explicit panic at startup so any future change that breaks the no-op meter fails loud immediately rather than silently corrupting the runtime. The previous TestInitDisabledByDefault is folded back into TestInitNoneIsDisabled, since "" is no longer disabled. Co-Authored-By: Oz <oz-agent@warp.dev>

ianhodge · 2026-04-29T17:01:36Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

oz-agent and others added 3 commits April 29, 2026 15:42

Run gofmt -s on internal/metrics/metrics.go

45b43f0

CI's lint job runs gofmt -s -l . and rejects unformatted code. The struct field alignment in `instruments` and a doc-comment indentation needed canonical formatting. Co-Authored-By: Oz <oz-agent@warp.dev>

ianhodge requested a review from bnavetta April 29, 2026 16:29

ianhodge commented Apr 29, 2026

View reviewed changes

ianhodge marked this pull request as ready for review April 29, 2026 16:30

bnavetta approved these changes Apr 29, 2026

View reviewed changes

ianhodge mentioned this pull request Apr 29, 2026

Stamp main.Version into release binaries via -ldflags #59

Merged

ianhodge merged commit 489d48f into main Apr 29, 2026
9 checks passed

ianhodge deleted the metrics/otel-autoexport branch April 29, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenTelemetry metrics for self-hosted worker observability#58

Add OpenTelemetry metrics for self-hosted worker observability#58
ianhodge merged 4 commits into
mainfrom
metrics/otel-autoexport

ianhodge commented Apr 29, 2026

Uh oh!

ianhodge Apr 29, 2026

Uh oh!

bnavetta left a comment

Uh oh!

bnavetta Apr 29, 2026

Uh oh!

ianhodge Apr 29, 2026

Uh oh!

bnavetta Apr 29, 2026

Uh oh!

bnavetta Apr 29, 2026

Uh oh!

ianhodge Apr 29, 2026

Uh oh!

ianhodge commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ianhodge commented Apr 29, 2026

Summary

Metric catalog

Changes

Validation

Out of scope

Uh oh!

ianhodge Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

bnavetta left a comment

Choose a reason for hiding this comment

Uh oh!

bnavetta Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ianhodge Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

bnavetta Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

bnavetta Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ianhodge Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ianhodge commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants