Add OpenTelemetry metrics for self-hosted worker observability#58
Conversation
Adds an internal metrics package wired through go.opentelemetry.io/contrib/exporters/autoexport
so operators choose the exporter via standard OTel environment variables
(OTEL_METRICS_EXPORTER=prometheus|otlp|console|none). Metrics default to off
when OTEL_METRICS_EXPORTER is unset, so existing deployments are unaffected.
Metric catalog (each carries service.name, service.version, worker.id, worker.backend):
- oz_worker_connected (gauge): 1 while WebSocket connected, 0 otherwise.
- oz_worker_tasks_active (UpDownCounter): tasks currently executing.
- oz_worker_tasks_max_concurrent (gauge): configured concurrency limit.
- oz_worker_tasks_claimed_total (counter): tasks accepted.
- oz_worker_tasks_rejected_total{reason} (counter): tasks declined (e.g. at_capacity).
- oz_worker_tasks_completed_total{result} (counter): succeeded/failed.
- oz_worker_task_duration_seconds{result} (histogram): wall-clock task duration.
- oz_worker_websocket_reconnects_total{reason} (counter): reconnects.
- oz_worker_info{version,backend,worker_id} (gauge): build/runtime metadata.
These cover the customer-facing asks: workers available (sum oz_worker_connected),
workers active (count oz_worker_tasks_active>0), task duration, success/failure rates.
Helm chart gains a metrics: section that, when enabled, injects the OTel env
vars, opens a containerPort, and creates a Service plus optional PodMonitor.
README adds a Monitoring section with quick-start and PromQL examples.
Co-Authored-By: Oz <oz-agent@warp.dev>
CI's lint job runs gofmt -s -l . and rejects unformatted code. The struct field alignment in `instruments` and a doc-comment indentation needed canonical formatting. Co-Authored-By: Oz <oz-agent@warp.dev>
The OTel Prometheus exporter only emits instruments that have been observed
at least once, which meant fresh workers exposed only the gauges that were
recorded synchronously during startup (`oz_worker_connected`,
`oz_worker_info`, `oz_worker_tasks_max_concurrent`). Counters and the
UpDownCounter stayed invisible until the first task or reject, which broke
the customer-facing pattern of writing alerts like
`rate(oz_worker_tasks_completed_total{result="failed"}[5m])` against a
freshly deployed worker.
`Init` now calls `primeInstruments` to record a zero observation against
every known (instrument, label) tuple. Bounded label values are pulled from
new exported constants (`RejectReasonAtCapacity`,
`WSReconnectReasonDialFailed`, `WSReconnectReasonRemoteClose`) and reused at
the runtime call sites in worker.go to keep priming and recording in sync.
The duration histogram is intentionally not primed: a synthetic 0-second
observation would distort latency quantiles. Dashboards that need to query
duration before the first task should rely on `_count` rate semantics, which
behave correctly when no samples have been observed.
Adds a unit test that asserts every counter and the UpDownCounter are
visible at startup with primed value 0, every bounded label combination is
exposed, and the histogram remains absent.
Co-Authored-By: Oz <oz-agent@warp.dev>
| // Dashboards that need to query the duration histogram before the first task | ||
| // runs should fall back to `_count`-based rate queries, which behave correctly | ||
| // when no samples have been observed. | ||
| func primeInstruments(ctx context.Context, set *instruments) { |
There was a problem hiding this comment.
we need to set a baseline for a lot of the metrics or else they error out if nothing has been recorded
| // - console: writes metrics to stdout. | ||
| // - none: disables metrics export entirely. | ||
| // | ||
| // When OTEL_METRICS_EXPORTER is unset, autoexport defaults to OTLP. To make |
There was a problem hiding this comment.
What does autoexport do if no OTLP destination is set? If we generate metrics unobtrusively when exporting to OTLP and it won't break anything, that seems like a fine default?
There was a problem hiding this comment.
yeah that's a good point, will change the default
| if err != nil { | ||
| // The no-op meter can't fail in practice; fall back to a zero-value | ||
| // instruments struct rather than panicking from package init. | ||
| noopSet = &instruments{} |
There was a problem hiding this comment.
This would cause a nil panic at runtime, right? It's kind of all the same if this is unreachable, but a panic at startup seems safer
| ) | ||
|
|
||
| // Version is the build-time version string. Override at link time with | ||
| // -ldflags="-X main.Version=...". |
There was a problem hiding this comment.
We don't currently do this - as a followup, could you update the release workflow to include a version?
There was a problem hiding this comment.
ahh good catch, will do
… init failure Two review-feedback fixes: 1. Default metrics export to whatever autoexport selects when OTEL_METRICS_EXPORTER is unset, rather than treating "unset" as off. This matches the OpenTelemetry SDK convention and means a worker dropped into an environment that already runs an OTLP collector picks it up automatically. Operators who don't run an OTLP collector and want to avoid periodic localhost push errors should set OTEL_METRICS_EXPORTER=none. README and the package doc comment are updated to spell out the new default. 2. The init() noop fallback used to install a zero-value `instruments` struct on the unreachable error path, which would nil-panic the first time a helper got called. Switch to an explicit panic at startup so any future change that breaks the no-op meter fails loud immediately rather than silently corrupting the runtime. The previous TestInitDisabledByDefault is folded back into TestInitNoneIsDisabled, since "" is no longer disabled. Co-Authored-By: Oz <oz-agent@warp.dev>
This stack of pull requests is managed by Graphite. Learn more about stacking. |

Summary
Adds OpenTelemetry metrics export to
oz-agent-workerso customers running self-hosted workers can monitor cluster health from their existing OTel/Prometheus stack — directly addressing the customer ask for# workers available,# workers active, active worker duration, and# successes / failures.Exporter selection is driven by the standard OpenTelemetry environment variables (
OTEL_METRICS_EXPORTER=prometheus|otlp|console|none) viago.opentelemetry.io/contrib/exporters/autoexport. Default behavior is unchanged: when the variable is unset, the worker emits no metrics.Metric catalog
All metrics carry the resource attributes
service.name=oz-agent-worker,service.version,worker.id, andworker.backend.oz_worker_connected(gauge) —1while WebSocket connected,0otherwise.oz_worker_tasks_active(UpDownCounter) — tasks currently executing.oz_worker_tasks_max_concurrent(gauge) — configured concurrency limit.oz_worker_tasks_claimed_total(counter) — tasks accepted.oz_worker_tasks_rejected_total{reason}(counter) — e.g.reason="at_capacity".oz_worker_tasks_completed_total{result}(counter) —result∈ {succeeded,failed}.oz_worker_task_duration_seconds{result}(histogram) — wall-clock task duration.oz_worker_websocket_reconnects_total{reason}(counter) — reconnect attempts.oz_worker_info{version,backend,worker_id}(gauge, value1) — build metadata.PromQL examples for the customer asks:
sum(oz_worker_connected)count(oz_worker_tasks_active > 0)sum(rate(oz_worker_tasks_completed_total{result="failed"}[5m]))histogram_quantile(0.95, sum by (le) (rate(oz_worker_task_duration_seconds_bucket[5m])))Changes
internal/metrics/metrics.go— new package owning the OTelMeter, instrument creation, and typed helpers. Helpers fall back to no-op instruments beforeInit(or when metrics are disabled) so worker code can call them unconditionally. Uses anatomic.Pointer[instruments]soInitcan hot-swap the no-op set for the SDK-backed set without locking.internal/metrics/metrics_test.go— covers the noop-before-init contract,Initenv-var gating, resource attributes, and counter/gauge/histogram semantics via amanual.Reader.internal/worker/worker.go— instruments the connect/reconnect lifecycle, task claim/reject/complete paths, and active-task accounting.main.go— callsmetrics.Initbeforeworker.New, defersshutdownafterw.Shutdown(), setsSetMaxConcurrentandSetWorkerInfoonce. Adds aVersionbuild var (overridable via-ldflags="-X main.Version=...").charts/oz-agent-worker/values.yaml— adds an opt-inmetrics:section (defaultenabled: false).charts/oz-agent-worker/templates/deployment.yaml— when enabled, injectsOTEL_METRICS_EXPORTER,OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0,OTEL_EXPORTER_PROMETHEUS_PORT, plus a namedmetricscontainerPort. Also forwardsmetrics.extraEnvfor OTLP endpoint configuration.charts/oz-agent-worker/templates/service.yaml— new namespace-scoped Service withprometheus.io/scrapeannotations; gated onmetrics.enabledandmetrics.exporter=prometheus.charts/oz-agent-worker/templates/podmonitor.yaml— optional Prometheus OperatorPodMonitor, gated onmetrics.podMonitor.create=true.README.md— Monitoring section with quick-start, Helm config, and PromQL examples.Validation
go vet ./...✅go build ./...✅go test ./...✅ (newinternal/metricstests pass; existing tests unchanged)helm lint charts/oz-agent-worker✅helm template ... --set metrics.enabled=true --set metrics.podMonitor.create=truerenders the expected env vars, containerPort, Service, and PodMonitor.metrics.enabled=false(default) the chart renders noOTEL_*env vars, no Service, and no PodMonitor.Out of scope
autoexportpattern can be applied for those signals in a follow-up.Conversation: https://staging.warp.dev/conversation/9f135c58-959f-411e-bb5f-5ed20a26e8d6
Run: https://oz.staging.warp.dev/runs/019dd9d9-609e-7514-957a-b91c5a86596a
Plans:
This PR was generated with Oz.