Skip to content

Add OpenTelemetry metrics for self-hosted worker observability#58

Merged
ianhodge merged 4 commits into
mainfrom
metrics/otel-autoexport
Apr 29, 2026
Merged

Add OpenTelemetry metrics for self-hosted worker observability#58
ianhodge merged 4 commits into
mainfrom
metrics/otel-autoexport

Conversation

@ianhodge
Copy link
Copy Markdown
Member

Summary

Adds OpenTelemetry metrics export to oz-agent-worker so customers running self-hosted workers can monitor cluster health from their existing OTel/Prometheus stack — directly addressing the customer ask for # workers available, # workers active, active worker duration, and # successes / failures.

Exporter selection is driven by the standard OpenTelemetry environment variables (OTEL_METRICS_EXPORTER=prometheus|otlp|console|none) via go.opentelemetry.io/contrib/exporters/autoexport. Default behavior is unchanged: when the variable is unset, the worker emits no metrics.

Metric catalog

All metrics carry the resource attributes service.name=oz-agent-worker, service.version, worker.id, and worker.backend.

  • oz_worker_connected (gauge) — 1 while WebSocket connected, 0 otherwise.
  • oz_worker_tasks_active (UpDownCounter) — tasks currently executing.
  • oz_worker_tasks_max_concurrent (gauge) — configured concurrency limit.
  • oz_worker_tasks_claimed_total (counter) — tasks accepted.
  • oz_worker_tasks_rejected_total{reason} (counter) — e.g. reason="at_capacity".
  • oz_worker_tasks_completed_total{result} (counter) — result ∈ {succeeded, failed}.
  • oz_worker_task_duration_seconds{result} (histogram) — wall-clock task duration.
  • oz_worker_websocket_reconnects_total{reason} (counter) — reconnect attempts.
  • oz_worker_info{version,backend,worker_id} (gauge, value 1) — build metadata.

PromQL examples for the customer asks:

  • Workers available: sum(oz_worker_connected)
  • Workers active: count(oz_worker_tasks_active > 0)
  • Failure rate: sum(rate(oz_worker_tasks_completed_total{result="failed"}[5m]))
  • p95 task duration: histogram_quantile(0.95, sum by (le) (rate(oz_worker_task_duration_seconds_bucket[5m])))

Changes

  • internal/metrics/metrics.go — new package owning the OTel Meter, instrument creation, and typed helpers. Helpers fall back to no-op instruments before Init (or when metrics are disabled) so worker code can call them unconditionally. Uses an atomic.Pointer[instruments] so Init can hot-swap the no-op set for the SDK-backed set without locking.
  • internal/metrics/metrics_test.go — covers the noop-before-init contract, Init env-var gating, resource attributes, and counter/gauge/histogram semantics via a manual.Reader.
  • internal/worker/worker.go — instruments the connect/reconnect lifecycle, task claim/reject/complete paths, and active-task accounting.
  • main.go — calls metrics.Init before worker.New, defers shutdown after w.Shutdown(), sets SetMaxConcurrent and SetWorkerInfo once. Adds a Version build var (overridable via -ldflags="-X main.Version=...").
  • charts/oz-agent-worker/values.yaml — adds an opt-in metrics: section (default enabled: false).
  • charts/oz-agent-worker/templates/deployment.yaml — when enabled, injects OTEL_METRICS_EXPORTER, OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0, OTEL_EXPORTER_PROMETHEUS_PORT, plus a named metrics containerPort. Also forwards metrics.extraEnv for OTLP endpoint configuration.
  • charts/oz-agent-worker/templates/service.yaml — new namespace-scoped Service with prometheus.io/scrape annotations; gated on metrics.enabled and metrics.exporter=prometheus.
  • charts/oz-agent-worker/templates/podmonitor.yaml — optional Prometheus Operator PodMonitor, gated on metrics.podMonitor.create=true.
  • README.md — Monitoring section with quick-start, Helm config, and PromQL examples.

Validation

  • go vet ./...
  • go build ./...
  • go test ./... ✅ (new internal/metrics tests pass; existing tests unchanged)
  • helm lint charts/oz-agent-worker
  • helm template ... --set metrics.enabled=true --set metrics.podMonitor.create=true renders the expected env vars, containerPort, Service, and PodMonitor.
  • With metrics.enabled=false (default) the chart renders no OTEL_* env vars, no Service, and no PodMonitor.

Out of scope

  • warp-server-side aggregate metrics: the customer needs data in their own monitoring stack, and the Cursor analogue they cited is also worker-side.
  • Tracing / logs export. The same autoexport pattern can be applied for those signals in a follow-up.

Conversation: https://staging.warp.dev/conversation/9f135c58-959f-411e-bb5f-5ed20a26e8d6
Run: https://oz.staging.warp.dev/runs/019dd9d9-609e-7514-957a-b91c5a86596a
Plans:

This PR was generated with Oz.

oz-agent and others added 3 commits April 29, 2026 15:42
Adds an internal metrics package wired through go.opentelemetry.io/contrib/exporters/autoexport
so operators choose the exporter via standard OTel environment variables
(OTEL_METRICS_EXPORTER=prometheus|otlp|console|none). Metrics default to off
when OTEL_METRICS_EXPORTER is unset, so existing deployments are unaffected.

Metric catalog (each carries service.name, service.version, worker.id, worker.backend):

- oz_worker_connected (gauge): 1 while WebSocket connected, 0 otherwise.
- oz_worker_tasks_active (UpDownCounter): tasks currently executing.
- oz_worker_tasks_max_concurrent (gauge): configured concurrency limit.
- oz_worker_tasks_claimed_total (counter): tasks accepted.
- oz_worker_tasks_rejected_total{reason} (counter): tasks declined (e.g. at_capacity).
- oz_worker_tasks_completed_total{result} (counter): succeeded/failed.
- oz_worker_task_duration_seconds{result} (histogram): wall-clock task duration.
- oz_worker_websocket_reconnects_total{reason} (counter): reconnects.
- oz_worker_info{version,backend,worker_id} (gauge): build/runtime metadata.

These cover the customer-facing asks: workers available (sum oz_worker_connected),
workers active (count oz_worker_tasks_active>0), task duration, success/failure rates.

Helm chart gains a metrics: section that, when enabled, injects the OTel env
vars, opens a containerPort, and creates a Service plus optional PodMonitor.
README adds a Monitoring section with quick-start and PromQL examples.

Co-Authored-By: Oz <oz-agent@warp.dev>
CI's lint job runs gofmt -s -l . and rejects unformatted code. The struct
field alignment in `instruments` and a doc-comment indentation needed
canonical formatting.

Co-Authored-By: Oz <oz-agent@warp.dev>
The OTel Prometheus exporter only emits instruments that have been observed
at least once, which meant fresh workers exposed only the gauges that were
recorded synchronously during startup (`oz_worker_connected`,
`oz_worker_info`, `oz_worker_tasks_max_concurrent`). Counters and the
UpDownCounter stayed invisible until the first task or reject, which broke
the customer-facing pattern of writing alerts like
`rate(oz_worker_tasks_completed_total{result="failed"}[5m])` against a
freshly deployed worker.

`Init` now calls `primeInstruments` to record a zero observation against
every known (instrument, label) tuple. Bounded label values are pulled from
new exported constants (`RejectReasonAtCapacity`,
`WSReconnectReasonDialFailed`, `WSReconnectReasonRemoteClose`) and reused at
the runtime call sites in worker.go to keep priming and recording in sync.

The duration histogram is intentionally not primed: a synthetic 0-second
observation would distort latency quantiles. Dashboards that need to query
duration before the first task should rely on `_count` rate semantics, which
behave correctly when no samples have been observed.

Adds a unit test that asserts every counter and the UpDownCounter are
visible at startup with primed value 0, every bounded label combination is
exposed, and the histogram remains absent.

Co-Authored-By: Oz <oz-agent@warp.dev>
@ianhodge ianhodge requested a review from bnavetta April 29, 2026 16:29
// Dashboards that need to query the duration histogram before the first task
// runs should fall back to `_count`-based rate queries, which behave correctly
// when no samples have been observed.
func primeInstruments(ctx context.Context, set *instruments) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to set a baseline for a lot of the metrics or else they error out if nothing has been recorded

@ianhodge ianhodge marked this pull request as ready for review April 29, 2026 16:30
Copy link
Copy Markdown
Collaborator

@bnavetta bnavetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment thread internal/metrics/metrics.go Outdated
// - console: writes metrics to stdout.
// - none: disables metrics export entirely.
//
// When OTEL_METRICS_EXPORTER is unset, autoexport defaults to OTLP. To make
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does autoexport do if no OTLP destination is set? If we generate metrics unobtrusively when exporting to OTLP and it won't break anything, that seems like a fine default?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good point, will change the default

Comment thread internal/metrics/metrics.go Outdated
if err != nil {
// The no-op meter can't fail in practice; fall back to a zero-value
// instruments struct rather than panicking from package init.
noopSet = &instruments{}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause a nil panic at runtime, right? It's kind of all the same if this is unreachable, but a panic at startup seems safer

Comment thread main.go
)

// Version is the build-time version string. Override at link time with
// -ldflags="-X main.Version=...".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently do this - as a followup, could you update the release workflow to include a version?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good catch, will do

… init failure

Two review-feedback fixes:

1. Default metrics export to whatever autoexport selects when
   OTEL_METRICS_EXPORTER is unset, rather than treating "unset" as off.
   This matches the OpenTelemetry SDK convention and means a worker
   dropped into an environment that already runs an OTLP collector picks
   it up automatically. Operators who don't run an OTLP collector and
   want to avoid periodic localhost push errors should set
   OTEL_METRICS_EXPORTER=none. README and the package doc comment are
   updated to spell out the new default.

2. The init() noop fallback used to install a zero-value `instruments`
   struct on the unreachable error path, which would nil-panic the first
   time a helper got called. Switch to an explicit panic at startup so
   any future change that breaks the no-op meter fails loud immediately
   rather than silently corrupting the runtime.

The previous TestInitDisabledByDefault is folded back into
TestInitNoneIsDisabled, since "" is no longer disabled.

Co-Authored-By: Oz <oz-agent@warp.dev>
Copy link
Copy Markdown
Member Author

@ianhodge ianhodge merged commit 489d48f into main Apr 29, 2026
9 checks passed
@ianhodge ianhodge deleted the metrics/otel-autoexport branch April 29, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants