|
1 | 1 | # Conductor Client Metrics |
2 | 2 |
|
3 | | -**Status: Incubating.** |
| 3 | +The `conductor-client-metrics` module provides Prometheus metrics for Java SDK clients and workers. It helps operators monitor worker polling, task execution, task result updates, payload sizes, workflow starts, and HTTP client latency. |
4 | 4 |
|
5 | | -Provides metrics and monitoring capabilities for Conductor clients. |
| 5 | +## Installation |
6 | 6 |
|
7 | | -It helps developers track the performance and health of their workers, offering insights into task execution times, error rates, and system throughput. |
| 7 | +Add the metrics module to the worker application: |
8 | 8 |
|
9 | | -As an incubating module, it's still under development and subject to changes. |
| 9 | +```groovy |
| 10 | +dependencies { |
| 11 | + implementation 'org.conductoross:conductor-client-metrics:5.1.0' |
| 12 | +} |
| 13 | +``` |
| 14 | + |
| 15 | +## Usage |
| 16 | + |
| 17 | +Create a `PrometheusMetricsCollector`, start the scrape server, and pass the collector to `ConductorClient.Builder`. All downstream clients and the task runner auto-register themselves as listeners. |
| 18 | + |
| 19 | +```java |
| 20 | +import com.netflix.conductor.client.metrics.prometheus.PrometheusMetricsCollector; |
| 21 | + |
| 22 | +PrometheusMetricsCollector metricsCollector = new PrometheusMetricsCollector(); |
| 23 | +metricsCollector.startServer(); // http://localhost:9991/metrics |
| 24 | + |
| 25 | +ConductorClient client = ConductorClient.builder() |
| 26 | + .basePath("http://conductor-server:8080/api") |
| 27 | + .withMetricsCollector(metricsCollector) |
| 28 | + .build(); |
| 29 | + |
| 30 | +TaskClient taskClient = new TaskClient(client); |
| 31 | +WorkflowClient workflowClient = new WorkflowClient(client); |
| 32 | + |
| 33 | +TaskRunnerConfigurer configurer = new TaskRunnerConfigurer.Builder(taskClient, workers) |
| 34 | + .withThreadCount(10) |
| 35 | + .build(); |
| 36 | +configurer.init(); |
| 37 | +``` |
| 38 | + |
| 39 | +`startServer()` also accepts `(port, endpoint)` for custom scrape configurations. The client builder accepts the usual timeouts, SSL, authentication, and other options alongside `withMetricsCollector` -- none of them change how metrics wiring works. |
| 40 | + |
| 41 | +### How Auto-Registration Works |
| 42 | + |
| 43 | +When a `MetricsCollector` is passed to `ConductorClient.Builder.withMetricsCollector()`: |
| 44 | + |
| 45 | +1. The `ConductorClient` installs an OkHttp interceptor that records `http_api_client_request_seconds`, `task_result_size_bytes`, and `workflow_input_size_bytes`. |
| 46 | +2. `TaskClient` detects the collector from the `ConductorClient` it receives and calls `registerListener` and `registerTaskRunnerListener` on itself. |
| 47 | +3. `WorkflowClient` detects the collector from the `ConductorClient` it receives and calls `registerListener` on itself. |
| 48 | +4. `TaskRunnerConfigurer.Builder.build()` detects the collector from the `TaskClient`'s `ConductorClient` and registers task-runner events automatically, unless `withMetricsCollector` was called explicitly on the builder. |
| 49 | + |
| 50 | +All registrations are idempotent. If you call both `withMetricsCollector` on the builder and `registerListener` manually with the same collector, events are not duplicated. |
| 51 | + |
| 52 | +The collector exposes Prometheus text format from the embedded HTTP server. Metrics are created lazily, so a metric family appears after the corresponding worker or client event has occurred. |
| 53 | + |
| 54 | +### Manual Wiring |
| 55 | + |
| 56 | +For advanced use cases where you need fine-grained control over which listeners are registered where, or you want to mix the metrics collector with custom event listeners, create the `ConductorClient` without `withMetricsCollector` and register listeners explicitly: |
| 57 | + |
| 58 | +```java |
| 59 | +import com.netflix.conductor.client.metrics.prometheus.PrometheusMetricsCollector; |
| 60 | + |
| 61 | +PrometheusMetricsCollector metricsCollector = new PrometheusMetricsCollector(); |
| 62 | +metricsCollector.startServer(); // http://localhost:9991/metrics |
| 63 | + |
| 64 | +ConductorClient client = ConductorClient.builder() |
| 65 | + .basePath("http://conductor-server:8080/api") |
| 66 | + .build(); |
| 67 | + |
| 68 | +TaskClient taskClient = new TaskClient(client); |
| 69 | +taskClient.registerListener(metricsCollector); |
| 70 | +taskClient.registerTaskRunnerListener(metricsCollector); |
| 71 | + |
| 72 | +TaskRunnerConfigurer configurer = new TaskRunnerConfigurer.Builder(taskClient, workers) |
| 73 | + .withThreadCount(10) |
| 74 | + .withMetricsCollector(metricsCollector) |
| 75 | + .build(); |
| 76 | + |
| 77 | +configurer.init(); |
| 78 | + |
| 79 | +WorkflowClient workflowClient = new WorkflowClient(client); |
| 80 | +workflowClient.registerListener(metricsCollector); |
| 81 | +``` |
| 82 | + |
| 83 | +Note that manual wiring does not install the OkHttp interceptor for `http_api_client_request_seconds`, `task_result_size_bytes`, or `workflow_input_size_bytes`. Use `withMetricsCollector` on the builder for those metrics. |
| 84 | + |
| 85 | +### Event Dispatch Threading |
| 86 | + |
| 87 | +Events are dispatched asynchronously on a single shared daemon thread (`conductor-event-dispatch`). This avoids contention with the application's `ForkJoinPool.commonPool()`. Metrics collector listeners (counter increments, timer recordings) are lock-free and sub-microsecond, so the single thread keeps up under normal load. Custom listeners registered via `EventDispatcher.register()` must be non-blocking; a slow listener will delay delivery of all events across the process. |
| 88 | + |
| 89 | +## Metrics Catalog |
| 90 | + |
| 91 | +Time metrics use seconds and standard bucket boundaries. Size metrics use bytes and standard size bucket boundaries. Exception labels use bounded exception type names, not exception messages or stack traces. |
| 92 | + |
| 93 | +### Counters |
| 94 | + |
| 95 | +| Meter | Labels | Meaning | |
| 96 | +|---|---|---| |
| 97 | +| `task_poll_total` | `taskType` | Incremented each time a worker issues a poll request. | |
| 98 | +| `task_execution_started_total` | `taskType` | Incremented when a polled task is dispatched to the worker function. | |
| 99 | +| `task_poll_error_total` | `taskType`, `exception` | Incremented when polling fails with a client-side exception. | |
| 100 | +| `task_execute_error_total` | `taskType`, `exception` | Incremented when worker code throws while executing a task. | |
| 101 | +| `task_update_error_total` | `taskType`, `exception` | Incremented when reporting a task result back to Conductor fails. | |
| 102 | +| `task_ack_failed_total` | `taskType` | Incremented when an explicit task ack response is unsuccessful. The internal task runner uses batch poll responses as ack and may not emit this during normal polling. | |
| 103 | +| `task_ack_error_total` | `taskType`, `exception` | Incremented when an explicit task ack call throws. The internal task runner uses batch poll responses as ack and may not emit this during normal polling. | |
| 104 | +| `task_execution_queue_full_total` | `taskType` | Incremented when a poll cycle is skipped because all worker threads are busy (zero permits available). | |
| 105 | +| `task_paused_total` | `taskType` | Incremented when a worker is paused and skips acting on a poll. | |
| 106 | +| `thread_uncaught_exceptions_total` | `exception` | Incremented when a worker thread raises an uncaught exception. | |
| 107 | +| `external_payload_used_total` | `entityName`, `operation`, `payloadType` | Incremented when external payload storage is used for task or workflow payloads. | |
| 108 | +| `workflow_start_error_total` | `workflowType`, `exception` | Incremented when starting a workflow fails client-side. | |
| 109 | + |
| 110 | +### Time Metrics |
| 111 | + |
| 112 | +| Meter | Labels | Meaning | |
| 113 | +|---|---|---| |
| 114 | +| `task_poll_time_seconds` | `taskType`, `status` | Poll request latency. `status` is `SUCCESS` or `FAILURE`. | |
| 115 | +| `task_execute_time_seconds` | `taskType`, `status` | Worker function execution latency. `status` is `SUCCESS` or `FAILURE`. | |
| 116 | +| `task_update_time_seconds` | `taskType`, `status` | Latency for reporting a task result back to Conductor. `status` is `SUCCESS` or `FAILURE`. | |
| 117 | +| `http_api_client_request_seconds` | `method`, `uri`, `status` | Latency of HTTP requests made by the API client. `status` is the HTTP status code as a string, or `0` when no response status is available. | |
| 118 | + |
| 119 | +Time metrics use these service-level objective buckets, in seconds: |
| 120 | + |
| 121 | +```text |
| 122 | +0.001, 0.005, 0.010, 0.025, 0.050, 0.100, 0.250, 0.500, 1, 2.5, 5, 10 |
| 123 | +``` |
| 124 | + |
| 125 | +The `uri` label for `http_api_client_request_seconds` uses the path template (e.g. `/workflow/{workflowId}`, `/tasks/poll/batch/{taskType}`) rather than the resolved path. This keeps the label space bounded regardless of how many unique workflow or task IDs are processed. |
| 126 | + |
| 127 | +### Size Metrics |
| 128 | + |
| 129 | +| Meter | Labels | Meaning | |
| 130 | +|---|---|---| |
| 131 | +| `task_result_size_bytes` | `taskType` | Serialized task result output size, captured from `RequestBody.contentLength()` of the outbound `POST /tasks` (or `POST /tasks/update-v2`) request. `taskType` is empty when the caller used the single-argument `TaskClient.updateTask(TaskResult)` overload. | |
| 132 | +| `workflow_input_size_bytes` | `workflowType`, `version` | Serialized workflow input size, captured from `RequestBody.contentLength()` of the outbound `POST /workflow` request. `version` is an empty string when the workflow version is absent. | |
| 133 | + |
| 134 | +Both histograms are populated at wire time by the `ApiClientMetrics` OkHttp interceptor, reading a `PayloadKind` tag attached by `TaskClient`/`WorkflowClient`. The byte count is read off the request body the HTTP layer is about to send, so no extra JSON serialization is needed. |
| 135 | + |
| 136 | +Size metrics use these service-level objective buckets, in bytes: |
| 137 | + |
| 138 | +```text |
| 139 | +100, 1000, 10000, 100000, 1000000, 10000000 |
| 140 | +``` |
| 141 | + |
| 142 | +### Gauges |
| 143 | + |
| 144 | +| Meter | Labels | Meaning | |
| 145 | +|---|---|---| |
| 146 | +| `active_workers` | `taskType` | Current number of worker threads actively executing tasks. | |
| 147 | + |
| 148 | +### Micrometer `_max` Sidecars |
| 149 | + |
| 150 | +Micrometer publishes a `*_max` Gauge alongside every Timer and DistributionSummary. These appear in scrape output as e.g. `task_poll_time_seconds_max`, `task_result_size_bytes_max`. The `_max` tracks the maximum observed value within the current reporting interval. This is a Micrometer artifact, not part of the metric catalog; it is harmless and can be ignored by dashboards that don't use it. |
| 151 | + |
| 152 | +## Labels |
| 153 | + |
| 154 | +| Label | Used by | Values | |
| 155 | +|---|---|---| |
| 156 | +| `taskType` | Worker metrics | Task definition name. | |
| 157 | +| `workflowType` | Workflow metrics | Workflow definition name. | |
| 158 | +| `version` | `workflow_input_size_bytes` | Workflow version as a string. Empty string when the version is absent. | |
| 159 | +| `status` | Task time metrics | `SUCCESS` or `FAILURE`. For `http_api_client_request_seconds`, the HTTP status code as a string, or `0` when no response status is available. | |
| 160 | +| `exception` | Error counters | Exception type name, such as `SocketTimeoutException`. | |
| 161 | +| `entityName` | `external_payload_used_total` | Task type or workflow name associated with the external payload. | |
| 162 | +| `operation` | `external_payload_used_total` | External payload operation, such as `READ` or `WRITE`. | |
| 163 | +| `payloadType` | `external_payload_used_total` | Payload type, such as `TASK_INPUT`, `TASK_OUTPUT`, `WORKFLOW_INPUT`, or `WORKFLOW_OUTPUT`. | |
| 164 | +| `method` | HTTP metrics | HTTP verb. | |
| 165 | +| `uri` | HTTP metrics | Path template from the Java HTTP client (e.g. `/workflow/{workflowId}`). Resolved identifiers are not included, keeping cardinality bounded. | |
| 166 | + |
| 167 | +## Troubleshooting |
| 168 | + |
| 169 | +### Metrics Are Empty |
| 170 | + |
| 171 | +- Verify that the collector is wired into the client. The simplest check: was `withMetricsCollector` called on `ConductorClient.Builder`? |
| 172 | +- Verify workers have polled or executed tasks. Metrics are created lazily when the relevant event occurs. |
| 173 | +- Confirm the scrape endpoint is reachable at the expected host and port. |
| 174 | + |
| 175 | +### Missing HTTP or Size Metrics |
| 176 | + |
| 177 | +- `http_api_client_request_seconds` requires the HTTP interceptor, which is installed automatically when `withMetricsCollector` is called on the builder. |
| 178 | +- `task_result_size_bytes` and `workflow_input_size_bytes` likewise require the HTTP interceptor -- they are recorded at wire time from `RequestBody.contentLength()` for requests tagged with a `PayloadKind`. |
| 179 | +- `task_ack_failed_total` and `task_ack_error_total` require `taskClient.registerTaskRunnerListener(metricsCollector)`. This is automatic when using `withMetricsCollector` on the builder. |
| 180 | + |
| 181 | +### High Cardinality |
| 182 | + |
| 183 | +- The `uri` label on `http_api_client_request_seconds` uses the path template, so it is bounded by the number of distinct API endpoints (not by request volume or unique IDs). The interceptor falls back to the resolved path for requests that are not tagged with a template, which may be unbounded. |
| 184 | +- Avoid embedding user identifiers or unbounded values in task type, workflow type, or external payload labels. |
0 commit comments