Skip to content

Commit 2366dd5

Browse files
committed
fixes
1 parent 9129b07 commit 2366dd5

20 files changed

Lines changed: 3209 additions & 250 deletions

File tree

AGENTS.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,9 @@ After every code change, you **must** run the following before considering the w
8686
npx ts-node -P tsconfig.json --transpile-only examples/workflow-ops.ts
8787
npx ts-node -P tsconfig.json --transpile-only examples/workers-e2e.ts
8888
npx ts-node -P tsconfig.json --transpile-only examples/perf-test.ts
89+
npx ts-node -P tsconfig.json --transpile-only examples/advanced/human-tasks.ts
90+
npx ts-node -P tsconfig.json --transpile-only examples/api-journeys/applications.ts
91+
npx ts-node -P tsconfig.json --transpile-only examples/api-journeys/event-handlers.ts
8992
```
9093

9194
Do not skip any example. If an example fails for reasons unrelated to your change (e.g., server down), note it explicitly.
@@ -204,6 +207,23 @@ public async someMethod(args): Promise<T> {
204207
- Client classes: `FooClient` (PascalCase + "Client" suffix)
205208
- Enums from LLM types: `Role.USER`, `LLMProvider.OPEN_AI` (use enum values, not raw strings)
206209

210+
## Documentation Maintenance
211+
212+
### Metrics Documentation (METRICS.md)
213+
214+
When adding, removing, or renaming metrics in `src/sdk/worker/metrics/MetricsCollector.ts`:
215+
1. Update `METRICS.md` to reflect the change (name, type, labels, description)
216+
2. Ensure both `MetricsCollector.toPrometheusText()` and `PrometheusRegistry.createMetrics()` are updated in sync — missing a summary/counter in either causes silent data loss
217+
3. Update the metric count in the METRICS.md overview section
218+
4. Add or update the corresponding direct recording method documentation if applicable
219+
220+
### SDK_NEW_LANGUAGE_GUIDE.md
221+
222+
When adding new client methods, builders, worker features, or examples:
223+
1. Update the relevant feature accounting table in `SDK_NEW_LANGUAGE_GUIDE.md` (Section 4)
224+
2. Update method/feature counts in the Phase 4 client inventory
225+
3. If adding a new metric, update Section 5.6 (Metrics Collector) tables
226+
207227
## Test Coverage
208228

209229
| Component | Unit | E2E | Total |

METRICS.md

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
# Metrics Reference
2+
3+
The Conductor JavaScript SDK provides built-in Prometheus metrics for monitoring worker performance, API latency, and task execution.
4+
5+
## Overview
6+
7+
`MetricsCollector` implements `TaskRunnerEventsListener` and records **18 metric types** (12 counters + 6 summaries). Metrics are exposed in [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/).
8+
9+
- **Default prefix:** `conductor_worker`
10+
- **Quantiles:** p50, p75, p90, p95, p99 (computed from a sliding window)
11+
- **Sliding window:** Last 1,000 observations (configurable)
12+
13+
## Quick Start
14+
15+
### HTTP Server
16+
17+
```typescript
18+
import { MetricsCollector, MetricsServer, TaskHandler } from "@io-orkes/conductor-javascript";
19+
20+
const metrics = new MetricsCollector({ httpPort: 9090 });
21+
22+
const handler = new TaskHandler({
23+
client,
24+
scanForDecorated: true,
25+
eventListeners: [metrics],
26+
});
27+
28+
await handler.startWorkers();
29+
// GET http://localhost:9090/metrics — Prometheus text format
30+
// GET http://localhost:9090/health — { "status": "UP" }
31+
```
32+
33+
### File Output
34+
35+
```typescript
36+
const metrics = new MetricsCollector({
37+
filePath: "/tmp/conductor_metrics.prom",
38+
fileWriteIntervalMs: 10000, // write every 10s
39+
});
40+
```
41+
42+
The file writer performs an immediate first write, then writes periodically at the configured interval. The timer is unreferenced so it does not prevent Node.js process exit.
43+
44+
### prom-client Integration
45+
46+
```typescript
47+
const metrics = new MetricsCollector({ usePromClient: true });
48+
// Metrics are registered in prom-client's default registry.
49+
// Use prom-client's register.metrics() for native scraping.
50+
```
51+
52+
Requires `npm install prom-client`. Falls back to built-in text format if not installed.
53+
54+
### All-in-One
55+
56+
```typescript
57+
const metrics = new MetricsCollector({
58+
prefix: "myapp_worker",
59+
httpPort: 9090,
60+
filePath: "/tmp/metrics.prom",
61+
fileWriteIntervalMs: 10000,
62+
slidingWindowSize: 500,
63+
usePromClient: true,
64+
});
65+
```
66+
67+
## Configuration
68+
69+
| Option | Type | Default | Description |
70+
|--------|------|---------|-------------|
71+
| `prefix` | `string` | `"conductor_worker"` | Prometheus metric name prefix |
72+
| `httpPort` | `number` || Start built-in HTTP server on this port |
73+
| `filePath` | `string` || Periodically write metrics to this file path |
74+
| `fileWriteIntervalMs` | `number` | `5000` | File write interval in milliseconds |
75+
| `slidingWindowSize` | `number` | `1000` | Max observations kept for quantile calculation |
76+
| `usePromClient` | `boolean` | `false` | Use `prom-client` for native Prometheus integration |
77+
78+
---
79+
80+
## Counter Metrics
81+
82+
### Labeled by `task_type`
83+
84+
| Prometheus Name | Internal Key | Description |
85+
|----------------|-------------|-------------|
86+
| `{prefix}_task_poll_total` | `pollTotal` | Total number of task polls initiated |
87+
| `{prefix}_task_poll_error_total` | `pollErrorTotal` | Total number of failed task polls |
88+
| `{prefix}_task_execute_total` | `taskExecutionTotal` | Total number of task executions completed |
89+
| `{prefix}_task_execute_error_total` | `taskExecutionErrorTotal` | Total task execution errors. Label format: `taskType:ExceptionName` |
90+
| `{prefix}_task_update_error_total` | `taskUpdateFailureTotal` | Total task result update failures (result lost from Conductor) |
91+
| `{prefix}_task_ack_error_total` | `taskAckErrorTotal` | Total task acknowledgement errors |
92+
| `{prefix}_task_execution_queue_full_total` | `taskExecutionQueueFullTotal` | Times the execution queue was full (concurrency limit reached) |
93+
| `{prefix}_task_paused_total` | `taskPausedTotal` | Total task paused events |
94+
95+
### Labeled by `payload_type`
96+
97+
| Prometheus Name | Internal Key | Description |
98+
|----------------|-------------|-------------|
99+
| `{prefix}_external_payload_used_total` | `externalPayloadUsedTotal` | External payload storage usage (e.g., `"workflow_input"`, `"task_output"`) |
100+
101+
### Global (no labels)
102+
103+
| Prometheus Name | Internal Key | Description |
104+
|----------------|-------------|-------------|
105+
| `{prefix}_thread_uncaught_exceptions_total` | `uncaughtExceptionTotal` | Total uncaught exceptions in worker processes |
106+
| `{prefix}_worker_restart_total` | `workerRestartTotal` | Total worker restart events |
107+
| `{prefix}_workflow_start_error_total` | `workflowStartErrorTotal` | Total workflow start errors |
108+
109+
---
110+
111+
## Summary Metrics
112+
113+
Each summary emits quantile values, a count, and a sum:
114+
115+
```
116+
{name}{task_type="myTask",quantile="0.5"} 12.3
117+
{name}{task_type="myTask",quantile="0.75"} 15.1
118+
{name}{task_type="myTask",quantile="0.9"} 18.7
119+
{name}{task_type="myTask",quantile="0.95"} 22.0
120+
{name}{task_type="myTask",quantile="0.99"} 45.2
121+
{name}_count{task_type="myTask"} 1000
122+
{name}_sum{task_type="myTask"} 14523.7
123+
```
124+
125+
### Labeled by `task_type`
126+
127+
| Prometheus Name | Internal Key | Unit | Description |
128+
|----------------|-------------|------|-------------|
129+
| `{prefix}_task_poll_time` | `pollDurationMs` | ms | Task poll round-trip duration |
130+
| `{prefix}_task_execute_time` | `executionDurationMs` | ms | Worker function execution duration |
131+
| `{prefix}_task_update_time` | `updateDurationMs` | ms | Task result update (SDK to server) duration |
132+
| `{prefix}_task_result_size_bytes` | `outputSizeBytes` | bytes | Task result output payload size |
133+
134+
### Labeled by `workflow_type`
135+
136+
| Prometheus Name | Internal Key | Unit | Description |
137+
|----------------|-------------|------|-------------|
138+
| `{prefix}_workflow_input_size_bytes` | `workflowInputSizeBytes` | bytes | Workflow input payload size |
139+
140+
### Labeled by `endpoint`
141+
142+
| Prometheus Name | Internal Key | Unit | Description |
143+
|----------------|-------------|------|-------------|
144+
| `{prefix}_http_api_client_request` | `apiRequestDurationMs` | ms | API request duration. Label format: `METHOD:/api/path:STATUS` |
145+
146+
---
147+
148+
## Event Listener Methods
149+
150+
These methods are called automatically by the `TaskRunner` when `MetricsCollector` is registered as an event listener:
151+
152+
| Method | Metrics Updated |
153+
|--------|----------------|
154+
| `onPollStarted(event)` | Increments `pollTotal` |
155+
| `onPollCompleted(event)` | Records `pollDurationMs` |
156+
| `onPollFailure(event)` | Increments `pollErrorTotal`, records `pollDurationMs` |
157+
| `onTaskExecutionStarted(event)` | _(no-op, counted on completion)_ |
158+
| `onTaskExecutionCompleted(event)` | Increments `taskExecutionTotal`, records `executionDurationMs` and `outputSizeBytes` |
159+
| `onTaskExecutionFailure(event)` | Increments `taskExecutionErrorTotal`, records `executionDurationMs` |
160+
| `onTaskUpdateCompleted(event)` | Records `updateDurationMs` |
161+
| `onTaskUpdateFailure(event)` | Increments `taskUpdateFailureTotal` |
162+
163+
## Direct Recording Methods
164+
165+
For metrics outside the event listener system, call these methods directly:
166+
167+
```typescript
168+
const collector = new MetricsCollector();
169+
170+
collector.recordTaskExecutionQueueFull("my_task");
171+
collector.recordUncaughtException();
172+
collector.recordWorkerRestart();
173+
collector.recordTaskPaused("my_task");
174+
collector.recordTaskAckError("my_task");
175+
collector.recordWorkflowStartError();
176+
collector.recordExternalPayloadUsed("task_output");
177+
collector.recordWorkflowInputSize("my_workflow", 2048);
178+
collector.recordApiRequestTime("POST", "/api/tasks", 200, 35);
179+
```
180+
181+
## Exposition Formats
182+
183+
### Built-in Prometheus Text
184+
185+
```typescript
186+
const text = collector.toPrometheusText();
187+
// Returns Prometheus text format (text/plain; version=0.0.4)
188+
```
189+
190+
### Async (with prom-client support)
191+
192+
```typescript
193+
const text = await collector.toPrometheusTextAsync();
194+
// Uses prom-client registry when available, falls back to built-in
195+
```
196+
197+
### HTTP Server (MetricsServer)
198+
199+
```typescript
200+
import { MetricsServer } from "@io-orkes/conductor-javascript";
201+
202+
const server = new MetricsServer(collector, 9090);
203+
await server.start();
204+
// GET /metrics — Content-Type from collector.getContentType()
205+
// GET /health — { "status": "UP" }
206+
await server.stop();
207+
```
208+
209+
### File Output
210+
211+
Configured via `filePath` in `MetricsCollectorConfig`. Writes `toPrometheusText()` output to disk. The file writer performs an immediate first write on construction, then writes periodically at the configured interval.
212+
213+
---
214+
215+
## Sliding Window and Quantile Calculation
216+
217+
Summary metrics use a **sliding window** (default: 1,000 observations) to calculate percentiles. This provides:
218+
219+
- Accurate recent percentiles without unbounded memory growth
220+
- No need to pre-configure histogram bucket boundaries
221+
- Direct percentile values without interpolation artifacts
222+
223+
Quantiles are computed on-demand using linear interpolation on sorted observations when `toPrometheusText()` is called.
224+
225+
When using `prom-client` (`usePromClient: true`), summaries use prom-client's native implementation with `maxAgeSeconds: 600` and `ageBuckets: 5`.
226+
227+
---
228+
229+
## Monitoring Best Practices
230+
231+
- **Use p95/p99 for SLO monitoring** rather than averages. Percentile-based thresholds better capture user-impacting performance variations.
232+
- **Alert on `task_update_error_total`** — a rising count indicates task results are being lost and workers are failing to report back to the Conductor server.
233+
- **Alert on `task_execution_queue_full_total`** — indicates the concurrency limit is consistently reached. Consider increasing worker `concurrency`.
234+
- **Monitor `task_poll_time` p99** — high poll latency suggests network issues or server overload.
235+
- **Monitor `task_execute_time` p95** — watch for execution time regression in worker functions.
236+
- **File output interval**: 10-60 seconds recommended for production. Lower intervals increase disk I/O.
237+
- **Clean metrics directory on startup** when using file output with multiprocess workers to avoid stale data.
238+
239+
---
240+
241+
## Programmatic Access
242+
243+
```typescript
244+
const metrics = collector.getMetrics();
245+
246+
// Counter values
247+
metrics.pollTotal.get("my_task"); // number
248+
metrics.taskExecutionTotal.get("my_task"); // number
249+
250+
// Summary observations (raw array)
251+
metrics.pollDurationMs.get("my_task"); // number[]
252+
metrics.executionDurationMs.get("my_task"); // number[]
253+
254+
// Reset all metrics
255+
collector.reset();
256+
257+
// Stop file writer and HTTP server
258+
await collector.stop();
259+
```

0 commit comments

Comments
 (0)