Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
a3b0703
implement canonical metrics, feature flagged to preserve legacy where…
chrishagglund-ship-it Apr 29, 2026
083f173
standardize on what truthiness means for env var
chrishagglund-ship-it Apr 29, 2026
be00e08
add some tests for new stuff
chrishagglund-ship-it May 5, 2026
2384d2d
delint
chrishagglund-ship-it May 5, 2026
0b082d6
add retry for 502,503,504 error codes
chrishagglund-ship-it May 5, 2026
396fcc5
experiments with flaky test runs ...
chrishagglund-ship-it May 5, 2026
bf71a1e
remove editorialization
chrishagglund-ship-it May 5, 2026
c8c4e9f
cleaner reporting on which metrics implementation is used in the test…
chrishagglund-ship-it May 6, 2026
c3da927
updates for metrics related documentation
chrishagglund-ship-it May 6, 2026
19773f8
add or update a changelog
chrishagglund-ship-it May 7, 2026
6cd069f
adjustments to internal mechanics
chrishagglund-ship-it May 15, 2026
db1fd4c
harness improvements for generating traffic with uris in metrics
chrishagglund-ship-it May 15, 2026
cb97171
harness improvements for generating traffic with uris in metrics .. c…
chrishagglund-ship-it May 15, 2026
24b162c
revisions to internal mechanics for http request metrics
chrishagglund-ship-it May 18, 2026
c8e97af
update to changelog and metrics docs
chrishagglund-ship-it May 18, 2026
5b0becd
address self-review concerns regarding backward compatibility
chrishagglund-ship-it May 18, 2026
7a4c991
address backward compat concerned from self review
chrishagglund-ship-it May 18, 2026
e4d8e67
trying to settle on a better implementation of retry fetch wrapper wi…
chrishagglund-ship-it May 18, 2026
a20cc3f
fix for label defect in canonical metrics
chrishagglund-ship-it May 18, 2026
f42bec9
give 1 flaky integration test more time and log it's progress
chrishagglund-ship-it May 19, 2026
01ca97f
test out pre-existing prom-client renderer
chrishagglund-ship-it May 19, 2026
29b71fb
testing integration tests against alternate servers
chrishagglund-ship-it May 21, 2026
af0c97d
adjust notes around integration testing. add another test server - no…
chrishagglund-ship-it May 22, 2026
58990af
make http bin url dynamic b/c different clusters have it on different…
chrishagglund-ship-it May 22, 2026
ded2ce2
add new constants file that can grab from env before setting constant
chrishagglund-ship-it May 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,7 @@ CONDUCTOR_AUTH_SECRET=""
CONDUCTOR_MAX_HTTP2_CONNECTIONS=

CONDUCTOR_TLS_INSECURE=
CONDUCTOR_DISABLE_HTTP2=
CONDUCTOR_DISABLE_HTTP2=

# Hostname of the httpbin service reachable from the Conductor server (default: httpbin-server)
HTTPBIN_SERVICE_HOSTNAME=
183 changes: 180 additions & 3 deletions .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,15 @@ jobs:
name: codecov-unit-node-${{ matrix.node-version }}
fail_ci_if_error: false

# Integration tests (v5): one job at a time (max-parallel: 1) to avoid 502/503 on shared Conductor.
# Sharding (--shard i/N) splits the suite so each job runs ~1/N of tests — keeps per-job under timeout.
# Integration tests (v5): lower max-parallel reduces 502/503 from the shared Conductor server
# but makes CI slower without eliminating flakes entirely — feel free to experiment.
# Sharding (--shard i/N) splits the suite so each job runs ~1/N of tests.
integration-tests:
runs-on: ubuntu-latest
timeout-minutes: 25
strategy:
fail-fast: false
max-parallel: 2
max-parallel: 3
matrix:
node-version: [20, 22, 24]
shard: [1, 2, 3]
Expand Down Expand Up @@ -118,6 +119,7 @@ jobs:
CONDUCTOR_AUTH_KEY: ${{ secrets.AUTH_KEY }}
CONDUCTOR_AUTH_SECRET: ${{ secrets.AUTH_SECRET }}
CONDUCTOR_REQUEST_TIMEOUT_MS: "120000"
CONDUCTOR_RETRY_SERVER_ERRORS: "true"
JEST_JUNIT_OUTPUT_NAME: integration-v5-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
- name: Publish Test Results
uses: dorny/test-reporter@v2
Expand All @@ -136,6 +138,122 @@ jobs:
name: codecov-integration-v5-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}
fail_ci_if_error: false

# Integration tests (v5 sdkdev): mirrors integration-tests but targets the sdkdev environment.
integration-tests-v5-sdkdev:
runs-on: ubuntu-latest
timeout-minutes: 25
strategy:
fail-fast: false
max-parallel: 3
matrix:
node-version: [20, 22, 24]
shard: [1, 2, 3]
name: Node.js v${{ matrix.node-version }} - integration v5 sdkdev (shard ${{ matrix.shard }}/3)
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: "npm"
- name: Cache node_modules
id: cache
uses: actions/cache@v4
with:
path: node_modules
key: npm-${{ matrix.node-version }}-${{ hashFiles('package-lock.json') }}
restore-keys: |
npm-${{ matrix.node-version }}-
- name: Install Dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: npm ci
- name: Run integration tests (v5 sdkdev) shard ${{ matrix.shard }}/3
run: npm run test:integration:v5 -- --ci --coverage --runInBand --testTimeout=120000 --shard=${{ matrix.shard }}/3 --reporters=default --reporters=github-actions --reporters=jest-junit
env:
ORKES_BACKEND_VERSION: "5"
CONDUCTOR_SERVER_URL: ${{ vars.SDKDEV_V5_SERVER_URL }}
CONDUCTOR_AUTH_KEY: ${{ vars.SDKDEV_V5_AUTH_KEY }}
CONDUCTOR_AUTH_SECRET: ${{ secrets.SDKDEV_V5_AUTH_SECRET }}
CONDUCTOR_REQUEST_TIMEOUT_MS: "120000"
CONDUCTOR_RETRY_SERVER_ERRORS: "true"
HTTPBIN_SERVICE_HOSTNAME: "httpbin"
JEST_JUNIT_OUTPUT_NAME: integration-v5-sdkdev-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
- name: Publish Test Results
uses: dorny/test-reporter@v2
if: ${{ !cancelled() }}
with:
name: integration v5 sdkdev Node ${{ matrix.node-version }} shard ${{ matrix.shard }}/3
path: reports/integration-v5-sdkdev-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
reporter: jest-junit
- name: Upload coverage to Codecov
if: always()
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage/lcov.info
flags: integration-v5-sdkdev
name: codecov-integration-v5-sdkdev-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}
fail_ci_if_error: false

# Integration tests (v5 sm): mirrors integration-tests but targets the SiliconMint v5 environment.
integration-tests-v5-sm:
runs-on: ubuntu-latest
timeout-minutes: 25
strategy:
fail-fast: false
max-parallel: 3
matrix:
node-version: [20, 22, 24]
shard: [1, 2, 3]
name: Node.js v${{ matrix.node-version }} - integration v5 sm (shard ${{ matrix.shard }}/3)
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: "npm"
- name: Cache node_modules
id: cache
uses: actions/cache@v4
with:
path: node_modules
key: npm-${{ matrix.node-version }}-${{ hashFiles('package-lock.json') }}
restore-keys: |
npm-${{ matrix.node-version }}-
- name: Install Dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: npm ci
- name: Run integration tests (v5 sm) shard ${{ matrix.shard }}/3
run: npm run test:integration:v5 -- --ci --coverage --runInBand --testTimeout=120000 --shard=${{ matrix.shard }}/3 --reporters=default --reporters=github-actions --reporters=jest-junit
env:
ORKES_BACKEND_VERSION: "5"
CONDUCTOR_SERVER_URL: ${{ vars.SM_V5_SERVER_URL }}
CONDUCTOR_AUTH_KEY: ${{ vars.SM_V5_AUTH_KEY }}
CONDUCTOR_AUTH_SECRET: ${{ secrets.SM_V5_AUTH_SECRET }}
CONDUCTOR_REQUEST_TIMEOUT_MS: "120000"
CONDUCTOR_RETRY_SERVER_ERRORS: "true"
HTTPBIN_SERVICE_HOSTNAME: "httpbin"
JEST_JUNIT_OUTPUT_NAME: integration-v5-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
- name: Publish Test Results
uses: dorny/test-reporter@v2
if: ${{ !cancelled() }}
with:
name: integration v5 sm Node ${{ matrix.node-version }} shard ${{ matrix.shard }}/3
path: reports/integration-v5-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
reporter: jest-junit
- name: Upload coverage to Codecov
if: always()
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage/lcov.info
flags: integration-v5-sm
name: codecov-integration-v5-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}
fail_ci_if_error: false

# Integration tests (v4): same sharding as v5. v4 fails in CI (passes locally); do not block PRs.
integration-tests-v4:
runs-on: ubuntu-latest
Expand Down Expand Up @@ -175,6 +293,7 @@ jobs:
CONDUCTOR_AUTH_KEY: ${{ secrets.AUTH_KEY_V4 }}
CONDUCTOR_AUTH_SECRET: ${{ secrets.AUTH_SECRET_V4 }}
CONDUCTOR_REQUEST_TIMEOUT_MS: "120000"
CONDUCTOR_RETRY_SERVER_ERRORS: "true"
JEST_JUNIT_OUTPUT_NAME: integration-v4-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
- name: Publish Test Results
uses: dorny/test-reporter@v2
Expand All @@ -192,3 +311,61 @@ jobs:
flags: integration-v4
name: codecov-integration-v4-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}
fail_ci_if_error: false

# Integration tests (v4 sm): mirrors integration-tests-v4 but targets the sm environment.
integration-tests-v4-sm:
runs-on: ubuntu-latest
timeout-minutes: 25
continue-on-error: true
strategy:
fail-fast: false
max-parallel: 1
matrix:
node-version: [20, 22, 24]
shard: [1, 2, 3]
name: Node.js v${{ matrix.node-version }} - integration v4 sm (shard ${{ matrix.shard }}/3)
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: "npm"
- name: Cache node_modules
id: cache
uses: actions/cache@v4
with:
path: node_modules
key: npm-${{ matrix.node-version }}-${{ hashFiles('package-lock.json') }}
restore-keys: |
npm-${{ matrix.node-version }}-
- name: Install Dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: npm ci
- name: Run integration tests (v4 sm) shard ${{ matrix.shard }}/3
run: npm run test:integration:v4 -- --ci --coverage --runInBand --testTimeout=120000 --shard=${{ matrix.shard }}/3 --reporters=default --reporters=github-actions --reporters=jest-junit
env:
ORKES_BACKEND_VERSION: "4"
CONDUCTOR_SERVER_URL: ${{ vars.SM_V4_SERVER_URL }}
CONDUCTOR_AUTH_KEY: ${{ vars.SM_V4_AUTH_KEY }}
CONDUCTOR_AUTH_SECRET: ${{ secrets.SM_V4_AUTH_SECRET }}
CONDUCTOR_REQUEST_TIMEOUT_MS: "120000"
CONDUCTOR_RETRY_SERVER_ERRORS: "true"
JEST_JUNIT_OUTPUT_NAME: integration-v4-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
- name: Publish Test Results
uses: dorny/test-reporter@v2
if: ${{ !cancelled() }}
with:
name: integration v4 sm Node ${{ matrix.node-version }} shard ${{ matrix.shard }}/3
path: reports/integration-v4-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}-test-results.xml
reporter: jest-junit
- name: Upload coverage to Codecov
if: always()
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage/lcov.info
flags: integration-v4-sm
name: codecov-integration-v4-sm-node-${{ matrix.node-version }}-shard-${{ matrix.shard }}
fail_ci_if_error: false
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ src/sdk/ # Main SDK source
decorators/worker.ts # @worker decorator + dual-mode support
decorators/registry.ts # Global registry (register/get/clear)
context/TaskContext.ts # AsyncLocalStorage per-task context
metrics/ # MetricsCollector, MetricsServer, PrometheusRegistry
metrics/ # LegacyMetricsCollector, CanonicalMetricsCollector, metricsFactory, MetricsServer, PrometheusRegistry, CanonicalPrometheusRegistry, accumulators, httpObserver
schema/ # jsonSchema, schemaField decorators
generators/ # Legacy generators (pre-v3, still exported for compat)
src/open-api/ # OpenAPI layer
Expand Down Expand Up @@ -211,10 +211,10 @@ public async someMethod(args): Promise<T> {

### Metrics Documentation (METRICS.md)

When adding, removing, or renaming metrics in `src/sdk/worker/metrics/MetricsCollector.ts`:
1. Update `METRICS.md` to reflect the change (name, type, labels, description)
2. Ensure both `MetricsCollector.toPrometheusText()` and `PrometheusRegistry.createMetrics()` are updated in sync — missing a summary/counter in either causes silent data loss
3. Update the metric count in the METRICS.md overview section
When adding, removing, or renaming metrics in `src/sdk/worker/metrics/`:
1. Update both `LegacyMetricsCollector.ts` and `CanonicalMetricsCollector.ts` (or add a no-op stub in the collector that does not emit the metric)
2. Ensure `toPrometheusText()` and the corresponding `PrometheusRegistry` / `CanonicalPrometheusRegistry` are updated in sync — missing a metric in either causes silent data loss
3. Update `METRICS.md` to reflect the change in both the legacy and canonical catalog tables
4. Add or update the corresponding direct recording method documentation if applicable

### SDK_NEW_LANGUAGE_GUIDE.md
Expand Down
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added

- **Canonical metrics** -- opt-in harmonized metric surface via `WORKER_CANONICAL_METRICS=true`. See [METRICS.md](METRICS.md) for the full catalog, configuration, and migration guide.
- Bounded `uri` label on `http_api_client_request_seconds`: canonical mode uses path templates (e.g. `/workflow/{workflowId}`) instead of fully-resolved paths, preventing metric cardinality explosion from dynamic IDs.
- `TaskPaused` event type and `PollerOptions.onPaused` callback: emitted when a poll cycle is skipped because the worker is paused. Canonical mode records `task_paused_total`; legacy mode does not (see Implementation Notes in METRICS.md).
- `measurePayloadSize` option in `MetricsCollectorConfig`: controls whether `workflow_input_size_bytes` is recorded via `JSON.stringify` on each `startWorkflow` call. Defaults to `true` for canonical, `false` for legacy.
- `retryServerErrors` option in `OrkesApiConfig` / `RetryFetchOptions` and `CONDUCTOR_RETRY_SERVER_ERRORS` env var: opt-in retry of HTTP 502/503/504 for idempotent methods (GET, HEAD, OPTIONS, PUT, DELETE). Default `false`; set to `true` to enable.
- `WorkflowStatusProbe` in harness: opt-in probe (via `HARNESS_PROBE_RATE_PER_SEC`) that exercises UUID-bearing endpoints to validate template URI metrics.
- `WORKER_LEGACY_METRICS` is reserved for future use. Once canonical metrics become the default, setting `WORKER_LEGACY_METRICS=true` will re-activate the legacy surface. It is not read by the current implementation.

### Changed

- Legacy metrics emit unchanged when constructing `LegacyMetricsCollector` directly (the pre-existing pattern). Using `createMetricsCollector()` additionally enables automatic HTTP request timing via OpenAPI interceptors for both legacy and canonical modes; no other action required for existing deployments.
- `MetricsCollector.ts` renamed to `LegacyMetricsCollector.ts`; the public symbol is preserved via re-export so existing imports keep working.
- `http_api_client_request` timing is now recorded automatically by `wrapFetchWithRetry` when a metrics collector is active (via `createMetricsCollector()` or `setHttpMetricsObserver`), covering both successful responses and network-error fallback paths. A lightweight request interceptor captures OpenAPI path templates so the canonical `uri` label uses bounded-cardinality templates in all cases. Previously, `recordApiRequestTime` existed but was not wired into the HTTP pipeline -- [details](METRICS.md#implementation-notes).
- Added optional `durationMs` field to `TaskUpdateFailure` event, recording the duration of the last update attempt. Declared optional so existing event listener implementations are unaffected.

### Deprecated

- Legacy metric names remain the default during the transition period. Migration guidance is in [METRICS.md](METRICS.md#migrating-from-legacy-to-canonical).
Loading
Loading