|
| 1 | +# Copilot Instructions — DurableTask JavaScript SDK |
| 2 | + |
| 3 | +This document provides architectural context for AI assistants working with this codebase. |
| 4 | +Focus is on **stable patterns, invariants, and pitfalls** — not file paths or function signatures. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## What This Project Is |
| 9 | + |
| 10 | +The **Durable Task SDK for JavaScript/TypeScript** enables writing long-running, stateful |
| 11 | +workflows as code. It is the JavaScript implementation of Microsoft's cross-language |
| 12 | +Durable Task Framework (sibling SDKs exist in .NET, Python, Java, Go). |
| 13 | + |
| 14 | +**This is NOT Azure Durable Functions.** It's a lower-level SDK for the Azure Durable Task |
| 15 | +Scheduler service (or its emulator). The two are related but have different APIs, deployment |
| 16 | +models, and npm packages. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Core Execution Model — Generator-Based Replay |
| 21 | + |
| 22 | +This is the single most important thing to understand. Everything else follows from it. |
| 23 | + |
| 24 | +### How Orchestrations Work |
| 25 | + |
| 26 | +Orchestrations are **async generator functions** (`async function*`) that `yield` `Task<T>` |
| 27 | +objects. The runtime replays completed history events to rebuild state, then advances the |
| 28 | +generator for new work. |
| 29 | + |
| 30 | +``` |
| 31 | +Orchestrator yields Task → Runtime checks history → Already done? Send result back |
| 32 | + → Not done? Suspend generator, create action for sidecar |
| 33 | +``` |
| 34 | + |
| 35 | +On every re-invocation: |
| 36 | +1. The executor receives `(instanceId, oldEvents[], newEvents[])`. |
| 37 | +2. With `isReplaying = true`: feeds `oldEvents` through event processing. Completed tasks |
| 38 | + resolve instantly via `generator.next(result)`. |
| 39 | +3. With `isReplaying = false`: processes `newEvents`. Incomplete tasks suspend the generator. |
| 40 | +4. The resulting `OrchestratorAction[]` are sent back to the sidecar. |
| 41 | + |
| 42 | +### The Determinism Rule (Critical) |
| 43 | + |
| 44 | +**Orchestrator code MUST be deterministic.** Every replay must produce the exact same |
| 45 | +sequence of actions. This is enforced at runtime — mismatches throw `NonDeterminismError`. |
| 46 | + |
| 47 | +What this means in practice: |
| 48 | +- **No `Date.now()`** — use `context.currentUtcDateTime` |
| 49 | +- **No `Math.random()` or `crypto.randomUUID()`** — use `context.newGuid()` (deterministic UUID v5) |
| 50 | +- **No direct I/O** (HTTP calls, file reads, database queries) — use activities |
| 51 | +- **No non-deterministic control flow** — no data-dependent branching on external state |
| 52 | + |
| 53 | +Adding, removing, or reordering `yield` statements between code versions **breaks replay** |
| 54 | +for in-flight orchestrations. The validator catches type/name mismatches but not all |
| 55 | +ordering issues. |
| 56 | + |
| 57 | +### Activities vs Orchestrations |
| 58 | + |
| 59 | +Activities are the leaf nodes — they execute side effects exactly once (modulo retries) |
| 60 | +and persist results as history events. They can do anything: HTTP calls, DB writes, etc. |
| 61 | + |
| 62 | +**Key mental model:** Orchestrations = coordination logic (deterministic). |
| 63 | +Activities = real work (non-deterministic allowed). |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Architecture — How Pieces Connect |
| 68 | + |
| 69 | +### Communication Model |
| 70 | + |
| 71 | +The SDK **never touches storage directly**. All communication happens over gRPC to a |
| 72 | +sidecar (Azure Durable Task Scheduler, emulator, or Dapr): |
| 73 | + |
| 74 | +``` |
| 75 | +Client → gRPC → Sidecar (owns storage & scheduling) |
| 76 | + ↕ |
| 77 | +Worker ← gRPC (streaming) ← Sidecar pushes work items |
| 78 | +``` |
| 79 | + |
| 80 | +The worker opens a **server-streaming RPC** (`getWorkItems`) and dispatches incoming work: |
| 81 | +orchestration replays, activity executions, and entity operations. |
| 82 | + |
| 83 | +### Package Structure |
| 84 | + |
| 85 | +This is an npm workspaces monorepo with two published packages: |
| 86 | + |
| 87 | +| Package | Role | |
| 88 | +|---|---| |
| 89 | +| Core SDK | Orchestration engine, client, worker, entities, testing, tracing | |
| 90 | +| Azure Managed Backend | Connection string / Entra ID auth, credential factory, builder pattern | |
| 91 | + |
| 92 | +The Azure package **peers on** the core package — it adds authentication and connection |
| 93 | +management but delegates all domain logic to core. |
| 94 | + |
| 95 | +### Protocol Layer |
| 96 | + |
| 97 | +Protobuf definitions live in `internal/protocol/`. Generated JavaScript stubs (not |
| 98 | +TypeScript) provide the gRPC message types. A helper layer converts between protobuf |
| 99 | +messages and SDK types (history events, orchestrator actions). |
| 100 | + |
| 101 | +**Generated proto files should never be manually edited.** They're rebuilt from the proto |
| 102 | +definitions using `grpc_tools_node_protoc`. |
| 103 | + |
| 104 | +--- |
| 105 | + |
| 106 | +## Task System — The Heart of the SDK |
| 107 | + |
| 108 | +### Task Hierarchy |
| 109 | + |
| 110 | +``` |
| 111 | +Task<T> — Base: result, exception, isComplete, isFailed, parent ref |
| 112 | +├── CompletableTask — Can be completed or failed; notifies parent on completion |
| 113 | +│ ├── RetryableTask — Policy-driven retries (exponential backoff) |
| 114 | +│ └── RetryHandlerTask — Custom retry handler function |
| 115 | +└── CompositeTask — Holds children with parent backlinks |
| 116 | + ├── WhenAllTask — Completes when ALL children done; fails fast on first failure |
| 117 | + └── WhenAnyTask — Completes when ANY child done |
| 118 | +``` |
| 119 | + |
| 120 | +**Parent notification chain:** When a child task completes, it calls |
| 121 | +`parent.onChildCompleted()`. This chain drives composite task completion. |
| 122 | + |
| 123 | +**Important:** `CompositeTask` checks already-complete children in its constructor. |
| 124 | +A composite task can be **immediately complete upon construction** — callers must handle this. |
| 125 | + |
| 126 | +### WhenAll Fail-Fast Behavior |
| 127 | + |
| 128 | +`WhenAllTask` marks itself complete on the **first** failed child. Other children may still |
| 129 | +be in flight. The `isComplete` guard prevents double-completion, but the mental model is: |
| 130 | +one failure = whole task fails immediately, remaining results ignored. |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Entity System |
| 135 | + |
| 136 | +### Identity Model |
| 137 | + |
| 138 | +Entities are addressed as `@name@key`. Names are **always lowercased** (cross-SDK |
| 139 | +consistency with .NET). Keys **preserve case**. |
| 140 | + |
| 141 | +### Execution Model |
| 142 | + |
| 143 | +Entities process operations in **single-threaded batches** with per-operation transactional |
| 144 | +semantics: |
| 145 | +1. Checkpoint state + accumulated actions before each operation |
| 146 | +2. Execute the operation |
| 147 | +3. On success: commit. On failure: rollback to checkpoint. |
| 148 | + |
| 149 | +State uses **lazy JSON deserialization** — deserialize only on first read. If state is |
| 150 | +written but never read, the original JSON is discarded (intentional optimization). |
| 151 | + |
| 152 | +### Method Dispatch |
| 153 | + |
| 154 | +`TaskEntity<TState>` does automatic method dispatch: it walks the prototype chain, finds |
| 155 | +a method matching the operation name (case-insensitive), and calls it. The implicit |
| 156 | +`delete` operation sets state to `null`. |
| 157 | + |
| 158 | +### Entity Lock Ordering |
| 159 | + |
| 160 | +`lockEntities()` **sorts entity IDs** before acquiring locks to prevent deadlocks. |
| 161 | +This is enforced in the orchestration context — never bypass it or implement custom locking. |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Error Handling Patterns |
| 166 | + |
| 167 | +### Exception Hierarchy |
| 168 | + |
| 169 | +| Error | When | |
| 170 | +|---|---| |
| 171 | +| `TaskFailedError` | Activity or sub-orchestration fails (carries `FailureDetails`) | |
| 172 | +| `NonDeterminismError` | Replayed action mismatches history (not recoverable) | |
| 173 | +| `OrchestrationStateError` | Invalid state transition attempted | |
| 174 | +| `EntityOperationFailedException` | Entity operation throws | |
| 175 | +| `TimeoutError` | Wait timeout exceeded | |
| 176 | + |
| 177 | +### Error Propagation Flow |
| 178 | + |
| 179 | +1. Activity throws → sidecar records failure event → executor calls `task.fail()` → |
| 180 | + generator receives thrown `TaskFailedError` → orchestrator can catch or propagate. |
| 181 | +2. Uncaught orchestrator error → executor catches → creates complete action with |
| 182 | + failure details → sidecar records failed status. |
| 183 | +3. Non-determinism → immediately throws, aborts replay. **Not recoverable by user code.** |
| 184 | + |
| 185 | +### Retry System |
| 186 | + |
| 187 | +Two flavors, both using **durable timers** for backoff (survives process restarts): |
| 188 | +- **Policy-driven:** `RetryPolicy` with max attempts, backoff coefficient, intervals. |
| 189 | + Delay formula: `firstRetryInterval × backoffCoefficient^(attempt - 1)`, capped. |
| 190 | +- **Custom handler:** User function receives `RetryContext`, returns `RetryAction`. |
| 191 | + |
| 192 | +**Critical invariant:** Retry tasks create timers whose `taskId` differs from the original |
| 193 | +task's ID. The executor maintains maps to link timer IDs back to retry tasks. If this |
| 194 | +linkage breaks (e.g., during continue-as-new), retries silently fail. |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## Tracing (OpenTelemetry) |
| 199 | + |
| 200 | +OpenTelemetry is an **optional peer dependency**, loaded via `require()` with fallback. |
| 201 | +If not available, tracing is silently disabled. |
| 202 | + |
| 203 | +Cross-SDK conventions (shared with .NET, Python, Java, Go): |
| 204 | +- Tracer name: `"Microsoft.DurableTask"` |
| 205 | +- Span naming: `"{type}:{name}"` (e.g., `"orchestration:MyWorkflow"`) |
| 206 | +- W3C `traceparent` propagated via protobuf `TraceContext` fields |
| 207 | +- Replay spans carry forward original span ID for cross-replay correlation |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## Logging Conventions |
| 212 | + |
| 213 | +### Logger Interface |
| 214 | + |
| 215 | +The SDK supports two logging modes, detected at runtime via type guard: |
| 216 | +- **Plain `Logger`:** `error()`, `warn()`, `info()`, `debug()` — string messages |
| 217 | +- **`StructuredLogger`:** Adds `logEvent(level, event, message)` with event IDs and |
| 218 | + categories matching the .NET SDK for cross-SDK log correlation |
| 219 | + |
| 220 | +### Replay-Safe Logging |
| 221 | + |
| 222 | +`ReplaySafeLogger` wraps any logger and **suppresses all output when `isReplaying = true`**. |
| 223 | +This prevents duplicate log entries during history replay. Always use it in orchestrator |
| 224 | +context. |
| 225 | + |
| 226 | +--- |
| 227 | + |
| 228 | +## Testing Approach |
| 229 | + |
| 230 | +### In-Memory Backend |
| 231 | + |
| 232 | +The SDK provides a full in-memory testing stack that mirrors the gRPC path: |
| 233 | + |
| 234 | +``` |
| 235 | +TestOrchestrationClient → InMemoryOrchestrationBackend ← TestOrchestrationWorker |
| 236 | +``` |
| 237 | + |
| 238 | +This enables testing orchestrations with the **real executor logic** but no sidecar, no |
| 239 | +gRPC, no network. The backend manages instance state, work queues, timers, and state |
| 240 | +waiters. |
| 241 | + |
| 242 | +Key test utilities: |
| 243 | +- `backend.waitForState(predicate, timeout)` — polling with timeout for async assertions |
| 244 | +- `backend.reset()` — clear all state between tests |
| 245 | +- Direct timer control for deterministic time-dependent tests |
| 246 | + |
| 247 | +### Test Conventions |
| 248 | + |
| 249 | +- Framework: Jest 29 with `ts-jest` |
| 250 | +- File naming: `.spec.ts` suffix |
| 251 | +- Protobuf test helpers: factory functions create `HistoryEvent` instances without a sidecar |
| 252 | +- Three tiers: unit tests (no sidecar), e2e with sidecar, e2e with Azure |
| 253 | + |
| 254 | +--- |
| 255 | + |
| 256 | +## Where Bugs Tend to Hide |
| 257 | + |
| 258 | +These are stable architectural areas where complexity concentrates: |
| 259 | + |
| 260 | +1. **Replay event matching** — The executor's large switch statement processes 27+ event |
| 261 | + types. Unmatched events are logged but silently dropped in several places. |
| 262 | + The TODOs in the code explicitly question whether these should be errors. |
| 263 | + |
| 264 | +2. **Timer-to-retry linkage** — Retry tasks create timers with different IDs. The maps |
| 265 | + connecting them are critical. Any break in linkage = silent retry failure. |
| 266 | + |
| 267 | +3. **Generator lifecycle edge cases** — What happens when the generator yields `null`? |
| 268 | + When `done` is true but the loop keeps checking? The initial `next()` return value? |
| 269 | + Several TODOs mark these as not fully hardened. |
| 270 | + |
| 271 | +4. **Composite task constructor side effects** — Already-complete children trigger |
| 272 | + `onChildCompleted()` during construction, potentially completing the composite |
| 273 | + before the caller inspects it. |
| 274 | + |
| 275 | +5. **Continue-as-new state reset** — Resetting history while preserving carryover events |
| 276 | + (buffered external events) is a subtle operation. Incorrect carryover handling leaks |
| 277 | + state between iterations. |
| 278 | + |
| 279 | +6. **Suspend/resume event buffering** — Suspended orchestrations buffer "suspendable" |
| 280 | + events. On resume, all buffered events are replayed in order. If events arrive at |
| 281 | + the boundary of suspend/resume transitions, ordering may be surprising. |
| 282 | + |
| 283 | +7. **gRPC stream lifecycle** — The worker's streaming connection can enter states where |
| 284 | + it's neither cleanly closed nor cleanly connected. The reconnection logic handles most |
| 285 | + cases, but simultaneous close + reconnect races exist. |
| 286 | + |
| 287 | +8. **Entity V1 vs V2 code paths** — The worker supports both entity execution paths. |
| 288 | + V2 is current; V1 is legacy. Incorrect version detection or mixed-version scenarios |
| 289 | + could cause issues. |
| 290 | + |
| 291 | +--- |
| 292 | + |
| 293 | +## Code Conventions |
| 294 | + |
| 295 | +### Naming |
| 296 | +- Files: `kebab-case.ts` |
| 297 | +- Classes: `PascalCase` |
| 298 | +- Private fields: `_underscorePrefixed` |
| 299 | +- Type aliases: `T`-prefixed (`TOrchestrator`, `TActivity`) |
| 300 | +- Constants: `UPPER_SNAKE_CASE` or `PascalCase` objects with `UPPER_SNAKE_CASE` properties |
| 301 | +- Enums: `PascalCase` names with `PascalCase` members |
| 302 | + |
| 303 | +### Module System |
| 304 | +- CommonJS output, ES2022 target |
| 305 | +- OpenTelemetry loaded via synchronous `require()` with catch (future ESM migration noted) |
| 306 | + |
| 307 | +### Cross-SDK Consistency |
| 308 | +Many decisions mirror the .NET SDK intentionally: entity name lowercasing, structured |
| 309 | +log event IDs, tracing attribute names, span naming conventions. When in doubt about |
| 310 | +"why is it done this way?", the answer is often ".NET SDK compatibility." |
| 311 | + |
| 312 | +### License |
| 313 | +Every source file starts with the Microsoft copyright header + MIT license reference. |
| 314 | + |
| 315 | +### Builder Pattern |
| 316 | +Both packages use the builder pattern for constructing clients and workers. |
| 317 | +Fluent API with `.host()`, `.port()`, `.connectionString()`, `.endpoint()`, `.build()`. |
| 318 | + |
| 319 | +### Registration Lock |
| 320 | +`addOrchestrator()`, `addActivity()`, `addEntity()` all throw if called after `start()`. |
| 321 | +The registry is frozen once the worker begins processing. |
| 322 | + |
| 323 | +--- |
| 324 | + |
| 325 | +## What Not to Touch |
| 326 | + |
| 327 | +- **Generated proto files** (`*_pb.js`, `*_pb.d.ts`, `*_grpc_pb.js`) — regenerated from |
| 328 | + proto definitions, never manually edited |
| 329 | +- **`version.ts`** — auto-generated by prebuild script |
| 330 | +- **Proto definitions** in `internal/protocol/` — shared across language SDKs, changes |
| 331 | + must be coordinated |
| 332 | + |
| 333 | +--- |
| 334 | + |
| 335 | +## Key Design Constraints to Respect |
| 336 | + |
| 337 | +1. **No new dependencies** without careful consideration — this is a core SDK |
| 338 | +2. **Node.js >= 22 required** — the engines field enforces this |
| 339 | +3. **Single-threaded execution** — no locking primitives; correctness depends on not |
| 340 | + yielding to the event loop during replay processing |
| 341 | +4. **Pre-1.0 API** — public API surface is still evolving but changes should be deliberate |
| 342 | +5. **Cross-SDK alignment** — behavior should match .NET/Python/Java/Go SDKs where applicable |
0 commit comments