Skip to content

Commit f7b9b9c

Browse files
committed
Merge branch 'main' of https://github.com/microsoft/durabletask-js into wangbill/export
2 parents 1354f83 + 933f78a commit f7b9b9c

40 files changed

Lines changed: 13071 additions & 1384 deletions

.github/agents/daily-code-review.agent.md

Lines changed: 370 additions & 0 deletions
Large diffs are not rendered by default.

.github/agents/pr-verification.agent.md

Lines changed: 491 additions & 0 deletions
Large diffs are not rendered by default.

.github/copilot-instructions.md

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
# Copilot Instructions — DurableTask JavaScript SDK
2+
3+
This document provides architectural context for AI assistants working with this codebase.
4+
Focus is on **stable patterns, invariants, and pitfalls** — not file paths or function signatures.
5+
6+
---
7+
8+
## What This Project Is
9+
10+
The **Durable Task SDK for JavaScript/TypeScript** enables writing long-running, stateful
11+
workflows as code. It is the JavaScript implementation of Microsoft's cross-language
12+
Durable Task Framework (sibling SDKs exist in .NET, Python, Java, Go).
13+
14+
**This is NOT Azure Durable Functions.** It's a lower-level SDK for the Azure Durable Task
15+
Scheduler service (or its emulator). The two are related but have different APIs, deployment
16+
models, and npm packages.
17+
18+
---
19+
20+
## Core Execution Model — Generator-Based Replay
21+
22+
This is the single most important thing to understand. Everything else follows from it.
23+
24+
### How Orchestrations Work
25+
26+
Orchestrations are **async generator functions** (`async function*`) that `yield` `Task<T>`
27+
objects. The runtime replays completed history events to rebuild state, then advances the
28+
generator for new work.
29+
30+
```
31+
Orchestrator yields Task → Runtime checks history → Already done? Send result back
32+
→ Not done? Suspend generator, create action for sidecar
33+
```
34+
35+
On every re-invocation:
36+
1. The executor receives `(instanceId, oldEvents[], newEvents[])`.
37+
2. With `isReplaying = true`: feeds `oldEvents` through event processing. Completed tasks
38+
resolve instantly via `generator.next(result)`.
39+
3. With `isReplaying = false`: processes `newEvents`. Incomplete tasks suspend the generator.
40+
4. The resulting `OrchestratorAction[]` are sent back to the sidecar.
41+
42+
### The Determinism Rule (Critical)
43+
44+
**Orchestrator code MUST be deterministic.** Every replay must produce the exact same
45+
sequence of actions. This is enforced at runtime — mismatches throw `NonDeterminismError`.
46+
47+
What this means in practice:
48+
- **No `Date.now()`** — use `context.currentUtcDateTime`
49+
- **No `Math.random()` or `crypto.randomUUID()`** — use `context.newGuid()` (deterministic UUID v5)
50+
- **No direct I/O** (HTTP calls, file reads, database queries) — use activities
51+
- **No non-deterministic control flow** — no data-dependent branching on external state
52+
53+
Adding, removing, or reordering `yield` statements between code versions **breaks replay**
54+
for in-flight orchestrations. The validator catches type/name mismatches but not all
55+
ordering issues.
56+
57+
### Activities vs Orchestrations
58+
59+
Activities are the leaf nodes — they execute side effects exactly once (modulo retries)
60+
and persist results as history events. They can do anything: HTTP calls, DB writes, etc.
61+
62+
**Key mental model:** Orchestrations = coordination logic (deterministic).
63+
Activities = real work (non-deterministic allowed).
64+
65+
---
66+
67+
## Architecture — How Pieces Connect
68+
69+
### Communication Model
70+
71+
The SDK **never touches storage directly**. All communication happens over gRPC to a
72+
sidecar (Azure Durable Task Scheduler, emulator, or Dapr):
73+
74+
```
75+
Client → gRPC → Sidecar (owns storage & scheduling)
76+
77+
Worker ← gRPC (streaming) ← Sidecar pushes work items
78+
```
79+
80+
The worker opens a **server-streaming RPC** (`getWorkItems`) and dispatches incoming work:
81+
orchestration replays, activity executions, and entity operations.
82+
83+
### Package Structure
84+
85+
This is an npm workspaces monorepo with two published packages:
86+
87+
| Package | Role |
88+
|---|---|
89+
| Core SDK | Orchestration engine, client, worker, entities, testing, tracing |
90+
| Azure Managed Backend | Connection string / Entra ID auth, credential factory, builder pattern |
91+
92+
The Azure package **peers on** the core package — it adds authentication and connection
93+
management but delegates all domain logic to core.
94+
95+
### Protocol Layer
96+
97+
Protobuf definitions live in `internal/protocol/`. Generated JavaScript stubs (not
98+
TypeScript) provide the gRPC message types. A helper layer converts between protobuf
99+
messages and SDK types (history events, orchestrator actions).
100+
101+
**Generated proto files should never be manually edited.** They're rebuilt from the proto
102+
definitions using `grpc_tools_node_protoc`.
103+
104+
---
105+
106+
## Task System — The Heart of the SDK
107+
108+
### Task Hierarchy
109+
110+
```
111+
Task<T> — Base: result, exception, isComplete, isFailed, parent ref
112+
├── CompletableTask — Can be completed or failed; notifies parent on completion
113+
│ ├── RetryableTask — Policy-driven retries (exponential backoff)
114+
│ └── RetryHandlerTask — Custom retry handler function
115+
└── CompositeTask — Holds children with parent backlinks
116+
├── WhenAllTask — Completes when ALL children done; fails fast on first failure
117+
└── WhenAnyTask — Completes when ANY child done
118+
```
119+
120+
**Parent notification chain:** When a child task completes, it calls
121+
`parent.onChildCompleted()`. This chain drives composite task completion.
122+
123+
**Important:** `CompositeTask` checks already-complete children in its constructor.
124+
A composite task can be **immediately complete upon construction** — callers must handle this.
125+
126+
### WhenAll Fail-Fast Behavior
127+
128+
`WhenAllTask` marks itself complete on the **first** failed child. Other children may still
129+
be in flight. The `isComplete` guard prevents double-completion, but the mental model is:
130+
one failure = whole task fails immediately, remaining results ignored.
131+
132+
---
133+
134+
## Entity System
135+
136+
### Identity Model
137+
138+
Entities are addressed as `@name@key`. Names are **always lowercased** (cross-SDK
139+
consistency with .NET). Keys **preserve case**.
140+
141+
### Execution Model
142+
143+
Entities process operations in **single-threaded batches** with per-operation transactional
144+
semantics:
145+
1. Checkpoint state + accumulated actions before each operation
146+
2. Execute the operation
147+
3. On success: commit. On failure: rollback to checkpoint.
148+
149+
State uses **lazy JSON deserialization** — deserialize only on first read. If state is
150+
written but never read, the original JSON is discarded (intentional optimization).
151+
152+
### Method Dispatch
153+
154+
`TaskEntity<TState>` does automatic method dispatch: it walks the prototype chain, finds
155+
a method matching the operation name (case-insensitive), and calls it. The implicit
156+
`delete` operation sets state to `null`.
157+
158+
### Entity Lock Ordering
159+
160+
`lockEntities()` **sorts entity IDs** before acquiring locks to prevent deadlocks.
161+
This is enforced in the orchestration context — never bypass it or implement custom locking.
162+
163+
---
164+
165+
## Error Handling Patterns
166+
167+
### Exception Hierarchy
168+
169+
| Error | When |
170+
|---|---|
171+
| `TaskFailedError` | Activity or sub-orchestration fails (carries `FailureDetails`) |
172+
| `NonDeterminismError` | Replayed action mismatches history (not recoverable) |
173+
| `OrchestrationStateError` | Invalid state transition attempted |
174+
| `EntityOperationFailedException` | Entity operation throws |
175+
| `TimeoutError` | Wait timeout exceeded |
176+
177+
### Error Propagation Flow
178+
179+
1. Activity throws → sidecar records failure event → executor calls `task.fail()`
180+
generator receives thrown `TaskFailedError` → orchestrator can catch or propagate.
181+
2. Uncaught orchestrator error → executor catches → creates complete action with
182+
failure details → sidecar records failed status.
183+
3. Non-determinism → immediately throws, aborts replay. **Not recoverable by user code.**
184+
185+
### Retry System
186+
187+
Two flavors, both using **durable timers** for backoff (survives process restarts):
188+
- **Policy-driven:** `RetryPolicy` with max attempts, backoff coefficient, intervals.
189+
Delay formula: `firstRetryInterval × backoffCoefficient^(attempt - 1)`, capped.
190+
- **Custom handler:** User function receives `RetryContext`, returns `RetryAction`.
191+
192+
**Critical invariant:** Retry tasks create timers whose `taskId` differs from the original
193+
task's ID. The executor maintains maps to link timer IDs back to retry tasks. If this
194+
linkage breaks (e.g., during continue-as-new), retries silently fail.
195+
196+
---
197+
198+
## Tracing (OpenTelemetry)
199+
200+
OpenTelemetry is an **optional peer dependency**, loaded via `require()` with fallback.
201+
If not available, tracing is silently disabled.
202+
203+
Cross-SDK conventions (shared with .NET, Python, Java, Go):
204+
- Tracer name: `"Microsoft.DurableTask"`
205+
- Span naming: `"{type}:{name}"` (e.g., `"orchestration:MyWorkflow"`)
206+
- W3C `traceparent` propagated via protobuf `TraceContext` fields
207+
- Replay spans carry forward original span ID for cross-replay correlation
208+
209+
---
210+
211+
## Logging Conventions
212+
213+
### Logger Interface
214+
215+
The SDK supports two logging modes, detected at runtime via type guard:
216+
- **Plain `Logger`:** `error()`, `warn()`, `info()`, `debug()` — string messages
217+
- **`StructuredLogger`:** Adds `logEvent(level, event, message)` with event IDs and
218+
categories matching the .NET SDK for cross-SDK log correlation
219+
220+
### Replay-Safe Logging
221+
222+
`ReplaySafeLogger` wraps any logger and **suppresses all output when `isReplaying = true`**.
223+
This prevents duplicate log entries during history replay. Always use it in orchestrator
224+
context.
225+
226+
---
227+
228+
## Testing Approach
229+
230+
### In-Memory Backend
231+
232+
The SDK provides a full in-memory testing stack that mirrors the gRPC path:
233+
234+
```
235+
TestOrchestrationClient → InMemoryOrchestrationBackend ← TestOrchestrationWorker
236+
```
237+
238+
This enables testing orchestrations with the **real executor logic** but no sidecar, no
239+
gRPC, no network. The backend manages instance state, work queues, timers, and state
240+
waiters.
241+
242+
Key test utilities:
243+
- `backend.waitForState(predicate, timeout)` — polling with timeout for async assertions
244+
- `backend.reset()` — clear all state between tests
245+
- Direct timer control for deterministic time-dependent tests
246+
247+
### Test Conventions
248+
249+
- Framework: Jest 29 with `ts-jest`
250+
- File naming: `.spec.ts` suffix
251+
- Protobuf test helpers: factory functions create `HistoryEvent` instances without a sidecar
252+
- Three tiers: unit tests (no sidecar), e2e with sidecar, e2e with Azure
253+
254+
---
255+
256+
## Where Bugs Tend to Hide
257+
258+
These are stable architectural areas where complexity concentrates:
259+
260+
1. **Replay event matching** — The executor's large switch statement processes 27+ event
261+
types. Unmatched events are logged but silently dropped in several places.
262+
The TODOs in the code explicitly question whether these should be errors.
263+
264+
2. **Timer-to-retry linkage** — Retry tasks create timers with different IDs. The maps
265+
connecting them are critical. Any break in linkage = silent retry failure.
266+
267+
3. **Generator lifecycle edge cases** — What happens when the generator yields `null`?
268+
When `done` is true but the loop keeps checking? The initial `next()` return value?
269+
Several TODOs mark these as not fully hardened.
270+
271+
4. **Composite task constructor side effects** — Already-complete children trigger
272+
`onChildCompleted()` during construction, potentially completing the composite
273+
before the caller inspects it.
274+
275+
5. **Continue-as-new state reset** — Resetting history while preserving carryover events
276+
(buffered external events) is a subtle operation. Incorrect carryover handling leaks
277+
state between iterations.
278+
279+
6. **Suspend/resume event buffering** — Suspended orchestrations buffer "suspendable"
280+
events. On resume, all buffered events are replayed in order. If events arrive at
281+
the boundary of suspend/resume transitions, ordering may be surprising.
282+
283+
7. **gRPC stream lifecycle** — The worker's streaming connection can enter states where
284+
it's neither cleanly closed nor cleanly connected. The reconnection logic handles most
285+
cases, but simultaneous close + reconnect races exist.
286+
287+
8. **Entity V1 vs V2 code paths** — The worker supports both entity execution paths.
288+
V2 is current; V1 is legacy. Incorrect version detection or mixed-version scenarios
289+
could cause issues.
290+
291+
---
292+
293+
## Code Conventions
294+
295+
### Naming
296+
- Files: `kebab-case.ts`
297+
- Classes: `PascalCase`
298+
- Private fields: `_underscorePrefixed`
299+
- Type aliases: `T`-prefixed (`TOrchestrator`, `TActivity`)
300+
- Constants: `UPPER_SNAKE_CASE` or `PascalCase` objects with `UPPER_SNAKE_CASE` properties
301+
- Enums: `PascalCase` names with `PascalCase` members
302+
303+
### Module System
304+
- CommonJS output, ES2022 target
305+
- OpenTelemetry loaded via synchronous `require()` with catch (future ESM migration noted)
306+
307+
### Cross-SDK Consistency
308+
Many decisions mirror the .NET SDK intentionally: entity name lowercasing, structured
309+
log event IDs, tracing attribute names, span naming conventions. When in doubt about
310+
"why is it done this way?", the answer is often ".NET SDK compatibility."
311+
312+
### License
313+
Every source file starts with the Microsoft copyright header + MIT license reference.
314+
315+
### Builder Pattern
316+
Both packages use the builder pattern for constructing clients and workers.
317+
Fluent API with `.host()`, `.port()`, `.connectionString()`, `.endpoint()`, `.build()`.
318+
319+
### Registration Lock
320+
`addOrchestrator()`, `addActivity()`, `addEntity()` all throw if called after `start()`.
321+
The registry is frozen once the worker begins processing.
322+
323+
---
324+
325+
## What Not to Touch
326+
327+
- **Generated proto files** (`*_pb.js`, `*_pb.d.ts`, `*_grpc_pb.js`) — regenerated from
328+
proto definitions, never manually edited
329+
- **`version.ts`** — auto-generated by prebuild script
330+
- **Proto definitions** in `internal/protocol/` — shared across language SDKs, changes
331+
must be coordinated
332+
333+
---
334+
335+
## Key Design Constraints to Respect
336+
337+
1. **No new dependencies** without careful consideration — this is a core SDK
338+
2. **Node.js >= 22 required** — the engines field enforces this
339+
3. **Single-threaded execution** — no locking primitives; correctness depends on not
340+
yielding to the event loop during replay processing
341+
4. **Pre-1.0 API** — public API surface is still evolving but changes should be deliberate
342+
5. **Cross-SDK alignment** — behavior should match .NET/Python/Java/Go SDKs where applicable

0 commit comments

Comments
 (0)