Skip to content

Commit cffdfc6

Browse files
docs: Update outdated design and API documentation (#327)
1 parent 41ad397 commit cffdfc6

12 files changed

Lines changed: 327 additions & 1983 deletions

docs/adr/003-completable-future-based-coordination.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ADR-003: CompletableFuture-Based Operation Coordination
22

3-
**Status:** Review
3+
**Status:** Accepted
44
**Date:** 2026-02-18
55

66
## Context

docs/advanced/configuration.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,7 @@ public class OrderProcessor extends DurableHandler<Order, OrderResult> {
3535
| `withSerDes()` | Serializer for step results | Jackson with default settings |
3636
| `withExecutorService()` | Thread pool for user-defined operations | Cached daemon thread pool |
3737
| `withLoggerConfig()` | Logger behavior configuration | Suppress logs during replay |
38+
| `withPollingStrategy()` | Backend polling strategy | Exponential backoff: 1s base, 2x rate, FULL jitter, 10s max |
39+
| `withCheckpointDelay()` | How often the SDK checkpoints updates | `Duration.ofSeconds(0)` (as soon as possible) |
3840

3941
The `withExecutorService()` option configures the thread pool used for running user-defined operations. Internal SDK coordination (checkpoint batching, polling) runs on an SDK-managed thread pool.

docs/advanced/error-handling.md

Lines changed: 44 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,34 @@
33
The SDK throws specific exceptions to help you handle different failure scenarios:
44

55
```
6-
DurableExecutionException - General durable exception
7-
├── NonDeterministicExecutionException - Code changed between original execution and replay. Fix code to maintain determinism; don't change step order/names.
8-
├── SerDesException - Serialization and deserialization exception.
9-
└── DurableOperationException - General operation exception
10-
├── StepException - General Step exception
11-
│ ├── StepFailedException - Step exhausted all retry attempts.Catch to implement fallback logic or let execution fail.
12-
│ └── StepInterruptedException - `AT_MOST_ONCE` step was interrupted before completion. Implement manual recovery (check if operation completed externally)
13-
├── InvokeException - General chained invocation exception
14-
│ ├── InvokeFailedException - Chained invocation failed. Handle the error or propagate failure.
15-
│ ├── InvokeTimedoutException - Chained invocation timed out. Handle the error or propagate failure.
16-
│ └── InvokeStoppedException - Chained invocation stopped. Handle the error or propagate failure.
17-
├── CallbackException - General callback exception
18-
│ ├── CallbackFailedException - External system sent an error response to the callback. Handle the error or propagate failure
19-
│ └── CallbackTimeoutException - Callback exceeded its timeout duration. Handle the error or propagate the failure
20-
├── WaitForConditionFailedException- waitForCondition exceeded max polling attempts or failed. Catch to implement fallback logic.
21-
└── ChildContextFailedException - Child context failed and the original exception could not be reconstructed
6+
RuntimeException
7+
├── SuspendExecutionException - Internal control-flow exception thrown by the SDK to suspend execution
8+
│ (e.g., during wait(), waitForCallback(), waitForCondition()).
9+
│ The SDK catches this internally — you will never see it unless you have
10+
│ a broad catch(Exception) block around durable operations. If caught
11+
│ accidentally, you MUST re-throw it so the SDK can suspend correctly.
12+
13+
└── DurableExecutionException - General durable exception
14+
├── SerDesException - Serialization and deserialization exception.
15+
├── UnrecoverableDurableExecutionException - Execution cannot be recovered. The durable execution will be immediately terminated.
16+
│ ├── NonDeterministicExecutionException - Code changed between original execution and replay. Fix code to maintain determinism; don't change step order/names.
17+
│ └── IllegalDurableOperationException - An illegal operation was detected. The execution will be immediately terminated.
18+
└── DurableOperationException - General operation exception
19+
├── StepException - General Step exception
20+
│ ├── StepFailedException - Step exhausted all retry attempts. Catch to implement fallback logic or let execution fail.
21+
│ └── StepInterruptedException - `AT_MOST_ONCE` step was interrupted before completion. Implement manual recovery (check if operation completed externally)
22+
├── InvokeException - General chained invocation exception
23+
│ ├── InvokeFailedException - Chained invocation failed. Handle the error or propagate failure.
24+
│ ├── InvokeTimedOutException - Chained invocation timed out. Handle the error or propagate failure.
25+
│ └── InvokeStoppedException - Chained invocation stopped. Handle the error or propagate failure.
26+
├── CallbackException - General callback exception
27+
│ ├── CallbackFailedException - External system sent an error response to the callback. Handle the error or propagate failure
28+
│ ├── CallbackTimeoutException - Callback exceeded its timeout duration. Handle the error or propagate the failure
29+
│ └── CallbackSubmitterException - Submitter step failed to submit the callback. Handle the error or propagate failure
30+
├── WaitForConditionFailedException- waitForCondition exceeded max polling attempts or failed. Catch to implement fallback logic.
31+
├── ChildContextFailedException - Child context failed and the original exception could not be reconstructed
32+
├── MapIterationFailedException - Map iteration failed and the original exception could not be reconstructed
33+
└── ParallelBranchFailedException - Parallel branch failed and the original exception could not be reconstructed
2234
```
2335

2436
```java
@@ -36,4 +48,20 @@ try {
3648
throw e; // Let it fail - manual intervention needed
3749
}
3850
}
51+
```
52+
53+
### Handling SuspendExecutionException
54+
55+
If you have a broad `catch (Exception e)` block around durable operations, you must re-throw `SuspendExecutionException` to let the SDK suspend correctly:
56+
57+
```java
58+
try {
59+
ctx.step("work", String.class, stepCtx -> doWork());
60+
ctx.wait("pause", Duration.ofDays(1));
61+
ctx.step("more-work", String.class, stepCtx -> doMoreWork());
62+
} catch (SuspendExecutionException e) {
63+
throw e; // Always re-throw — lets the SDK suspend the execution
64+
} catch (Exception e) {
65+
log.error("Operation failed", e);
66+
}
3967
```

docs/core/callbacks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ var waitForCallbackConfig = WaitForCallbackConfig.builder()
4545
.callbackConfig(config)
4646
.stepConfig(StepConfig.builder().retryStrategy(...).build())
4747
.build();
48-
ctx.waitForCallback("approval", String.class, callbackId -> sendApprovalRequest(callbackId), waitForCallbackConfig);
48+
ctx.waitForCallback("approval", String.class, (callbackId, stepCtx) -> sendApprovalRequest(callbackId), waitForCallbackConfig);
4949
```
5050

5151
| Option | Description |

docs/core/invoke.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@ var result = ctx.invoke("invoke-function",
99
Result.class,
1010
InvokeConfig.builder()
1111
.payloadSerDes(...) // payload serializer
12-
.resultSerDes(...) // result deserializer
13-
.timeout(Duration.of(...)) // wait timeout
12+
.serDes(...) // result deserializer
1413
.tenantId(...) // Lambda tenantId
1514
.build()
1615
);

docs/core/map.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@ Each `MapResultItem<T>` contains:
5757

5858
| Field | Description |
5959
|-------|-------------|
60-
| `status()` | `SUCCEEDED`, `FAILED`, or `NOT_STARTED` |
61-
| `result()` | The result value, or `null` if failed/not started |
62-
| `error()` | The error details as `MapError`, or `null` if succeeded/not started |
60+
| `status()` | `SUCCEEDED`, `FAILED`, or `SKIPPED` |
61+
| `result()` | The result value, or `null` if failed/skipped |
62+
| `error()` | The error details as `MapError`, or `null` if succeeded/skipped |
6363

6464
### MapError
6565

@@ -135,10 +135,10 @@ var config = MapConfig.builder()
135135
.build();
136136

137137
var result = ctx.map("find-two", items, String.class, fn, config);
138-
assertEquals(CompletionReason.MIN_SUCCESSFUL_REACHED, result.completionReason());
138+
assertEquals(ConcurrencyCompletionStatus.MIN_SUCCESSFUL_REACHED, result.completionReason());
139139
```
140140

141-
When early termination triggers, items that were never started have `NOT_STARTED` status with `null` for both result and error in the `MapResult`.
141+
When early termination triggers, items that were never started have `SKIPPED` status with `null` for both result and error in the `MapResult`.
142142

143143
### Checkpoint-and-Replay
144144

docs/core/parallel.md

Lines changed: 94 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,143 +1,124 @@
1-
# Parallel Operations Design Plan
1+
## parallel() – Concurrent Branch Execution
22

3-
## Overview
4-
5-
Add parallel execution capability to the AWS Lambda Durable Execution SDK, allowing multiple branches to run concurrently within a single durable function execution.
6-
7-
## API Design
8-
9-
### User Interface
3+
`parallel()` runs multiple independent branches concurrently, each in its own child context. Branches are registered via `branch()` and execute immediately (respecting `maxConcurrency`). The operation completes when all branches finish or completion criteria are met.
104

115
```java
12-
try (var parallelContext = ctx.parallel(ParallelConfig.builder().build())) {
13-
DurableFuture<Boolean> task1 = parallelContext.branch("validate", Boolean.class, branchContext -> validate());
14-
DurableFuture<String> task2 = parallelContext.branch("process", String.class, branchContext -> process());
15-
parallelContext.join(); // Wait for completion based on config
16-
17-
// Access results
18-
Boolean validated = task1.get();
19-
String processed = task2.get();
20-
}
6+
// Basic parallel execution
7+
var parallel = ctx.parallel("validate-and-process");
8+
DurableFuture<Boolean> task1 = parallel.branch("validate", Boolean.class, branchCtx -> {
9+
return branchCtx.step("check", Boolean.class, stepCtx -> validate());
10+
});
11+
DurableFuture<String> task2 = parallel.branch("process", String.class, branchCtx -> {
12+
return branchCtx.step("work", String.class, stepCtx -> process());
13+
});
14+
15+
// Wait for all branches and get the aggregate result
16+
ParallelResult result = parallel.get();
17+
18+
// Access individual branch results
19+
Boolean validated = task1.get();
20+
String processed = task2.get();
2121
```
2222

23-
### Core Components
24-
25-
#### 1. ParallelConfig
26-
Configuration object controlling parallel execution behavior:
23+
`ParallelDurableFuture` implements `AutoCloseable` — calling `close()` triggers `get()` if it hasn't been called yet, ensuring all branches complete.
2724

2825
```java
29-
ParallelConfig config = ParallelConfig.builder()
30-
.maxConcurrency(5) // Max branches running simultaneously
31-
.minSuccessful(3) // Minimum successful branches required (-1 = all)
32-
.toleratedFailureCount(2) // Max failures before stopping execution
33-
.build();
26+
// AutoCloseable pattern
27+
try (var parallel = ctx.parallel("work")) {
28+
parallel.branch("a", String.class, branchCtx -> branchCtx.step("a1", String.class, stepCtx -> "a"));
29+
parallel.branch("b", String.class, branchCtx -> branchCtx.step("b1", String.class, stepCtx -> "b"));
30+
} // close() calls get() automatically
3431
```
3532

36-
**Configuration Rules:**
37-
- `maxConcurrency`: Controls resource usage, prevents overwhelming the system
38-
- `minSuccessful`: Enables "best effort" scenarios where not all branches need to succeed
39-
- `toleratedFailureCount`: Fail-fast behavior when too many branches fail
33+
### ParallelResult
4034

41-
#### 2. ParallelContext
42-
Manages the lifecycle of parallel branches:
35+
`ParallelResult` is a summary of the parallel execution:
4336

44-
```java
45-
public class ParallelContext implements AutoCloseable {
46-
// Create branches
47-
public <T> DurableFuture<T> branch(String name, Class<T> resultType, Function<DurableContext, T> func);
48-
public <T> DurableFuture<T> branch(String name, TypeToken<T> resultType, Function<DurableContext, T> func);
49-
50-
// Wait for completion
51-
public void join();
52-
53-
// AutoCloseable ensures join() is called
54-
public void close();
55-
}
56-
```
37+
| Field | Description |
38+
|-------|-------------|
39+
| `size()` | Total number of registered branches |
40+
| `succeeded()` | Number of branches that succeeded |
41+
| `failed()` | Number of branches that failed |
42+
| `completionStatus()` | Why the operation completed (`ALL_COMPLETED`, `MIN_SUCCESSFUL_REACHED`, `FAILURE_TOLERANCE_EXCEEDED`) |
5743

58-
#### 3. DurableContext Integration
59-
Add single method to existing `DurableContext`:
60-
61-
```java
62-
public ParallelContext parallel(ParallelConfig config);
63-
```
44+
### ParallelConfig
6445

65-
## Implementation Strategy
46+
Configure concurrency limits and completion criteria:
6647

67-
### 1. Leverage Existing Child Context Infrastructure
68-
69-
Each parallel branch will be implemented as a `ChildContextOperation`:
70-
- **Isolation**: Each branch has its own checkpoint log
71-
- **Replay Safety**: Branches replay independently
72-
- **Error Handling**: Branch failures don't affect other branches directly
73-
74-
### 2. Execution Flow
75-
76-
1. **Branch Registration**: `branch()` calls create `ChildContextOperation` instances but don't execute immediately
77-
2. **Execution Start**: `join()` triggers execution of branches respecting `maxConcurrency`
78-
3. **Concurrency Control**: Use a queue to manage pending branches when `maxConcurrency` is reached
79-
4. **Completion Logic**: Monitor success/failure counts against configuration thresholds
80-
5. **Result Collection**: Return results via `DurableFuture` instances
48+
```java
49+
var config = ParallelConfig.builder()
50+
.maxConcurrency(5) // at most 5 branches run at once
51+
.completionConfig(CompletionConfig.allCompleted()) // default: run all branches
52+
.build();
8153

54+
var parallel = ctx.parallel("work", config);
55+
```
8256

83-
### 4. Error Handling Strategy
57+
| Option | Default | Description |
58+
|--------|---------|-------------|
59+
| `maxConcurrency` | Unlimited | Maximum branches running simultaneously (must be ≥ 1) |
60+
| `completionConfig` | `allCompleted()` | Controls when the operation stops starting new branches |
8461

85-
**Branch-Level Failures:**
86-
- Individual branch failures are captured in their respective `DurableFuture`
87-
- Don't immediately fail the entire parallel operation
88-
- Count towards `failureCount` for threshold checking
62+
#### CompletionConfig
8963

90-
**Parallel-Level Failures:**
91-
- Exceed `toleratedFailureCount`: Stop starting new branches, wait for running ones
92-
- Insufficient `minSuccessful`: Throw `ParallelExecutionException` after all branches complete
93-
- Configuration validation errors: Fail immediately
64+
`CompletionConfig` controls when the parallel operation stops starting new branches:
9465

95-
## Key Design Decisions
66+
| Factory Method | Behavior |
67+
|----------------|----------|
68+
| `allCompleted()` (default) | All branches run regardless of failures |
69+
| `allSuccessful()` | Stop if any branch fails (zero failures tolerated) |
70+
| `firstSuccessful()` | Stop after the first branch succeeds |
71+
| `minSuccessful(n)` | Stop after `n` branches succeed |
72+
| `toleratedFailureCount(n)` | Stop after more than `n` failures |
9673

97-
### 1. Build on Child Contexts
98-
- **Pros**: Reuses existing isolation and checkpointing logic
99-
- **Cons**: Each branch has overhead of a separate child context
100-
- **Decision**: Acceptable trade-off for clean isolation and replay safety
74+
Note: `toleratedFailurePercentage` is not supported for parallel operations.
10175

102-
### 2. Eager vs Lazy Execution
103-
- **Chosen**: Lazy execution (branches start only on `join()`)
104-
- **Rationale**: Allows all branches to be registered before execution starts, enabling better concurrency planning
76+
### ParallelBranchConfig
10577

106-
### 3. AutoCloseable Pattern
107-
- **Purpose**: Ensures `join()` is called even if user forgets
108-
- **Behavior**: If `close()` is called before `join()`, automatically call `join()`
78+
Per-branch configuration can be provided:
10979

110-
### 4. Configuration Validation
111-
- Validate at `ParallelConfig.build()` time:
112-
- `maxConcurrency > 0`
113-
- `minSuccessful >= -1` (where -1 means "all")
114-
- `toleratedFailureCount >= 0`
115-
- `minSuccessful + toleratedFailureCount <= total branches` (validated at runtime)
80+
```java
81+
parallel.branch("work", String.class, branchCtx -> doWork(),
82+
ParallelBranchConfig.builder()
83+
.serDes(customSerDes)
84+
.build());
85+
```
11686

117-
## Implementation Files
87+
### Error Handling
11888

119-
### New Files to Create
120-
1. `ParallelConfig.java` - Configuration builder
121-
2. `ParallelContext.java` - User-facing parallel context
122-
3. `operation/ParallelOperation.java` - Core execution logic
123-
4. `exception/ParallelExecutionException.java` - Parallel-specific exceptions
89+
Branch failures are captured individually. A failed branch throws its exception when you call `get()` on its `DurableFuture`:
12490

125-
### Files to Modify
126-
1. `DurableContext.java` - Add `parallel()` method
127-
2. `DurableFuture.java` - Ensure compatibility with parallel results (likely no changes needed)
91+
```java
92+
var parallel = ctx.parallel("work");
93+
var risky = parallel.branch("risky", String.class, branchCtx -> {
94+
throw new RuntimeException("failed");
95+
});
96+
var safe = parallel.branch("safe", String.class, branchCtx -> {
97+
return branchCtx.step("ok", String.class, stepCtx -> "done");
98+
});
99+
100+
ParallelResult result = parallel.get();
101+
102+
String safeResult = safe.get(); // "done"
103+
try {
104+
risky.get(); // throws
105+
} catch (ParallelBranchFailedException e) {
106+
// Branch failed and the SDK could not reconstruct the original exception.
107+
// This happens when: the error info was not checkpointed, the exception
108+
// class is not on the classpath, or deserialization of the error data
109+
// failed. The original error type and message are in e.getMessage().
110+
}
111+
```
128112

129-
## Testing Strategy
113+
| Exception | When Thrown |
114+
|-----------|-------------|
115+
| `ParallelBranchFailedException` | Branch failed and the original exception could not be reconstructed |
116+
| User's exception | Branch threw a reconstructable exception — propagated through `get()` |
130117

131-
### Unit Tests
132-
- `ParallelConfigTest` - Configuration validation
133-
- `ParallelOperationTest` - Core execution logic with mocked child contexts
118+
### Checkpoint-and-Replay
134119

135-
### Integration Tests
136-
- Success scenarios with various configurations
137-
- Failure scenarios (exceeding thresholds)
138-
- Concurrency limits
139-
- Replay behavior
120+
Parallel operations are fully durable. On replay after interruption:
140121

141-
### Example Implementation
142-
- `ParallelExample.java` in examples module
143-
- Demonstrate common patterns and error handling
122+
- Completed branches return cached results without re-execution
123+
- Incomplete branches resume from their last checkpoint
124+
- Branches that never started execute fresh

docs/core/steps.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,4 +119,4 @@ var orderMap = ctx.step("fetch-orders", new TypeToken<Map<String, Order>>() {},
119119
stepCtx -> orderService.getOrdersByCustomer());
120120
```
121121

122-
This is needed for the SDK to deserialize a checkpointed result and get the exact type to reconstruct. See [TypeToken and Type Erasure](docs/internal-design.md#typetoken-and-type-erasure) for technical details.
122+
This is needed for the SDK to deserialize a checkpointed result and get the exact type to reconstruct. See [TypeToken and Type Erasure](../design.md#custom-serdes-and-typetoken) for technical details.

0 commit comments

Comments
 (0)