Skip to content

Commit 8d2c6c7

Browse files
committed
feat(retry): implement story 4.3 retry strategy configuration
- Add NonRetryableErrorTypes list to ToTemporalRetryPolicy (7 error types) - Enhance ExecuteStepActivity with NonRetryableError detection and conversion - Add retry_nonretry_test.go with comprehensive unit tests for error types - Update docs/nodes/README.md with error handling best practices and examples - Simplify workflow/activity tests, defer integration tests to test/integration/ - Add test scripts and testdata for retry strategy validation - Update Story 4.3 status to done with code review fixes documented - Fix errcheck linting issues in plugin manager tests Resolves: Story 4.3 - Retry Strategy Configuration Coverage: pkg/dsl 89.3%, pkg/dsl/node 84.4% Tests: All pkg/temporal and pkg/dsl tests passing Note: gocyclo warnings in validator.go from Story 4.2 are acceptable for complex validation logic
1 parent 789b094 commit 8d2c6c7

36 files changed

Lines changed: 13266 additions & 210 deletions

agent

39.3 MB
Binary file not shown.

docs/configuration.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,3 +281,83 @@ Failed to load config: invalid configuration: log.level must be one of [debug, i
281281
- 检查配置文件中的值是否符合要求
282282
- 参考本文档中的可选值列表
283283
- 使用默认值或环境变量覆盖
284+
## Step 配置参考
285+
286+
### retry-strategy (重试策略)
287+
288+
配置 Step 失败时的重试行为。(Story 4.3)
289+
290+
**字段:**
291+
292+
| 字段 | 类型 | 必需 | 默认值 | 说明 |
293+
|------|------|------|--------|------|
294+
| max-attempts | integer || 3 | 最大尝试次数 (1-10) |
295+
| initial-interval | string || "1s" | 首次重试间隔 (≥1s) |
296+
| backoff-coefficient | float || 2.0 | 退避系数 (1.0-10.0) |
297+
| max-interval | string || "60s" | 最大重试间隔 |
298+
299+
**示例:**
300+
301+
```yaml
302+
steps:
303+
# 默认重试策略 (3次,指数退避)
304+
- name: Quick Task
305+
uses: exec/script@v1
306+
with:
307+
command: ./task.sh
308+
309+
# 自定义重试策略 - 网络调用
310+
- name: API Call
311+
uses: http/request@v1
312+
retry-strategy:
313+
max-attempts: 10
314+
initial-interval: 2s
315+
backoff-coefficient: 1.5
316+
max-interval: 60s
317+
with:
318+
url: https://api.example.com/data
319+
320+
# 禁用重试
321+
- name: One Shot
322+
uses: exec/script@v1
323+
retry-strategy:
324+
max-attempts: 1
325+
with:
326+
command: ./critical-task.sh
327+
```
328+
329+
**重试算法:**
330+
331+
指数退避算法:
332+
```
333+
下次间隔 = min(
334+
initial-interval * (backoff-coefficient ^ attempt),
335+
max-interval
336+
)
337+
```
338+
339+
示例 (initial-interval=1s, backoff-coefficient=2.0):
340+
```
341+
尝试 1: 失败 → 等待 1s
342+
尝试 2: 失败 → 等待 2s
343+
尝试 3: 失败 → 等待 4s
344+
尝试 4: 失败 → 等待 8s
345+
```
346+
347+
**永久性错误 (不重试):**
348+
349+
以下错误类型不会重试,立即失败:
350+
- `validation_error` - 参数验证错误
351+
- `schema_error` - JSON Schema 验证错误
352+
- `not_found` - 资源不存在
353+
- `permission_denied` - 权限不足
354+
- `invalid_argument` - 参数无效
355+
- `node_not_registered` - 节点未注册
356+
- `plugin_load_error` - 插件加载失败
357+
358+
**最佳实践:**
359+
360+
- 🎯 **网络调用:** 使用较高的 max-attempts (5-10次)
361+
- 🎯 **本地脚本:** 使用默认策略 (3次)
362+
- 🎯 **关键任务:** 禁用重试 (max-attempts: 1)
363+
- 🎯 **长时间任务:** 增大 max-interval (避免过长等待)

docs/guides/node-development.md

Lines changed: 79 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -87,29 +87,93 @@ func (n *MyNode) Params() map[string]node.ParamSpec {
8787
- `Enum` - Allowed values
8888
- `MinValue` / `MaxValue` - Numeric ranges (for int/float)
8989

90+
#### Parameter Validation Details
91+
92+
**Automatic Validation**: Waterflow automatically validates all parameters against your `ParamSpec` definitions:
93+
94+
1. **Submission Time** (Static Values): DSL parser validates non-expression parameters when workflow is submitted
95+
2. **Runtime** (All Values): Activity validates all parameters (including expression results) before calling `Execute()`
96+
97+
**Validation Rules**:
98+
99+
| Constraint | Applies To | Example |
100+
|------------|-----------|---------|
101+
| `Required` | All types | Missing required parameter → Error |
102+
| `Type` | All types | String expected, int provided → Error |
103+
| `Pattern` | `string` | Email regex `^[a-z]+@[a-z]+\.[a-z]+$` |
104+
| `Enum` | All types | Must be one of `["GET", "POST", "PUT"]` |
105+
| `MinValue`/`MaxValue` | `int`, `float` | Value must be in range `[1, 100]` |
106+
| `Default` | Optional params | Applied if value not provided |
107+
108+
**Type Compatibility** (JSON/YAML Parsing):
109+
- Numbers in JSON/YAML are parsed as `float64`
110+
- `Type: "int"` accepts `float64` if no decimal part (e.g., `30.0` → valid int)
111+
- `Type: "float"` accepts both `int` and `float64`
112+
113+
**Expression Parameters**:
114+
- Expressions like `${{ vars.timeout }}` skip validation at submission time
115+
- Validated after expression evaluation at runtime
116+
- Expression results must match the declared type
117+
118+
**Error Handling**:
119+
```go
120+
// Validation errors are NonRetryableError (permanent)
121+
err := node.ValidateInputs(inputs, n.Params())
122+
if err != nil {
123+
// Error type: *node.InputValidationError
124+
// Properties:
125+
// - NodeName: "exec/shell@v1"
126+
// - Errors: []ParameterError
127+
// - ParamName: "timeout"
128+
// - ErrorType: "RangeViolation"
129+
// - Expected: "<= 60"
130+
// - Actual: 120
131+
// - Message: "value 120 exceeds maximum 60"
132+
return nil, err
133+
}
134+
```
135+
136+
**Validation Error Types**:
137+
- `Missing` - Required parameter not provided
138+
- `TypeMismatch` - Wrong type (e.g., string instead of int)
139+
- `PatternMismatch` - String doesn't match regex pattern
140+
- `EnumViolation` - Value not in allowed list
141+
- `RangeViolation` - Number outside min/max range
142+
143+
**Best Practices**:
144+
- Use `Pattern` for formats (emails, URLs, file paths)
145+
- Use `Enum` for limited choices (methods, log levels)
146+
- Use `MinValue`/`MaxValue` for sensible ranges (timeouts, counts)
147+
- Provide `Default` values for optional parameters
148+
- Write descriptive `Description` to help users
149+
90150
### 5. Implement Execute Logic
91151

92152
```go
93153
func (n *MyNode) Execute(ctx context.Context, inputs map[string]interface{}) (*node.NodeResult, error) {
94154
start := time.Now()
95155

96-
// 1. Validate inputs
97-
if err := node.ValidateInputs(inputs, n.Params()); err != nil {
98-
return nil, err
99-
}
156+
// ⚠️ 参数已由 Temporal Activity 验证,无需再次验证
157+
// Waterflow 保证传入的 inputs 100% 符合 ParamSpec 定义
100158

101-
// 2. Extract parameters
159+
// 1. 提取参数 (类型断言安全,因为已验证)
102160
command := inputs["command"].(string)
103-
timeout := 60
161+
timeout := 60 // Default value
104162
if t, ok := inputs["timeout"]; ok {
105-
timeout = t.(int)
163+
// Type already validated - safe to assert
164+
switch v := t.(type) {
165+
case int:
166+
timeout = v
167+
case float64:
168+
timeout = int(v) // JSON numbers are float64
169+
}
106170
}
107171

108-
// 3. Create result
172+
// 2. Create result
109173
result := node.NewNodeResult()
110174
result.AddLog("Execution started")
111175

112-
// 4. Perform work (check context cancellation)
176+
// 3. Perform work (check context cancellation)
113177
select {
114178
case <-ctx.Done():
115179
return nil, ctx.Err()
@@ -124,7 +188,7 @@ func (n *MyNode) Execute(ctx context.Context, inputs map[string]interface{}) (*n
124188
result.SetOutput("exit_code", 0)
125189
}
126190

127-
// 5. Record duration and logs
191+
// 4. Record duration and logs
128192
result.Duration = time.Since(start)
129193
result.AddLog("Execution completed")
130194

@@ -133,11 +197,11 @@ func (n *MyNode) Execute(ctx context.Context, inputs map[string]interface{}) (*n
133197
```
134198

135199
**Best Practices**:
136-
- Always validate inputs first
137-
- Handle `ctx.Done()` for cancellation
138-
- Log execution progress
139-
- Return structured outputs
140-
- Record execution duration
200+
- **Do NOT validate inputs** - Already validated by Activity layer
201+
- Handle `ctx.Done()` for cancellation support
202+
- Log execution progress for debugging
203+
- Return structured outputs in OutputSchema format
204+
- Record execution duration for metrics
141205

142206
### 6. Provide Metadata
143207

docs/nodes/README.md

Lines changed: 94 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -146,25 +146,111 @@ steps:
146146
continue-on-error: true # 容器不存在也继续执行
147147
```
148148

149-
#### 重试策略
149+
#### 重试策略 (Story 4.3)
150150

151151
```yaml
152152
steps:
153153
- name: Pull image with retry
154154
uses: docker/exec@v1
155+
retry-strategy:
156+
max-attempts: 3
157+
initial-interval: 5s
158+
backoff-coefficient: 2.0
159+
max-interval: 60s
155160
with:
156161
command: pull
157162
args: ["nginx:latest"]
158-
retry:
159-
max_attempts: 3
160-
initial_interval: 5s
161-
backoff_coefficient: 2.0
162163
```
163164

164165
**重试参数:**
165-
- `max_attempts`: 最大尝试次数
166-
- `initial_interval`: 初始重试间隔
167-
- `backoff_coefficient`: 退避系数(每次重试间隔乘以此系数)
166+
- `max-attempts`: 最大尝试次数 (1-10)
167+
- `initial-interval`: 初始重试间隔 (≥1s)
168+
- `backoff-coefficient`: 退避系数 (1.0-10.0)
169+
- `max-interval`: 最大重试间隔
170+
171+
**重试算法:** 指数退避 (`间隔 = initial-interval * backoff-coefficient ^ attempt`)
172+
173+
**永久性错误 (不重试):**
174+
某些错误类型会立即失败,不进行重试:
175+
- `validation_error` - 参数验证错误
176+
- `schema_error` - Schema 验证错误
177+
- `not_found` - 资源不存在
178+
- `permission_denied` - 权限不足
179+
- `invalid_argument` - 无效参数
180+
- `node_not_registered` - 节点未注册
181+
- `plugin_load_error` - 插件加载失败
182+
183+
更多详情参考: [配置文档 - retry-strategy](../configuration.md#retry-strategy-重试策略)
184+
185+
## 节点错误处理最佳实践
186+
187+
在自定义节点中,应该正确区分永久性错误和临时性错误:
188+
189+
### 永久性错误 (NonRetryableError)
190+
191+
用于参数错误、权限问题等,重试无意义的场景:
192+
193+
```go
194+
import "github.com/Websoft9/waterflow/pkg/dsl/node"
195+
196+
func (n *MyNode) Execute(ctx context.Context, inputs map[string]interface{}) (*node.NodeResult, error) {
197+
// 参数验证
198+
replicas, ok := inputs["replicas"].(float64)
199+
if !ok || replicas <= 0 {
200+
return nil, &node.NonRetryableError{
201+
ErrorType: "validation_error",
202+
Message: "parameter 'replicas' must be positive integer",
203+
}
204+
}
205+
206+
// 权限检查
207+
if !hasPermission(ctx) {
208+
return nil, &node.NonRetryableError{
209+
ErrorType: "permission_denied",
210+
Message: "insufficient permissions to execute this operation",
211+
}
212+
}
213+
214+
// ... 执行逻辑
215+
}
216+
```
217+
218+
### 临时性错误 (可重试)
219+
220+
用于网络超时、服务不可用等,重试可能成功的场景:
221+
222+
```go
223+
func (n *MyNode) Execute(ctx context.Context, inputs map[string]interface{}) (*node.NodeResult, error) {
224+
// 网络调用
225+
resp, err := http.Get(url)
226+
if err != nil {
227+
// 返回普通 error,Waterflow 会自动重试
228+
return nil, fmt.Errorf("network call failed: %w", err)
229+
}
230+
231+
// 服务不可用 (503)
232+
if resp.StatusCode == 503 {
233+
return nil, fmt.Errorf("service temporarily unavailable")
234+
}
235+
236+
// ... 处理响应
237+
}
238+
```
239+
240+
### 错误类型对照表
241+
242+
| 场景 | 错误类型 | 重试 | 示例 |
243+
|------|----------|------|------|
244+
| 参数验证失败 | validation_error | ❌ | 缺少必填参数 |
245+
| JSON Schema 错误 | schema_error | ❌ | 参数类型不匹配 |
246+
| 资源不存在 | not_found | ❌ | 文件/容器不存在 |
247+
| 权限不足 | permission_denied | ❌ | SSH 认证失败 |
248+
| 无效参数 | invalid_argument | ❌ | 端口号超出范围 |
249+
| 网络超时 | (普通 error) | ✅ | http.Get() timeout |
250+
| 服务不可用 | (普通 error) | ✅ | API 返回 503 |
251+
| 临时故障 | (普通 error) | ✅ | 数据库连接失败 |
252+
253+
更多节点开发指南,参考: [节点开发文档](../guides/node-development.md)
168254

169255
### 使用 Secrets
170256

0 commit comments

Comments
 (0)