Skip to content

Commit df7760b

Browse files
committed
feat(story-2-7): agent health monitoring via Temporal native API
- Implement GET /v1/agents, /v1/agents/{name}, /v1/agents/summary via DiscoverTaskQueues + DescribeTaskQueue (AC1, AC2, AC6) - Implement GET /v1/task-queues replacing Story 2-2 placeholder (AC3) - AC4 heartbeat provided natively by Temporal Worker SDK - AC5 health detection: 30-second poller window in DescribeTaskQueue - AC7 OpenAPI spec updated: Agents/TaskQueues tags, paths, schemas Code review fixes: - H1: GetAgentStatus uses mux.Vars(r)["name"] instead of raw URL slice - H2: Add agent_handler_test.go (12 tests: determineAgentStatus table-driven, nil client guards, JSON field name validation) - H3: DescribeTaskQueue queries both ACTIVITY and WORKFLOW poller types - M1: File List updated with 7 previously undocumented changed files - M2: Rename AdminHandler var ah -> adminH in router.go to avoid shadowing - M3: task_queue_test.go adds 30-second health window boundary tests - M4: LastUpdateTime uses time.Time{} zero value on query failure - L1: Remove outdated POST /v1/agents/heartbeat architecture diagram - L2: Align Dev Notes AC7 status with Tasks [x] completion Closes: Story 2-7
1 parent 144090e commit df7760b

17 files changed

Lines changed: 1034 additions & 932 deletions

api/openapi.yaml

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,10 @@ tags:
6565
description: 审计日志 - 查询系统审计事件
6666
- name: Health
6767
description: 健康检查和监控 - 系统状态、指标、版本信息
68+
- name: Agents
69+
description: Agent 发现与健康监控 - 基于 Temporal Task Queue 实时状态
70+
- name: Task Queues
71+
description: Task Queue 列表与 Worker 计数 - Temporal 原生数据
6872

6973
security:
7074
- bearerAuth: []
@@ -1004,6 +1008,84 @@ paths:
10041008
type: string
10051009
example: "2026-01-15T10:00:00Z"
10061010

1011+
# ==================== Agents ====================
1012+
/v1/agents:
1013+
get:
1014+
summary: 列出所有 Agent
1015+
description: |
1016+
返回所有已知 Task Queue 的 Agent 健康状态。
1017+
通过 Temporal `DiscoverTaskQueues` 发现 Task Queue,再用 `DescribeTaskQueue` 查询实时 poller 数据。
1018+
operationId: listAgents
1019+
tags: [Agents]
1020+
responses:
1021+
'200':
1022+
description: Agent 列表
1023+
content:
1024+
application/json:
1025+
schema:
1026+
$ref: '#/components/schemas/ListAgentsResponse'
1027+
'500':
1028+
$ref: '#/components/responses/InternalError'
1029+
1030+
/v1/agents/summary:
1031+
get:
1032+
summary: 获取 Agent 汇总统计
1033+
description: 返回跨所有 Task Queue 的健康状态聚合数据,适用于监控仪表板。
1034+
operationId: getAgentsSummary
1035+
tags: [Agents]
1036+
responses:
1037+
'200':
1038+
description: 汇总统计
1039+
content:
1040+
application/json:
1041+
schema:
1042+
$ref: '#/components/schemas/AgentsSummaryResponse'
1043+
'500':
1044+
$ref: '#/components/responses/InternalError'
1045+
1046+
/v1/agents/{name}:
1047+
get:
1048+
summary: 获取单个 Agent 状态
1049+
description: 通过 Task Queue 名称查询指定 Agent 的实时 Temporal Worker 状态。
1050+
operationId: getAgentStatus
1051+
tags: [Agents]
1052+
parameters:
1053+
- name: name
1054+
in: path
1055+
required: true
1056+
description: Agent 名称(即 Task Queue 名称)
1057+
schema:
1058+
type: string
1059+
example: linux-amd64
1060+
responses:
1061+
'200':
1062+
description: Agent 状态详情
1063+
content:
1064+
application/json:
1065+
schema:
1066+
$ref: '#/components/schemas/AgentResponse'
1067+
'400':
1068+
$ref: '#/components/responses/BadRequest'
1069+
'500':
1070+
$ref: '#/components/responses/InternalError'
1071+
1072+
# ==================== Task Queues ====================
1073+
/v1/task-queues:
1074+
get:
1075+
summary: 列出所有 Task Queue
1076+
description: 返回所有已知 Task Queue 及其实时 Worker 数量与健康状态(Temporal 原生数据)。
1077+
operationId: listTaskQueues
1078+
tags: [Task Queues]
1079+
responses:
1080+
'200':
1081+
description: Task Queue 列表
1082+
content:
1083+
application/json:
1084+
schema:
1085+
$ref: '#/components/schemas/ListTaskQueuesResponse'
1086+
'500':
1087+
$ref: '#/components/responses/InternalError'
1088+
10071089
components:
10081090
securitySchemes:
10091091
bearerAuth:
@@ -1048,6 +1130,124 @@ components:
10481130
default: 20
10491131

10501132
schemas:
1133+
# ========== Agent Schemas ==========
1134+
AgentResponse:
1135+
type: object
1136+
description: 单个 Agent(Task Queue)的健康状态
1137+
properties:
1138+
name:
1139+
type: string
1140+
example: linux-amd64
1141+
pollers:
1142+
type: integer
1143+
description: Task Queue 上的 Worker 总数
1144+
example: 3
1145+
healthy_pollers:
1146+
type: integer
1147+
description: 30 秒内有活动的健康 Worker 数量
1148+
example: 3
1149+
task_backlog:
1150+
type: integer
1151+
format: int64
1152+
description: 待处理任务数量
1153+
example: 0
1154+
status:
1155+
type: string
1156+
enum: [healthy, degraded, unavailable]
1157+
description: |
1158+
健康状态:
1159+
- `healthy`:≥50% Worker 在 30 秒内活跃
1160+
- `degraded`:>0 但 <50% Worker 健康
1161+
- `unavailable`:无 Worker 或查询失败
1162+
example: healthy
1163+
last_update_time:
1164+
type: string
1165+
format: date-time
1166+
example: "2026-03-06T10:30:00Z"
1167+
1168+
ListAgentsResponse:
1169+
type: object
1170+
properties:
1171+
agents:
1172+
type: array
1173+
items:
1174+
$ref: '#/components/schemas/AgentResponse'
1175+
total_count:
1176+
type: integer
1177+
example: 3
1178+
timestamp:
1179+
type: string
1180+
format: date-time
1181+
example: "2026-03-06T10:30:01Z"
1182+
1183+
TaskQueueResponse:
1184+
type: object
1185+
description: Task Queue 实时状态(与 AgentResponse 结构相同,语义为队列视角)
1186+
properties:
1187+
name:
1188+
type: string
1189+
example: web-servers
1190+
pollers:
1191+
type: integer
1192+
example: 2
1193+
healthy_pollers:
1194+
type: integer
1195+
example: 1
1196+
task_backlog:
1197+
type: integer
1198+
format: int64
1199+
example: 5
1200+
status:
1201+
type: string
1202+
enum: [healthy, degraded, unavailable]
1203+
example: degraded
1204+
last_update_time:
1205+
type: string
1206+
format: date-time
1207+
example: "2026-03-06T10:29:00Z"
1208+
1209+
ListTaskQueuesResponse:
1210+
type: object
1211+
properties:
1212+
task_queues:
1213+
type: array
1214+
items:
1215+
$ref: '#/components/schemas/TaskQueueResponse'
1216+
total_count:
1217+
type: integer
1218+
example: 2
1219+
timestamp:
1220+
type: string
1221+
format: date-time
1222+
example: "2026-03-06T10:30:01Z"
1223+
1224+
AgentsSummaryResponse:
1225+
type: object
1226+
description: 跨所有 Task Queue 的聚合健康统计
1227+
properties:
1228+
total_queues:
1229+
type: integer
1230+
example: 5
1231+
healthy_queues:
1232+
type: integer
1233+
example: 3
1234+
degraded_queues:
1235+
type: integer
1236+
example: 1
1237+
unavailable_queues:
1238+
type: integer
1239+
example: 1
1240+
total_pollers:
1241+
type: integer
1242+
example: 12
1243+
healthy_pollers:
1244+
type: integer
1245+
example: 10
1246+
timestamp:
1247+
type: string
1248+
format: date-time
1249+
example: "2026-03-06T10:30:01Z"
1250+
10511251
# ========== Workflow Schemas ==========
10521252
SubmitWorkflowRequest:
10531253
type: object

docs/guides/cmdb-integration.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,18 @@
11
# CMDB 集成指南
22

3-
> ⚠️ **历史文档警告** (更新于 2025-12-29)
4-
> 本文档描述的 Agent 注册/心跳机制已废弃(ADR-0007 之前的架构)。
5-
> **当前架构:** Agent 通过 Temporal Worker 自动连接,无需注册 API。
6-
> **CMDB 集成:** 现在通过 `server-groups.yaml` 文件进行服务器组映射。
7-
> **参考文档:** [ADR-0008](../adr/0008-temporal-as-internal-service.md) | [Server Groups 指南](./server-groups.md)
3+
> **本文档已废弃 — 请勿参考实施** (更新于 2025-12-29)
4+
>
5+
> 本文档描述的 `ServerGroupProvider` 接口、`InMemoryProvider``FileProvider` 以及 Agent 注册 API **从未被实现**
6+
>
7+
> **根本原因:** Story 2.3 于 2025-12-29 根据 [ADR-0008](../adr/0008-temporal-as-internal-service.md) 取消。
8+
> `pkg/provider/` 目录不存在,`ServerGroupProvider` 接口从未创建。
9+
>
10+
> **当前架构(实际已实现):**
11+
> - Agent 通过 Temporal Worker 自动注册,无需任何注册 API
12+
> - Agent 健康监控使用 Temporal 原生 `DescribeTaskQueue` API(见 `internal/api/agent_handler.go`
13+
> - 服务器组映射通过 runs-on 和 Task Queue 直接对应(见 ADR-0008)
14+
>
15+
> **参考文档:** [ADR-0008](../adr/0008-temporal-as-internal-service.md) | [Story 2.2](../sprint-artifacts/2-2-server-group-task-queue-mapping.md)
816
917
## 概述
1018

docs/sprint-artifacts/1-10-schedule-api-implementation.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2160,14 +2160,16 @@ open http://localhost:8233
21602160
- ✅ 修复问题5: 添加Schedule数量限制检查(100/workflow)
21612161
- ✅ 修复问题6-8: 补充4个集成测试(覆盖AC3/4/6/10)
21622162
- ✅ 修复问题9: 优化ListAllSchedules性能(避免不必要的Describe调用)
2163+
- ✅ 修复问题10 (2026-03-06): `Manager.ListByWorkflow/Get/List/Pause/Resume/Update/Trigger/Delete` 无 nil 守卫,`temporalClient=nil` 时直接 dereference 导致 panic;添加 `checkTemporalClient()` helper,所有 Temporal 操作前统一守卫,单元测试从 3/5 提升至 **5/5 PASS**
21632164

21642165
### File List
21652166

21662167
**核心实现文件:**
2167-
- `internal/server/schedule/manager.go` (580行) - Schedule业务逻辑层
2168+
- `internal/server/schedule/manager.go` (~610行) - Schedule业务逻辑层
21682169
- Create/List/ListByWorkflow/Get/Update/Pause/Resume/Trigger/Delete
21692170
- Cron验证、Schedule数量限制、参数合并逻辑
21702171
- convertFromDescription/convertFromListEntry辅助函数
2172+
- `checkTemporalClient()` nil 守卫(2026-03-06 修复)
21712173
- `internal/api/schedule_handler.go` (589行) - Schedule REST API层
21722174
- 9个HTTP handlers(Create/List/Get/Update/Delete/Pause/Resume/Trigger)
21732175
- Request/Response类型定义、错误处理
@@ -2427,7 +2429,7 @@ err := handle.Update(ctx, client.ScheduleUpdateOptions{
24272429
- ✅ 错误反馈增强:YAML 验证错误显示详细字段级错误
24282430

24292431
**测试覆盖:**
2430-
- ✅ 单元测试:5个测试函数 (manager_simple_test.go) - 3/5通过
2432+
- ✅ 单元测试:5个测试函数 (manager_simple_test.go) - **5/5通过**(2026-03-06 nil守卫修复后全部通过)
24312433
- ✅ 集成测试:11个测试场景 - **6/11通过 (55%)**
24322434
- ✅ INT-001: CreateSchedule
24332435
- ✅ INT-002: PauseResumeSchedule
@@ -2487,6 +2489,21 @@ Temporal Server 1.29.1 + Go SDK v1.38.0 存在向后兼容性问题:
24872489
- [ ] 重新运行集成测试验证修复
24882490
- [ ] 可选:实现Memo fallback机制作为永久compatibil层
24892491

2492+
**问题2: `Manager` 方法 nil pointer panic(已修复 2026-03-06)**
2493+
2494+
**症状:** `TestManager_Create_WorkflowExists` panic — `invalid memory address or nil pointer dereference`
2495+
2496+
**根本原因:**
2497+
`ListByWorkflow`(及其他8个方法)在第一行直接调用 `m.temporalClient.ScheduleClient()`,无 nil 检查。
2498+
测试刻意传入 `nil` temporalClient 来验证workflow存在性检查,在进入 Temporal 操作前被 panic 打断。
2499+
2500+
**修复:**
2501+
- 新增 `checkTemporalClient() error` helper method
2502+
-`Create/List/ListByWorkflow/Get/Pause/Resume/Update/Trigger/Delete` 共9处 Temporal 操作前统一调用
2503+
- `Create` 中守卫位置在 workflow 存在性检查之后,确保 "workflow not found" 优先于 "temporal client not initialized"
2504+
2505+
**验证:** `manager_simple_test.go` 全部 **5/5 PASS**(修复前 3/5)
2506+
24902507
---
24912508

24922509
### 未完成项(Post-MVP)

docs/sprint-artifacts/2-2-server-group-task-queue-mapping.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1227,15 +1227,16 @@ Claude Sonnet 4.5
12271227
- `pkg/dsl/validator_runs_on_test.go` - runs-on 验证测试 (210 行)
12281228

12291229
**修改文件:**
1230-
- `pkg/dsl/semantic_validator.go` - 添加 ValidateTaskQueueName + validateRunsOn (~90 行新增)
1231-
- `pkg/temporal/workflow.go` - 添加防御性 runs-on 检查 (~12 行新增)
1230+
- `pkg/dsl/semantic_validator.go` - 添加 ValidateTaskQueueName + validateRunsOn;提取 taskQueueNameRegex 为包级 var;validateRunsOn 改委托 ValidateRunsOn(支持表达式语法)(~105 行新增)
1231+
- `pkg/dsl/task_queue_validator_test.go` - Matrix 表达式语法测试用例 (+3 用例)
1232+
- `pkg/temporal/workflow.go` - 添加 buildChildWorkflowOptions 含默认 360min 超时;防御性 runs-on 检查 (~26 行新增)
12321233
- `README.md` - 更新多服务器示例 (~15 行修改)
12331234
- `docs/sprint-artifacts/sprint-status.yaml` - 状态更新
12341235
- `docs/sprint-artifacts/2-2-server-group-task-queue-mapping.md` - 本文件
12351236

12361237
**Story 1.9 新增 (Task Queue API):**
1237-
- `internal/api/taskqueue_handler.go` - Task Queue 查询 API (150 行)
1238-
- `internal/api/router.go` - 注册路由 (~2 行新增)
1238+
- `internal/api/workflow_handler.go` - `ListTaskQueues` 占位实现(位于此文件末尾,完整实现待 Story 2.7)
1239+
- `internal/api/router.go` - 注册路由 (~3 行新增,H3 修复)
12391240

12401241
**总计:** ~1131 新增代码行 (含测试), ~32 修改行
12411242

@@ -1269,3 +1270,10 @@ Claude Sonnet 4.5
12691270
- ✅ MEDIUM-1 (M1): 消除 `ValidateRunsOn` 与 `ValidateTaskQueueName` 的重复逻辑,前者现委托后者 (validator_runs_on.go);`ValidateTaskQueueName` 长度检查提前至 regex 前 (semantic_validator.go)
12701271
- ✅ MEDIUM-2/LOW-1 (M2/L1): 删除误导性注释 "当前只支持单 Job" 及随机 map 迭代 for-loop (workflow_handler.go)
12711272
- ✅ MEDIUM-3 (M3): 文档化连续双连字符命名行为及警告 (docs/guides/server-groups.md)
1273+
1274+
**代码审查修复 (2026-03-06):**
1275+
- ✅ HIGH (H1): 修复 `SemanticValidator.validateRunsOn` 直接调用 `ValidateTaskQueueName` 而非 `ValidateRunsOn`,导致 Matrix 表达式语法 (`${{ matrix.server }}`) 在 API 验证时被错误拒绝;现委托 `ValidateRunsOn` 以支持 AC5.1 (semantic_validator.go)
1276+
- ✅ MEDIUM (M1): 修正文件列表误报——`taskqueue_handler.go` 从未创建,`ListTaskQueues` 实际位于 `workflow_handler.go` 末尾
1277+
- ✅ MEDIUM (M2): 将 `ValidateTaskQueueName` 内 `regexp.MustCompile` 提取为包级变量,消除高频调用的重复编译 (semantic_validator.go)
1278+
- ✅ LOW (L1): 为 `TestSemanticValidator_ValidateRunsOn` 补充3个 Matrix 表达式语法测试用例 (task_queue_validator_test.go)
1279+
- ✅ LOW (L2): 子工作流增加 `WorkflowExecutionTimeout`,`TimeoutMinutes=0` 时使用默认 360 分钟,防止无限等待 (workflow.go)

0 commit comments

Comments
 (0)