Skip to content

Commit 817401d

Browse files
khaliqgantProactive Runtime Botgithub-actions[bot]claude
authored
Broker reliability: half-start recovery + surfaced read failures + doctor diagnostics (#916)
* Fix relay reliability delivery events * Fix delivery retry review gaps * Cook agent log output by default * Harden agents logs byte handling * style: auto-format Rust code with cargo fmt * Fix delivery terminal reliability review findings * Recover half-started broker launches * fix: surface broker read failures * test(doctor): add broker stale/auth/half-start repro regression coverage Cherry-picks the doctor diagnostic regression from test/doctor-stale-broker-repro (a0ee79bc) without the trajectory churn: docs/doctor-orchestration-repros.md plus doctor.ts/doctor.test.ts assertions that `agent-relay doctor` flags stale connection, unresolved API-key template, and half-started/orphaned broker states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): note doctor now fails on half-started/stale/unresolved-key brokers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Proactive Runtime Bot <agent@agent-relay.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9c1849b commit 817401d

19 files changed

Lines changed: 1067 additions & 113 deletions

.trajectories/compacted/release-6.2.3.json

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,15 @@
33
"version": 1,
44
"type": "compacted",
55
"compactedAt": "2026-05-19T14:41:29.877Z",
6-
"sourceTrajectories": [
7-
"traj_47akjihewlow",
8-
"traj_f9wxa8ujeg78",
9-
"traj_sqerp89tc436"
10-
],
6+
"sourceTrajectories": ["traj_47akjihewlow", "traj_f9wxa8ujeg78", "traj_sqerp89tc436"],
117
"dateRange": {
128
"start": "2026-05-19T00:45:33.159Z",
139
"end": "2026-05-19T01:38:29.105Z"
1410
},
1511
"summary": {
1612
"totalDecisions": 4,
1713
"totalEvents": 4,
18-
"uniqueAgents": [
19-
"default"
20-
]
14+
"uniqueAgents": ["default"]
2115
},
2216
"decisionGroups": [
2317
{
@@ -83,9 +77,5 @@
8377
"crates/broker/src/worker.rs",
8478
"crates/broker/src/wrap.rs"
8579
],
86-
"commits": [
87-
"7182810c",
88-
"2ecfb018",
89-
"040e6d9f"
90-
]
91-
}
80+
"commits": ["7182810c", "2ecfb018", "040e6d9f"]
81+
}
Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Trajectory Compaction: May 19, 2026 - May 19, 2026
22

33
## Summary
4+
45
- Sessions: 3
56
- Decisions: 4
67
- Events: 4
@@ -9,15 +10,19 @@
910
- Commits: 3
1011

1112
## Architecture
13+
1214
- Split broker runtime by responsibility -> Split broker runtime by responsibility (traj_47akjihewlow)
1315
- Split broker binary entrypoint mechanically -> Split broker binary entrypoint mechanically (traj_f9wxa8ujeg78)
1416

1517
## Other
18+
1619
- Apply issue 875 to crates/broker/src/main.rs -> Apply issue 875 to crates/broker/src/main.rs (traj_sqerp89tc436)
1720
- Removed repository-root bin fallback -> Removed repository-root bin fallback (traj_sqerp89tc436)
1821

1922
## Key Learnings
23+
2024
- None
2125

2226
## Key Findings
23-
- None
27+
28+
- None
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
{
2+
"id": "traj_2gpglosdsq7s",
3+
"version": 1,
4+
"task": {
5+
"title": "Fix broker session read paths and agent listing errors"
6+
},
7+
"status": "completed",
8+
"startedAt": "2026-05-19T12:37:18.367Z",
9+
"completedAt": "2026-05-19T12:48:50.116Z",
10+
"agents": [
11+
{
12+
"name": "default",
13+
"role": "lead",
14+
"joinedAt": "2026-05-19T12:44:57.363Z"
15+
}
16+
],
17+
"chapters": [
18+
{
19+
"id": "chap_fxwoou59eukx",
20+
"title": "Work",
21+
"agentName": "default",
22+
"startedAt": "2026-05-19T12:44:57.363Z",
23+
"endedAt": "2026-05-19T12:48:50.116Z",
24+
"events": [
25+
{
26+
"ts": 1779194697364,
27+
"type": "decision",
28+
"content": "Resolved CLI read surfaces through the project broker connection file: Resolved CLI read surfaces through the project broker connection file",
29+
"raw": {
30+
"question": "Resolved CLI read surfaces through the project broker connection file",
31+
"chosen": "Resolved CLI read surfaces through the project broker connection file",
32+
"alternatives": [],
33+
"reasoning": "status already reads the project .agent-relay/connection.json; passing that path explicitly prevents AGENT_RELAY_STATE_DIR from redirecting who/agents/history/replies to a stale broker"
34+
},
35+
"significance": "high"
36+
}
37+
]
38+
}
39+
],
40+
"retrospective": {
41+
"summary": "Fixed broker read surfaces to use the project broker connection and report agent-list query failures instead of empty lists",
42+
"approach": "Standard approach",
43+
"confidence": 0.9
44+
},
45+
"commits": [],
46+
"filesChanged": [],
47+
"projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents",
48+
"tags": [],
49+
"_trace": {
50+
"startRef": "f5dd259e2a7009bcdbc1e9aa30c750ac74e5aeca",
51+
"endRef": "f5dd259e2a7009bcdbc1e9aa30c750ac74e5aeca"
52+
}
53+
}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Trajectory: Fix broker session read paths and agent listing errors
2+
3+
> **Status:** ✅ Completed
4+
> **Confidence:** 90%
5+
> **Started:** May 19, 2026 at 02:37 PM
6+
> **Completed:** May 19, 2026 at 02:48 PM
7+
8+
---
9+
10+
## Summary
11+
12+
Fixed broker read surfaces to use the project broker connection and report agent-list query failures instead of empty lists
13+
14+
**Approach:** Standard approach
15+
16+
---
17+
18+
## Key Decisions
19+
20+
### Resolved CLI read surfaces through the project broker connection file
21+
22+
- **Chose:** Resolved CLI read surfaces through the project broker connection file
23+
- **Reasoning:** status already reads the project .agent-relay/connection.json; passing that path explicitly prevents AGENT_RELAY_STATE_DIR from redirecting who/agents/history/replies to a stale broker
24+
25+
---
26+
27+
## Chapters
28+
29+
### 1. Work
30+
31+
_Agent: default_
32+
33+
- Resolved CLI read surfaces through the project broker connection file: Resolved CLI read surfaces through the project broker connection file
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
{
2+
"id": "traj_gnqvtoxtc8dy",
3+
"version": 1,
4+
"task": {
5+
"title": "Fix broker half-start recovery"
6+
},
7+
"status": "completed",
8+
"startedAt": "2026-05-19T12:34:36.057Z",
9+
"completedAt": "2026-05-19T12:47:18.115Z",
10+
"agents": [
11+
{
12+
"name": "default",
13+
"role": "lead",
14+
"joinedAt": "2026-05-19T12:41:16.026Z"
15+
}
16+
],
17+
"chapters": [
18+
{
19+
"id": "chap_b3d0wym6dp08",
20+
"title": "Work",
21+
"agentName": "default",
22+
"startedAt": "2026-05-19T12:41:16.026Z",
23+
"endedAt": "2026-05-19T12:47:18.115Z",
24+
"events": [
25+
{
26+
"ts": 1779194476027,
27+
"type": "decision",
28+
"content": "Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying: Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying",
29+
"raw": {
30+
"question": "Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying",
31+
"chosen": "Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying",
32+
"alternatives": [],
33+
"reasoning": "The failure mode leaves a live agent-relay up --foreground process without usable connection metadata; scanning only agent-relay-broker misses that wrapper, so up/down --force must reap both wrapper and broker PID candidates."
34+
},
35+
"significance": "high"
36+
}
37+
]
38+
}
39+
],
40+
"retrospective": {
41+
"summary": "Added deterministic recovery for detached broker half-starts by reaping unready broker PIDs and metadata-less foreground wrappers before restart, and by cleaning failed detached children on readiness timeout.",
42+
"approach": "Standard approach",
43+
"confidence": 0.9
44+
},
45+
"commits": [],
46+
"filesChanged": [],
47+
"projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-half-start-recovery",
48+
"tags": [],
49+
"_trace": {
50+
"startRef": "f5dd259e2a7009bcdbc1e9aa30c750ac74e5aeca",
51+
"endRef": "f5dd259e2a7009bcdbc1e9aa30c750ac74e5aeca"
52+
}
53+
}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Trajectory: Fix broker half-start recovery
2+
3+
> **Status:** ✅ Completed
4+
> **Confidence:** 90%
5+
> **Started:** May 19, 2026 at 02:34 PM
6+
> **Completed:** May 19, 2026 at 02:47 PM
7+
8+
---
9+
10+
## Summary
11+
12+
Added deterministic recovery for detached broker half-starts by reaping unready broker PIDs and metadata-less foreground wrappers before restart, and by cleaning failed detached children on readiness timeout.
13+
14+
**Approach:** Standard approach
15+
16+
---
17+
18+
## Key Decisions
19+
20+
### Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying
21+
22+
- **Chose:** Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying
23+
- **Reasoning:** The failure mode leaves a live agent-relay up --foreground process without usable connection metadata; scanning only agent-relay-broker misses that wrapper, so up/down --force must reap both wrapper and broker PID candidates.
24+
25+
---
26+
27+
## Chapters
28+
29+
### 1. Work
30+
31+
_Agent: default_
32+
33+
- Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying: Recover half-started detached brokers by killing foreground CLI wrappers and unready broker PIDs before retrying

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6868
- PTY context budget detection uses the latest percentage in output and can re-emit after the budget rises.
6969
- `agent-relay agents:logs` now cooks PTY redraws into line-oriented output by default and keeps raw terminal bytes behind `--raw`.
7070
- `agent-relay agents:logs --raw` preserves non-UTF-8 bytes, and follow mode keeps split escape/codepoint sequences intact.
71+
- `agent-relay up --no-dashboard` and `agent-relay down --force` now recover half-started brokers that stayed alive without readable connection metadata.
72+
- `agent-relay who` and `agent-relay agents` now fail clearly when broker queries fail instead of printing an empty agent list.
73+
- `agent-relay history` and `agent-relay replies` now resolve the project broker session even when `AGENT_RELAY_STATE_DIR` points elsewhere.
74+
- `agent-relay doctor` now fails with an actionable diagnostic for half-started, stale-connection, and unresolved-API-key-template brokers instead of reporting "healthy".
7175
- CLI readiness checks use the live VT grid and cursor position to avoid false ready states in alternate screens and menus.
7276
- `agent-relay history --from <agent>` returns the newest messages after chronological sorting.
7377
- `agent-relay replies --unread` prints nothing when there are no unread messages.
@@ -84,57 +88,74 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8488
## [6.2.6] - 2026-05-20
8589

8690
### Product Perspective
91+
8792
#### User-Impacting Fixes
93+
8894
- Flush UTF-8 decoder on normal pty_closed path
8995
- Preserve split multi-byte UTF-8 in worker_stream (#922) (#922)
9096

9197
### Technical Perspective
98+
9299
#### Releases
100+
93101
- v6.2.6
94102

95103
---
96104

97105
## [6.2.5] - 2026-05-19
98106

99107
### Product Perspective
108+
100109
#### User-Impacting Fixes
110+
101111
- Handle write_pty frames in PTY worker (#920)
102112

103113
### Technical Perspective
114+
104115
#### Dependencies & Tooling
116+
105117
- Sync package-lock.json for next 15.5.18 bump
106118
- Bump next from 15.5.14 to 15.5.18 in /web
107119

108120
#### Releases
121+
109122
- v6.2.5
110123

111124
---
112125

113126
## [6.2.4] - 2026-05-19
114127

115128
### Technical Perspective
129+
116130
#### Architecture & API Changes
131+
117132
- Use relaycast sdk 1.1 helpers
118133

119134
#### Releases
135+
120136
- v6.2.4
121137

122138
---
123139

124140
## [6.2.3] - 2026-05-19
125141

126142
### Product Perspective
143+
127144
#### User-Facing Features & Improvements
145+
128146
- **Align reported version with product release line** (#904)
129147

130148
#### User-Impacting Fixes
149+
131150
- Address coderabbit review on version handling
132151
- Use next/link for docs navigation
133152
- Pass idle threshold to spawned workers
134153
- Address runtime review findings
135154

136155
### Technical Perspective
156+
137157
#### Architecture & API Changes
158+
138159
- Narrow public crate API
139160
- Group relaycast broker integration
140161
- Extract broker runtime event handlers
@@ -143,11 +164,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
143164
- Move broker crate under crates
144165

145166
#### Dependencies & Tooling
167+
146168
- Record runtime split trajectory
147169
- Complete issue 875 trajectory file list
148170
- Update issue 875 trajectory metadata
149171

150172
#### Releases
173+
151174
- v6.2.3
152175

153176
---

0 commit comments

Comments
 (0)