fix(broker): route all delivery through Relaycast + honor spawn startup timeout#1221
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughBroker Relaycast send paths now require an explicit sender identity, validated against workspace-scoped worker registration, plus reply-target filtering for threaded messages and a new delivery_queued fleet event. HarnessDriverClient.spawn() replaces a fixed 10-attempt handshake retry with a deadline-based retry tolerating only 503 errors. ChangesBroker Relaycast send routing and delivery events
Harness driver handshake retry
Estimated code review effort: 3 (Moderate) | ~30 minutes Possibly related issues
Possibly related PRs
Suggested reviewers: Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the broker's message delivery system to route all messages through Relaycast, removing the direct local-injection bypass to ensure consistent state tracking. It also updates HarnessDriverClient.spawn() to poll the broker's startup handshake for the full timeout budget instead of a fixed 10 attempts, preventing premature spawn failures. The review feedback correctly notes that the added changelog entry contains internal CI details, violating the repository style guide's requirement for concise, impact-first entries.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/broker/src/runtime/api.rs (1)
719-732: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winRecord thread history only after Relaycast publish succeeds.
This inserts the message into
recent_thread_messagesbefore the Relaycast call. If the publish fails or times out,/api/threadscan show a message that was never accepted by Relaycast. Move this record into theOk(Ok(()))branch, or mark failed attempts separately.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/broker/src/runtime/api.rs` around lines 719 - 732, The thread history entry is being recorded before Relaycast confirms delivery, which can leave `/api/threads` showing messages that were never accepted. Move the `record_thread_history_event` call in `crates/broker/src/runtime/api.rs` so it only runs inside the `Ok(Ok(()))` success branch after the Relaycast publish completes, or otherwise record failed publishes separately. Use the existing publish result handling around the Relaycast call and the `record_thread_history_event` helper to keep the history consistent with actual delivery.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/broker/src/relaycast/ws.rs`:
- Around line 525-529: The threaded reply path in relaycast_ws::send_with_mode
currently allows Steer to fall through to AgentClient::reply, which ignores
MessageInjectionMode and silently behaves like a plain reply. Update the logic
around the thread_id branch so that when the requested mode is Steer and a
thread_id is present, it returns an explicit unsupported-operation error instead
of calling AgentClient::reply; keep plain threaded replies working for supported
modes and preserve the existing error wrapping pattern.
In `@crates/broker/src/runtime/api.rs`:
- Around line 713-717: The publish sender selection in the `publish_from` block
only checks `workers.has_worker(&delivery_from)`, which is not enough to ensure
the worker belongs to `selected_workspace`. Update the logic in `api.rs` so
impersonation is allowed only when the worker exists and is associated with the
currently selected workspace; otherwise fall back to `workspace_self_name`. Use
the existing `delivery_from`, `selected_workspace`, `workers.has_worker`, and
`workspace_self_name` symbols to locate and adjust the guard before calling
`selected_workspace.http_client.send_with_mode`.
In `@packages/harness-driver/src/client.ts`:
- Around line 377-395: The handshake retry loop in client.ts is checking
err.message for 503 text, but BrokerTransport.request() already exposes
structured retry data. Update the retry condition in the session handshake logic
to use the thrown error’s code or status (for example HarnessDriverProtocolError
with http_503 / 503) instead of parsing the formatted message, while keeping the
existing deadline and brokerExited behavior in the client.getSession retry flow.
---
Outside diff comments:
In `@crates/broker/src/runtime/api.rs`:
- Around line 719-732: The thread history entry is being recorded before
Relaycast confirms delivery, which can leave `/api/threads` showing messages
that were never accepted. Move the `record_thread_history_event` call in
`crates/broker/src/runtime/api.rs` so it only runs inside the `Ok(Ok(()))`
success branch after the Relaycast publish completes, or otherwise record failed
publishes separately. Use the existing publish result handling around the
Relaycast call and the `record_thread_history_event` helper to keep the history
consistent with actual delivery.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 89de1fa1-1af1-46aa-a909-07039ae1d9d1
📒 Files selected for processing (9)
CHANGELOG.mdcrates/broker/src/lib.rscrates/broker/src/relaycast/ws.rscrates/broker/src/runtime/api.rscrates/broker/src/runtime/fleet.rscrates/broker/src/runtime/mod.rscrates/broker/src/runtime/util.rscrates/broker/src/worker.rspackages/harness-driver/src/client.ts
💤 Files with no reviewable changes (1)
- crates/broker/src/worker.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b4a6bfd3c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| InboundQueueOutcome::Queued => { | ||
| tracing::info!( | ||
| target = "relay_broker::fleet", | ||
| agent = %deliver.agent, | ||
| delivery_id = %deliver.delivery_id, | ||
| msg_id = %deliver.msg_id, | ||
| "queued node delivery (manual_flush inbound delivery mode)" | ||
| ); | ||
| Ok(()) |
There was a problem hiding this comment.
Emit queued events for manual-flush deliveries
When a worker is in manual_flush, this new node-delivery branch only logs the queued message and then ACKs it. The removed local /api/send path emitted a delivery_queued broker event in this case, and attach-drive relies on that event to increment its pending counter after the initial /pending poll. Since /api/send now always reaches this branch via Relaycast, messages queued while driving an agent become invisible in the drive status line until a later flush/poll even though they are held in the broker.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ee2c139. The InboundQueueOutcome::Queued branch in the fleet path now re-emits the delivery_queued event (with the name field attach --drive scopes on, plus event_id/delivery_id/from/target), matching the removed local send path. Held messages are visible in the drive status line again.
| if let Some(thread_id) = thread_id { | ||
| agent_client | ||
| .reply(thread_id, text, None, None) |
There was a problem hiding this comment.
Validate thread IDs before using Relaycast reply
For channel targets, any supplied thread_id is now treated as a Relaycast message id. Broker clients commonly pass the event_id returned by /api/send, which is still a synthetic http_* id, and /api/threads can also surface synthetic grouping keys such as #general/direct:*; Relaycast cannot reply to those ids, so these channel follow-ups now fail instead of posting with the provided thread context. Gate this path to real Relaycast message ids or return/use the Relaycast id from the initial publish.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in bf68b92. Added is_relaycast_reply_target() (reuses the existing synthetic-event-id classifier and also rejects #channel / direct:* grouping keys); the send handler only forwards thread_id to reply() when it's a real message id and otherwise falls back to a plain post, so a synthetic http_*/grouping id no longer fails the send.
There was a problem hiding this comment.
4 issues found and verified against the latest diff
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/broker/src/relaycast/ws.rs">
<violation number="1" location="crates/broker/src/relaycast/ws.rs:527">
P1: Thread replies now require a Relaycast message id, but the broker send API exposes broker `event_id`s to callers. Follow-up sends using the returned id will fail at `reply` instead of publishing the message.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| .map_err(|e| anyhow::anyhow!("relaycast send_to_channel failed: {e}"))?; | ||
| if let Some(thread_id) = thread_id { | ||
| agent_client | ||
| .reply(thread_id, text, None, None) |
There was a problem hiding this comment.
P1: Thread replies now require a Relaycast message id, but the broker send API exposes broker event_ids to callers. Follow-up sends using the returned id will fail at reply instead of publishing the message.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/relaycast/ws.rs, line 527:
<comment>Thread replies now require a Relaycast message id, but the broker send API exposes broker `event_id`s to callers. Follow-up sends using the returned id will fail at `reply` instead of publishing the message.</comment>
<file context>
@@ -464,31 +486,57 @@ impl RelaycastHttpClient {
- .map_err(|e| anyhow::anyhow!("relaycast send_to_channel failed: {e}"))?;
+ if let Some(thread_id) = thread_id {
+ agent_client
+ .reply(thread_id, text, None, None)
+ .await
+ .map_err(|e| anyhow::anyhow!("relaycast thread reply failed: {e}"))?;
</file context>
There was a problem hiding this comment.
Fixed in bf68b92 — see reply above. Synthetic http_* event ids (and #channel/direct:* keys) are no longer passed to reply(); they fall back to a plain post so the follow-up send still publishes.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/broker/src/worker.rs (1)
1590-1596: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd test coverage for the matching/mismatching workspace branches.
The only new test covers the "unknown worker" case. The security-relevant branches — worker present with a matching
workspace_id, worker present with a mismatchedworkspace_id, and the unscoped (None) worker fallback — are untested, even though this function is the sole gate against cross-workspace sender impersonation.♻️ Suggested additional tests
#[test] fn has_worker_in_workspace_returns_false_for_unknown() { let reg = make_registry(vec![]); let workspace = crate::ids::WorkspaceId::new("ws_1".to_string()); assert!(!reg.has_worker_in_workspace("nonexistent", &workspace)); } + + // TODO: add cases exercising a spawned worker with a matching + // `workspace_id`, a mismatched `workspace_id`, and `workspace_id: None` + // once a test helper exists to insert a `WorkerHandle` directly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/broker/src/worker.rs` around lines 1590 - 1596, Add test coverage in has_worker_in_workspace for the security-critical branches that are currently missing: verify a known worker with a matching WorkspaceId returns true, a known worker with a mismatched WorkspaceId returns false, and a worker with no workspace scope (None) still falls back correctly. Extend the existing worker registry test setup helpers such as make_registry and the has_worker_in_workspace test group in worker.rs so these cases are explicitly asserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/broker/src/worker.rs`:
- Around line 212-230: Keep the workspace check in has_worker_in_workspace, but
add tests covering every branch of the worker lookup logic. Add cases for a
worker with the same workspace_id, a worker with a different workspace_id, and a
worker whose workspace_id is None to verify the legacy/SDK fallback behavior
matches the routing rule. Use the has_worker_in_workspace method and the workers
map/handle.workspace_id paths as the targets for the new coverage.
---
Nitpick comments:
In `@crates/broker/src/worker.rs`:
- Around line 1590-1596: Add test coverage in has_worker_in_workspace for the
security-critical branches that are currently missing: verify a known worker
with a matching WorkspaceId returns true, a known worker with a mismatched
WorkspaceId returns false, and a worker with no workspace scope (None) still
falls back correctly. Extend the existing worker registry test setup helpers
such as make_registry and the has_worker_in_workspace test group in worker.rs so
these cases are explicitly asserted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: a3550598-2908-442a-8cfa-4196e929c9ab
📒 Files selected for processing (6)
CHANGELOG.mdcrates/broker/src/relaycast/ws.rscrates/broker/src/runtime/api.rscrates/broker/src/runtime/fleet.rscrates/broker/src/worker.rspackages/harness-driver/src/client.ts
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/harness-driver/src/client.ts
- crates/broker/src/runtime/api.rs
- crates/broker/src/relaycast/ws.rs
- crates/broker/src/runtime/fleet.rs
| /// True when a worker named `name` exists and either has no recorded | ||
| /// workspace or belongs to `workspace_id`. Gates sender impersonation on | ||
| /// Relaycast publish: a worker attached to workspace A must not be | ||
| /// impersonated when publishing into workspace B, which would register or | ||
| /// rotate that name's token in the wrong workspace. | ||
| pub(crate) fn has_worker_in_workspace( | ||
| &self, | ||
| name: &str, | ||
| workspace_id: &crate::ids::WorkspaceId, | ||
| ) -> bool { | ||
| match self.workers.get(name) { | ||
| Some(handle) => match &handle.workspace_id { | ||
| Some(worker_ws) => worker_ws == workspace_id, | ||
| None => true, | ||
| }, | ||
| None => false, | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: check where WorkerHandle.workspace_id is set to None vs Some,
# to confirm None only occurs in single-workspace configurations.
rg -n -B3 -A3 'workspace_id\s*:\s*None' crates/broker/src/worker.rs crates/broker/src/runtime/*.rs
rg -n -B3 -A10 'struct WorkerHandle' crates/broker/src/worker.rsRepository: AgentWorkforce/relay
Length of output: 1240
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the worker and API call sites first.
ast-grep outline crates/broker/src/worker.rs --view expanded
ast-grep outline crates/broker/src/runtime/api.rs --view expanded
ast-grep outline crates/broker/src/runtime/app_server.rs --view expanded
# Read the relevant sections around workspace_id handling and the impersonation gate.
sed -n '1,140p' crates/broker/src/worker.rs
sed -n '1,240p' crates/broker/src/runtime/api.rs
sed -n '340,430p' crates/broker/src/runtime/app_server.rs
# Find all assignments/usages of WorkerHandle.workspace_id and related workspace scoping.
rg -n -B3 -A3 'workspace_id\s*:' crates/broker/src
rg -n -B3 -A6 'has_worker_in_workspace|from:|workspace_id' crates/broker/src/runtime/api.rs crates/broker/src/worker.rs
# Check tests around the new helper and any multi-workspace behavior.
rg -n -B2 -A8 'has_worker_in_workspace|workspace_id: None|workspace_id: Some|RelayDelivery' crates/broker/srcRepository: AgentWorkforce/relay
Length of output: 50377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the impersonation gate and worker creation paths with tight ranges.
sed -n '210,260p' crates/broker/src/worker.rs
sed -n '760,900p' crates/broker/src/worker.rs
sed -n '1,220p' crates/broker/src/runtime/api.rs
# Find all places a WorkerHandle is constructed and how workspace_id is populated.
rg -n -B4 -A12 'WorkerHandle \{' crates/broker/src
rg -n -B4 -A8 'workspace_id:\s*(Some\(|None|workspace_id)' crates/broker/src/worker.rs crates/broker/src/runtime crates/broker/src
# Read the routing rule that already treats None as match-all.
sed -n '1,80p' crates/broker/src/routing.rs
# Narrow tests around the new helper and any workspace-related behavior.
sed -n '1570,1610p' crates/broker/src/worker.rsRepository: AgentWorkforce/relay
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the exact impersonation gate in runtime/api.rs.
rg -n -B8 -A18 'has_worker_in_workspace|impersonat|sender|from' crates/broker/src/runtime/api.rs
# Inspect the worker workspace propagation path from spawn call sites.
rg -n -B6 -A10 'spawn\(' crates/broker/src/runtime/api.rs crates/broker/src/runtime/tests.rs crates/broker/src/listen_api.rs
# Check the helper tests and add any nearby assertions around None/workspace match behavior.
sed -n '1588,1608p' crates/broker/src/worker.rs
sed -n '1,80p' crates/broker/src/routing.rsRepository: AgentWorkforce/relay
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find every production construction of WorkerHandle and inspect its workspace_id value.
rg -n -B6 -A14 'WorkerHandle \{' crates/broker/src --glob '!**/*test*' --glob '!**/tests.rs'
# Find every direct use of `workspace_id: None` in worker creation contexts.
rg -n -B4 -A6 'workspace_id:\s*None' crates/broker/src --glob '!**/*test*' --glob '!**/tests.rs'
# Show the full spawn signature and the workspace_id value path again, plus any other spawn call sites.
rg -n -B4 -A12 'pub\(crate\) async fn spawn\(' crates/broker/src/worker.rs
rg -n -n 'spawn\(' crates/broker/src/runtime/api.rs crates/broker/src/listen_api.rs crates/broker/src/pty_worker.rs crates/broker/src/node_control.rs crates/broker/src/fleet.rsRepository: AgentWorkforce/relay
Length of output: 14237
Keep the workspace check, but add coverage for all branches. The None fallback matches the existing legacy/SDK worker rule in crates/broker/src/routing.rs, so it looks intentional; add tests for the same-workspace, mismatched-workspace, and unscoped cases so this impersonation gate doesn’t regress.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/broker/src/worker.rs` around lines 212 - 230, Keep the workspace check
in has_worker_in_workspace, but add tests covering every branch of the worker
lookup logic. Add cases for a worker with the same workspace_id, a worker with a
different workspace_id, and a worker whose workspace_id is None to verify the
legacy/SDK fallback behavior matches the routing rule. Use the
has_worker_in_workspace method and the workers map/handle.workspace_id paths as
the targets for the new coverage.
|
Addressed all bot review findings across two commits (ee2c139, bf68b92):
New tests: |
There was a problem hiding this comment.
2 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/broker/src/runtime/api.rs">
<violation number="1" location="crates/broker/src/runtime/api.rs:770">
P2: Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to `AgentClient::reply`, which can still fail the send this fallback is meant to protect.</violation>
</file>
<file name="crates/broker/src/worker.rs">
<violation number="1" location="crates/broker/src/worker.rs:1591">
P2: The test for `has_worker_in_workspace` only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| let reply_thread_id = thread_id | ||
| .as_deref() | ||
| .filter(|tid| is_relaycast_reply_target(tid)); |
There was a problem hiding this comment.
P2: Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to AgentClient::reply, which can still fail the send this fallback is meant to protect.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/runtime/api.rs, line 770:
<comment>Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to `AgentClient::reply`, which can still fail the send this fallback is meant to protect.</comment>
<file context>
@@ -755,6 +760,24 @@ impl BrokerRuntime {
+ // `/api/threads` aren't reply targets — Relaycast would reject
+ // the reply and fail the whole send. Fall back to a plain post
+ // (unthreaded) for those, preserving delivery.
+ let reply_thread_id = thread_id
+ .as_deref()
+ .filter(|tid| is_relaycast_reply_target(tid));
</file context>
| let reply_thread_id = thread_id | |
| .as_deref() | |
| .filter(|tid| is_relaycast_reply_target(tid)); | |
| let reply_thread_id = thread_id | |
| .as_deref() | |
| .map(str::trim) | |
| .filter(|tid| is_relaycast_reply_target(tid)); |
| } | ||
|
|
||
| #[test] | ||
| fn has_worker_in_workspace_returns_false_for_unknown() { |
There was a problem hiding this comment.
P2: The test for has_worker_in_workspace only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/worker.rs, line 1591:
<comment>The test for `has_worker_in_workspace` only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.</comment>
<file context>
@@ -1568,6 +1587,13 @@ mod tests {
}
+ #[test]
+ fn has_worker_in_workspace_returns_false_for_unknown() {
+ let reg = make_registry(vec![]);
+ let workspace = crate::ids::WorkspaceId::new("ws_1".to_string());
</file context>
HarnessDriverClient.spawn() polled the broker's startup handshake for a fixed 10 attempts × 1s (~10s), ignoring the caller's startupTimeoutMs. When the Relaycast handshake took longer than ~10s on a cold or slow network, the loop exhausted its attempts and threw the broker's retryable "Broker is starting" 503 even though the broker was healthy and still warming up. This failed the release smoke tests, which skipped the internal @agent-relay/* publish jobs, which then made "Publish Main Package" time out waiting for deps that were never published. Poll until the full startup budget (default 45s) elapses instead. The brokerExited race still surfaces a dead broker immediately, so this only extends how long we wait on a broker that is alive and warming up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e queued event, steer/thread, structured 503 - api.rs/worker.rs: gate sender impersonation on workspace membership, not just name. `has_worker_in_workspace` requires the custodial worker to belong to the workspace being published into, so a worker attached to workspace A can't be impersonated (registering/rotating its token) in workspace B. - fleet.rs: re-emit the `delivery_queued` broker event when a node delivery is held under manual_flush. The removed local send path emitted it and `attach --drive` counts it to show pending messages; node delivery is the only path now, so without this queued messages were invisible until a later flush. - ws.rs: a `Steer` send with a `thread_id` can't be honored (`AgentClient::reply` takes no injection mode). Downgrade to a normal reply as before, but log a warning and document it instead of dropping steer silently. - client.ts: match the startup-handshake 503 on the structured `HarnessDriverProtocolError` fields (`status`/`code`) with a message-regex fallback, so a custom broker 503 body still retries. - CHANGELOG: trim the spawn-timeout entry to impact-first per the style guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Since the send handler now posts via AgentClient::reply when thread_id is present, a caller echoing back a synthetic id — the broker-minted http_* event_id from /api/send, or a #channel / direct:* grouping key from /api/threads — would make Relaycast reject the reply and fail the whole send. Gate the reply path on is_relaycast_reply_target (reusing the existing synthetic-event-id classifier plus the channel/DM grouping-key prefixes); non-message-id thread_ids fall back to a plain post so the message is still delivered, just unthreaded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bf68b92 to
11ac64b
Compare
Summary
Makes all message delivery Relaycast-mediated (no local PTY bypass), so any client that only observes state through Relaycast sees every message — plus follow-on fixes for sender identity/thread grouping at the relay boundary and a harness-driver startup-handshake timeout bug surfaced by the release smoke test.
Changes
Route all delivery through Relaycast, no local bypass (
253f4315)ListenApiRequest::Send(HTTP API + MCPsend_message) previously injected directly into a locally-attached worker's PTY and only published via Relaycast when no local target existed. Any Relaycast-only observer (hosted observer, teammate's Pear, cross-device sync) silently missed messages whenever sender and recipient shared a broker.handle_fleet_deliver) to reach local workers.handle_fleet_deliveris wired into the same per-workerInboundDeliveryModechoke point (queue_inbound_for_delivery_mode) the old bypass used, somanual_flushkeeps working now that node delivery is the only path.Preserve sender identity + thread grouping on publish (
b1b05882)from/thread_idbeing dropped at the relay boundary went from a rare edge case to affecting every message. Impersonate the originalfromviaregistered_agent_client_as, and post viaAgentClient::replywhenthread_idis present so Relaycast records a real thread reply.Don't impersonate an unvalidated sender (
bc2a2cdb)register_agent_tokensilently registers a new Relaycast agent for an unknown name, or rotates the token of an existing one (invalidating it). Sincefromis caller-supplied, only impersonate names this broker actually spawned; otherwise fall back to the broker's own identity.Poll broker handshake for the full
startupTimeoutMs(5b4a6bfd)HarnessDriverClient.spawn()polled the broker's startup handshake for a fixed ~10s (10 × 1s), ignoring the caller'sstartupTimeoutMs(default 45s). The broker answers a retryable503 "Broker is starting"on its local API untilconnect_relaycompletes; when the Relaycast handshake (fresh-workspace create + agent register against prod) took >10s on a cold CI runner, the client threw that 503 even though the broker was healthy and still warming up. This failed the release smoke test, which skipped the internal@agent-relay/*publish jobs, which then made "Publish Main Package" time out waiting for deps that were never published.Date.now() + timeoutMsdeadline. ThebrokerExitedrace still surfaces a dead broker immediately, so this only extends patience for a broker that's alive and handshaking.Testing
npm run build --workspace=@agent-relay/harness-driver+tsc --noEmitpass.#[allow(dead_code)]convention for their tests.🤖 Generated with Claude Code