Skip to content

fix(broker): route all delivery through Relaycast + honor spawn startup timeout#1221

Merged
willwashburn merged 5 commits into
mainfrom
fix/relaycast-always-mediated-delivery
Jul 1, 2026
Merged

fix(broker): route all delivery through Relaycast + honor spawn startup timeout#1221
willwashburn merged 5 commits into
mainfrom
fix/relaycast-always-mediated-delivery

Conversation

@willwashburn

@willwashburn willwashburn commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Makes all message delivery Relaycast-mediated (no local PTY bypass), so any client that only observes state through Relaycast sees every message — plus follow-on fixes for sender identity/thread grouping at the relay boundary and a harness-driver startup-handshake timeout bug surfaced by the release smoke test.

Changes

Route all delivery through Relaycast, no local bypass (253f4315)

  • ListenApiRequest::Send (HTTP API + MCP send_message) previously injected directly into a locally-attached worker's PTY and only published via Relaycast when no local target existed. Any Relaycast-only observer (hosted observer, teammate's Pear, cross-device sync) silently missed messages whenever sender and recipient shared a broker.
  • Delivery is now unconditionally relaycast-mediated: always publish via Relaycast and rely on node/fleet redelivery (handle_fleet_deliver) to reach local workers.
  • handle_fleet_deliver is wired into the same per-worker InboundDeliveryMode choke point (queue_inbound_for_delivery_mode) the old bypass used, so manual_flush keeps working now that node delivery is the only path.

Preserve sender identity + thread grouping on publish (b1b05882)

  • Now that every send publishes through Relaycast, from/thread_id being dropped at the relay boundary went from a rare edge case to affecting every message. Impersonate the original from via registered_agent_client_as, and post via AgentClient::reply when thread_id is present so Relaycast records a real thread reply.

Don't impersonate an unvalidated sender (bc2a2cdb)

  • register_agent_token silently registers a new Relaycast agent for an unknown name, or rotates the token of an existing one (invalidating it). Since from is caller-supplied, only impersonate names this broker actually spawned; otherwise fall back to the broker's own identity.

Poll broker handshake for the full startupTimeoutMs (5b4a6bfd)

  • HarnessDriverClient.spawn() polled the broker's startup handshake for a fixed ~10s (10 × 1s), ignoring the caller's startupTimeoutMs (default 45s). The broker answers a retryable 503 "Broker is starting" on its local API until connect_relay completes; when the Relaycast handshake (fresh-workspace create + agent register against prod) took >10s on a cold CI runner, the client threw that 503 even though the broker was healthy and still warming up. This failed the release smoke test, which skipped the internal @agent-relay/* publish jobs, which then made "Publish Main Package" time out waiting for deps that were never published.
  • Now polls against a Date.now() + timeoutMs deadline. The brokerExited race still surfaces a dead broker immediately, so this only extends patience for a broker that's alive and handshaking.

Testing

  • npm run build --workspace=@agent-relay/harness-driver + tsc --noEmit pass.
  • Broker Rust changes carry their existing unit tests; dead local-routing helpers retained under the crate's #[allow(dead_code)] convention for their tests.

🤖 Generated with Claude Code

Review in cubic

@willwashburn willwashburn requested a review from khaliqgant as a code owner July 1, 2026 14:40
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f1cb4571-ed88-4f5b-a485-3c0814d5fa2b

📥 Commits

Reviewing files that changed from the base of the PR and between bf68b92 and 7235a1c.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • crates/broker/src/relaycast/ws.rs
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/delivery.rs
  • crates/broker/src/runtime/fleet.rs
  • crates/broker/src/worker.rs
  • packages/harness-driver/src/client.ts
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • crates/broker/src/runtime/delivery.rs
  • packages/harness-driver/src/client.ts
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/relaycast/ws.rs

📝 Walkthrough

Walkthrough

Broker Relaycast send paths now require an explicit sender identity, validated against workspace-scoped worker registration, plus reply-target filtering for threaded messages and a new delivery_queued fleet event. HarnessDriverClient.spawn() replaces a fixed 10-attempt handshake retry with a deadline-based retry tolerating only 503 errors.

Changes

Broker Relaycast send routing and delivery events

Layer / File(s) Summary
Reply-target eligibility filter
crates/broker/src/runtime/delivery.rs
Adds is_relaycast_reply_target, rejecting blank, grouping (#, direct:), and synthetic thread ids, with unit tests.
Workspace-scoped worker lookup
crates/broker/src/worker.rs
Adds has_worker_in_workspace, confirming a worker exists and is unscoped or matches the given workspace, plus a unit test.
RelaycastHttpClient explicit sender identity
crates/broker/src/relaycast/ws.rs
send_dm, send_dm_with_mode, send, and send_with_mode now accept/require an explicit from identity and authenticate via registered_agent_client_as(from, None); threaded channel replies route through reply with Steer downgraded and logged; documentation and DM steer test updated accordingly.
HTTP Send route validated identity and reply target
crates/broker/src/runtime/api.rs
publish_from impersonates delivery_from only if confirmed via has_worker_in_workspace, else falls back to workspace_self_name; reply_thread_id is derived via is_relaycast_reply_target and passed to send_with_mode instead of the raw thread_id.
Fleet queued delivery event
crates/broker/src/runtime/fleet.rs
surface_fleet_deliver's Inject/Queued branch now emits a delivery_queued event with event/delivery IDs, from/target, and a fixed reason before returning.

Harness driver handshake retry

Layer / File(s) Summary
Deadline-based handshake retry
packages/harness-driver/src/client.ts, CHANGELOG.md
spawn()'s handshake loop now retries getSession() until a computed handshakeDeadline (from timeoutMs), tolerating only 503/"Service Unavailable" errors; changelog updated to describe this behavior.

Estimated code review effort: 3 (Moderate) | ~30 minutes

Possibly related issues

Possibly related PRs

  • AgentWorkforce/relay#1062: Both PRs modify crates/broker/src/runtime/delivery.rs to classify/skip synthetic Relaycast message IDs, tying together eligible ID logic for reply/read-ack actions.
  • AgentWorkforce/relay#1219: Builds on the same Relaycast send plumbing (publish_from/thread_id validation, has_worker_in_workspace, is_relaycast_reply_target) in the same files/functions.

Suggested reviewers: khaliqgant

Poem

A hop through code, a thread to send,
With names confirmed and true, my friend.
No more ten tries against the clock—
Just wait, retry, till 503s unlock.
Delivery queued, a signal bright,
This rabbit thumps its paws in delight! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main changes: Relaycast delivery routing and spawn handshake timeout handling.
Description check ✅ Passed The description covers Summary, changes, and testing well, but it omits the template's exact Test Plan checklist and Screenshots sections.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/relaycast-always-mediated-delivery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the broker's message delivery system to route all messages through Relaycast, removing the direct local-injection bypass to ensure consistent state tracking. It also updates HarnessDriverClient.spawn() to poll the broker's startup handshake for the full timeout budget instead of a fixed 10 attempts, preventing premature spawn failures. The review feedback correctly notes that the added changelog entry contains internal CI details, violating the repository style guide's requirement for concise, impact-first entries.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread CHANGELOG.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/broker/src/runtime/api.rs (1)

719-732: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Record thread history only after Relaycast publish succeeds.

This inserts the message into recent_thread_messages before the Relaycast call. If the publish fails or times out, /api/threads can show a message that was never accepted by Relaycast. Move this record into the Ok(Ok(())) branch, or mark failed attempts separately.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/runtime/api.rs` around lines 719 - 732, The thread history
entry is being recorded before Relaycast confirms delivery, which can leave
`/api/threads` showing messages that were never accepted. Move the
`record_thread_history_event` call in `crates/broker/src/runtime/api.rs` so it
only runs inside the `Ok(Ok(()))` success branch after the Relaycast publish
completes, or otherwise record failed publishes separately. Use the existing
publish result handling around the Relaycast call and the
`record_thread_history_event` helper to keep the history consistent with actual
delivery.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/broker/src/relaycast/ws.rs`:
- Around line 525-529: The threaded reply path in relaycast_ws::send_with_mode
currently allows Steer to fall through to AgentClient::reply, which ignores
MessageInjectionMode and silently behaves like a plain reply. Update the logic
around the thread_id branch so that when the requested mode is Steer and a
thread_id is present, it returns an explicit unsupported-operation error instead
of calling AgentClient::reply; keep plain threaded replies working for supported
modes and preserve the existing error wrapping pattern.

In `@crates/broker/src/runtime/api.rs`:
- Around line 713-717: The publish sender selection in the `publish_from` block
only checks `workers.has_worker(&delivery_from)`, which is not enough to ensure
the worker belongs to `selected_workspace`. Update the logic in `api.rs` so
impersonation is allowed only when the worker exists and is associated with the
currently selected workspace; otherwise fall back to `workspace_self_name`. Use
the existing `delivery_from`, `selected_workspace`, `workers.has_worker`, and
`workspace_self_name` symbols to locate and adjust the guard before calling
`selected_workspace.http_client.send_with_mode`.

In `@packages/harness-driver/src/client.ts`:
- Around line 377-395: The handshake retry loop in client.ts is checking
err.message for 503 text, but BrokerTransport.request() already exposes
structured retry data. Update the retry condition in the session handshake logic
to use the thrown error’s code or status (for example HarnessDriverProtocolError
with http_503 / 503) instead of parsing the formatted message, while keeping the
existing deadline and brokerExited behavior in the client.getSession retry flow.

---

Outside diff comments:
In `@crates/broker/src/runtime/api.rs`:
- Around line 719-732: The thread history entry is being recorded before
Relaycast confirms delivery, which can leave `/api/threads` showing messages
that were never accepted. Move the `record_thread_history_event` call in
`crates/broker/src/runtime/api.rs` so it only runs inside the `Ok(Ok(()))`
success branch after the Relaycast publish completes, or otherwise record failed
publishes separately. Use the existing publish result handling around the
Relaycast call and the `record_thread_history_event` helper to keep the history
consistent with actual delivery.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 89de1fa1-1af1-46aa-a909-07039ae1d9d1

📥 Commits

Reviewing files that changed from the base of the PR and between 7809b20 and 5b4a6bf.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • crates/broker/src/lib.rs
  • crates/broker/src/relaycast/ws.rs
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/fleet.rs
  • crates/broker/src/runtime/mod.rs
  • crates/broker/src/runtime/util.rs
  • crates/broker/src/worker.rs
  • packages/harness-driver/src/client.ts
💤 Files with no reviewable changes (1)
  • crates/broker/src/worker.rs

Comment thread crates/broker/src/relaycast/ws.rs
Comment thread crates/broker/src/runtime/api.rs Outdated
Comment thread packages/harness-driver/src/client.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b4a6bfd3c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +430 to +438
InboundQueueOutcome::Queued => {
tracing::info!(
target = "relay_broker::fleet",
agent = %deliver.agent,
delivery_id = %deliver.delivery_id,
msg_id = %deliver.msg_id,
"queued node delivery (manual_flush inbound delivery mode)"
);
Ok(())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit queued events for manual-flush deliveries

When a worker is in manual_flush, this new node-delivery branch only logs the queued message and then ACKs it. The removed local /api/send path emitted a delivery_queued broker event in this case, and attach-drive relies on that event to increment its pending counter after the initial /pending poll. Since /api/send now always reaches this branch via Relaycast, messages queued while driving an agent become invisible in the drive status line until a later flush/poll even though they are held in the broker.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ee2c139. The InboundQueueOutcome::Queued branch in the fleet path now re-emits the delivery_queued event (with the name field attach --drive scopes on, plus event_id/delivery_id/from/target), matching the removed local send path. Held messages are visible in the drive status line again.

Comment on lines +525 to +527
if let Some(thread_id) = thread_id {
agent_client
.reply(thread_id, text, None, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate thread IDs before using Relaycast reply

For channel targets, any supplied thread_id is now treated as a Relaycast message id. Broker clients commonly pass the event_id returned by /api/send, which is still a synthetic http_* id, and /api/threads can also surface synthetic grouping keys such as #general/direct:*; Relaycast cannot reply to those ids, so these channel follow-ups now fail instead of posting with the provided thread context. Gate this path to real Relaycast message ids or return/use the Relaycast id from the initial publish.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bf68b92. Added is_relaycast_reply_target() (reuses the existing synthetic-event-id classifier and also rejects #channel / direct:* grouping keys); the send handler only forwards thread_id to reply() when it's a real message id and otherwise falls back to a plain post, so a synthetic http_*/grouping id no longer fails the send.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found and verified against the latest diff

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/broker/src/relaycast/ws.rs">

<violation number="1" location="crates/broker/src/relaycast/ws.rs:527">
P1: Thread replies now require a Relaycast message id, but the broker send API exposes broker `event_id`s to callers. Follow-up sends using the returned id will fail at `reply` instead of publishing the message.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

.map_err(|e| anyhow::anyhow!("relaycast send_to_channel failed: {e}"))?;
if let Some(thread_id) = thread_id {
agent_client
.reply(thread_id, text, None, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Thread replies now require a Relaycast message id, but the broker send API exposes broker event_ids to callers. Follow-up sends using the returned id will fail at reply instead of publishing the message.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/relaycast/ws.rs, line 527:

<comment>Thread replies now require a Relaycast message id, but the broker send API exposes broker `event_id`s to callers. Follow-up sends using the returned id will fail at `reply` instead of publishing the message.</comment>

<file context>
@@ -464,31 +486,57 @@ impl RelaycastHttpClient {
-                .map_err(|e| anyhow::anyhow!("relaycast send_to_channel failed: {e}"))?;
+            if let Some(thread_id) = thread_id {
+                agent_client
+                    .reply(thread_id, text, None, None)
+                    .await
+                    .map_err(|e| anyhow::anyhow!("relaycast thread reply failed: {e}"))?;
</file context>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bf68b92 — see reply above. Synthetic http_* event ids (and #channel/direct:* keys) are no longer passed to reply(); they fall back to a plain post so the follow-up send still publishes.

Comment thread crates/broker/src/runtime/api.rs Outdated
Comment thread crates/broker/src/runtime/fleet.rs
Comment thread crates/broker/src/relaycast/ws.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/broker/src/worker.rs (1)

1590-1596: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add test coverage for the matching/mismatching workspace branches.

The only new test covers the "unknown worker" case. The security-relevant branches — worker present with a matching workspace_id, worker present with a mismatched workspace_id, and the unscoped (None) worker fallback — are untested, even though this function is the sole gate against cross-workspace sender impersonation.

♻️ Suggested additional tests
     #[test]
     fn has_worker_in_workspace_returns_false_for_unknown() {
         let reg = make_registry(vec![]);
         let workspace = crate::ids::WorkspaceId::new("ws_1".to_string());
         assert!(!reg.has_worker_in_workspace("nonexistent", &workspace));
     }
+
+    // TODO: add cases exercising a spawned worker with a matching
+    // `workspace_id`, a mismatched `workspace_id`, and `workspace_id: None`
+    // once a test helper exists to insert a `WorkerHandle` directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/worker.rs` around lines 1590 - 1596, Add test coverage in
has_worker_in_workspace for the security-critical branches that are currently
missing: verify a known worker with a matching WorkspaceId returns true, a known
worker with a mismatched WorkspaceId returns false, and a worker with no
workspace scope (None) still falls back correctly. Extend the existing worker
registry test setup helpers such as make_registry and the
has_worker_in_workspace test group in worker.rs so these cases are explicitly
asserted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/broker/src/worker.rs`:
- Around line 212-230: Keep the workspace check in has_worker_in_workspace, but
add tests covering every branch of the worker lookup logic. Add cases for a
worker with the same workspace_id, a worker with a different workspace_id, and a
worker whose workspace_id is None to verify the legacy/SDK fallback behavior
matches the routing rule. Use the has_worker_in_workspace method and the workers
map/handle.workspace_id paths as the targets for the new coverage.

---

Nitpick comments:
In `@crates/broker/src/worker.rs`:
- Around line 1590-1596: Add test coverage in has_worker_in_workspace for the
security-critical branches that are currently missing: verify a known worker
with a matching WorkspaceId returns true, a known worker with a mismatched
WorkspaceId returns false, and a worker with no workspace scope (None) still
falls back correctly. Extend the existing worker registry test setup helpers
such as make_registry and the has_worker_in_workspace test group in worker.rs so
these cases are explicitly asserted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a3550598-2908-442a-8cfa-4196e929c9ab

📥 Commits

Reviewing files that changed from the base of the PR and between c57603e and ee2c139.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • crates/broker/src/relaycast/ws.rs
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/fleet.rs
  • crates/broker/src/worker.rs
  • packages/harness-driver/src/client.ts
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/harness-driver/src/client.ts
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/relaycast/ws.rs
  • crates/broker/src/runtime/fleet.rs

Comment on lines +212 to +230
/// True when a worker named `name` exists and either has no recorded
/// workspace or belongs to `workspace_id`. Gates sender impersonation on
/// Relaycast publish: a worker attached to workspace A must not be
/// impersonated when publishing into workspace B, which would register or
/// rotate that name's token in the wrong workspace.
pub(crate) fn has_worker_in_workspace(
&self,
name: &str,
workspace_id: &crate::ids::WorkspaceId,
) -> bool {
match self.workers.get(name) {
Some(handle) => match &handle.workspace_id {
Some(worker_ws) => worker_ws == workspace_id,
None => true,
},
None => false,
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: check where WorkerHandle.workspace_id is set to None vs Some,
# to confirm None only occurs in single-workspace configurations.
rg -n -B3 -A3 'workspace_id\s*:\s*None' crates/broker/src/worker.rs crates/broker/src/runtime/*.rs
rg -n -B3 -A10 'struct WorkerHandle' crates/broker/src/worker.rs

Repository: AgentWorkforce/relay

Length of output: 1240


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the worker and API call sites first.
ast-grep outline crates/broker/src/worker.rs --view expanded
ast-grep outline crates/broker/src/runtime/api.rs --view expanded
ast-grep outline crates/broker/src/runtime/app_server.rs --view expanded

# Read the relevant sections around workspace_id handling and the impersonation gate.
sed -n '1,140p' crates/broker/src/worker.rs
sed -n '1,240p' crates/broker/src/runtime/api.rs
sed -n '340,430p' crates/broker/src/runtime/app_server.rs

# Find all assignments/usages of WorkerHandle.workspace_id and related workspace scoping.
rg -n -B3 -A3 'workspace_id\s*:' crates/broker/src
rg -n -B3 -A6 'has_worker_in_workspace|from:|workspace_id' crates/broker/src/runtime/api.rs crates/broker/src/worker.rs

# Check tests around the new helper and any multi-workspace behavior.
rg -n -B2 -A8 'has_worker_in_workspace|workspace_id: None|workspace_id: Some|RelayDelivery' crates/broker/src

Repository: AgentWorkforce/relay

Length of output: 50377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the impersonation gate and worker creation paths with tight ranges.
sed -n '210,260p' crates/broker/src/worker.rs
sed -n '760,900p' crates/broker/src/worker.rs
sed -n '1,220p' crates/broker/src/runtime/api.rs

# Find all places a WorkerHandle is constructed and how workspace_id is populated.
rg -n -B4 -A12 'WorkerHandle \{' crates/broker/src
rg -n -B4 -A8 'workspace_id:\s*(Some\(|None|workspace_id)' crates/broker/src/worker.rs crates/broker/src/runtime crates/broker/src

# Read the routing rule that already treats None as match-all.
sed -n '1,80p' crates/broker/src/routing.rs

# Narrow tests around the new helper and any workspace-related behavior.
sed -n '1570,1610p' crates/broker/src/worker.rs

Repository: AgentWorkforce/relay

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the exact impersonation gate in runtime/api.rs.
rg -n -B8 -A18 'has_worker_in_workspace|impersonat|sender|from' crates/broker/src/runtime/api.rs

# Inspect the worker workspace propagation path from spawn call sites.
rg -n -B6 -A10 'spawn\(' crates/broker/src/runtime/api.rs crates/broker/src/runtime/tests.rs crates/broker/src/listen_api.rs

# Check the helper tests and add any nearby assertions around None/workspace match behavior.
sed -n '1588,1608p' crates/broker/src/worker.rs
sed -n '1,80p' crates/broker/src/routing.rs

Repository: AgentWorkforce/relay

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find every production construction of WorkerHandle and inspect its workspace_id value.
rg -n -B6 -A14 'WorkerHandle \{' crates/broker/src --glob '!**/*test*' --glob '!**/tests.rs'

# Find every direct use of `workspace_id: None` in worker creation contexts.
rg -n -B4 -A6 'workspace_id:\s*None' crates/broker/src --glob '!**/*test*' --glob '!**/tests.rs'

# Show the full spawn signature and the workspace_id value path again, plus any other spawn call sites.
rg -n -B4 -A12 'pub\(crate\) async fn spawn\(' crates/broker/src/worker.rs
rg -n -n 'spawn\(' crates/broker/src/runtime/api.rs crates/broker/src/listen_api.rs crates/broker/src/pty_worker.rs crates/broker/src/node_control.rs crates/broker/src/fleet.rs

Repository: AgentWorkforce/relay

Length of output: 14237


Keep the workspace check, but add coverage for all branches. The None fallback matches the existing legacy/SDK worker rule in crates/broker/src/routing.rs, so it looks intentional; add tests for the same-workspace, mismatched-workspace, and unscoped cases so this impersonation gate doesn’t regress.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/worker.rs` around lines 212 - 230, Keep the workspace check
in has_worker_in_workspace, but add tests covering every branch of the worker
lookup logic. Add cases for a worker with the same workspace_id, a worker with a
different workspace_id, and a worker whose workspace_id is None to verify the
legacy/SDK fallback behavior matches the routing rule. Use the
has_worker_in_workspace method and the workers map/handle.workspace_id paths as
the targets for the new coverage.

@willwashburn

Copy link
Copy Markdown
Member Author

Addressed all bot review findings across two commits (ee2c139, bf68b92):

  • Workspace-scoped impersonation (P1, api.rs/worker.rs) — impersonation now requires the custodial worker to belong to the workspace being published into (has_worker_in_workspace), so a worker from workspace A can't be impersonated (token registered/rotated) in workspace B. Workers with no recorded workspace still match.
  • delivery_queued regression (P2, fleet.rs) — the node-delivery Queued branch re-emits the delivery_queued event (with the name field attach --drive scopes on) as the removed local path did, so held messages are visible in the drive status line again.
  • Synthetic thread ids (P1, ws.rs/api.rs/delivery.rs) — the send handler only calls AgentClient::reply for real Relaycast message ids (is_relaycast_reply_target); synthetic http_* event ids and #channel/direct:* grouping keys fall back to a plain post instead of failing the whole send.
  • Steer + threaded reply (P2, ws.rs) — reply has no injection mode, so Steer + thread_id is downgraded to a normal reply, now with a tracing::warn! and doc note instead of silently. Chose downgrade-and-log over a hard error so the message is still delivered.
  • Structured 503 match (Minor, client.ts) — the handshake retry checks the typed HarnessDriverProtocolError status/code fields, with the message regex as a fallback.
  • Changelog — trimmed to an impact-first line.

New tests: has_worker_in_workspace_returns_false_for_unknown, is_relaycast_reply_target (real vs synthetic/grouping ids). Full broker lib suite (797) + tsc + prettier pass locally.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/broker/src/runtime/api.rs">

<violation number="1" location="crates/broker/src/runtime/api.rs:770">
P2: Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to `AgentClient::reply`, which can still fail the send this fallback is meant to protect.</violation>
</file>

<file name="crates/broker/src/worker.rs">

<violation number="1" location="crates/broker/src/worker.rs:1591">
P2: The test for `has_worker_in_workspace` only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment on lines +770 to +772
let reply_thread_id = thread_id
.as_deref()
.filter(|tid| is_relaycast_reply_target(tid));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to AgentClient::reply, which can still fail the send this fallback is meant to protect.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/runtime/api.rs, line 770:

<comment>Pass the trimmed reply id to Relaycast, not the original untrimmed slice. Current validation trims for classification but forwards whitespace to `AgentClient::reply`, which can still fail the send this fallback is meant to protect.</comment>

<file context>
@@ -755,6 +760,24 @@ impl BrokerRuntime {
+                // `/api/threads` aren't reply targets — Relaycast would reject
+                // the reply and fail the whole send. Fall back to a plain post
+                // (unthreaded) for those, preserving delivery.
+                let reply_thread_id = thread_id
+                    .as_deref()
+                    .filter(|tid| is_relaycast_reply_target(tid));
</file context>
Suggested change
let reply_thread_id = thread_id
.as_deref()
.filter(|tid| is_relaycast_reply_target(tid));
let reply_thread_id = thread_id
.as_deref()
.map(str::trim)
.filter(|tid| is_relaycast_reply_target(tid));

}

#[test]
fn has_worker_in_workspace_returns_false_for_unknown() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The test for has_worker_in_workspace only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/broker/src/worker.rs, line 1591:

<comment>The test for `has_worker_in_workspace` only covers the "unknown worker" branch. Since this function gates sender impersonation (security-sensitive), add tests for the same-workspace (returns true), mismatched-workspace (returns false), and unscoped/None workspace (returns true) branches to prevent regressions on the impersonation gate.</comment>

<file context>
@@ -1568,6 +1587,13 @@ mod tests {
     }
 
+    #[test]
+    fn has_worker_in_workspace_returns_false_for_unknown() {
+        let reg = make_registry(vec![]);
+        let workspace = crate::ids::WorkspaceId::new("ws_1".to_string());
</file context>

willwashburn and others added 4 commits July 1, 2026 08:43
HarnessDriverClient.spawn() polled the broker's startup handshake for a
fixed 10 attempts × 1s (~10s), ignoring the caller's startupTimeoutMs.
When the Relaycast handshake took longer than ~10s on a cold or slow
network, the loop exhausted its attempts and threw the broker's
retryable "Broker is starting" 503 even though the broker was healthy
and still warming up. This failed the release smoke tests, which skipped
the internal @agent-relay/* publish jobs, which then made "Publish Main
Package" time out waiting for deps that were never published.

Poll until the full startup budget (default 45s) elapses instead. The
brokerExited race still surfaces a dead broker immediately, so this only
extends how long we wait on a broker that is alive and warming up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e queued event, steer/thread, structured 503

- api.rs/worker.rs: gate sender impersonation on workspace membership, not just
  name. `has_worker_in_workspace` requires the custodial worker to belong to the
  workspace being published into, so a worker attached to workspace A can't be
  impersonated (registering/rotating its token) in workspace B.
- fleet.rs: re-emit the `delivery_queued` broker event when a node delivery is
  held under manual_flush. The removed local send path emitted it and
  `attach --drive` counts it to show pending messages; node delivery is the only
  path now, so without this queued messages were invisible until a later flush.
- ws.rs: a `Steer` send with a `thread_id` can't be honored (`AgentClient::reply`
  takes no injection mode). Downgrade to a normal reply as before, but log a
  warning and document it instead of dropping steer silently.
- client.ts: match the startup-handshake 503 on the structured
  `HarnessDriverProtocolError` fields (`status`/`code`) with a message-regex
  fallback, so a custom broker 503 body still retries.
- CHANGELOG: trim the spawn-timeout entry to impact-first per the style guide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Since the send handler now posts via AgentClient::reply when thread_id is
present, a caller echoing back a synthetic id — the broker-minted http_*
event_id from /api/send, or a #channel / direct:* grouping key from
/api/threads — would make Relaycast reject the reply and fail the whole
send. Gate the reply path on is_relaycast_reply_target (reusing the
existing synthetic-event-id classifier plus the channel/DM grouping-key
prefixes); non-message-id thread_ids fall back to a plain post so the
message is still delivered, just unthreaded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@willwashburn willwashburn force-pushed the fix/relaycast-always-mediated-delivery branch from bf68b92 to 11ac64b Compare July 1, 2026 15:44
@willwashburn willwashburn merged commit 0835a61 into main Jul 1, 2026
43 checks passed
@willwashburn willwashburn deleted the fix/relaycast-always-mediated-delivery branch July 1, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant